User:Michael A. White/Proposal
Contents |
[edit] Abstract
- Project
- Natural language-style semantic search for Semantic MediaWiki
- Name
- Michael White (en.wp, homepage)
- mikewhite314@gmail.com
[edit] Working info
- Location
- Medford, just outside Boston, Massachusetts, US
- Time zone
- UTC-4
- Typical working hours
- 12:00-22:00, but highly flexible (as early as 8:00, as late as 04:00).
- IM handles
- Skype: mikewhite314; Gtalk/Jabber: mikewhite314@gmail.com
[edit] Project summary
Adding semantic structure to the vast amount of data in Wikipedia is a compelling value proposition that is the inspiration for my project. Semantic MediaWiki opens exciting new ways of both programmatic and human manipulation of data. For instance, semantic search on Wikipedia would compete very well against Powerset and Wolfram Alpha, but a way for non-technical users to query semantic data is needed. My project will improve and extend the AskTheWiki (ATW) extension, which enables users to make semantic searches without having to learn a query syntax by converting keyword queries to semantic queries.
For end users who are not familiar with a wiki's nomenclature to benefit from semantic search, they need to be able to perform semantic searches without having to know a complicated query syntax. A user should be able to type "Actor born in Boston height 180cm" and let the software figure out that they mean [[Category:Actor]] [[born in::Boston]] [[height::180cm]] (and not, for instance, [[Category:Actor]] [[born in::<q>[[Category:Boston]][[height::180cm]]</q>]]). ATW achieves this by translating keyword queries into SPARQL queries on an RDF export of the wiki using a Java servlet backend and displaying the results using a custom interface. My project will focus primarily on (1) modifying or rewriting ATW backend components that currently have unknown licensing status so ATW can be released, (2) making ATW translate human queries directly to SMW Ask queries rather than use its own SPARQL results objects, so SMW's Semantic Result Formats can be taken advantage of, (3) making ATW operate directly on the SMW database or an RDF interface to it, rather than requiring a full RDF dump, (4) making the parser recognize additional human language equivalents for Ask query features (i.e. comparators) that are currently not supported, and (5) improving the ATW results browser.
If this is accomplished with time remaining, the following options could be explored:
- What is a good interface for integrating semantic query results on Special:Search? This would involve determining whether a given query targets the structured data or the unstructured data.
- Projects like DBPedia extract semantic properties from Wikipedia templates. How can extracted data be stored alongside explicit structured data and used for search, both for query interpretation and results?
[edit] About me
I am a second-year computer science student at Tufts University. I enjoy programming, most simply, because I enjoy creating deterministic abstractions and solving problems in the most efficient/clever/lazy way. I have a wide range of academic and personal interests and hobbies and am currently interested in linguistics, natural language processing, and artificial general intelligence, which I think are ultimately connected to Wikipedia's mission (I like to joke that once we have an intelligent machine, we could feed it Wikipedia and WikiHow and send it on its way). I thoroughly support the goal of making data on the web semantically structured in order to be machine-processable, because it ultimately allows humans to do vastly more interesting things with it.
I am a longtime Wikipedia editor (although not so active recently), but I think I can benefit Wikipedia more with code than with prose, and I plan to begin contributing regularly to MediaWiki regardless of the acceptance of my proposal. PHP/MySQL is my primary toolset and I also have significant experience with C++, JavaScript and HTML/CSS. I recently co-founded a profitable textbook comparison-shopping website and wrote its PHP/MySQL backend.[1] Courses I have taken include Data Structures, Algorithms, Programming Languages, Web Programming, and Artificial Intelligence. For more information about me, including courses I have taken, see my homepage.
[edit] Deliverables
Improving the AskTheWiki extension in the following ways, as noted above:
- Maintain all the current functionality while replacing or rewriting code that has unknown license status
- Returns queries in Ask format rather than as SPARQL, so Semantic Results Formats can be used
- Operates concurrently with the SMW database or an RDF intermediary rather than requiring a complete RDF dump
- Implement all Ask query features for which there is a plausible natural language equivalent and a feasible implementation, including possibly comparators, wildcards, booleans, and subqueries
- Handle variations in human input, such as use of plurals and synonyms of the terms that are actually in the data
- Include incoming properties in the faceted results browser
If time permits:
- Usability testing and UI refinements
- Multi-language support for #4 above (for example, for [[height::>6 ft]], a localization can define a string that means "greater than".
- Explore ways to integrate search of structured and unstructured data
[edit] Schedule
I would use the "Community Bonding Period" to continue familiarizing myself with the MW and SMW code (including by submitting unrelated patches) and plan the technical implementation aspects of my project in more detail with my mentor, so that I could begin coding in earnest as soon as the window opens.
- Milestone 1: AskTheWiki can be released, more or less in its current state (Deliverable 1)
- by end of week 4, but could be accomplished sooner depending on whether my mentor can get certain components to be released
- Milestone 2: Lower-level backend changes (Deliverables 2 & 3)
- week 7
- Milestone 3: Higher-level backend changes (Deliverables 4 & 5)
- week 11
- Milestone 4: Deliverable 6
- week 12
[edit] Participation
I'd expect to have Skype check-ins with my mentor at least once a week, and email/IM contact more frequently as needed. I also intend to keep a weekly development blog and/or put high-level planning on this wiki in order to solicit feedback from the community, which I would also do via the email list (mainly smw-devel but also wikitech-l when appropriate) whenever useful. My code would be running on a test MediaWiki installation on my own server and I'd commit it to a public SVN repository daily.
[edit] Past open source experience
I haven't contributed to any open source projects as a developer. I have contributed to the PhpGedView and GRAMPS communities as a user in the past, genealogy being an interest of mine. Mostly, I use open source software (almost exclusively), and have run a Linux-only desktop since 2005.
[edit] Footnotes
- ↑ I will be working on this company over the summer, but the most work will occur after GSoC is done, as we ramp up for the fall semester in late August. At no point do I expect nor would I allow it to prevent me from committing the necessary hours to GSoC.
[edit] External links
- Talking to the Semantic Web: Query Interfaces to Ontologies for the Rest of Us
- Semantic Wiki Search - paper about AskTheWiki
- Jena Triplestore Connector (SMW+)
| Google Summer of Code: | 2006 • 2007 • 2008 • 2009 • 2010 • 2011 • 2012 • Past projects |
|---|