User:SHL/GSoC2010/Status

Status page for Samuel Lampa's Google Summer of Code project, "General RDF export/import functionality for Semantic MediaWiki"

How to follow my progress
IRC: Ping me on samuell @ irc.freenode.net / #mediawiki
 * My GSoC blog posts
 * This page
 * My GSoC posts on twitter
 * Google code New URL! Please update your links!
 * SMW RDFIO SVN checkins New URL! Please update your links!
 * Demo website - (Though not 100% up to date, as I'm developing against a localhost site)

= Status updates = Here I post longer wrapups of what has been done. For more frequent updates, see GSoC tagged posts on my blog.

'''This list is a bit outdated! See my blog for up to date content!'''

July 15 2010
Please note that I moved to a new Google code repo due to the name change (SMW RDF Connector --> SMW RDFIO): Please update your links and bookmarks!
 * http://code.google.com/p/smwrdfio/

See also the newly created extension page

July 14 2010 (2)
Created Extension page: RDFIO (See blog post on why I renamed from "SMW RDF Connector")

July 14 2010
(Originilly posted here)

As you might have seen among the GSoC2010 tagged posts I've had a rudimental RDF/XML import, and a SPARQL endpoint (only for querying so far!) up running for a while. You should be able to set up these yourself by following one or more of the instructions in the Google code repo:


 * ARC2 Store Install (required for SPARQL endpoint)
 * ARC2 SPARQL Endpoint Install
 * RDF Import Install

I have since worked a bit on some use cases, which revealed a lot of intricacies to take into account on RDF import. One of them was a spinoff discussion, from a blog post by Egon Willighagen, which quite nicely outlines one of the motivations for having general RDF import in MediaWiki (read post, read discussion).

The last few days I've been working on heavly refactoring the import code, so that it is more general and easy to modify in new ways. There is still a lot to be improved in the code, like error handling, documentation, adding more options etc, so feel free to give feedback on the code! (Especially RDFImporter.php and EquivalendURIHandler.php, and preferrably use the mailing lists: semediawiki-devel, semediawiki-user or mediawiki-l)

The RDF import seems to be the most challenging part in my project (and on which the export feature heavily depends) - since it is the part where I'm breaking a bit of new ground, so here feedback is much welcome.

Choosing wiki titles for RDF entities on import - Feedback wanted
The one most challenging issue is about how to select reasonable wikititles to use for RDF entities on RDF import, based on the RDF data (one relevant blog post here). The question of being able to export the page with the original URI, should not limit the choice directly, since this is already solved by storing the original URI as a property on each page.

The thoughts we have had so far - in short - is:


 * First look if the RDF entity in question has one of a list of properties, in prioritized order, that should be used as wiki title.
 * The first of these, could be a special property which can be used to manually specify this by including it in the imported RDF, like "hasWikiTitle".
 * The suggested list or properties so far can be seen in this blog post. This list should of course be configurable, and one question is also how to best implement this configuration? A setting in LocalSettings.php? A wiki page?
 * If no matching property is in this list, then the label for the RDF entity should be used. For example, if the entity:s URI is http://bio2rdf.org/go:0032283, then 0032283 is used.

How to configure namespace prefixes / preudo-namespaces?
Using only the label of course has the risk that multiple RDF entities converts to the same wiki title, which is not acceptable for example if using the wiki as a "one time RDF editor", which is one of the motivations for this project.

To solve this, one alternative (as a configurable option) could be to use a pseudo-namespace in the wiki title (e.g. "go" in the above example, which would result in "go:0032283" as the wikititle). This could be configured by creating a mapping between base URI:s and pseudo-namespaces (.e. "http://bio2rdf.org/go:" and "go", in this case).

But then there is the question how to configure this mapping. We've been thinking of a few options:
 * Let it be configured in the incoming RDF, by using a custom predicate "hasPseudoNameSpace"
 * Store a config in a Wiki article
 * Config in LocalSettings.php
 * On submitting data for import, analyze it first and present a screen with all the base URI:s used, with fields to manually fill in the pseudo-prefixes to use.
 * Any combination of the above

I will be working ahead, and try to figure out the most reasonable strategy together with Denny (who is my GSoC mentor), but feedback and comments are always welcome! (As said, preferrably send feedback on the mailing lists; semediawiki-devel, semediawiki-user or mediawiki-l!)

-- Samuel / SHL 12:00, 14 July 2010 (UTC)