User:SHL/GSoC2010

The project has been accepted

See the Status page for how to follow my progress.

Identity

Name: Samuel Lampa
Email: samuel.lampa[at]gmail.com
Project title: General RDF export/import functionality for Semantic MediaWiki

Contact/working info

Timezone: Sweden (GMT +1)
Typical working hours: 14:00 - 02:00
IRC or IM networks/handle(s): Skype: samuel_lampa, IRC: freenode/samuell

Project summary

Extend the import/export functionality of Semantic MediaWiki (SMW) to allow also full, general RDF import.

The background for the idea (for me) is to enable the use of SMW as a general collaborative RDF editor that can be integrated with workflow systems/scriptable workbench software such as Bioclipse), to enable workflows of the type:

Import RDF to Wiki --> Collaboratively edit --> Export back in same format

...but this project would include some general reworkings of the import/export functionality, which that specific use case can take advantage of.

The ideas for a practical approach (based on mail conversation with Denny Vrandecic, who has declared interest to mentor the project), is briefly presented in the deliverables section.

The main idea is to replace the defunct RAP with ARC as RDF library (has been discussed on SMW-devel already), and make use of the SMWWriter API, to create a general SPARQL API, with update functionality (Preferrably with on the "SPARQL Update" standard).

This would form the technical basis for implementing general RDF export/import functionality. What also needs to be added, is functionality to map RDF URI:s and/or RDF URI to wiki titles.

Two approaches have been discussed for that (probably both can be used, depending on use case):

Make use of "Equivalent URI" property for mapping of already existing wiki articles to RDF URI:s. (On import, any article containing an Equivalent URI annotation pointing to a specific URI, then all properties for that URI in the imported RDF, will be added to that wiki article).
For RDF nodes without corresponding content in the wiki, enable to configure (either in the imported RDF data itself) or in a central config, a mapping between RDF base URI:s and wiki pseudo-namespaces.

The motivation for this is that, on importing RDF nodes without corresponding wiki article, one has to choose the name of the wiki articles somehow. Using the full URI:s is unpractical, and using just the part after the "base URI" risks to create duplicate pages (in case there are two RDF URI.s with the same property name, but different base URI:s).

An example of this:

One would for example map the wiki pseudo namespace "foaf:" with "http://xmlns.com/foaf/0.1/", so that on importing a triple containing "http://xmlns.com/foaf/0.1/lives_in" results in the wiki article "Property:foaf:lives_in", "http://dbpedia.org/resource/Stuttgart" might similarly result in "dbpedia:Stuttgart" etc.

About you

I'm a 27 year old biotechnology student att Uppsala university, having much interest in systems biology, computational biology, system design, semantic technologies and web development, currently just finishing my M.Sc. degree in biotechnology (focusing on systems biology and bioinformatics).

Much of my technical experience comes from besides my studies, from doing web design since 10+ years, web development with Drupal and MediaWiki for some 4 years, as well as summer work as PC support technician/(Windows) network admin etc. Web development has been done through my father's small firm RIL Partner AB where we are also providing web hosting for a few customers, running our own dedicated (Ubuntu) servers (which we optimized for MediaWIki and Drupal) which I'm administrating. At RIL Partner we've been playing around quite a bit with MediaWiki and Semantic MediaWiki, testing out different ideas.

In the last few years I've actively focused on getting more hands on coding experience, and hence did a PHP/MediaWiki web interface project at uni, took bioinformatics courses, did a little Java web crawler for use with the Sphinx search engine etc. In my degree project, I'm getting experience from Java coding, Eclipse RCP development and Prolog, as well as getting to know the W3C Semantic formats and technologies.

The borders between studies, work and hobby tends to get a bit blurred for me (I'm typically easier to reach by e-mail or skype than by phone :) ). not leaving very much spare time. The time that is over anyway I typically spend hanging out with my family.

In the near future I hope to be able to work in the bioinformatics sector, or with systems and knowledge management tools for the Life Sciences. I'll probably continue open source development for Bioclipse and MediaWiki to some extent in the future, as I see both of them as great platforms for the kind of functionality I want to implement and work with. The above proposed GSoC project is highly interesting to me as it would be a killer feature for Bioclipse to be able to export data for community collaboration, and then retrieve it back again.

What drives me is a vision to enable better and more systematic knowledge discovery and integration in the Biology / Life Sciences domain, by integrating Semantic technologies, with computational and simulation tools.

Deliverables

Required deliverables

The order of 1) and 2) does not matter.

Replace the RAP connection, using it as a starting point for connecting ARC with SMW and probably the recently introduced SMWWriter API, to allow creation of a SPARQL/SPARQL Update API to the Wiki Knowledge base.
Improve the equivalent URI functionality allowing to use it for mapping of URI:s with Wiki articles, at import and export. and replace the current vocabulary import feature.
Connect the SPARQL endpoint with the equivalent URI feature so that one can use their own vocabulary when querying the wiki.
Design an improved RDF export feature that allows to specify an ontology to use for export.
Implement an import and update of RDF to the wiki, preferably using SPARQL update as an interface. Implement a namespace based mappings of wiki titles to RDF base URI:s for that (more info above).
Define a way for using mapping properties in the RDF

During the whole time: document, support, release.

If time permits

Implement a mapping tool.

Project schedule

...

Participation

I prefer having contact daily (or so) on a chat such as IRC or Skype (hanging out daily at Freenode/#bioclipse right now for my degree project) + E-mail for longer discussions. I also much like the idea to use a blog (and really use it!) to document my progress (and make sure I don't forget things learned), and to use GitHub (or similar) for publishing source code.

Past open source experience

SWI-Prolog integration plugin for Bioclipse (GitHub repo, Project blog, Screencast)
A little patch to Yaron Koren's Semantic Forms.

Any other info

Made a web interface for a protein analysis tool (Project done at the LCB in Uppsala), using MediaWiki, the MediaWiki API + external php scripts (Screencast)
A Java based web crawler for use in combination with SPHINX search engine
MediaWiki skin (demo)
Drupal theme