User:J.a.coffman/GSoc 2013 Proposal

Name and contact information
Name: Jeremy Coffman Email: jcoffman93@gmail.com IRC or IM networks/handle(s): Typical working hours:

My working hours are fairly flexible, and I would be happy to change them. Barring any requests, the following is what I have planned so far:

From May 7 to August 29:

18:00-01:00 (UTC) or 10:00-5:00 (PST) 7 days per week

Synopsis
Currently, MediaWiki's data dump scripts dump the entirety of a wiki's contents each time they are run, instead of only applying the changes that have been made since they were last run. This is highly inefficient, and places unnecessary strain on wikis that make use of the MediaWiki engine. Those who download the data dumps suffer from the same issue, in that they also have to download the entire dump instead of only the changes they need. The following issues must be addressed in order to adequately solve this problem.
 * 1) Comparing the state of the wiki to the state of the dumps, and identifying any revisions and pages that have been added or deleted.
 * 2) Storing these changes in a format that allows for easy compression of the data, while maintaining easy access to particular elements.
 * 3) Informing a user of what changes they need to download in order to update their dump files, and then applying these changes locally.

The dump format employed at the present is essentially a literal interpretation of each page in the wiki. It consists of a single XML document containing a list of pages, with each page containing a list of all of its revisions, along with other information, such as the page’s title. This format is difficult to compress, as different pages have varying numbers of revisions, and we would like to avoid separating revisions from their parent pages. This issue can be solved by storing the data dumps as a sequence of revisions instead of pages. We can associate particular pages with revisions by including a unique page identifier within each pair of revision tags. In other words, the XML document would assume the following structure: 2        2001-01-15T13:15:00Z Foobar I have just one thing to say! A bunch of text here. 1        2001-01-15T13:10:27Z 10.0.0.2 new! Words and things.

Because each revision is a discrete unit, this structure allows us to compress an arbitrary number of revisions together. Sorting the revisions by timestamp would allow fast access to any revision. When a user wishes to update their local dump, the download script would simply find the latest timestamp in their dump and download all revisions with a greater timestamp. An additional script would then reformat the downloaded data into the format required for importing into a MediaWiki instance.

Deliverables
Include a brief, clear work breakdown structure with milestones and deadlines. Make sure to label deliverables as optional or required. It’s OK to include thinking time (“investigation”) in your work schedule. Deliverables should include investigation, coding, deploying, testing and documentation.

About you
I am currently completing my second year at Brandeis University, where I am studying Computer Science (and possibly also Linguistics and East Asian Studies), with a possible focus on Natural Language Processing. I have experience writing in Python, Ruby, Java, and Javascript with slight exposure to C.

Participation
I intend to make regular commits to a github repo to allow my mentor(s) to monitor my progress. I am already subscribed to the wikitech-l mailing list, and I plan to be available on wikitech's IRC channel both during my working hours and, ideally, outside of them as well. In addition to communicating with my mentor(s) via whatever method they find most convenient, I expect to request help and seek advice from other community members via IRC. Since I plan to become a regular contributor to MediaWiki, I would also like to keep myself informed of the progress of ongoing projects in the community. This would give me some idea of what I might work on after my modifications to the dump scripts are ready to be integrated.

Past open source experience
While I do not have experience working on an explicitly open source project, I did collaborate with two other students and the City of Boston in developing a application for the city. The github repository can be found here: Wheres My Lane

Any other info
Please add any other relevant information -- UI mockups, references to related projects, a link to your proof of concept code, whatever. There are no specific requirements, but we love to see people who love what they're doing. Show us you're excited about this project and have an interest in the background and are considering how best to make your idea work.