User:J.a.coffman/GSoc 2013 Proposal

From mediawiki.org

Incremental Data Dumps[edit]

Name and contact information[edit]

Name: Jeremy Coffman
Email: jcoffman93@gmail.com
IRC or IM networks/handle(s): jcoffman
Typical working hours:
My working hours are fairly flexible, and I would be happy to change them. Barring any requests, the following is what I have planned so far:
From May 9 to August 29:
18:00-01:00 (UTC) or 10:00-5:00 (PST) 7 days per week

Synopsis[edit]

Currently, MediaWiki's data dump scripts dump the entirety of a wiki's contents each time they are run, instead of only applying the changes that have been made since they were last run. This is highly inefficient, and places unnecessary strain on wikis that make use of the MediaWiki engine. Those who download the data dumps suffer from the same issue, in that they also have to download the entire dump instead of only the changes they need. The following issues must be addressed in order to adequately solve this problem.

  1. Comparing the state of the wiki to the state of the dumps, and identifying any revisions and pages that have been added or deleted.
  2. Storing these changes in a format that allows for easy compression of the data, while maintaining easy access to particular elements.
  3. Informing a user of what changes they need to download in order to update their dump files, and then applying these changes locally.

The dump format employed at the present is essentially a literal interpretation of each page in the wiki. It consists of a single XML document containing a list of pages, with each page containing a list of all of its revisions, along with other information, such as the page’s title. This format is difficult to compress, as different pages have varying numbers of revisions, and we would like to avoid separating revisions from their parent pages. This issue can be solved by storing the data dumps as a sequence of revisions instead of pages, with each revision being tagged with the timestamp of the dump that last catalogued it. We can associate particular pages with revisions by including a unique page identifier within each pair of revision tags. In other words, the XML document would assume the following structure:

    <revision>
       <pageid>2</pageid>
       <timestamp>2001-01-15T13:15:00Z</timestamp>
       <dtimestamp>2001-1-31T15:02:00Z</dtimestamp>
       <contributor><username>Foobar</username></contributor>
       <comment>I have just one thing to say!</comment>
       <text>A bunch of text here.</text>
       <minor />
     </revision>
     <revision>
       <pageid>1</pageid>
       <timestamp>2001-01-15T13:10:27Z</timestamp>
       <dtimestamp>2001-1-31T15:02:00Z</dtimestamp>
       <contributor><ip>10.0.0.2</ip></contributor>
       <comment>new!</comment>
       <text>Words and things.</text>
     </revision>

Because each revision is a discrete unit, this structure allows us to compress an arbitrary number of revisions together. Separating revisions that belong to different pages would no longer be a concern, as their relationship can be reconstructed using the page id. Sorting the revisions by dump timestamp would allow fast access to any revision.

The new format would only be employed for storage purposes. When a user wishes to update their local dump, the download script would simply submit the timestamp of the last dump they had downloaded, and download all revisions with a larger dump timestamp. An additional script could then reformat the data into the old dump format to prevent conflicts with existing software.

Deliverables[edit]

Required Deliverables

  • The new format for the dumps, as specified above.
  • Modification of existing scripts to support the new format.
  • New script to generate incremental dumps.
  • Script to handle downloads of dump.
  • Script to convert downloaded revisions to old format.
  • Writing necessary documentation.

If time permits

  • Developing a new format for SQL dumps.
  • Automatic updating of local dumps for users.

Project Schedule[edit]

Before May 27

Familiarize with existing code and determine which elements can be incorporated into this project.

May 27 - June 17

Community bonding. I will also collaborate with my mentor to develop a more rigorous specification of what I need to achieve during the program, as well as an appropriate schedule. I will also begin working on the code itself.

June 17 - June 30

Finalize format of the new dumps, begin to write new scripts to support generating dumps following the new format.

July 1 - July 14

Finalize scripts for dump generation. Begin to modify existing scripts to support the new format.

July 15 - July 21

Start working on the script people will use to download the dumps.

July 22 - July 28

Finish download script, start working on script that will convert the downloaded revisions to the old format.

July 29 - August 29

Bug management, refactoring, improving documentation. This is also when I will begin to work on the optional deliverables. By the end of this period, the code should be ready for deployment. Though the goal is to finish by August 29, I can continue working until the Summer of Code ends on September 23.

About you[edit]

I am currently completing my second year at Brandeis University, where I am studying Computer Science (and possibly also Linguistics and East Asian Studies), with a possible focus on Natural Language Processing. I have experience writing in Python, Ruby, Java, and Javascript with slight exposure to C.

Participation[edit]

I intend to make regular commits to a github repo to allow my mentor(s) to monitor my progress. I am already subscribed to the wikitech-l mailing list, and I plan to be available on wikitech's IRC channel both during my working hours and, ideally, outside of them as well. In addition to communicating with my mentor(s) via whatever method they find most convenient, I expect to request help and seek advice from other community members via IRC. Since I plan to become a regular contributor to MediaWiki, I would also like to keep myself informed of the progress of ongoing projects in the community. This would give me some idea of what I might work on after my modifications to the dump scripts are ready to be integrated.

Past open source experience[edit]

While I do not have experience working on an explicitly open source project, I did collaborate with two other students and the City of Boston in developing a application for the city. The github repository can be found here: Wheres My Lane

Any other info[edit]

See also[edit]