User:Legoktm/GSoC 2013

Identity

Name: Kunal Mehta (aka Legoktm)
Email: legoktm.wikipediagmail.com
Project title: Incremental Data Dumps

Contact/working info

Timezone: CDT until mid May, then PDT.
Typical working hours: I'm not sure, anytime really.
IRC or IM networks/handle(s): legoktm on freenode and just about everywhere else.

Project summary

Tracked in Phabricator
Task T30956

Currently when downloading a generating a new dump, an entire one must be produced, which takes a much longer time for large wikis (enwp), and forces consumers to download the entire dump again as well, even if 80% of the pages have not changed. To make the process much more efficient, it makes sense to produce dumps that only contain the changed content since the last dump was produced. A script could then be written to merge the incremental dumps into the last full dump to produce a complete dump for those who want it.

This means that we need to design a few format for the dumps as well as an additional format for the incremental ones. Scripts will need to be written to convert the new format into the old/current format to maintain backwards compatibility.

Deliverables

format for the incremental dumps
format for the full dumps (aka the "new format")
Script to create these dumps
- Should work for both the text table and external storage
Script to convert full dumps from "old format" to "new format"
- This should be for both dump providers as well as dump consumers
Script to merge incremental dumps with the full one
- This should be for both dump providers as well as dump consumers

Required deliverables

A format for incremental dumps
A format for the full dumps (aka the "new format")
Updates to all the maintenance/dump*.php scripts to support incremental dumps
A script to generate the incremental dumps
A script to merge incremental dumps with a full dump to produce a new dump

If time permits

Dump parsing scripts for re-users for the "new format" to assist in transition
Look into different compression formats for optimization
SQL dumps?
User script to automatically download latest incremental and merge it into the locally stored full dump
???

Project schedule

Pre-start: Get familiar with the current process of how dumps are produced
May 28 - start
Come up with a format to format the dumps in - 1 week
- Need a list of events that should be captured in this format (edits, moves, deletes, revdel)
Write a python script to generate incremental dumps - 2 weeks
- Can be tested on smaller wikis, but should be efficient enough to run on larger ones.
Implement changes into core as maintenance scripts - 2 weeks
- Need to modify backend, as well as pre-loading script.
Script to merge incrementals into full - 1 week
Look into better compression methods for incremental and full - 1 week

About you

I'm a long time Wiki(p|m)edian, who has been downloading and parsing dumps for a long time. I feel that the current system of dumps is bulky and hard for anyone to re-use/import. I'm a huge believer in the m:Right to fork and this will increase that greatly.

Participation

We don't just want to know what you plan to accomplish; we want to know how. Briefly describe your work style: how you plan to communicate progress, where you plan to publish your source code while you're working, how and where you plan to ask for help. (We will tend to favor applicants that demonstrate a clear vision for what it means to be an active participant in our development community.)

I'm always on IRC, either answering questions or asking them. I'm familiar with git/gerrit and the workflow of submitting a patch, getting feedback, fixing, and repeat.

Past open source experience

I've contributed code to MediaWiki and extensions before (gerrit patches), and am a long time MediaWiki bot author, and am also Manual:Pywikipediabot developer (commits), working on the rewrite project.

Any other info

I'm really not sure what to add here...