Incremental dumps

Name and contact information

 * Name: Petr Onderka
 * Email: gsvick@gmail.com
 * IRC or IM networks/handle(s):
 * jabber: gsvick@gmail.com
 * Location: Prague, Czech Republic
 * Typical working hours: 15:00–22:00 CEST (13:00–20:00 UTC)

Synopsis
Mailing list thread

Currently, creating a database dump of larger Wikimedia sites takes a very long time, because it's always done from scratch. Creating new dump based on previous one could be much faster, but not feasible with the current XML format. This project proposes to create a new binary format for the dumps, which would allow efficient modification of the dump, and thus creating new dump based on the previous one.

Another benefit would be that this format would also allow seeking, so a user can directly access the data they are interested in. A similar format will be also created, which will allow downloading only changes since the last dump was made and applying them to previously downloaded dump.

Deliverables
I will start working right after my final exam on 27 June.

Required by midterm

 * script for creating dumps in the new format
 * using basic compression
 * includes full history dumps, current version dumps and “stub” dumps (which contain metadata only)
 * library for reading the dumps by users

Required by final deadline

 * script for creating dumps
 * using smarter compression techniques for better space efficiency
 * will also create incremental dumps (for all 3 types of dumps) in a similar format, containing only changes since the last dump
 * user script for applying incremental dump to previous dump
 * the file format of the dumps will be fully documented

Optional

 * user library for downloading only required parts of a dump (and only if they changed since the last download)
 * script to convert from the new format to the old format
 * SQL dumps?

TODO from here

About you
We don't just care about your project -- you are a person, and that matters to us! What drives you? What makes you want to make this the most awesomest wiki enhancement ever?

You don't need to write out your life story (we can read your blog if we want that), but we want to know a little about what makes you tick. Are you a Wikipedia addict wanting to make your own experience better? Did a wiki with usability problems run over your dog, and you're seeking revenge? :-) What does making this project happen mean to you?

Participation
We don't just want to know what you plan to accomplish; we want to know how. Briefly describe your work style: how you plan to communicate progress, where you plan to publish your source code while you're working, how and where you plan to ask for help. (We will tend to favor applicants that demonstrate a clear vision for what it means to be an active participant in our development community.)

Past open source experience
Do you have any past experience working in open source projects (MediaWiki or otherwise)? If so, tell us about it! If you have already written a feature or bugfix in a Wikimedia technology such as MediaWiki, link to it here; we will give strong preference to candidates who have done so.

Any other info
Please add any other relevant information -- UI mockups, references to related projects, a link to your proof of concept code, whatever. There are no specific requirements, but we love to see people who love what they're doing. Show us you're excited about this project and have an interest in the background and are considering how best to make your idea work.