Incremental dumps

Suggested file format

Name and contact information

 * Name: Petr Onderka
 * Email: gsvick@gmail.com
 * IRC or IM networks/handle(s):
 * jabber: gsvick@gmail.com
 * Location: Prague, Czech Republic
 * Typical working hours: 15:00–22:00 CEST (13:00–20:00 UTC)

Synopsis
Mailing list thread

Currently, creating a database dump of larger Wikimedia sites takes a very long time, because it's always done from scratch. Creating new dump based on previous one could be much faster, but not feasible with the current XML format. This project proposes to create a new binary format for the dumps, which would allow efficient modification of the dump, and thus creating new dump based on the previous one.

Another benefit would be that this format would also allow seeking, so a user can directly access the data they are interested in. A similar format will be also created, which will allow downloading only changes since the last dump was made and applying them to previously downloaded dump.

Deliverables
I will start working right after my final exam on 27 June.

Required by midterm

 * script for creating dumps in the new format
 * using basic compression
 * includes full history dumps, current version dumps and “stub” dumps (which contain metadata only)
 * library for reading the dumps by users

Detailed timeline

 * 28 June – 30 June
 * create proposal of the file format
 * set up my working environment


 * 1 July – 7 July
 * write code for creating and updating stub current dump based on a range of revisions (getting required information from PHP, saving it in C/C++)


 * 8 July – 14 July
 * add other dump types (with history, with page text, with both; text of each page will be compressed separately using a general purpose algorithm like 7z)


 * 15 July – 21 July
 * handle articles and revisions that were deleted or undeleted


 * 22 July – 28 July
 * write library for reading dumps

Required by final deadline

 * script for creating dumps
 * using smarter compression techniques for better space efficiency
 * will also create incremental dumps (for all 3 types of dumps) in a similar format, containing only changes since the last dump
 * user script for applying incremental dump to previous dump
 * the file format of the dumps will be fully documented

Timeline

 * 1 August – 7 August
 * reading directly from MediaWiki, including deleted and undeleted pages and revisions


 * 8 August – 14 August
 * creating diff dumps (they contain changes since last dump; can be applied to existing dump)


 * 15 August – 28 August (2 weeks)
 * implementing and tweaking revision text delta compression; decreasing dump size in other ways


 * 29 August – 11 September (2 weeks)
 * tweaking performance of reading and writing dumps

Optional

 * user library for downloading only required parts of a dump (and only if they changed since the last download)
 * script to convert from the new format to the old format
 * SQL dumps?

About you
For quite some time now, I've been interested in working with Wikipedia data and usually creating some sort of report out of it. Over time, I have accessed this data in pretty much any form: XML dumps, SQL dumps, the API and accessing the database using SQL on Toolserver. I haven't done much in this area lately (partly due to work and school), but the interest is still there.

I am also interested in compression and file formats in general, specifically things like the format of git's pack files or ProtocolBuffers. Though this is more of a passive interest, since I didn't find anything where I could use this. Until now.

I'm also among the top users on Stack Overflow for the mediawiki and wikipedia tags.

Participation
If it's clear what I'm supposed to do (or what I want to do), I tend to work alone. For example, in my last GSoC (for another project), where the interface was set in stone, I didn't communicate with my mentor (or the community) much. This project is much more open-ended though, so I plan to talk with my mentor more (and to a lesser degree, the community).

I will certainly publish my changes to a public git repo at least daily. If I'm going to get access to my own branch in the official repo, I will push my changes there, otherwise, I will use github. I already have some experience with Wikimedia's gerrit, though I don't expect to use it much in this project.

Past open source experience
Lots of my work is open source in name only: I have published the code, but no one else ever worked on it.

But for my bachelor thesis, I needed to extend the MediaWiki API, so I did just that. During my work on the thesis, I also noticed some bugs in the API, so I fixed them.

Last summer, I participated in GSoC for mono. The goal was to finish implementation of a concurrency library TPL Dataflow, which I finished successfully.

June report
As mentioned, I have actually started working on 28 June, so there isn't much to report.

What I did do:


 * created a page describing suggested file format
 * imported dump of a tiny wiki to my local MediaWiki for testing
 * requested a repository for this project in gerrit

July report
Most of the work planned for July is done. The application can create dumps in the new format (which can deal with incremental updates efficiently) from an existing XML dump. It can then convert a dump in the new format back to XML. The generated XML dumps are the same as the originals (with few expected exceptions).

The original plan was to have a library for reading the new dumps, but this was changed to XML output, because that's more convenient for current users of dumps.

There are two items that I planned for July but didn't actually do: reading directly from MediaWiki (instead of an existing XML dump) and handling of un-/deleted pages and revisions.