Incremental dumps

Name and contact information

Name: Petr Onderka
Email: gsvick@gmail.com
IRC or IM networks/handle(s):
- jabber: gsvick@gmail.com
Location: Prague, Czech Republic
Typical working hours: 15:00–22:00 CEST (13:00–20:00 UTC)

Synopsis

Tracked in Phabricator
Task T30956

Currently, creating a database dump of larger Wikimedia sites takes a very long time, because it's always done from scratch. Creating new dump based on previous one could be much faster, but not feasible with the current XML format. This project proposes to create a new binary format for the dumps, which would allow efficient modification of the dump, and thus creating new dump based on the previous one.

Another benefit would be that this format would also allow seeking, so a user can directly access the data they are interested in. A similar format will be also created, which will allow downloading only changes since the last dump was made and applying them to previously downloaded dump.

Deliverables

I will start working right after my final exam on 27 June.

Required by midterm

script for creating dumps in the new format
- using basic compression
- includes full history dumps, current version dumps and “stub” dumps (which contain metadata only)
library for reading the dumps by users

Detailed timeline

28 June – 30 June

create proposal of the file format
set up my working environment

1 July – 7 July

write code for creating and updating stub current dump based on a range of revisions (getting required information from PHP, saving it in C/C++)

8 July – 14 July

add other dump types (with history, with page text, with both; text of each page will be compressed separately using a general purpose algorithm like 7z)

15 July – 21 July

handle articles and revisions that were deleted or undeleted

22 July – 28 July

write library for reading dumps

Required by final deadline

script for creating dumps
- using smarter compression techniques for better space efficiency
- will also create incremental dumps (for all 3 types of dumps) in a similar format, containing only changes since the last dump
user script for applying incremental dump to previous dump
the file format of the dumps will be fully documented

Timeline

1 August – 7 August

reading directly from MediaWiki, including deleted and undeleted pages and revisions

8 August – 14 August

creating diff dumps (they contain changes since last dump; can be applied to existing dump)

15 August – 28 August (2 weeks)

implementing and tweaking revision text delta compression; decreasing dump size in other ways

29 August – 11 September (2 weeks)

tweaking performance of reading and writing dumps

Optional

user library for downloading only required parts of a dump (and only if they changed since the last download)
script to convert from the new format to the old format
SQL dumps?

About you

For quite some time now, I've been interested in working with Wikipedia data and usually creating some sort of report out of it. Over time, I have accessed this data in pretty much any form: XML dumps,^[1] SQL dumps,^[2] the API^[3] and accessing the database using SQL on Toolserver.^[4] I haven't done much in this area lately (partly due to work and school), but the interest is still there.

I am also interested in compression and file formats in general, specifically things like the format of git's pack files or ProtocolBuffers. Though this is more of a passive interest, since I didn't find anything where I could use this. Until now.

I'm also among the top users on Stack Overflow for the mediawiki and wikipedia tags.

Participation

If it's clear what I'm supposed to do (or what I want to do), I tend to work alone. For example, in my last GSoC (for another project), where the interface was set in stone, I didn't communicate with my mentor (or the community) much. This project is much more open-ended though, so I plan to talk with my mentor more (and to a lesser degree, the community).

I will certainly publish my changes to a public git repo at least daily. If I'm going to get access to my own branch in the official repo, I will push my changes there, otherwise, I will use github. I already have some experience with Wikimedia's gerrit, though I don't expect to use it much in this project.

Past open source experience

Lots of my work is open source in name only: I have published the code, but no one else ever worked on it.^[5]

But for my bachelor thesis, I needed to extend the MediaWiki API, so I did just that. During my work on the thesis, I also noticed some bugs in the API, so I fixed them.

Last summer, I participated in GSoC for mono. The goal was to finish implementation of a concurrency library TPL Dataflow, which I finished successfully.

Updates

June report

As mentioned, I have actually started working on 28 June, so there isn't much to report.

What I did do:

created a page describing suggested file format
imported dump of a tiny wiki to my local MediaWiki for testing
requested a repository for this project in gerrit

July report

Most of the work planned for July is done. The application can create dumps in the new format (which can deal with incremental updates efficiently) from an existing XML dump. It can then convert a dump in the new format back to XML. The generated XML dumps are the same as the originals (with few expected exceptions).

The original plan was to have a library for reading the new dumps, but this was changed to XML output, because that's more convenient for current users of dumps.

There are two items that I planned for July but didn't actually do: reading directly from MediaWiki (instead of an existing XML dump) and handling of un-/deleted pages and revisions.

August report

The timeline slipped in August. I have completed the first two planned features: creating incremental dumps directly from MediaWiki (i.e. not from XML dumps) and diff dumps (that can be used to update existing dump). Work on compression, that was supposed to be finished in August, is currently in progress.

Notes

↑ Used in generating the Dusty articles report.
↑ I have written a library for processing SQL dumps from .Net and use it in the Category cycles report.
↑ My bachelor thesis was writing a library for accessing the API from C#.
↑ Used in generating cleanup listings for WikiProjects.
↑ Most of the work mentioned in the above notes belong in this category.

[1] Used in generating the Dusty articles report.

[2] I have written a library for processing SQL dumps from .Net and use it in the Category cycles report.

[3] My bachelor thesis was writing a library for accessing the API from C#.

[4] Used in generating cleanup listings for WikiProjects.

[5] Most of the work mentioned in the above notes belong in this category.

[1]

[2]

[3]

[4]

[5]