User:Sanja pavlovic/GSOC/OPW application

Contact info

 * Name: Sanja Pavlovic
 * E-mail: sanja.pavlovic@vikimedija.org, sanjoo911@gmail.com, pavlovic.sanja.91@gmail.com
 * IRC: sanjup
 * Location: Belgrade, Serbia

Project
Incremental data dumps


 * We offer data dumps of Wikipedia and other Wikimedia projects, allowing people to access this knowledge where Internet connection is missing, slow or expensive, to research edit patterns and data-mine our vast knowledge base. The dumps for the larger projects are only getting larger i.e. 40GB for English Wikipedia. What is more, the update a month later will be another 40GB or more. In fact, only a small subset of that information is actually changed in the form of new pages, new revisions, or deleted revisions. Imagine if users of these files could download just the changes, plus a script that applied the changes. Imagine if the dumps could be written out using the previous month's dumps with such a scheme. Imagine running the German language Wikipedia dumps in 3 days instead of the current 16. This could be achieved by designing the right output format for the XML files containing text for all revisions.

Possible mentor
Ariel Glenn

The timeline of the project

 * First 2 weeks:
 * Getting familiar with the present code for database dumps.


 * Week 3:
 * Thinking about posible solutions and testing them.


 * From week 4 to week 7:
 * Implementing the best solution


 * From week 8 to week 10:
 * Testing the code