User:Sanja pavlovic/GSOC/OPW application

Contact info

 * Name: Sanja Pavlovic
 * E-mail: sanja.pavlovic@vikimedija.org, sanjoo911@gmail.com, pavlovic.sanja.91@gmail.com
 * IRC: sanjup
 * Location: Belgrade, Serbia
 * Typical working hour: 15:00 - 23:00 CEST (13:00 - 21:00 UTC)

Project synopsis
Incremental data dumps


 * We offer data dumps of Wikipedia and other Wikimedia projects, allowing people to access this knowledge where Internet connection is missing, slow or expensive, to research edit patterns and data-mine our vast knowledge base. The dumps for the larger projects are only getting larger i.e. 40GB for English Wikipedia. What is more, the update a month later will be another 40GB or more. In fact, only a small subset of that information is actually changed in the form of new pages, new revisions, or deleted revisions. Imagine if users of these files could download just the changes, plus a script that applied the changes. Imagine if the dumps could be written out using the previous month's dumps with such a scheme. Imagine running the German language Wikipedia dumps in 3 days instead of the current 16. This could be achieved by designing the right output format for the XML files containing text for all revisions.

Possible mentor
Ariel Glenn

Deliverables
Improved skript which will considerably shorten dump process. All necessary changes in that and related scripts, as needed for compatibility.

For example, if we are going to migrate from XML to JSON format, the new program or script should replace.

Inovation
It is necessary to approach database dump from a different angle, because currently used method requires a large amount of time and processing power. Some of the ideas that acured to me while thinking about this problem are: The first and second week after the formal start of the project are designated to testing these and other approaches.
 * decreasing the size and increasing the number of files that dump produces which would shorten duration of compression and decompression
 * creating an index in dump files by the ids they contain which would accelerate search for entries
 * switching from XML to JSON format which would considerable reduce the size of dump files

The timeline of the project

 * Up to May 3rd
 * finishing application and the first contribution


 * From May 4th to June 16th
 * introducing myself with the code base and the rest of documentation


 * June 17th
 * officially starting the project


 * June 17th to June 30th
 * thinking about the ides and testing their feasibility


 * July 1st to July 31st
 * writing the code


 * August 1st to August 31st
 * testing the code


 * September 1st to September 23rd
 * final cheks and writing the documentation

About me
Hi, all!

My name is Sanja Pavlovic. I live in Belgrade, Serbia. I am currently in my third year of Journalism and Communicology studies at the Faculty of Political Sciences, University of Belgrade. I volunteer at Serbia's weekly magazine "Time".

Last year I started contributing to Wikinews, after which I soon wanted to become a member of Wikimedia Serbia. In a few months I was elected president of the Wikimedia Serbia's Media board. I held that position for about 7 months, after which I became a member of the Wikimedia Serbia Board. In that capacity, I actively participate in the Wikimedia Serbia's decision making process, also helping out in implementation of its projects and ideas.

During the last few months I became interested in programming, so I started learning HTML, Python, PHP, and administration skills by myself, and shortly after I started coming to workshops in Belgrade's hackerspace, Hacklab Belgrade, where I am learning a lot.