User:Sanja pavlovic/GSOC/OPW application

Contact info

 * Name: Sanja Pavlovic
 * E-mail: sanja.pavlovic@vikimedija.org, sanjoo911@gmail.com, pavlovic.sanja.91@gmail.com
 * IRC: sanjup
 * Location: Belgrade, Serbia
 * Typical working hour: 15:00 - 23:00 CEST (13:00 - 21:00 UTC)

Project synopsis
Incremental data dumps


 * We offer data dumps of Wikipedia and other Wikimedia projects, allowing people to access this knowledge where Internet connection is missing, slow or expensive, to research edit patterns and data-mine our vast knowledge base. The dumps for the larger projects are only getting larger i.e. 40GB for English Wikipedia. What is more, the update a month later will be another 40GB or more. In fact, only a small subset of that information is actually changed in the form of new pages, new revisions, or deleted revisions. Imagine if users of these files could download just the changes, plus a script that applied the changes. Imagine if the dumps could be written out using the previous month's dumps with such a scheme. Imagine running the German language Wikipedia dumps in 3 days instead of the current 16. This could be achieved by designing the right output format for the XML files containing text for all revisions.

Possible mentor
Ariel Glenn

Deliverables
Improved skript which will considerably shorten dump process. All necessary changes in that and related scripts, as needed for compatibility.


 * worker.py is the main part of the dump softwere and it's been already patched (48012)
 * if we are going to migrate from XML to JSON format, converter from JSON to XML should be created, so people would be able to use XML dumps, still
 * Bzip2Xml.py should suffer major rewriting: it's likely that the final dumps will be tar'd bzip2 files; instead of searching for pageIDs, it should have that information before the decompression starts.
 * pagerange.py is consisted of the code important for our project; it's likely that it would be changed a lot.

Inovation
It is necessary to approach database dump from a different angle, because currently used method requires a large amount of time and processing power. Some of the ideas that acured to me while thinking about this problem are: The first and second week after the formal start of the project are designated to testing these and other approaches.
 * decreasing the size and increasing the number of files that dump produces which would shorten duration of compression and decompression
 * creating an index in dump files by the ids they contain which would accelerate search for entries
 * switching from XML to JSON format which would considerable reduce the size of dump files

The timeline of the project

 * Up to May 3rd
 * finishing application and the first contribution (just for the OPW)


 * From May 4th to June 16th
 * introducing myself with the code base and the rest of documentation in detail


 * June 17th
 * officially starting the project


 * June 17th to June 30th
 * thinking about the ides and testing their feasibility:
 * what's the difference between (de)compressing one large file instead of many smaller in the sense of time and CPU consumption?
 * what's the best method for indexing files? how the existing code can be used for this propose?
 * is JSON format giving significantly better performances in comparasion to XML?
 * etc.


 * July 1st to July 31st
 * writing the code


 * August 1st to August 31st
 * testing the code


 * September 1st to September 23rd
 * final cheks and writing the documentation

About me
Hi, all!

My name is Sanja Pavlovic. I live in Belgrade, Serbia. I am currently in my third year of Journalism and Communicology studies at the Faculty of Political Sciences, University of Belgrade. I volunteer at Serbia's weekly magazine "Time".

Last year I started contributing to Wikinews, after which I soon wanted to become a member of Wikimedia Serbia. In a few months I was elected president of the Wikimedia Serbia's Media board. I held that position for about 7 months, after which I became a member of the Wikimedia Serbia Board. In that capacity, I actively participate in the Wikimedia Serbia's decision making process, also helping out in implementation of its projects and ideas.

During the last few months I became interested in programming, so I started learning HTML, Python, PHP, and administration skills by myself, and shortly after I started coming to workshops in Belgrade's hackerspace, Hacklab Belgrade, where I am learning a lot.

I am very ambitious and eager to learn, especially when it comes to Wikimedia projects. So far I have mostly had the chance to participate in content creation, but I am also fascinated by the technical aspects of Wikimedia's projects. I am mostly interested in getting to know the inner workings of this technology and want to be able to contribute from that side as well.

I would be more than happy to continue working on the administration of Wikimedia projects even after GSOC/OPW ends.

Info plus

 * From the May 23rd to May 27th, I will be attending Wikimedia Hackathon in Amsterdam, where I will be able to talk and consult people from Wikimedia, and to improve my knowledge about the administration aspect of Wikimedia's projects.

“The project “Casting a wider net” consists of building a web application as an interface between citizens and members of parliament i.e. representatives in the National Assembly of the Republic of Serbia (MPs), with an aim to facilitate a dialogue, monitor MPs’ activities, their voting history and influence decision making processes. We want to demonstrate how the Internet, by being an open and interoperable communication technology can make democratic processes more participatory and improve transparency of decision making in our society. In order to achieve that mid term goal, we will firstly focus on Internet freedoms, and regulations from the field, in perspective broadening the scope of subject matters."
 * Currently (up to the end of May), I'm working on a project named “Casting a wider web”, which is conducted by the Internet society Serbia and Wikimedia Serbia. The project summary is: