User:Legoktm/GSoC 2013: Difference between revisions

Browse history interactively

← Older edit Newer edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 08:41, 24 April 2013

Identity

Name: Kunal Mehta (aka Legoktm)
Email: legoktm.wikipediagmail.com
Project title: Incremental Data Dumps

Contact/working info

Timezone: CDT until mid May, then PDT.
Typical working hours: I'm not sure, anytime really.
IRC or IM networks/handle(s): legoktm on freenode and just about everywhere else.

Project summary

Tracked in Phabricator
Task T30956

Currently when downloading a generating a new dump, an entire one must be produced, which takes a much longer time for large wikis (enwp), and forces consumers to download the entire dump again as well, even if 80% of the pages have not changed. To make the process much more efficient, it makes sense to produce dumps that only contain the changed content since the last dump was produced. A script could then be written to merge the incremental dumps into the last full dump to produce a complete dump for those who want it.

This means that we need to design a few format for the dumps as well as an additional format for the incremental ones. Scripts will need to be written to convert the new format into the old/current format to maintain backwards compatibility.

Deliverables

XML format for the incremental dumps
XML format for the full dumps (aka the "new format")
Script to create these dumps
- Should work for both the text table and external storage
Script to convert full dumps from "old format" to "new format"
- This should be for both dump providers as well as dump consumers
Script to merge incremental dumps with the full one
- This should be for both dump providers as well as dump consumers

Required deliverables

A XML format for incremental dumps
A XML format for the full dumps (aka the "new format")
Updates to all the maintenance/dump*.php scripts to support incremental dumps
A script to generate the incremental dumps
A script to merge incremental dumps with a full dump to produce a new dump

If time permits

Dump parsing scripts for re-users for the "new format" to assist in transition
Look into different compression formats for optimization
SQL dumps?
???

Project schedule

Pre-stuff: Get familiar with the current process of how dumps are produced
May 28 - start
Come up with a XML format to format the dumps in - 1 week
- Need a list of events that should be captured in this format (edits, moves, deletes, revdel)
- Try to keep compatibility with past formats
Write a simple python script to generate incremental dumps (test on smaller wikis)
...

About you

I'm a long time Wiki(p|m)edian, who has been downloading and parsing dumps for a long time. I feel that the current system of dumps is bulky and hard for anyone to re-use/import. I'm a huge believer in the m:Right to fork and this will increase that greatly.

Participation

We don't just want to know what you plan to accomplish; we want to know how. Briefly describe your work style: how you plan to communicate progress, where you plan to publish your source code while you're working, how and where you plan to ask for help. (We will tend to favor applicants that demonstrate a clear vision for what it means to be an active participant in our development community.)

I'm always on IRC, either answering questions or asking them. I'm familiar with git/gerrit and the workflow of submitting a patch, getting feedback, fixing, and repeat.

Past open source experience

I've contributed code to MediaWiki and extensions before (gerrit patches), and am a long time Mediawiki bot author, and am also Manual:Pywikipediabot developer (commits), working on the rewrite project.

Any other info

I'm really not sure what to add here...

@@ Line 14: / Line 14: @@
 Currently when downloading a generating a new dump, an entire one must be produced, which takes a much longer time for large wikis (enwp), and forces consumers to download the entire dump again as well, even if 80% of the pages have not changed. To make the process much more efficient, it makes sense to produce dumps that only contain the changed content since the last dump was produced. A script could then be written to merge the incremental dumps into the last full dump to produce a complete dump for those who want it.
+This means that we need to design a few format for the dumps as well as an additional format for the incremental ones. Scripts will need to be written to convert the new format into the old/current format to maintain backwards compatibility.
 == Deliverables ==
 *XML format for the incremental dumps
+*XML format for the full dumps (aka the "new format")
 *Script to create these dumps
 **Should work for both the text table and external storage
+*Script to convert full dumps from "old format" to "new format"
+**This should be for both dump providers as well as dump consumers
 *Script to merge incremental dumps with the full one
 **This should be for both dump providers as well as dump consumers
@@ Line 25: / Line 30: @@
 *A XML format for incremental dumps
+*A XML format for the full dumps (aka the "new format")
 *Updates to all the maintenance/dump*.php scripts to support incremental dumps
 *A script to generate the incremental dumps
@@ Line 31: / Line 37: @@
 === If time permits ===
-*Dump parsing scripts for re-users
+*Dump parsing scripts for re-users for the "new format" to assist in transition
 *Look into different compression formats for optimization
 *SQL dumps?