User:Legoktm/GSoC 2013: Difference between revisions

From mediawiki.org
Content deleted Content added
+
+
(One intermediate revision by the same user not shown)
Line 14: Line 14:


Currently when downloading a generating a new dump, an entire one must be produced, which takes a much longer time for large wikis (enwp), and forces consumers to download the entire dump again as well, even if 80% of the pages have not changed. To make the process much more efficient, it makes sense to produce dumps that only contain the changed content since the last dump was produced. A script could then be written to merge the incremental dumps into the last full dump to produce a complete dump for those who want it.
Currently when downloading a generating a new dump, an entire one must be produced, which takes a much longer time for large wikis (enwp), and forces consumers to download the entire dump again as well, even if 80% of the pages have not changed. To make the process much more efficient, it makes sense to produce dumps that only contain the changed content since the last dump was produced. A script could then be written to merge the incremental dumps into the last full dump to produce a complete dump for those who want it.

This means that we need to design a few format for the dumps as well as an additional format for the incremental ones. Scripts will need to be written to convert the new format into the old/current format to maintain backwards compatibility.


== Deliverables ==
== Deliverables ==
*XML format for the incremental dumps
*XML format for the incremental dumps
*XML format for the full dumps (aka the "new format")
*Script to create these dumps
*Script to create these dumps
**Should work for both the text table and external storage
**Should work for both the text table and external storage
*Script to convert full dumps from "old format" to "new format"
**This should be for both dump providers as well as dump consumers
*Script to merge incremental dumps with the full one
*Script to merge incremental dumps with the full one
**This should be for both dump providers as well as dump consumers
**This should be for both dump providers as well as dump consumers
Line 25: Line 30:


*A XML format for incremental dumps
*A XML format for incremental dumps
*A XML format for the full dumps (aka the "new format")
*Updates to all the maintenance/dump*.php scripts to support incremental dumps
*Updates to all the maintenance/dump*.php scripts to support incremental dumps
*A script to generate the incremental dumps
*A script to generate the incremental dumps
Line 31: Line 37:
=== If time permits ===
=== If time permits ===


*Dump parsing scripts for re-users
*Dump parsing scripts for re-users for the "new format" to assist in transition
*Look into different compression formats for optimization
*Look into different compression formats for optimization
*SQL dumps?
*SQL dumps?

Revision as of 08:41, 24 April 2013

Identity

  • Name: Kunal Mehta (aka Legoktm)
  • Email: legoktm.wikipedia@gmail.com
  • Project title: Incremental Data Dumps

Contact/working info

  • Timezone: CDT until mid May, then PDT.
  • Typical working hours: I'm not sure, anytime really.
  • IRC or IM networks/handle(s): legoktm on freenode and just about everywhere else.

Project summary

Currently when downloading a generating a new dump, an entire one must be produced, which takes a much longer time for large wikis (enwp), and forces consumers to download the entire dump again as well, even if 80% of the pages have not changed. To make the process much more efficient, it makes sense to produce dumps that only contain the changed content since the last dump was produced. A script could then be written to merge the incremental dumps into the last full dump to produce a complete dump for those who want it.

This means that we need to design a few format for the dumps as well as an additional format for the incremental ones. Scripts will need to be written to convert the new format into the old/current format to maintain backwards compatibility.

Deliverables

  • XML format for the incremental dumps
  • XML format for the full dumps (aka the "new format")
  • Script to create these dumps
    • Should work for both the text table and external storage
  • Script to convert full dumps from "old format" to "new format"
    • This should be for both dump providers as well as dump consumers
  • Script to merge incremental dumps with the full one
    • This should be for both dump providers as well as dump consumers

Required deliverables

  • A XML format for incremental dumps
  • A XML format for the full dumps (aka the "new format")
  • Updates to all the maintenance/dump*.php scripts to support incremental dumps
  • A script to generate the incremental dumps
  • A script to merge incremental dumps with a full dump to produce a new dump

If time permits

  • Dump parsing scripts for re-users for the "new format" to assist in transition
  • Look into different compression formats for optimization
  • SQL dumps?
  • ???

Project schedule

  • Pre-stuff: Get familiar with the current process of how dumps are produced
  • May 28 - start
  • Come up with a XML format to format the dumps in - 1 week
    • Need a list of events that should be captured in this format (edits, moves, deletes, revdel)
    • Try to keep compatibility with past formats
  • Write a simple python script to generate incremental dumps (test on smaller wikis)
  • ...

About you

I'm a long time Wiki(p|m)edian, who has been downloading and parsing dumps for a long time. I feel that the current system of dumps is bulky and hard for anyone to re-use/import. I'm a huge believer in the m:Right to fork and this will increase that greatly.

Participation

We don't just want to know what you plan to accomplish; we want to know how. Briefly describe your work style: how you plan to communicate progress, where you plan to publish your source code while you're working, how and where you plan to ask for help. (We will tend to favor applicants that demonstrate a clear vision for what it means to be an active participant in our development community.)

I'm always on IRC, either answering questions or asking them. I'm familiar with git/gerrit and the workflow of submitting a patch, getting feedback, fixing, and repeat.

Past open source experience

I've contributed code to MediaWiki and extensions before (gerrit patches), and am a long time Mediawiki bot author, and am also Manual:Pywikipediabot developer (commits), working on the rewrite project.

Any other info

I'm really not sure what to add here...

See also