Mediawiki-utilities/mwxml

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing.


 * Complexity
 * Streaming XML parsing is gross. XML dumps consist of (1) some site meta data, (2) a collection of pages that contain (3) collections of revisions. The module allows you to think about dump files in this way and ignore the fact that you’re streaming XML. A [//pythonhosted.org/mwxml/iteration.html#mwxml.Dump mwxml.Dump] contains a [//pythonhosted.org/mwxml/iteration.html#mwxml.SiteInfo mwxml.SiteInfo] and an iterator of [//pythonhosted.org/mwxml/iteration.html#mwxml.Page mwxml.Page]‘s. A [//pythonhosted.org/mwxml/iteration.html#mwxml.Page mwxml.Page] contains page metadata and an iterator of [//pythonhosted.org/mwxml/iteration.html#mwxml.Revision mwxml.Revision]‘s. A [//pythonhosted.org/mwxml/iteration.html#mwxml.Revision mwxml.Revision] contains revision metadata and text.


 * Performance
 * Performance is a serious concern when processing large database XML dumps. Regretfully, python’s Global Intepreter Lock prevents us from running threads on multiple CPUs. This library provides [//pythonhosted.org/mwxml/map.html mwxml.map], a function that maps a dump processing over a set of dump files using multiprocessing to distribute the work over multiple CPUS

Examples

 * Extract link count changes (ipython/PAWS)