This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing.
- Streaming XML parsing is gross. XML dumps consist of (1) some site meta data, (2) a collection of pages that contain (3) collections of revisions. The module allows you to think about dump files in this way and ignore the fact that you’re streaming XML. A mwxml.Dump contains a mwxml.SiteInfo and an iterator of mwxml.Page‘s. A mwxml.Page contains page metadata and an iterator of mwxml.Revision‘s. A mwxml.Revision contains revision metadata and text.
- Performance is a serious concern when processing large database XML dumps. Regretfully, python’s Global Intepreter Lock prevents us from running threads on multiple CPUs. This library provides mwxml.map(), a function that maps a dump processing over a set of dump files using multiprocessing to distribute the work over multiple CPUS
- Extract link count changes (ipython/PAWS)