Jump to content

Parsoid/DumpGrepper

From mediawiki.org

The dumpgrepper utility is useful to search XML dumps for specific regexp patterns. With a simple regexp, an enwiki dump can be grepped in ~20 minutes.

The grepper operates on actual wikitext (with XML encoding removed), so there is no need to complicate regexps with entities. It supports JavaScript RegExps.

Installation

[edit]
npm install -g dumpgrepper

Usage

[edit]
bzcat /path/to/enwiki-latest-pages-articles.xml.bz2 | dumpgrepper '\| *link *='

See also

[edit]
  • New 'insource' regexp search on wikitext of WMF wikis: Example query, Bug.
  • User:cscott made a hacked variant that lets you chain conditions, so you can say "pages with this but not that (optionally, on the same line)". See https://github.com/cscott/dumpgrepper. This was just a one-off for a particular wikitext migration; if it is more generally useful it could be cleaned up and merged.