This page is a compilation of links, descriptions, and status reports of the various alternative MediaWiki parsers—that is, programs and projects, other than MediaWiki itself, which are able or intended to translate MediaWiki's text markup syntax into something else. Some of these have quite narrow purposes, while others are possible contenders for replacing the somewhat labyrinthine code that currently drives MediaWiki itself.
Many of the things linked here are likely to be out of date and under-maintained, or even abandoned. But in the interest of not duplicating the same work over and over, it seemed sensible to collect together what was "out there".
Fully-featured round-tripping parser/runtime that powers the Visual editor on Wikipedia. Work ongoing to provide a HTML-only read / edit interface, and later to become the default parser for MediaWiki. See roadmap.
A fast, datamining-oriented C++ parser capable of processing the complete Wikipedia dump into XML and plain text in 2-3 hours. Open source, user friendly graphical interface. Windows installer available.
A Python library to convert Wiki markup to a navigable string, which can be used to examine and manipulate templates. Written in pure Python, compatible with Python 2.7 and 3, and no dependencies.
Parses sections, templates with parameters, links, images and categories, wiki-table to JS array or JS array to wiki-table, and many more. You may modify parts of the wikitext, then regenerate the page just using parsed.toString(). Runs on node.js and browser.
Provides several accessor methods in an object tree to navigate to structural elements like sections, tables, links etc. Supports extracting table data as list of lists. Available via pip, supports Python 3.
Recursive Descent based on Monadic Parser Combinators. Allows for non context-free input, especially non well formatted HTML as often found on Wikipedia.
Well formed sequence of events, HTML/XHTML, other WikiMarkups
No
No
XWiki can be used a full-fledged wiki supporting several WikiMarkups (including MediaWiki's markup). It also offers a standalone Rendering Engine that can be used as a Java library for parsing/rendering WikiMarkups. Cant output to mediawiki format as of 2016/03 though.
Stateful PEG parser based on Grako (Archived 2014-03-09 at the Wayback Machine), with a very clean separation of parsing stages, grammars and semantic transformations.
A portable .NET library that parses wikitext into Abstract Syntax Tree. For now it supports most of the common markup expressions except file links, double-underscored magic words, and tables.
Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python, segment_wiki - script for wikipedia parsing & extraction.
Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions.
Hand-crafted parser with template expansion, parser functions (core and extended), tag extensions (<ref>, <source>), wiki text parsing. Used for the WikiTaxi offline reader.
"JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia." "JWPL is for you: If you need structured access to Wikipedia in Java." Older parser not maintained any more - JWPL uses Sweble now.
A simple client that enables you to query Wikipedia articles in english. The results are formatted in basic HTML. You can retrieve either a summary of an article (i.e. before the table of contents) or a full article.
YaCy is a search engine and a MediaWiki parser is included as one of the import modules. MediaWiki xml dumps are first converted to Dublin Core XML as intermediate format and then inserted into the search index using the built-in Dublin Core importer.
Wiktionary parser. As of October 2019, downloads the article on-the-fly and parses "etymologies, definitions, pronunciations, examples, audio links and related words".
Primary an wikimedias offline reader with interwiki support. Libmwparser is a source independent library which supports most of MediaWiki syntax and some extensions like math or gallery.
Marker is a Ruby implementation of a subset of the MediaWiki markup language, intended bring MediaWiki's markup language to non-wiki applications with multiple output formats.
One of the common uses of alternative parsers is to dump wiki content into static form, such as HTML or PDF. Tim Starling has written a script which isn't a parser, but uses the MediaWiki internal code to dump an entire wiki to HTML, from the command-line. See Extension:DumpHTML. This has been used (years ago) to create the static dumps at https://dumps.wikimedia.org
There are also similar dumpers as part of the Kiwix project, for example mwoffliner, and you can query the RESTBase API to obtain HTML-format output with semantic information (such as tranclusions) included.
MediaWiki lexer and MediaWiki flexer (not parsers as such, just grammar definitions; probably superseded by/within other projects below)
en:Wikipedia:Text editor support includes various scripts and extensions for things like syntax highlighting for things like EMACS, Vim, and all sorts; some of these may include rudimentary parsing capabilities.
Here are some proof of concept rules for a subset of the Mediawiki markup: these are written in a metalanguage that treats preformatted text as source text, and everything else as comment.
Markup spec aims to produce a specification of MediaWiki's markup format.