Alternative parsers

From MediaWiki.org
Jump to navigation Jump to search

This page is a compilation of links, descriptions, and status reports of the various alternative MediaWiki parsers—that is, programs and projects, other than MediaWiki itself, which are able or intended to translate MediaWiki's text markup syntax into something else. Some of these have quite narrow purposes, others are possible contenders for replacing the somewhat labyrinthine code that currently drives MediaWiki itself.

Many of the things linked here are likely to be out of date and under-maintained, even abandoned. But in the interest of not duplicating the same work over and over, it seemed sensible to collect together what was "out there".

Parsers that build an abstract syntax tree and provide access to it are listed under #Parsers providing an AST, parsers that don't build an AST but extract some information are listed under #Parsers extracting some information, the rest of the parsers are listed under #Other parsers.

Parsers providing an AST[edit]

Name and link Principal author(s) Language Input Output Complete implementation Can convert output back to markup Comments / other info License
Parsoid Gabriel Wicke and the Parsoid / Visual editor team PEG / Node.js markup, XML dumps, test cases tokens, HTML5 DOM with RDFa and round-trip data Yes Yes Fully-featured round-tripping parser/runtime that powers the Visual editor on Wikipedia. Work ongoing to provide a HTML-only read / edit interface, and later to become the default parser for MediaWiki. See roadmap. GPLv2+
Wiki Parser Dizzy Logic C++ XML dumps Syntax tree in XML, plain text No No A fast, datamining-oriented C++ parser capable of processing the complete Wikipedia dump into XML and plain text in 2-3 hours. Open source, user friendly graphical interface. Windows installer available. MIT License
mwparserfromhell The Earwig Python markup AST almost Yes A Python library to convert Wiki markup to a navigable string, which can be used to examine and manipulate templates. Written in pure Python, compatible with Python 2.7 and 3, and no dependencies. MIT License
wtf_wikipedia Spencer Kelly JavaScript markup JSON almost No Supports recursive links & templates, parses infoboxes and links, resolves special templates, parses images and categories. runs server-side & browser. MIT
Sweble Wikitext Parser Hannes Dohrn Java markup Abstract syntax tree, XML, HTML almost ? Claims to be very thorough. There are three papers surrounding the Sweble Wikitext Parser. Apache License 2.0
wikitextparser 5j9 Python markup AST almost Yes Provides several accessor methods in an object tree to navigate to structural elements like sections, tables, links etc. Supports extracting table data as list of lists. Available via pip, supports Python 3. GPLv3
mwlib PediaPress.com Python with C library markup and other parse tree, HTML, PDF, XML, OpenDocument No ? Used by MediaWiki's "Print/export" feature, see Reading/Web/PDF Functionality. BSD
wb2pdf Dirk Hünniger Haskell online article LaTeX, PDF, Parse Tree No ? Recursive Descent based on Monadic Parser Combinators. Allows for non context-free input, especially non well formatted HTML as often found on Wikipedia. GPL
XWiki Rendering Framework XWiki dev team Java various WikiMarkups Well formed sequence of events, HTML/XHTML, other WikiMarkups No No XWiki can be used a full-fledged wiki supporting several WikiMarkups (including MediaWiki's markup). It also offers a standalone Rendering Engine that can be used as a Java library for parsing/rendering WikiMarkups. Cant output to mediawiki format as of 2016/03 though. LGPL
mediawiki-parser Peter Potrowl, Erik Rose Python markup XHTML, raw text, AST No No GSoC-2011 project; the use of a PEG parser makes it easy to improve. Parser functions are not supported yet. GPLv3
smc.mw Marcus Brinkmann Python markup AST, HTML No No Stateful PEG parser based on Grako[dead link], with a very clean separation of parsing stages, grammars and semantic transformations. BSD
Pandoc John MacFarlane Haskell markup many & AST No not identical Can convert subset of mediawiki markup to ~35 different formats (5 of which are flavors of markdown). GPLv2
MwParserFromScratch CXuesong C# markup AST No Yes A portable .NET library that parses wikitext into Abstract Syntax Tree. For now it supports most of the common markup expressions except file links, double-underscored magic words, and tables. Apache License
gensim.segment_wiki RaRe Technologies Python MediaWiki XML JSON No No Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python, segment_wiki - script for wikipedia parsing & extraction. LGPLv2.1
parse_wiki_text Fredrik Portström Rust markup AST No No Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions. modified MIT

Proprietary[edit]

Name and link Principal author(s) Language Input Output Complete implementation Comments / other info License
WikiTaxi Ralf Junker Delphi / Pascal MediaWiki markup, page or fragment Node-tree, HTML, potentially others almost Hand-crafted parser with template expansion, parser functions (core and extended), tag extensions (<ref>, <source>), wiki text parsing. Used for the WikiTaxi offline reader. No sources available

Abandoned[edit]

Name and link Principal author(s) Language Input Output Complete implementation Comments / other info License
DKPro JWPL parser Torsten Zesch, Richard Eckart de Castilho, Oliver Ferschke, Elisabeth Niemann Java XML dump API to access pages, outlinks, inlinks and more No "JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia." "JWPL is for you: If you need structured access to Wikipedia in Java." Older parser not maintained any more - JWPL uses Sweble now. LGPL
FlexBisonParse Timwi flex, bison and C markup fragment Custom XML No Intended as an eventual replacement to the parsing code inside MediaWiki itself.
Wiki2XML Magnus Manske C++ markup fragment (?) Custom XML No Another aborted project on the way to 'flexbisonparse'.
sanskrit-coders/wiki-tools Vishvas Vasuki Scala Mediawiki text Mediawiki text and Section tree No Only parses mediawiki sections - that's it. One can parse a wiki page with multiple sections, get a section tree, add, access and delete sections. Creative commons
Perl Wikipedia Toolkit[dead link] Michal Jurosz Perl XML dump, SQL dump Own parse tree, WikiMedia markup No Perl Wikipedia Toolkit developed for Computer-assisted Wikipedia translation. (Little functional)
WikiOnCD[dead link] Andrew Rodland Perl SQL dump or markup HTML, Parse tree (eventually?) No Started out as an offline wiki browser, but grew a parser when Wiki2static turned out to be too limiting. No web presence yet; code is in the SVN. GPL
WikiPress Publisher[dead link] Erwin Jurschitza Delphi 7 XML dump DocBook XML, Digibib XML, HTML No Used for the German DVD, generates lists of bad markup. No sources available
Saya.Parser.Wiki[dead link] Nana Sakisaka C++ markup Abstract syntax tree No Pure C++11 parser implemented with Boost.Spirit.Qi. Boost Software License 1.0

Parsers extracting some information[edit]

Name and link Principal author(s) Language Input Output Complete implementation Comments / other info License
PHP-Wikipedia-Syntax-Parser Don Wilson PHP markup Associative array No Parses top-level sections, w:Wikipedia:Persondata, infoboxes, external links, categories, and interlanguage links. GPL
Wiki-infobox-parser Zhipeng Jiang JavaScript markup JSON No A light Wikipedia Infobox Parser written in JavaScript. MIT

Other parsers[edit]

Name and link Principal author(s) Language Input Output Complete implementation Comments / other info License
Mylyn WikiText David Green Java Local files HTML, DocBook, Eclipse Help, DITA, extensible No Integration with Ant and Eclipse runtime. EPL
wikipedia-js kenshiro_o Node.js markup HTML No A simple client that enables you to query Wikipedia articles in english. The results are formatted in basic HTML. You can retrieve either a summary of an article (i.e. before the table of contents) or a full article. MIT
WikiExtractor Giuseppe Attardi, Antonio Fuschetto Python XML dumps text No Simple and fast tool for extracting plain text from Wikipedia dumps. It performs template expansion and handles parser functions (core and extended). GPL
Mediawiki2HTML Machine Johannes Buchner PHP markup HTML No Project for parsing without the Mediawiki engine. AGPL3 + any later version
Java API (Bliki engine) axelclk Java markup fragment HTML, PDF almost Java Wikipedia API - (supports ParserFunctions, Lua/Scribunto...). EPLv1.0 or GPLv2.1+
WikiCloth nricciar Ruby markup HTML No Ruby implementation of the MediaWiki markup language, including a fair amount of the parser functions. MIT
YaCy YaCy dev team Java XML dump XML with Dublin Core Metadata No YaCy is a search engine and a MediaWiki parser is included as one of the import modules. MediaWiki xml dumps are first converted to Dublin Core XML as intermediate format and then inserted into the search index using the built-in Dublin Core importer. GPL

Abandoned[edit]

Name and link Principal author(s) Language Input Output Complete implementation Comments / other info License
libmwparser Saitmoh C XML dumps, markup XML, XHTML, Expanded WikiText almost Primary an wikimedias offline reader with interwiki support. Libmwparser is a source independent library which supports most of MediaWiki syntax and some extensions like math or gallery. GPL
Wiky.php Toni Lähdekorpi PHP, Regular Expressions markup HTML No A tiny PHP library that uses only regular expressions to convert Wiki markup to HTML. Apache License/GPL/LGPL/MPL/CC
Wiky Tanin Na Nakorn Ruby markup HTML No A simple Ruby library to convert Wiki markup to HTML. Apache License
Wiky.js Tanin Na Nakorn JavaScript markup HTML No A simple JavaScript library to convert Wiki markup to HTML (limited subset). Apache License
txtwiki.js Joao Sa JavaScript markup Text No A JavaScript library to convert MediaWiki markup to plaintext. MIT License
mw2html Connelly Barnes Python Wiki url HTML No Minimal setup - gets the basic job of creating a static copy of the wiki done. Public Domain
PHP5 WP Dan Goldsmith PHP markup HTML No Parser With Plugin Framework To Add Additional Syntax. Configurable for alternative markup i.e. PMWIKI. MPL 2.0
JAMWiki Ryan Java JAMWiki front-end HTML No Java Wiki engine that supports MediaWiki syntax. The roadmap also calls for XML import and export that will be compatible with Mediawiki. LGPLv2
InstaView Pilaf JavaScript markup fragment HTML No Provides instant preview while editing a page (without reloading). BSD
InstaView C. Scott Ananian JavaScript markup fragment HTML No Port of Pilaf's code to node.js, volo, and the browser. BSD
Tero-dump Tero Karvinen ? Local wiki installation, including MySQL, PHP, web server HTML No Scripts for grabbing the whole wiki; does not include images.
Text_Wiki_Mediawiki Multiple PHP markup HTML, LaTeX, Plain text No Part of the Text_Wiki library. LGPL
TomeRaider export Erik Zachte Perl XML dump TomeRaider database No See en:Wikipedia:TomeRaider database for more details.
Waikiki Magnus Manske C++ SQL dump (via SQLite) HTML No Abandoned in favour of "flexbisonparse", but has been used inside some experimental "front ends".
Wikiwyg[dead link] Jim Higson JavaScript A live installation of MediaWiki HTML (via XML) No More than just a parser; attempts to create a fully functional client-side interface.
wik2dict Guaka Python SQL dump DICT No
wiki2pdf Stephan Walter Python (and PHP) markup fragment or set of online articles LaTeX, PDF No Project is incomplete and dormant.
WikiPDF Felipe Sanches Python (and PHP) One selected article LaTeX based on templates, PDF No Mediawiki extension that uses Stephan Walter's wiki2pdf as backend.
Wikifilter ? C++ (VS) XML dumps HTML No A Windows program that uses Apache/IIS to serve the pages. Abandoned in 2006, before ParserFunctions were available.
Wikipedia Dump Reader Benjamin Thyreau Python XML dumps On screen No Cross platform viewer. GPLv2/~BSD license
Marker Ryan Blue Ruby markup (subset) HTML or formatted text No Marker is a Ruby implementation of a subset of the MediaWiki markup language, intended bring MediaWiki's markup language to non-wiki applications with multiple output formats. GPL
Kiwi Thomas Luce, Karl Matthias, AboutUs.org C, Ruby, PEG markup HTML almost Kiwi is a PEG-based C implementation with Ruby bindings and a command line parser. It is very fast and supports most of the MediaWiki syntax. BSD

A non-parser dumper[edit]

One of the common uses of alternative parsers is to dump wiki content into static form, such as HTML or PDF. Tim Starling has written a script which isn't a parser, but uses the MediaWiki internal code to dump an entire wiki to HTML, from the command-line. See Extension:DumpHTML. This has been used (years ago) to create the static dumps at https://dumps.wikimedia.org

There are also similar dumpers as part of the Kiwix project, for example mwoffliner, and you can query the RESTBase API to obtain HTML-format output with semantic information (such as tranclusions) included.

Related topics[edit]