Alternative parsers
This page is a compilation of links, descriptions, and status reports of the various alternative MediaWiki parsers—that is, programs and projects, other than MediaWiki itself, which are able or intended to translate MediaWiki's text markup syntax into something else. Some of these have quite narrow purposes, others are possible contenders for replacing the somewhat labyrinthine code that currently drives MediaWiki itself.
Many of the things linked here are likely to be out of date and under-maintained, even abandoned. But in the interest of not duplicating the same work over and over, it seemed sensible to collect together what was "out there".
Related topics[edit]
- One-pass parser
- MediaWiki lexer and MediaWiki flexer (not parsers as such, just grammar definitions; probably superseded by/within other projects below)
- en:Wikipedia:Text editor support includes various scripts and extensions for things like syntax highlighting for things like EMACS, Vim, and all sorts; some of these may include rudimentary parsing capabilities.
- Here are some proof of concept rules for a subset of the Mediawiki markup: these are written in a metalanguage that treats preformatted text as source text, and everything else as comment.
- Markup spec aims to produce a specification of MediaWiki's markup format.
- Help:Extension:ParserFunctions is the main parser extension for MediaWiki.
A non-parser dumper[edit]
One of the common uses of alternative parsers is to dump wiki content into static form, such as HTML or PDF. Tim Starling has written a script which isn't a parser, but uses the MediaWiki internal code to dump an entire wiki to HTML, from the command-line. See Extension:DumpHTML. This has been used (years ago) to create the static dumps at http://static.wikipedia.org
Known implementations[edit]
| Name and link | Principal author(s) | Language | Input | Output | Comments / other info | License |
|---|---|---|---|---|---|---|
| Wiky.php | Toni Lähdekorpi | PHP, Regular Expressions | Markup | HTML | A tiny PHP library that uses only regular expressions to convert Wiki markup to HTML | Apache License/GPL/LGPL/MPL/CC |
| Wiky | Tanin Na Nakorn | Ruby | Markup | HTML | A simple Ruby library to convert Wiki markup to HTML | Apache License |
| Wiky.js | Tanin Na Nakorn | Javascript | Markup | HTML | A simple Javascript library to convert Wiki markup to HTML (limited subset) | Apache License |
| txtwiki.js | Joao Sa | Javascript | Markup | Text | A javascript library to convert MediaWiki markup to plaintext | MIT License |
| wikipedia-js | kenshiro_o | Node.js | Markup | HTML | A simple client that enables you to query Wikipedia articles in english. The results are formatted in basic HTML. You can retrieve either a summary of an article (i.e. before the table of contents) or a full article | MIT |
| WikiExtractor | Giuseppe Attardi, Antonio Fuschetto | Python | SQL dump | text | Simple and fast tool for extracting plain text from Wikipedia dumps | GPL |
| mw2html | Connelly Barnes | Python | Wiki url | HTML | Mininimal setup - gets the basic job of creating a static copy of the wiki done | |
| mwlib | PediaPress.com | Python with C library | Markup and other | parse tree, HTML, PDF, XML, OpenDocument | Part of cooperation between Wikimedia Foundation and PediaPress | BSD |
| Mediawiki2HTML Machine | Johannes Buchner | PHP | Markup | HTML | Project for parsing without the Mediawiki engine. | |
| PHP5 WP | Dan Goldsmith | PHP | Markup | HTML | Parser With Plugin Framework To Add Additional Syntax. Configurable for alternative markup i.e. PMWIKI | MPL 2.0 |
| Mylyn WikiText | David Green | Java | Local files | HTML, DocBook, Eclipse Help, DITA, extensible | Integration with Ant and Eclipse runtime | |
| Java API (Bliki engine) and Eclipse Plugin | axelcl | Java | Markup fragment (supports ParserFunctions) | On-screen preview, HTML, PDF | Java Wikipedia API and a plugin for the Eclipse IDE for assisted editing of Wikipedia (and anything else MediaWiki-based) | |
| FlexBisonParse | Timwi | flex, bison and C | Markup fragment | Custom XML | Intended as an eventual replacement to the parsing code inside MediaWiki itself | |
| JAMWiki | Ryan | Java | JAMWiki front-end | HTML | Java Wiki engine that supports MediaWiki syntax. The roadmap also calls for XML import and export that will be compatible with Mediawiki. | |
| InstaView | Pilaf | JavaScript | Markup fragment | HTML | Provides instant preview while editing a page (without reloading). | |
| InstaView | C. Scott Ananian | JavaScript | Markup fragment | HTML | Port of Pilaf's code to node.js, volo, and the browser. | |
| Magnus' magic wiki-to-XML converter | Magnus Manske | PHP | Markup fragment or list of article titles | Custom XML, plain text, DocBook XML | Feature-complete parser (except math and timeline); pure PHP, so slow but portable. Can directly generate PDFs if DocBook infrastructure is installed | |
| Perl Wikipedia Toolkit | Michal Jurosz | Perl | XML dump, SQL dump | Own parse tree, WikiMedia markup | Perl Wikipedia Toolkit developed for Computer-assisted Wikipedia translation. (Little functional.) | |
| Mylyn WikiText | David Green | Java | Local files | HTML, DocBook, Eclipse Help, extensible | Integration with Ant and Eclipse runtime. | |
| Text_Wiki_Mediawiki | Multiple | PHP | Markup | HTML, Latex, Plain text | Part of the Text_Wiki library | |
| TomeRaider export | Erik Zachte | Perl | XML dump | TomeRaider database | See en:Wikipedia:TomeRaider database for more details | |
| Waikiki | Magnus Manske | C++ | SQL dump (via SQLite) | HTML | abandoned in favour of "flexbisonparse", but has been used inside some experimental "front ends" | |
| Wikiwyg | Jim Higson | JavaScript | A live installation of MediaWiki | HTML (via XML) | More than just a parser; attempts to create a fully functional client-side interface | |
| wik2dict | Guaka | Python | SQL dump | DICT | ||
| wiki2pdf | Stephan Walter | Python (and PHP) | Markup fragment or set of online articles | LaTeX, PDF | Project is incomplete and dormant | |
| wb2pdf | Dirk Hünniger | Haskell | online article | LaTeX, PDF, Parse Tree | Recursive Descent based on Monadic Parser Combinators. Allows for non context-free input, especially non well formatted HTML as often found on Wikipedia | GPL |
| WikiPDF | Felipe Sanches | Python (and PHP) | One selected article | LaTeX based on templates, PDF | Mediawiki extension that uses Stephan Walter's wiki2pdf as backend. | |
| Wiki2XML | Magnus Manske | C++ | Markup fragment (?) | Custom XML | Another aborted project on the way to 'flexbisonparse' | |
| HTML2FPDF | Renato A. C. | PHP | A PHP class that transforms HTML into a feed for FPDF resulting in a PDF file | HTML -> HTML2FPDF -> FPDF -> PDF | Not specifically for Mediawiki, but easy to install using an updated version of this tool:updated html2fpdf.php. See HTML2FPDF and Mediawiki for more instructions. | |
| WikiOnCD | Andrew Rodland | Perl | SQL Dump or markup | HTML, Parse tree (eventually?) | Started out as an offline wiki browser, but grew a parser when Wiki2static turned out to be too limiting. No web presence yet; code is in the SVN. | GPL |
| WikiTaxi | Ralf Junker | Delphi / Pascal | MediaWiki markup, page or fragment | Node-tree, HTML, potentially others | Hand-crafted parser with template expansion, parser functions (core and extended), tag extensions (<ref>, <source>), wiki text parsing. Used for the WikiTaxi offline reader. | no sources available |
| Wikifilter | ? | C++ (VS) | XML dumps | HTML | A Windows program that uses Apache/IIS to serve the pages. Abandoned in 2006, before ParserFunctions were available. | |
| Wikipedia Dump Reader | Benjamin Thyreau | Python | XML dumps | On screen | Cross platform viewer |
GPLv2/~BSD license |
| WikiModel | MikhailKotelnikov | Java | Various WikiMarkups | Well formed sequence of events. Event listeners for XML/XHTML generation. Extensible. | WikiModel is a set of tools and a common API used to work with various wiki markups like CommonSyntax, Creole, MediaWiki, Confluence, JSPWiki, XWiki... WikiModel contains JavaCC-based validating and fixing parsers producing guarantied well formed set of events (like SAX for XML documents). Contains a XHTML serializer. | Apache License |
| Marker | Ryan Blue | ruby | Markup (subset) | HTML or formatted text | Marker is a ruby implementation of a subset of the MediaWiki markup language, intended bring MediaWiki's markup language to non-wiki applications with multiple output formats. | GPL |
| WikiCloth | nricciar | ruby | Markup | HTML | Ruby implementation of the MediaWiki markup language, including a fair amount of the parser functions. | MIT |
| XWiki | XWiki dev team | Java | Various WikiMarkups | Well formed sequence of events, HTML/XHTML, other WikiMarkups | XWiki can be used a full-fledged wiki supporting several WikiMarkups (including MediaWiki's markup). It also offers a standalone Rendering Engine that can be used as a Java library for parsing/rendering WikiMarkups | LGPL |
| Kiwi | Thomas Luce, Karl Matthias, AboutUs.org | C, Ruby, PEG | Markup | HTML | Kiwi is a PEG-based C implementation with Ruby bindings and a command line parser. It is very fast and supports most of the MediaWiki syntax. Actively developed. | BSD |
| YaCy | YaCy dev team | Java | XML Dump | XML with Dublin Core Metadata | YaCy is a search engine and a MediaWiki parser is included as one of the import modules. MediaWiki xml dumps are first converted to Dublin Core XML as intermediate format and then inserted into the search index using the built-in Dublin Core importer. | GPL |
| MessageParser | Neil Kandalgaonkar | JavaScript | Markup | Abstract syntax tree, jQuery object, HTML | Designed for use with message strings, to allow enhanced interface in the browser, like pluralizing internationalized messages or attaching jQuery behaviour to links within a message | GPL |
| Sweble Wikitext Parser | Hannes Dohrn | Java | Markup | Abstract syntax tree, XML, HTML | Claims to be very thorough. | Apache License 2.0 |
| JWPL api | Torsten Zesch, Richard Eckart de Castilho, Oliver Ferschke, Elisabeth Niemann | Java | XML Dump | API to access pages, outlinks, inlinks and more | "JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia." "JWPL is for you: If you need structured access to Wikipedia in Java." | LGPL |
| libmwparser | Saitmoh | C | XML dumps, Markup | XML, XHTML, Expanded WikiText | Primary an wikimedias offline reader with interwiki support. Libmwparser is a source independant library which supports most of MediaWiki syntax and some extentions like math or gallery | GPL |
| mediawiki-parser | Peter Potrowl Erik Rose |
Python | Markup | XHTML, raw text, AST | GSoC-2011 project; the use of a PEG parser makes it easy to improve Parser functions are not supported yet. |
GPL |
| Parsoid | Gabriel Wicke and the Parsoid / Visual editor team | PEG / JavaScript / Node.js | Markup, XML dumps, test cases | Tokens, HTML5 DOM with microdata and round-trip data | WikiMedia project in support of the Visual editor project. Support for template fetching and -expansion and parser functions (work in progress). Extension support planned via call-back to PHP parser. | GPL |
| mwparserfromhell | The Earwig | Python | Markup | AST | A Python library to convert Wiki markup to a navigable string, which can be used to examine and manipulate templates. Written in pure Python, compatible with Python 2.7 and 3, and no dependencies. | MIT License |
| Name and link | Principal author(s) | Language | Input | Output | Comments / other info | License |
|---|---|---|---|---|---|---|
| not listed | as yet unknown | RTF | Markup (or importable XML dump) | |||
| not listed | as yet unknown | OpenOffice Formats | Markup (or importable XML dump) | |||
| not listed | as yet unknown | Text Processing Formats | Markup (or importable XML dump) |
- Tero-dump gives a 404 error.