Parsoid is an application which can translate back and forth, at runtime, between MediaWiki's wikitext syntax and an equivalent HTML/RDFa document model with enhanced support for automated processing and rich editing. It has been under development by a team at the Wikimedia Foundation since 2012. It is currently used extensively by VisualEditor and Flow, as well as a growing list of other applications.
Parsoid is a web-service implemented using node.js, often referred to simply as node. For a quick overview, you can test drive Parsoid using a node web service. Development happens in the Parsoid service in Git (see tree). If you need help, you can contact us in #mediawiki-parsoid or the wikitext-l mailing list.
If you use the MediaWiki-Vagrant development environment using a virtual machine, you can simply add the role visualeditor to it and it will set up a working Parsoid along with Extension:VisualEditor.
See Parsoid/Setup for detailed instructions.
The Parsoid web API
Converting simple wikitext
You can convert simple wikitext snippets from the command line using the parse.js script:
echo '[[Foo]]' | node tests/parse
The parse script has a lot of options.
node parse --help gives you information about this.
In Ubuntu 13 and 14,
node has been renamed to
nodejs. There, either create a symbolic link (or equivalent) or type:
echo '[[Foo]]' | nodejs tests/parse
Running the tests
To run all parser tests:
parserTests has quite a few options now which can be listed using
node ./parserTests.js --help.
Debugging Parsoid (for developers)
See Parsoid/Debugging for debugging tips.
If you have questions, try to ping the team on the wikitext-l mailinglist. If all that fails, you can also contact us by email at parser-team at the wikimedia.org domain., or send a mail to
The broad architecture looks like this:
wikitext V PEG wiki/HTML tokenizer (or other tokenizers / SAX-like parsers) | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree V (X)HTML serialization | +------------------> Browser | V VisualEditor
So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.
- Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
- The resulting tokens are then fed to a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
- The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what tidy does in the production parser.
- Finally, the DOM tree can be serialized as XML or HTML.
- Task T92643 has Parsoid goals for Q4 2014.
- Parsoid deployment agenda on Wikimedia cluster (code normally deployed every Monday and Wednesday between 1pm - 1:30pm PST)
- Parsoid/MediaWiki DOM spec: Wiki content model spec using HTML/XML DOM and RDFa. The external interface for Parsoid, and designed to be useful as a future storage format.
- Parsoid/Round-trip testing: The round-trip testing setup we are using to test the wikitext -> HTML DOM -> wikitext round-trip on actual Wikipedia content.
- Parsoid/Visual Diffs Testing: Info about visual diff testing for comparing Parsoid's html rendering with php parser's html rendering + a testreduce setup for doing mass visual diff tests.
- Parsoid:How Wikipedia catches up with the web -- blog post from March 2013 outlining why this problem is difficult and how we tackle it.
- A preliminary look at Parsoid internals [ Slides, Video ] -- tech talk from April 2014 and should still be an useful overview of how Parsoid tackles this problem.
- Parsoid/limitations: Limitations in Parsoid, mainly contrived templating (ab)uses that don't matter in practice. Could be extended to be similar to the preprocessor upgrade notes (Might need updating)
- Parsoid/Roadmap: old parsoid roadmap. (NOW STALE and needs updating)
- Parsoid/Bibliography: Bibliography of related literature
- RT testing commits (useful to check regressions and fixes)
- Deployment instructions for Parsoid
- Kibana Parsoid dashboard
- Grafana dashboard for wt2html metrics
- Grafana dashboard for html2wt metrics
- Ganglia dashboard for Parsoid cluster
- See Parsoid/Debugging for debugging tips.