Parsoid

Parsoid is an application which can translate back and forth, at runtime, between MediaWiki's wikitext syntax and an equivalent HTML/RDFa document model with enhanced support for automated processing and rich editing. It has been under development by a team at the Wikimedia Foundation since 2012. It is currently used extensively by VisualEditor and Flow, as well as a growing list of other applications.

Parsoid is structured as a web service, and is written in JavaScript, making use of Node.js. It is intended to provide flawless back-and-forth conversion, i.e. to avoid both "dirty diffs" and any information loss. On Wikimedia wikis, for several applications, Parsoid is currently proxied behind RESTBase, which stores the HTML translated by Parsoid.

For more on the overall project, see this blog post from March 2013. To read about the HTML model being used, see MediaWiki DOM spec.



Getting started
Parsoid is a web-service implemented using node.js, often referred to simply as node. For a quick overview, you can test drive Parsoid using a node web service. Development happens in the Parsoid service in Git (see ). If you need help, you can contact us in #mediawiki-parsoid or the wikitext-l mailing list.

If you use the MediaWiki-Vagrant development environment using a virtual machine, you can simply add the role visualeditor to it and it will set up a working Parsoid along with Extension:VisualEditor.

Parsoid setup
See Parsoid/Setup for detailed instructions.

Troubleshooting
See the troubleshooting page.

The Parsoid web API
See Parsoid/API

Converting simple wikitext
You can convert simple wikitext snippets from the command line using the parse.js script: echo 'Foo' | node tests/parse The parse script has a lot of options. gives you information about this.

Development
Code review happens in Gerrit. See Gerrit/Getting started and ping us in #mediawiki-parsoid.

Running the tests
To run all parser tests:

npm test

parserTests has quite a few options now which can be listed using.

Debugging Parsoid (for developers)
See Parsoid/Debugging for debugging tips.

Todo
Our plans for Q4 2014 are spelled out in. All pending tasks and bugs are tracked in our bug list.

If you have questions, try to ping the team on, or send a mail to the wikitext-l mailinglist. If all that fails, you can also contact us by email at parser-team at the wikimedia.org domain.

Architecture
The broad architecture looks like this:

wikitext V PEG wiki/HTML tokenizer        (or other tokenizers / SAX-like parsers) | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree V (X)HTML serialization |    +--> Browser |    V VisualEditor

So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.


 * 1) The PEG-based  produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.
 * 2) Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
 * 3) The resulting tokens are then fed to a  (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
 * 4) The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what tidy does in the production parser.
 * 5) Finally, the DOM tree can be serialized as XML or HTML.

Technical documents

 * has Parsoid goals for Q4 2014.
 * Parsoid deployment agenda on Wikimedia cluster (code normally deployed every Monday and Wednesday between 1pm - 1:30pm PST)
 * Parsoid/MediaWiki DOM spec: Wiki content model spec using HTML/XML DOM and RDFa. The external interface for Parsoid, and designed to be useful as a future storage format.
 * Parsoid/Round-trip testing: The round-trip testing setup we are using to test the wikitext -> HTML DOM -> wikitext round-trip on actual Wikipedia content.
 * Parsoid/Visual Diffs Testing: Info about visual diff testing for comparing Parsoid's html rendering with php parser's html rendering + a testreduce setup for doing mass visual diff tests.
 * Parsoid:How Wikipedia catches up with the web -- blog post from March 2013 outlining why this problem is difficult and how we tackle it.
 * A preliminary look at Parsoid internals [ Slides, Video ] -- tech talk from April 2014 and should still be an useful overview of how Parsoid tackles this problem.
 * Parsoid/limitations: Limitations in Parsoid, mainly contrived templating (ab)uses that don't matter in practice. Could be extended to be similar to the preprocessor upgrade notes (Might need updating)
 * Parsoid/Roadmap: old parsoid roadmap. (NOW STALE and needs updating)
 * Parsoid/Bibliography: Bibliography of related literature

Useful links for Parsoid developers

 * Parsoid/Deployments
 * RT testing commits (useful to check regressions and fixes)
 * Deployment instructions for Parsoid
 * Kibana Parsoid dashboard
 * Grafana dashboard for wt2html metrics
 * Grafana dashboard for html2wt metrics
 * Ganglia dashboard for Parsoid cluster
 * See Parsoid/Debugging for debugging tips.