Parsoid

From MediaWiki.org
(Redirected from Extension:Parsoid)
Jump to: navigation, search

The Parsoid team is developing a wiki runtime which can translate back and forth between MediaWiki's wikitext syntax and an equivalent HTML / RDFa document model with better support for automated processing and rich editing. It powers VisualEditor, Flow and a growing list of other applications. A major (and not easy) requirement is to avoid both 'dirty diffs' and information loss in the conversion. A good overview can be found in this blog post. Our roadmap describes what we are currently up to.

Artist's impression of the Parsoid HTML5 + RDFa wiki runtime

Getting started[edit | edit source]

Parsoid is a web-service implemented using node.js, often referred to simply as node. For a quick overview, you can test drive Parsoid using a node web service. Development happens in the Parsoid extension in Git (see tree). If you need help, you can contact us in #mediawiki-parsoid or the wikitext-l mailing list.

If you use the MediaWiki-Vagrant development environment using a virtual machine, you can simply add the role visualeditor to it and it will set up a working Parsoid along with Extension:VisualEditor.

Parsoid setup[edit | edit source]

See Parsoid/Setup for detailed instructions.

Troubleshooting[edit | edit source]

See the troubleshooting page.

Converting simple wikitext[edit | edit source]

You can convert simple wikitext snippets using our parse.js script:

cd tests
echo '[[Foo]]' | node parse

More options are available with

node parse --help

The Parsoid web API[edit | edit source]

Parsoid converts MediaWiki's Wikitext to XHTML5 + RDFa and back.

Common HTTP headers supported in all entry points[edit | edit source]

Accept-encoding 
Please accept gzip.
Cookie 
Cookie header that will be forwarded to the API. Makes it possible to use Parsoid with private wikis. Setting a cookie implicitly disables all caching for security reasons, so do not send a cookie for public wikis if you care about caching.

Entry points[edit | edit source]

The {prefix} in these examples refers to the configured wiki id as available in the siteinfo API request. Examples: 'enwiki', 'frwiki', 'dewiki' etc. {page} refers to the canonical page name with spaces replaced with underscores. Examples: 'Main_Page', 'Barack_Obama' etc.

GET /{prefix}/{page}?oldid=12345 
Get HTML for a given page revision. Example: /enwiki/Main_Page?oldid=598252063.
POST /{prefix}/{page}
Convert passed-in wikitext or html
The {page} path component should be provided if available. Both it and the oldid is needed for clean round-tripping of HTML retrieved earlier with GET /{prefix}/{page}?oldid=12345.
HTML to Wikitext
oldid
the revision id this is based on (if any)
html
HTML to serialize to wikitext
Wikitext to HTML
wt
Wikitext to parse to HTML
body (optional) 
boolean flag, only return the HTML body.innerHTML instead of a full document

Convenience method:

GET /{prefix}/{page} 
Get HTML for the latest page revision, redirects to the full /{prefix}/{page}?oldid=12345 form.

There are additional form-based debugging tools available. See / (e.g. http://parsoid.wmflabs.org/). Those are not part of the API, and can change or disappear at any time.

Development[edit | edit source]

Code review happens in Gerrit. See Gerrit/Getting started and ping us in #mediawiki-parsoid.

Running the tests[edit | edit source]

To run all parser tests:

npm test

parserTests has quite a few options now which can be listed using node ./parserTests.js --help.

An alternative wrapper taking wikitext on stdin and emitting HTML on stdout is modules/parser/parse.js:

cd tests
echo '{{:Main Page}}' | node parse.js

This example will transclude the English Wikipedia's en:Main Page including its embedded templates. Also check out node parse.js --help for options.

You can also try to round-trip a page and check for the significance of the differences. For example, try

cd tests
node roundtrip-test.js --wiki mw Parsoid

This example will run the roundtripper on this page (the one you're reading, including all of this text) and report the results. It will also attempt to determine whether the differences in wikitext create any differences in the display of the page. If not, it reports the difference as "syntactic".

Finally, if you really wanted to hammer the Parsoid codebase to see how we're doing, you can try running the roundtrip testing environment on your computer with a list of titles.

As if that weren't enough, we've also added a --selser option, with multiple related options, to the parserTests.js script. The way it works:

cd tests
node parserTests.js --selser

You can also write out change files, read them in, and specify any number of iterations of random changes to go through. There's also a plan to pass in actual changes to the tests, but those plans are still in progress.

Debugging Parsoid (for developers)[edit | edit source]

See Parsoid/Debugging for debugging tips.

Monthly high-level status summary[edit | edit source]

2014-03-monthly:

Presentation slides from the Parsoid team's quarterly review meeting on March 28

March saw the Parsoid team continuing with a lot of unglamorous bug fixing and tweaking. Media / image handling in particular received a good amount of love, and is now in a much better state than it used to be. In the process, we discovered a lot of edge cases and inconsistent behavior in the PHP parser, and fixed some of those issues there as well.

We wrapped up our mentorship for Be Birchall and Maria Pecana in the Outreach Program for Women. We revamped our round-trip test server interface and fixed some diffing issues in the round-trip test system. Maria wrote a generic logging backend that lets us dynamically map an event stream to any number of logging sinks. A huge step up from our console.error based basic error logging so far.

We also designed and implemented a HTML templating library which combines the correctness and security support of a DOM-based solution with the performance of string-based templating. This is implemented as a compiler from KnockoutJS-compatible HTML syntax to a JSON intermediate representation, and a small and very fast runtime for the JSON representation. The runtime is now also being ported to PHP in order to gauge the performance there as well. It will also be a test bed for further forays into HTML templating for translation messages and eventually wiki content.

(See all status reports)

Todo[edit | edit source]

Our big plans are spelled out in some detail in our roadmap. Smaller-step tasks are tracked in our bug list.

If you have questions, try to ping the team on #mediawiki-parsoidconnect, or send a mail to the wikitext-l mailinglist. If all that fails, you can also contact Gabriel Wicke by mail.

Architecture[edit | edit source]

The broad architecture looks like this:

| wikitext
     V
 PEG wiki/HTML tokenizer         (or other tokenizers / SAX-like parsers)
     | Chunks of tokens
     V
 Token stream transformations 
     | Chunks of tokens
     V
 HTML5 tree builder 
     | HTML 5 DOM tree
     V
 DOM Postprocessors 
     | HTML5 DOM tree
     V
 (X)HTML serialization
     |
     +------------------> Browser
     |
     V
 VisualEditor

So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.

  1. The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.
  2. Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also be expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
  3. The resulting tokens are then fed to a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
  4. The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what tidy does in the production parser.
  5. Finally, the DOM tree can be serialized as XML or HTML.


Technical documents[edit | edit source]

See also[edit | edit source]