Parsoid
| Group: | Features |
| Start: | |
| End: | |
| Team: | C.Scott Ananian, Mark Holmquist, Robert Smith, Subramanya Sastry |
| Lead: | James Forrester, Gabriel Wicke |
| Status: | See updates |
The Parsoid team is developing a wiki runtime which can translate back and forth between MediaWiki's wikitext syntax and an equivalent HTML / RDFa document model with better support for automated processing and visual editing. Its main use currently is the VisualEditor project. A major (and not easy) requirement is to avoid 'dirty diffs' or information loss in the conversion. A good overview can be found in this blog post. Our roadmap describes what we are currently up to.
Contents |
Getting started[edit]
For a quick overview, you can test drive Parsoid using a node web service. Development happens in the Parsoid extension in Git (see [1]). The parser tests uses the parserTests.txt file from the core module.
Parsoid setup[edit]
If you want to do an anonymous checkout.
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Parsoid.git
Or if you plan to hack Parsoid, then please follow the Gerrit 'getting started' docs and use an authenticated checkout url instead, such as
git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/extensions/Parsoid.git
Use node.js 0.8, not 0.10[edit]
You need node.js 0.8.x (0.8.22 works best). Anything below 0.8 won't work due to problems with our dependencies, and node 0.9+ won't work yet due to changes in the stream API and other issues (see bug 45994). On Debian, the following should work: apt-get install nodejs nodejs-legacy npm.
In the Mediawiki-Vagrant virtual machine, you probably will need to do the following instead after doing a vagrant ssh:
$ wget http://nodejs.org/dist/v0.8.22/node-v0.8.22-linux-x64.tar.gz $ tar xzf node-v0.8.22-linux-x64.tar.gz $ sudo cp node-v0.8.22-linux-x64/bin/node /bin/node $ cd node-v0.8.22-linux-x64/lib/node_modules/npm $ ./configure $ sudo make $ sudo make install
Install dependencies[edit]
First, install the npm dependencies:
cd Parsoid/js npm install
Configuration[edit]
If you would like to point the Parsoid web service to your own wiki, go to the Parsoid/js/api directory and create a localsettings.js file based on localsettings.js.example. Use parsoidConfig.setInterwiki to point to the MediaWiki instance(s) you want to use.
Run the server[edit]
You should be able to run the Parsoid web service using:
cd Parsoid/js node api/server.js
This will start the Parsoid HTTP service on port 8000. To test it, point your browser to http://localhost:8000/ .
Converting simple wikitext[edit]
You can convert simple wikitext snippets using our parse.js script:
cd Parsoid/js/tests echo '[[Foo]]' | node parse
More options are available with
node parse --help
Running the tests[edit]
To run all parser tests:
cd Parsoid/js npm test
parserTests has quite a few options now which can be listed using node ./parserTests.js --help.
An alternative wrapper taking wikitext on stdin and emitting HTML on stdout is modules/parser/parse.js:
cd Parsoid/js/tests
echo '{{:Main Page}}' | node parse.js
This example will transclude the English Wikipedia's en:Main Page including its embedded templates. Also check out node parse.js --help for options.
You can also try to round-trip a page and check for the significance of the differences. For example, try
cd Parsoid/js/tests node roundtrip-test.js --wiki mw Parsoid
This example will run the roundtripper on this page (the one you're reading, including all of this text) and report the results. It will also attempt to determine whether the differences in wikitext create any differences in the display of the page. If not, it reports the difference as "syntactic".
Finally, if you really wanted to hammer the Parsoid codebase to see how we're doing, you can try running the roundtrip testing environment on your computer with a list of titles.
As if that weren't enough, we've also added a --selser option, with multiple related options, to the parserTests.js script. The way it works:
cd Parsoid/js/tests node parserTests.js --selser
You can also write out change files, read them in, and specify any number of iterations of random changes to go through. There's also a plan to pass in actual changes to the tests, but those plans are still in progress.
Monthly high-level status summary[edit]
A major image handling overhaul enabled rendering and editing of all image-related parameters with a relatively simple DOM structure. Template and extension editing was improved to support editing of templates within extensions. This lets editors modify and add templated citations in VisualEditor, an important feature to improve the quality of articles in Wikipedia.
On the performance front, we are now reusing expensive template, extension and image expansions from our own previous output to avoid most API queries after an edit. This is necessary to avoid overloading the API when tracking all edits on Wikimedia projects. A cache infrastructure with appropriate purging was set up and will be tested at full load through June.
At the Amsterdam hackathon, we helped developers leverage our rich HTML+RDFa DOM output for projects like a Wikipedia-to-SMS service or the Kiwix offline Wikipedia reader.Todo[edit]
Our big plans are spelled out in some detail in our roadmap. Smaller-step tasks are tracked in our bug list.
We also have a slightly outdated list of tasks we saw ahead at some point in the past interspersed with notes on (still) open issues. The Todo page is in the process of being cleaned up / moved into Bugzilla, so don't pick a task on it without asking us about the status first.
If you have questions, try to ping the team on #mediawiki-parsoidconnect, or send a mail to the wikitext-l mailinglist. If all that fails, you can also contact Gabriel Wicke by mail.
Architecture[edit]
The broad architecture looks like this:
| wikitext
V
PEG wiki/HTML tokenizer (or other tokenizers / SAX-like parsers)
| Chunks of tokens
V
Token stream transformations
| Chunks of tokens
V
HTML5 tree builder
| HTML 5 DOM tree
V
DOM Postprocessors
| HTML5 DOM tree
V
(X)HTML serialization
|
+------------------> Browser
|
V
Visual Editor
So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.
- The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.
- Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also be expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
- The resulting tokens are then fed to a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
- The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what tidy does in the production parser.
- Finally, the DOM tree can be serialized as XML or HTML.
Technical documents[edit]
- Parsoid/Roadmap: What we are up to.
- Parsoid/MediaWiki DOM spec: Wiki content model spec using HTML/XML DOM and RDFa. The external interface for Parsoid, and designed to be useful as a future storage format.
- Parsoid/Round-trip testing: The round-trip testing setup we are using to test the wikitext -> HTML DOM -> wikitext round-trip on actual Wikipedia content.
- /test cases: Please add interesting snippets or pages.
- If you feel masochistic, check out our broken wikitext tar pit.
- Minimization of DOM tags primarily used for minimizing nesting of inline tags (bold and inline primarily).
See also[edit]
- Future/Parser plan: Early (now relatively old) design ideas and issues
- User:GWicke: Some notes on existing wiki and HTML parsers, should really be moved to general documentation
- Special:PrefixIndex/Parsoid/: Parsoid-related pages on this wiki