Parsoid

From MediaWiki.org
(Redirected from Future/Parser development)
Jump to: navigation, search

The parser Parsoid project aims to develop a more consistent WikiText parser which translates MediaWiki's well-known syntax into an equivalent representation with better support for automated processing and visual editing. It is developed in parallel with and in support of the visual editor project as a future core project. A major requirement is the ability to reverse this translation (serialize back to WikiText) without the introduction of 'dirty diffs' or information loss. Wiki pages remain editable as plain WikiText.

Contents

[edit] Architecture

The broad architecture looks like this:

    | wikitext
    V
PEG wiki/HTML tokenizer         (or other tokenizers / SAX-like parsers)
    | Chunks of tokens
    V
Token stream transformations 
    | Chunks of tokens
    V
HTML5 tree builder 
    | HTML 5 DOM tree
    V
DOM Postprocessors 
    | HTML5 DOM tree
    V
(X)HTML serialization
    |
    +------------------> Browser
    |
    V
Visual Editor

So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.

  1. The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.
  2. Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also be expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
  3. The resulting tokens are then fed to a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
  4. The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what tidy does in the production parser.
  5. Finally, the DOM tree can be serialized as XML or HTML.

[edit] Getting started

For a quick overview, you can test drive Parsoid using a node web service. Development happens in the VisualEditor extension in Git (see modules/parser and tests/parser). The parser tests uses the parserTests.txt file from the core module.

git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/VisualEditor.git

You need node.js 0.4+, and npm 1.0+. If you have npm 0.x (as shown by npm -v), please upgrade npm first with npm install npm or curl http://npmjs.org/install.sh | sudo sh.

First, install the npm dependencies:

cd extensions/VisualEditor/modules/parser
npm install

You can also install globally, using sudo npm install -g on Linux. If you decide to install the dependencies locally, you will probably need to export NODE_PATH=node_modules. If you run into problems with npm on Ubuntu Oneiric, you can try to update it manually using npm install npm or curl http://npmjs.org/install.sh | sudo sh.

When this is in place, you should be able to run all parser tests using:

cd extensions/VisualEditor/modules/parser
npm test

parserTests has quite a few options now which can be listed using node ./parserTests.js --help.

An alternative wrapper taking wikitext on stdin and emitting HTML on stdout is modules/parser/parse.js:

cd extensions/VisualEditor/modules/parser
echo '{{:Main Page}}' | node parse.js

This example will transclude the English Wikipedia's en:Main Page including its embedded templates. Also check out node parse.js --help for options.

Enjoy!

[edit] Monthly high-level status summary

Shared with the Visual editor project.

2012-05-23:

Gabriel has set up a very basic parsoid service.

  • browse english wikipedia as parsoid sees it
  • POST wikitext -> HTML DOM
  • POST HTML DOM -> wikitext.

Note: round-tripping is limited. Does not support preservation of variable whitespace, templates and other complex constructs yet.

Currently 154 parser tests are passing in the new --roundtrip mode that Subbu added last week.

Also the team had a meeting with James

(See all status reports)

[edit] Todo

If you would like to hack the Parsoid parser, we have a list of tasks we currently see ahead interspersed with notes on open issues. Some tasks are marked as especially well suited for newbies. If you have questions, try to ping gwicke on #mediawiki or send a mail to the wikitext-l mailinglist. If all that fails, you can also contact Gabriel Wicke by mail.

[edit] parserTests.js result history

Total 672 tests (including normally disabled ones).

  • 15:04, 29 November 2011 (UTC): 50 passed, 4m45
  • 15:07, 29 November 2011 (UTC): 55 passed, 4m40
  • 16:27, 1 December 2011 (UTC): 139 passed, 8m50
  • 22:14, 6 December 2011 (UTC): 169 passed, 7m30
  • 17:32, 7 December 2011 (UTC): 180 passed, 7m35
  • 11:13, 12 December 2011 (UTC): 180 passed, 0m14 (and 5 seconds with --cache) after avoiding to re-build the tokenizer for each test
  • 00:11, 22 January 2012 (UTC): 220 passed, 0m6.1 seconds with --cache
  • 12:53, 1 February 2012 (UTC): 222 passed, 0m6.3 seconds with --cache
  • 17:36, 7 February 2012 (UTC): 232 passed, 0m6.6 seconds with --cache
  • 19:08, 13 February 2012 (UTC): 238 passed, 0m7.1 seconds with --cache
  • 17:30, 17 February 2012 (UTC): 244 passed, 0m7.0s with --cache
  • 20:07, 20 February 2012 (UTC): 249 passed, 7.0s with --cache
  • 16:43, 22 February 2012 (UTC): 251 passed, 6.9s with --cache
  • 18:14, 5 March 2012 (UTC): 268 passed, 7.0s with --cache
  • 18:20, 3 April 2012 (UTC): 288 passed, 8.2s with --cache
  • 16:38, 26 April 2012 (UTC): 303 passed, 5.6s with --cache

See /parserTests result for the raw output of running parserTests.js --no-color.


[edit] Technical documents

  • /HTML5 DOM with microdata: Design for the embedding of Wiki information into the HTML5 DOM produced by the parser, and used for communication with the visual editor. Not implemented, but gives the general idea.
  • Parsoid/RDFa vocabulary: Actual design work using RDFa instead of Microdata. Implementation in progress.
  • /test cases: Please add interesting snippets or pages.

[edit] See also

  • Future/Parser plan: Early (now relatively old) design ideas and issues
  • User:GWicke: Some notes on existing wiki and HTML parsers, should really be moved to general documentation
Personal tools
Namespaces

Variants
Actions
Navigation
Support
Download
Development
Communication
Print/export
Toolbox