Parsoid/status

Last update on: 2012-11-monthly

2011-02-09
A dedicated project was created for Parsoid. Status updates prior to this date were included in the Visual editor updates.

2012-08-20 (MW 1.20wmf10)
The Parsoid team worked on the final tasks in the JS prototype, in preparation for the C++ port. The port will allow an efficient integration with PHP and Lua, improve performance and allow the parallelization of the parser in the longer term in preparation for production use.

An important milestone we reached is the implementation and verification of the template DOM range encapsulation algorithm, which now identifies all template-affected parts of the DOM for round-tripping and protection in the VisualEditor. We are currently implementing template round-tripping based on this. Other new features include oldid support so that previous versions of pages can be edited, rather than just the current one, and more complete error reporting in the web service. Wikitext escaping in the serializer is much improved, and now also handles interactions across multiple DOM nodes. An ongoing task has been improving test coverage to enable us to refactor code with more confidence and also help test the correctness of the C++ port.

Most details of the C++ port were researched. A basic build system including the selected libraries was set up, and design work on the basic data structures has started, ahead of full porting which we expect to start next iteration.

The full list of Parsoid bugs closed in the last two weeks is available in Bugzilla.

2012-08-monthly
The Parsoid team reached a major milestone in August by implementing a template output encapsulation algorithm, and started to use it to support expanded template round-tripping. In parallel with this and the usual smaller tweaks, work on a C++ port of the parser was started. The port is expected to allow an efficient integration with PHP and Lua, improve performance and allow the parallelization of the parser in the longer term.

2012-09-03 (MW 1.20wmf11)
The Parsoid team reached a major milestone with basic round-tripping of expanded templates and the Cite extension. This includes the protection of closely coupled and unbalanced table start / row / end templates, which makes it possible to protect and later edit these in the Visual Editor.

On the C++ side, the work has now started to port the existing JavaScript code, starting with the Tokenizer. Basic token data structures and a reference counting scheme are implemented. The integration of the boost.asio event loop for asynchronous and parallel operations and the adaptation of the libhubbub HTML5 tree builder and libxml2 DOM are next steps.

The full list of Parsoid commits is available in Gerrit.

2012-09-17
In the JavaScript Parsoid implementation, we further improved support for round-tripping of templates and numerous other constructs. We now have an additional thirty parser tests and a similar number of round-trip tests passing. We started work on automated round-trip testing on dumps to provide a benchmark for progress and to identify the most important problem areas to focus on. We also added edit support for behavior switches and category links. To support selective serialization of the edited sections of the document without dirty diffs in unmodified sections, we are now associating DOM nodes with the source wikitext that produced that DOM.

On the C++ port, the data structures and synchronization / queueing strategies are now nearly complete. The tokenizer can handle very simple content (mainly headings) and populate the data structures. We started work on TokenTransformManagers. However, due to resource constraints, the C++ is currently a part-time effort, with most effort going into the temporary JavaScript implementation as the safest bet for the December release.

The full list of Parsoid commits is available in Gerrit.

2012-09-monthly
The Parsoid team spent September improving the JavaScript prototype to get it ready for the December release, and improving the C++ port for longer-term deployment. The original plan to finish the C++ port before the December release looks very risky with the limited resources available, so the plan is to release the JavaScript prototype instead.

On the JavaScript side, the focus was on round-tripping of templates and other constructs such as the Cite extension, support for category links and "magic words". Many parser tests were added, and a new milestone of 603 passing round-trip tests (with 218 to go) was reached. First steps towards round-trip testing on a full dump were taken.

In the C++ implementation, the tokenizer can now support very simple content and use it to populate the internal data structures. Basic interfaces for asynchronous and parallel processing were defined. An XML DOM abstraction layer was introduced to make DOM-related algorithms independent of the used DOM library. The focus on the JavaScript prototype for the release limited the progress on the C++ implementation.

2012-10-01
 JavaScript implementation:
 * Many improvements to template round-tripping and DOM source range calculations
 * Reworked paragraph wrapping to be more bug-for-bug compatible with the PHP parser
 * Many small tokenizer and round-trip fixes
 * Added many new parser tests
 * 603 round-trip tests passing, 218 to go

C++ implementation:
 * Basic token transformer skeletons with boost.asio integration is hooked up
 * New XML DOM abstraction interface for the separation of DOM-based code from used DOM library; Using PugiXML DOM backend for performance and memory footprint
 * Takes back seat to JavaScript prototype implementation due to resource constraints

The full list of Parsoid commits is available in Gerrit.

2012-10-15
JavaScript implementation:
 * Many improvements to template round-tripping and DOM source range calculations
 * Added many new parser tests
 * Test runner now runs various round-trip test modes based on parser tests
 * Wikitext to wikitext round-trip tests up to 618 from 608. Total 1343 tests passing
 * Set up continuous integration with Jenkins, runs parser tests on separate virtual machine on each commit
 * Created round-trip test infrastructure on full dumps with classification into syntactic-only / semantic diffs, adding distributed client-server mode to speed it up
 * Big articles like Barack Obama are now close to round-tripping without semantic differences

C++ implementation:
 * Generalized pipeline interfaces
 * Implemented HTML5 tree builder with XML DOM backend
 * Designed and implemented token stream transformer APIs with usability improvements on the JavaScript version
 * Added Scope class (~preprocessor frame) and simplified expansion logic vs. JavaScript implementation
 * Parses simple wikitext all the way to XML DOM

The full list of Parsoid commits is available in Gerrit.

2012-10-monthly
<section begin="2012-10-monthly"/>The Parsoid team focused on testing the JavaScript prototype parser against a corpus of 100,000 randomly-selected articles from the English Wikipedia. A distributed MapReduce-like system, which uses several virtual machines on Wikimedia Labs, constantly converts articles to HTML DOM and back again to wikitext using the latest version of the Parsoid. For a little over 75% of these articles, this results in exactly the same wikitext, as we intend. For another 18% of these articles, there are some differences in the wikitext, but these are so minor that they don't result in any differences in the produced HTML structure when it is re-parsed. In the production version of Parsoid which will attempt to retain original wikitext as far as possible, these minor differences will only show up, if at all, around content that the user edited. Finally, just under 7% of articles still contain errors that change the produced HTML structure. These issues are the focus of the current work in preparation for the December release.<section end="2012-10-monthly"/>

2012-11-12
<section begin="2012-11-12"/>Most of the work has been on the JavaScript implementation and testing infrastructure, in preparation for the December release. The automated testing of wikitext->HTML->wikitext now has 75.8% of articles returning exactly the same, and 94.5% with changes that do not change the nature of the page (the additional ~19% have changes in source wikitext). This is up from about 85% two weeks ago (rather than 93% as reported in the previous report -- we discovered a bug in the error accounting process which we discovered and fixed). The Barack Obama article now round-trips without any diffs.

A first iteration of the selective serialization algorithm is in development. This algorithm will hide purely syntactic differences in unmodified parts of the page by using the original wikitext for those. It heavily relies on our calculation of source ranges for each DOM element.

The full list of Parsoid commits is available in Gerrit.<section end="2012-11-12"/>

2012-11-monthly
<section begin="2012-11-monthly"/>In preparation for the upcoming deployment on the English Wikipedia, the Parsoid team concentrated on the preservation of existing content. Automated round-trip testing on 100000 randomly chosen pages from the English Wikipedia using distributed test runners helped to identify many issues, which were fixed and often resulted in new minimal test cases being added to the parser test suite. Currently, 79.4% (up from about 65% last month) round-trip without any differences at all, an additional 18% round-trip with only minor (whitespace, quote style etc) differences, and the remaining 2.6% of pages have differences that still need fixing (down from about 15% last month).

Selective serialization will further avoid dirty diffs in unmodified parts of a page by using the original wikitext for those. This will help further fix the 20% of pages that had any kind of difference in wikitext. The implementation of this algorithm is currently being finalized.

The full list of Parsoid commits is available in Gerrit.<section end="2012-11-monthly"/>