Parsoid/status

Last update on: 2013-05-monthly

2011-02-09
A dedicated project was created for Parsoid. Status updates prior to this date were included in the Visual editor updates.

2012-08-20 (MW 1.20wmf10)
The Parsoid team worked on the final tasks in the JS prototype, in preparation for the C++ port. The port will allow an efficient integration with PHP and Lua, improve performance and allow the parallelization of the parser in the longer term in preparation for production use.

An important milestone we reached is the implementation and verification of the template DOM range encapsulation algorithm, which now identifies all template-affected parts of the DOM for round-tripping and protection in the VisualEditor. We are currently implementing template round-tripping based on this. Other new features include oldid support so that previous versions of pages can be edited, rather than just the current one, and more complete error reporting in the web service. Wikitext escaping in the serializer is much improved, and now also handles interactions across multiple DOM nodes. An ongoing task has been improving test coverage to enable us to refactor code with more confidence and also help test the correctness of the C++ port.

Most details of the C++ port were researched. A basic build system including the selected libraries was set up, and design work on the basic data structures has started, ahead of full porting which we expect to start next iteration.

The full list of Parsoid bugs closed in the last two weeks is available in Bugzilla.

2012-08-monthly
The Parsoid team reached a major milestone in August by implementing a template output encapsulation algorithm, and started to use it to support expanded template round-tripping. In parallel with this and the usual smaller tweaks, work on a C++ port of the parser was started. The port is expected to allow an efficient integration with PHP and Lua, improve performance and allow the parallelization of the parser in the longer term.

2012-09-03 (MW 1.20wmf11)
The Parsoid team reached a major milestone with basic round-tripping of expanded templates and the Cite extension. This includes the protection of closely coupled and unbalanced table start / row / end templates, which makes it possible to protect and later edit these in the Visual Editor.

On the C++ side, the work has now started to port the existing JavaScript code, starting with the Tokenizer. Basic token data structures and a reference counting scheme are implemented. The integration of the boost.asio event loop for asynchronous and parallel operations and the adaptation of the libhubbub HTML5 tree builder and libxml2 DOM are next steps.

The full list of Parsoid commits is available in Gerrit.

2012-09-17
In the JavaScript Parsoid implementation, we further improved support for round-tripping of templates and numerous other constructs. We now have an additional thirty parser tests and a similar number of round-trip tests passing. We started work on automated round-trip testing on dumps to provide a benchmark for progress and to identify the most important problem areas to focus on. We also added edit support for behavior switches and category links. To support selective serialization of the edited sections of the document without dirty diffs in unmodified sections, we are now associating DOM nodes with the source wikitext that produced that DOM.

On the C++ port, the data structures and synchronization / queueing strategies are now nearly complete. The tokenizer can handle very simple content (mainly headings) and populate the data structures. We started work on TokenTransformManagers. However, due to resource constraints, the C++ is currently a part-time effort, with most effort going into the temporary JavaScript implementation as the safest bet for the December release.

The full list of Parsoid commits is available in Gerrit.

2012-09-monthly
The Parsoid team spent September improving the JavaScript prototype to get it ready for the December release, and improving the C++ port for longer-term deployment. The original plan to finish the C++ port before the December release looks very risky with the limited resources available, so the plan is to release the JavaScript prototype instead.

On the JavaScript side, the focus was on round-tripping of templates and other constructs such as the Cite extension, support for category links and "magic words". Many parser tests were added, and a new milestone of 603 passing round-trip tests (with 218 to go) was reached. First steps towards round-trip testing on a full dump were taken.

In the C++ implementation, the tokenizer can now support very simple content and use it to populate the internal data structures. Basic interfaces for asynchronous and parallel processing were defined. An XML DOM abstraction layer was introduced to make DOM-related algorithms independent of the used DOM library. The focus on the JavaScript prototype for the release limited the progress on the C++ implementation.

2012-10-01
 JavaScript implementation:
 * Many improvements to template round-tripping and DOM source range calculations
 * Reworked paragraph wrapping to be more bug-for-bug compatible with the PHP parser
 * Many small tokenizer and round-trip fixes
 * Added many new parser tests
 * 603 round-trip tests passing, 218 to go

C++ implementation:
 * Basic token transformer skeletons with boost.asio integration is hooked up
 * New XML DOM abstraction interface for the separation of DOM-based code from used DOM library; Using PugiXML DOM backend for performance and memory footprint
 * Takes back seat to JavaScript prototype implementation due to resource constraints

The full list of Parsoid commits is available in Gerrit.

2012-10-15
JavaScript implementation:
 * Many improvements to template round-tripping and DOM source range calculations
 * Added many new parser tests
 * Test runner now runs various round-trip test modes based on parser tests
 * Wikitext to wikitext round-trip tests up to 618 from 608. Total 1343 tests passing
 * Set up continuous integration with Jenkins, runs parser tests on separate virtual machine on each commit
 * Created round-trip test infrastructure on full dumps with classification into syntactic-only / semantic diffs, adding distributed client-server mode to speed it up
 * Big articles like Barack Obama are now close to round-tripping without semantic differences

C++ implementation:
 * Generalized pipeline interfaces
 * Implemented HTML5 tree builder with XML DOM backend
 * Designed and implemented token stream transformer APIs with usability improvements on the JavaScript version
 * Added Scope class (~preprocessor frame) and simplified expansion logic vs. JavaScript implementation
 * Parses simple wikitext all the way to XML DOM

The full list of Parsoid commits is available in Gerrit.

2012-10-monthly
<section begin="2012-10-monthly"/>The Parsoid team focused on testing the JavaScript prototype parser against a corpus of 100,000 randomly-selected articles from the English Wikipedia. A distributed MapReduce-like system, which uses several virtual machines on Wikimedia Labs, constantly converts articles to HTML DOM and back again to wikitext using the latest version of the Parsoid. For a little over 75% of these articles, this results in exactly the same wikitext, as we intend. For another 18% of these articles, there are some differences in the wikitext, but these are so minor that they don't result in any differences in the produced HTML structure when it is re-parsed. In the production version of Parsoid which will attempt to retain original wikitext as far as possible, these minor differences will only show up, if at all, around content that the user edited. Finally, just under 7% of articles still contain errors that change the produced HTML structure. These issues are the focus of the current work in preparation for the December release.<section end="2012-10-monthly"/>

2012-11-12
<section begin="2012-11-12"/>Most of the work has been on the JavaScript implementation and testing infrastructure, in preparation for the December release. The automated testing of wikitext->HTML->wikitext now has 75.8% of articles returning exactly the same, and 94.5% with changes that do not change the nature of the page (the additional ~19% have changes in source wikitext). This is up from about 85% two weeks ago (rather than 93% as reported in the previous report -- we discovered a bug in the error accounting process which we discovered and fixed). The Barack Obama article now round-trips without any diffs.

A first iteration of the selective serialization algorithm is in development. This algorithm will hide purely syntactic differences in unmodified parts of the page by using the original wikitext for those. It heavily relies on our calculation of source ranges for each DOM element.

The full list of Parsoid commits is available in Gerrit.<section end="2012-11-12"/>

2012-11-monthly
<section begin="2012-11-monthly"/>In preparation for the upcoming deployment on the English Wikipedia, the Parsoid team concentrated on the preservation of existing content. Automated round-trip testing on 100,000 randomly chosen pages from the English Wikipedia using distributed test runners helped to identify many issues, which were fixed and often resulted in new minimal test cases being added to the parser test suite. Currently, 79.4% test articles (up from about 65% last month) round-trip without any differences at all, an additional 18% round-trip with only minor (whitespace, quote style etc) differences, and the remaining 2.6% of pages have differences that still need fixing (down from about 15% last month). Selective serialization will further avoid dirty diffs in unmodified parts of a page by using the original wikitext for those. This will help further fix the 20% of pages that had any kind of difference in wikitext. The implementation of this algorithm is currently being finalized.<section end="2012-11-monthly"/>

2012-12-monthly
<section begin="2012-12-monthly"/>The Parsoid project reached a major milestone with its first deployment to the English Wikipedia along with the VisualEditor. This was a major test for Parsoid, as it needed to handle the full range of arbitrary and complex existing wiki content including templates, tables and extensions for the first time.

As witnessed by the clean edit diffs, Parsoid passed this test with flying colors. This represents very hard work by the team (Gabriel Wicke, Subramanya Sastry and Mark Holmquist) on automated round-trip testing and the completion of a selective serialization strategy just in time for the release.

After catching their breath, the team now has its sights on the next phase in Parsoid development. This includes a longer-term strategy for the integration of Parsoid and HTML DOM into MediaWiki, performance improvements and better support for complex features of wikitext.<section end="2012-12-monthly"/>

2013-01-monthly
<section begin="2013-01-monthly"/>In January, the Parsoid team did some Spring cleaning and bug fixing. The serialization subsystem was overhauled: it now features simpler and more robust separator handling. Selective serialization was rewritten to deal with content deletions. It also features DOM diff-based change detection that does not rely on client-side change marking. Support for non-English wikis and local configurations was also improved a lot, and will likely stabilize in the next weeks.

The team also discussed and documented the longer-term Parsoid / MediaWiki strategy in the Parsoid roadmap. The performance-oriented C++ port was deprioritized in favor of DOM-based performance improvements and HTML storage. The basic idea behind storing (close to) fully processed HTML is to speed things up by doing no significant parsing on page view at all. In the longer term, VisualEditor-only wikis can avoid a dependency on Parsoid by switching to HTML storage exclusively. Overall, the plan is to leverage the Parsoid-generated HTML/RDFa DOM format inside MediaWiki core to enable better performance and editing capabilities in the future.<section end="2013-01-monthly"/>

2013-02-monthly
<section begin="2013-02-monthly"/>The Parsoid team continued to improve support for non-English wikis. This involved exposing more configuration information through the MediaWiki API and using it throughout Parsoid. The support is now reasonably complete, but needs testing. The round-trip testing framework needs to be adapted to support running tests on pages from multiple wikis.

A new contributor, C. Scott Ananian, improved Parsoid's performance by switching the DOM library from JSDom to Domino. He also improved image handling and contributed numerous other patches.

The tokenizer was modified to parse one top-level block at a time, which helps to spread out API requests and minimize the number of tokens in flight. The serializer is in the process of being rewritten to work on DOM input to benefit from the context provided by the DOM. This rewrite is expected to simplify the logic significantly, and help fix some more selective serialization issues that are blocking a deployment to production.

We also used the ops and core hackathon to discuss and refine our storage plans. Finally, we wrote a blog post about Parsoid on the WMF tech blog.<section end="2013-02-monthly"/>

2013-03-monthly
<section begin="2013-03-monthly"/>In March, the Parsoid team continued with improvements to internationalization, serialization, and extension handling.

The parser test framework now supports language-specific tests, which required support for loading language-specific default setting in Parsoid.

The serializer is now fully DOM-based and uses constraint-based newline / white-space separator handling, which will make the serializer less sensitive to newlines and whitespace in HTML. Round-trip test results of 82% (pages without any diffs) and 98% (pages without semantic diffs) indicates that the new serializer is on par with the old serializer currently deployed on production.

Extension content is now parsed all the way to DOM, which enforces proper nesting. The generic support for balanced fragment parsing will later also be applied to templates. Parsing of transclusion directives (includeonly and friends) has also been improved and simplified.

The DOM specification for images and templated / extension content was fleshed out in preparation for full editing support.

Late in March, C. Scott Ananian joined us as a contractor. Welcome!<section end="2013-03-monthly"/>

2013-04-monthly
<section begin="2013-04-monthly"/>In April, the Parsoid team successfully deployed the cumulative work done over the last four months. This includes support for non-English wiki configurations, a rewritten serialization subsystem based on server-side DOM diffs, category link and basic template parameter editing support and a long list of fixes and improvements.

Several other features for the July release are on track. The specification for extensions containing templates and templates containing extensions were fleshed out and are currently being implemented. Similarly, our specs for images and thumbnails were vastly improved so that we will soon support full editing for all parameters.

We also improved our code quality and testing infrastructure.

In preparation for the July release, we did more benchmarking and capacity planning. A caching strategy that avoids overwhelming the API with requests was developed, hardware to run Parsoid was ordered and work on the implementation started.<section end="2013-04-monthly"/>

2013-05-monthly
<section begin="2013-05-monthly"/>In May, the Parsoid team implemented several new features, as well as important performance optimizations in preparation for the July VisualEditor release.

A major image handling overhaul enabled rendering and editing of all image-related parameters with a relatively simple DOM structure. Template and extension editing was improved to support editing of templates within extensions. This lets editors modify and add templated citations in VisualEditor, an important feature to improve the quality of articles in Wikipedia.

On the performance front, we are now reusing expensive template, extension and image expansions from our own previous output to avoid most API queries after an edit. This is necessary to avoid overloading the API when tracking all edits on Wikimedia projects. A cache infrastructure with appropriate purging was set up and will be tested at full load through June.

At the Amsterdam hackathon, we helped developers leverage our rich HTML+RDFa DOM output for projects like a Wikipedia-to-SMS service or the Kiwix offline Wikipedia reader.<section end="2013-05-monthly"/>