Parsoid/C++

Libraries

 * peg/leg is a plain C library which supports the same grammar features as PEG.js, and can include C++ code in actions.
 * DOM library candidates:

Webkit
HTML5 tree builder, reasonable token objects, JS runtime integration. Disadvantages: complex integration and build system

The parser code is at. We'd need to write a wrapper similar to HTMLDocumentParser.

Phantomjs is an example runtime embedding webkit which supports loading web pages, running any JS that is embedded in them and exporting the resulting DOM.

Memory management
Token chunks are cached and shared between concurrent expansion threads, so a mechanism like refcounting would be needed. This documentation about refcounted pointers in WebKit is quite interesting: http://www.webkit.org/coding/RefPtr.html. Doing this per-chunk should help to amortize the overheads of thread-safe refcounting.

Thread architecture
We would like to parallelize the parser execution into at least separate threads for the tokenizer, token stream transforms and the tree builder. Token stream transforms can be parallelized further.

PHP generally supports async callbacks into the interpreter. Callbacks need to be performed in a single-threaded fashion. Error handling in PHP is normally done via longjmp, which should be avoided since this would leave the parser in an undefined state. It might be possible to wrap all callbacks into try/catch and register error handlers so that the longjmp is avoided.

For the parser, template source retrieval would be a main application for parallel IO. The need for this can be reduced a lot by passing it the source of included templates (the list is available in the links table) retrieved using a batch lookup from memcached. Only cache misses would then trigger the sequential DB retrieval.