Parsoid/C++

Libraries

 * peg/leg is a plain C library which supports the same grammar features as PEG.js, and can include C++ code in actions.
 * DOM library candidates:

Webkit
HTML5 tree builder, reasonable token objects, JS runtime integration. Disadvantages: complex integration and build system

The parser code is at. We'd need to write a wrapper similar to HTMLDocumentParser.

Phantomjs is an example runtime embedding webkit which supports loading web pages, running any JS that is embedded in them and exporting the resulting DOM.

Memory management
Token chunks are cached and shared between concurrent expansion threads, so a mechanism like refcounting would be needed. This documentation about refcounted pointers in WebKit is quite interesting: http://www.webkit.org/coding/RefPtr.html. Doing this per-chunk should help to amortize the overheads of thread-safe refcounting.

Thread architecture
We would like to parallelize the parser execution into at least separate threads for the tokenizer, token stream transforms and the tree builder. Token stream transforms can be parallelized further.

PHP generally supports async callbacks into the interpreter, but the exact synchronization requirements for this need to be investigated.