Parsoid/C++

Libraries

 * peg/leg is a plain C library which supports the same grammar features as PEG.js, and can include C++ code in actions.
 * DOM library candidates:

Webkit
HTML5 tree builder, reasonable token objects, JS runtime integration. Disadvantages: complex integration and build system

The parser code is at. We'd need to write a wrapper similar to HTMLDocumentParser.

Related projects:
 * Phantomjs embedded headless webkit which supports loading web pages, running any JS that is embedded in them and exporting the resulting DOM. It performs full rendering and can save screenshots or PDFs of pages.
 * Webkitdriver is a headless port which cuts out the rendering portion (and replaces it with stubs).

Gecko

 * parser source, DOM source
 * Also on Github

Related projects:
 * Crowbar is a Gecko wrapper without rendering

TODO:
 * We should investigate if we can rip out the required pieces out of Gecko and use it independently.

hubbub
A simple stand-alone HTML5 parser written for the netsurf browser project. 18k lines of simple C. Example libxml DOM integration would be a good starting point for a minimalist solution, that still provides libxml features like xslt and xpath.

pugixml
Another lightweight parser and DOM generator, nice source that looks portable. The list of outstanding bugs is inoffensive.

Boost.PropertyTree
Based on RapidXML.

Memory management
Token chunks are cached and shared between concurrent expansion threads, so a mechanism like refcounting would be needed. This documentation about refcounted pointers in WebKit is quite interesting: http://www.webkit.org/coding/RefPtr.html. Doing this per-chunk should help to amortize the overheads of thread-safe refcounting.

Thread architecture and PHP interfacing
We would like to parallelize the parser execution into at least separate threads for the tokenizer, token stream transforms and the tree builder. Token stream transforms can be parallelized further. These threads could be started on demand for each call to the parser, or kept around in a thread pool. PHP extensions can register a module setup function which is called when Apache starts up, and can allocate SAPI-global state that is preserved across requests. This could be used to set up a Parsoid thread pool per SAPI.

PHP generally supports synchronous callbacks into the interpreter (examples: luasandbox.c,,,). Callbacks need to be performed in a single-threaded fashion. Asynchronous callbacks (with the main PHP thread running in parallel) are not supported as all internal state (memory allocation etc) assumes single-threaded execution. Simple asynchronous signaling can still be performed via the socket-like stream API or file descriptors.

Error handling in PHP is normally done via longjmp, which should be avoided since this would leave the parser in an undefined state. It might be possible to wrap all callbacks into try/catch and register error handlers so that the longjmp is avoided.

For the parser, template source retrieval would be a main application for parallel IO. The need for this can be reduced a lot by passing it the source of included templates (the list is available in the links table) retrieved using a batch lookup from memcached. Only cache misses would then trigger the sequential DB retrieval.

Bindings to other languages
Writing an API description in SWIG will make it simple to provide bindings for PHP, Python and other scripting languages.

There's no portable way to make callbacks into the scripting language, but I think we can store variables in a shared memory space writable by either peer. Maybe it's possible to declare the complete domain of environment variables before calling the parser extension.