During the Parsoid/PHP porting process, we've come across some issues which, while not blockers to our porting effort, would be nice to address. We're not going to work on these (at least not right away) since they are not on our critical path, but we'd love to get help on these.
PHP DOM issues
The PHP DOM extension is a wrapper around libxml2 with a thin layer of DOM-compatibility on top ("To some extent libxml2 provides support for the following additional specifications but doesn't claim to implement them completely [...] Document Object Model (DOM) Level 2 Core [...] but it doesn't implement the API itself, gdome2 does this on top of libxml2"). This is not really remotely close to a modern standards-compliant HTML5 DOM implementation and is barely maintained, much less kept in sync with the WHATWG's pace of change. phab:T215000 has some details on the issues and workarounds, and phab:T218183 (help wanted!) describes a task to audit existing uses of the DOM extension in the MediaWiki code base. phab:T217867 (help wanted!) is a wish for a modern DOM implementation for PHP, either by binding to an existing library (C or Rust or ...?) or by porting domino or another existing implementation.
Some issues we've worked around could use actual fixes upstream:
- Case of tag names in the DOM: phab:T217700
- Performance of
- Performance of
- Fixes for
DOMDocument#getElementsById: phab:T215000#5002986, PHP bug #77686
- Fix node type 13 at root level of
DOMDocument#createElementdoesn't set namespace: phab:T215000#5003044
DOMNode::normalize()doesn't remove empty nodes: phab:T215000#5290553, PHP bug #78221
The MediaWiki code base has a number of different incompatible ways to do the same thing. We'd like to standardize on a small number of libraries. For example, the HtmlFormatter library (used by MobileFrontend/CirrusSearch/TextExtracts), contains workarounds of PHP DOM issues which are similar-but-not-quite-the-same as the workarounds implemented for Parsoid. It also contains a CSS-to-XPath translator which is (again) similar-but-not-quite-the-same as the one used by Parsoid (and elsewhere). phab:217360 (help wanted!) suggests converting HtmlFormatter to use Remex and Zest.
Lots of other places in mediawiki use the PHP DOM extension's
loadHTML methods, with various workarounds for its bugs. They should be unified to use Remex (once some performance issues are dealt with -- phab:T212543) and an appropriate serializer.
Remex (HTML parser) improvements
We replaced tidy with Remex! However, Remex has some documentation debt. It would be nice to get some help improving Remex as a stand-alone library and evangelizing its use. For example, improving documentation (phab:217849, help wanted!), adding utility methods to make using it easier (phab:T217850, help wanted!) and further improving performance (phab:T212543). Other feature requests:
- Indexing ID attributes: phab:T217696, help wanted!
- Upstream PHP feature request: https://bugs.php.net/bug.php?id=77744
Help with actual Parsoid porting
If you want to help with porting, feel free to jump into #mediawiki-parsoid channel and ask. Or, email ssastry or parsing-team.