Parsing/Notes/Two Systems Problem

Why this page?
This document is an attempt to present the proposals and ideas that have been proposed in different venues / discussions to address the 2-systems (wikitext) parser problem. Relatedly, requirements of this single parser aren't fully agreed upon yet, especially around support for 3rd party wikis, and what kind of wikitext evolution we want to get to.

This page is an attempt to pull together all related concerns and proposals in one place so we can have a more informed discussion about how to move ahead wrt solving the 2-systems problem. Some of these proposals and sections need to be filled out in more detail. The proposals on this page should not be considered a roadmap or what will or should happen. These notes have been pulled together for discussion and will inform RFCs, possible dev summit proposals, and talks. The parsing team roadmap will evolve through those other processes.

2-systems problem in parsing land
Right now, in parsing land, we have the 2-systems problem in 2 different ways.
 * For Wikimedia wikis, there are two HTML renderings of a page
 * PHP Parser HTML - used for read views, action=edit previews
 * Parsoid HTML - used by VE, CX, Flow, MCS, OCG + Google, Kiwix

This is not tenable in the long term. Migrating to a single parser that supports both read and edit requirements would solve both these in one shot but let us consider the two separately because some proposed solutions don't require a consolidation behind a single parser.
 * For MediaWiki the software, we have two different parsers
 * PHP parser - does not support VE, CX, OCG, Kiwix, has limited support for Flow
 * Parsoid - does not replace all PHP parser components + 3rd party installation woes + awkward boundary between Parsoid & MediaWiki + some performance concerns

Source of output differences between Parsoid & PHP parser
There are 2 primary sources for why output between Parsoid & PHP parser differ:

DOM processing vs. string processing
In order to support HTML editing clients like VE, Parsoid tries hard to map the macro-preprocessing-like parsing model to a DOM-centric model. This mostly works, except when it doesn't (link to other documents for more info).

Reliance on state that Parsoid doesn't have direct access to
Right now, Parsoid is structured as a separate stateless service and Parsoid HTML = f(wikitext, mw-config). Parsoid caches the wiki configs at startup and computes output HTML as a function of the input wikitext. This model is not reflective of how the PHP parser generates its output HTML.

Right now, MediaWiki HTML output depends on the following: i.e. MediaWiki-HTML = F_php(wikitext, mw-config, media-resources, user-state, site-messages, corpus-state, PHP-parser-hooks, tidy-hacks)
 * input wikitext
 * wiki config (including installed extensions)
 * media resources (images, audio, video)
 * PHP parser hooks that expose parsing internals and implementation details (not replicable in other parsers)
 * wiki messages (ex: cite output)
 * state of the corpus and other db state (ex: red links, bad images)
 * user state (prefs, etc.)

Parsoid gets away with its assumption by (a) calling the MediaWiki API to support rendering of media resources, native extensions and other hooks. (b) proposing to run HTML transformations to support things like redlinks, bad images, user preferences. Parsoid supports some Tidy hacks, but has rendering differences nevertheless. Replacing Tidy with a HTML5 parser effectively eliminates the tidy-hacks as one of the inputs.

Possible refactoring of this computation model
However, output still depends on external state which could be handled differently. For example, changes to the corpus (change in link state) can force a reparse of the entire page. Or, needing to parse a page to accommodate user preferences, or change in site messages. If instead, we could have MediaWiki-HTML be a series of composable transformations, then we could improve cacheability and untangle some of the dependencies.

i.e make MediaWiki HTML = f1(f2(f3(wikitext, mw), user-state), corpus-state) and so on. This lets you cache the output of f3 and use it to apply transformations (ideally client-side since it lets you return cached output rapidly, but could also be server-side if necessary).

The original proposal for using Parsoid HTML for read views was based on this idea of composable transformations. This idea is up for discussion.

Requirements of the unified parser

 * Needs to support all existing HTML editing clients (VE, CX, Flow, etc.) and functionality
 * Full feature and output compatibility wrt reading clients, tools, bots, etc
 * Needs to be "performant enough" for use on the Wikimedia cluster
 * Ideally, it will support the long-term move to DOM-based semantics
 * Cannot drastically break rendering of old revisions
 * What kind of 3rd party installation considerations are relevant? Many 3rd party wikis might have expectations of being able to use the latest and greatest m/w features (VE & Flow) found on Wikipedias.
 * Do all installation solutions (shared hosting, 1 click-install, vms, containers etc.) need to provide this support? This question was the focus of a 2016 dev summit discussion and hasn't yet been resolved satisfactorily. Answer to this question has an implication on the unified parser discussion. Alternatively, the decision made for the unified parsing solution will de facto dictate what is possible for 3rd parties. The qn. to resolve: which is the dog, which is the tail .. ? :)
 * Note that even if we have a PHP/PHP-binary-compatible version of Parsoid for 3rd party wikis (options 2 & 3 below), there is also RESTBase, Citoid, Graphoid, Mathoid which are all node.js services as well. RESTBase could be optional as long as those wikis don't care about dirty diffs with VE edits, but, if they want Citoid (or maps), are we now expecting a PHP/PHP-binary-compatible port of Citoid (and Graphoid) as well?
 * On the other hand, we did make it reasonably easy to use Scribunto without installing any PHP extensions or running external services, even though the WMF cluster does use a PHP extension. Image scaling, pre-Mathoid math rendering, and Tidy have long worked on a plain PHP installation too ([Brad]: AFAIK we're wanting to kill Tidy because it doesn't do HTML5 well, not because the integration is a problem; [SSS]: Yes, that is correct, integration is not a driver for Tidy replacement).
 * However, if a 3rd party wiki does not want VE, but only wants wikitext editing without a non-PHP language dependency for the foreseeable future, then that eliminates option 1. But, that means not all 3rd party wikis will get the latest and greatest features of MediaWiki as found on Wikipedias.
 * Question: Perhaps we need a support matrix for installation modes like for browsers .. i.e. what installation modes must support what features, which at least provides some agreement that such a thing is needed (rather than the amorphous "everything must support everything" requirement) and let the discussion and debate move towards how the table cells are filled up.

Tangent: Nomenclature
A small tangent. Parser is no longer an accurate name since we are now looking at transformations in both directions between wikitext and HTML. Separately, the content-handler abstraction enables other content representations, for example, markdown, json, HTML, etc.

All that said, wikitext is still the primary content format for wikimedia and many 3rd party wikis. And, if we restrict our attention to just this wikitext component, perhaps a wikitext runtime/implementation (?) or something else is a better name? (Maybe a bi-directional compiler (or, transpiler))

Possible approaches for addressing the 2-systems problem
Here are some options that have been proposed.

Option 1: Deprecate the core parser and replace it with Parsoid
+ no need to replicate new functionality in PHP parser

+ a faster path to get to the end goal

- eliminates the simple install option for MediaWiki, the software.

- introduces a hard dependency on a non- PHP component.

? with a future simplified processing model, there is potential to run this code on the client. I've been asked this more than once over the years including very recently. It is going to be feasible perf-wise since p95 times are 3-4s, but not really that feasible because today's Parsoid is bulky. This is not really a serious consideration, but throwing it out there since I didn't want to lose track of this.

Some solutions that have been proposed for the negatives here are:
 * Embed PHP inside Javascript via node-php-embed (requires existing installs to shift to node.js providers)
 * Use container solutions like docker as a distribution solution for PHP MediaWiki + non-PHP services (could work, but unclear how much consensus / energy there is behind this)
 * When asked about using docker as a deployment method for other software that has made that choice, our own Ops considered it problematic (private mailing list link). Other organizations might have the same sort of reservations were we to choose that solution.

Option 2: Port Parsoid functionality to PHP and ditch Parsoid
+ everything is back in PHP land

- non-trivial amount of work

- may not be performant enough for WMF's use especially since Parsoid requires an in-memory DOM and good GC support. This is an unverified assertion. But, this is likely very true for Zend. It is unclear if this is true for HHVM + RepoAuthoritative config which might potentially meet the bar -- requires testing and analysis.

Option 3: Port Parsoid to a more performant PHP-binary compatible language (C / C++)
This was the original 2012 plan when Parsoid was first being developed and the then Parsoid team even embarked on a proof of concept for this port, but quickly abandoned it since it was diverting scarce developer time from the more urgent task at hand -- supporting rollout of the Visual Editor.

+ improved performance

+ could potentially integrate Parsoid functionally more tightly with core and could eliminate the whole 3rd party dance

- non-trivial amount of work

- more complex programming and debugging environment

Option 4: Define a spec for wikitext, extensions, HTML output
The spec should not tied to implementation internals and spec wikitext -> html and html -> wikitext behavior. + How many parsers there are is irrelevant as long as they are spec-compliant.
 * html -> wikitext might be an optional component of the spec.
 * html -> wikitext might generate normalized wikitext and generate dirty diffs on edits. No-dirty-diffs serialization would be an implementation detail.

+ Different implementations might have different maintainers. WMF cluster can continue to use Parsoid or a Java port or a C++ port or Rust port or whatever. Let MediaWiki Foundation or whichever group represents the interest of shared hosting providers take primary responsibility for the PHP parser.

- Getting to the spec is not trivial, but Parsoid's implementation provides a way out. This immediately makes the current PHP parser non-compliant since template wrapping is non-trivial in its current model. So, this is not entirely different from solution 1. except that the focus here is on the spec, not on the lone compliant implementation.

Option 5: Maintain both PHP parser and Parsoid
+ If we solve the feature and output equivalence pieces for wt2html (see section below) and continue to clean up wikitext and implement this in both PHP and Parsoid land, we end up much closer to getting to a spec and actually adopting the 'define a spec' proposal.

+ Continue to provide 3rd party wikis with a PHP-only install option in wikitext-editor-only scenarios, i.e. no VE for shared-install 3rd party wikis.

- Kicking the ball down the road.

- Ongoing maintenance burden of feature compatibility in two parsers that have two different computational models (string vs. dom). This negative can be mitigated if we don't provide newer wt2html features that we are considering and that are better implemented in a DOM-based implementation (ex: tagging, "balanced" templates, etc.)

Option 6: Modularizing parser interface ("Zero parsers in core")
If we extract a good interface to the parser, you could make the selection of parser (PHP, Parsoid, markdown, wikitext 2.0, html-only) a configurable option. WMF could move to Parsoid-as-a-service as the primary parser while still allowing 3rd parties to run the "legacy" PHP parser as an independent component (or even run Parsoid embedded in PHP via node- PHP-embed, if they are allergic to the services model). The community can take up the responsibility for maintaining the PHP parser (or node-php-embed).

The endpoint for a modular interface would be "zero parsers in core", where all parsers are available as libraries, and no parser gets preferential treatment in core. An template API can be created alongside the parser API to allow parsers to share template semantics (see option 7 below).

+ Splits maintenance responsibility for different use cases (just like in Option 4) by providing a pluggable interface just like VirtualRestService

+ Wikitext-only use cases continue to be supported via the PHP parser (this doesn't have to live in core, it can be a composer library)

- PHP-only installs may not get the latest and greatest DOM-based features (VE, etc.)

- Once the Wikimedia cluster no longer uses the PHP parser, the parser functionality could potentially diverge without maintainers for it. However, Tim has observed a couple times in the past that the PHP parser has not required too much maintenance. It continues to be stable and hasn't see a lot of code changes over the years. So, barring newly discovered security issues or annoying bug fixes, it could continue to support existing MediaWiki installations just fine.

Option 7: Modularizing template interface ("Zero template engines in core")
This begins with the parser interface and librarization of option 6, but also adds a Template Engine API, based on DOM semantics. When a parser library encounters a template invocation (regardless of the exact syntax), it invokes the template engine API to expand the template into a DOM fragment. This expansion likely invokes the parser API recursively on the template source, but the template markup language or parser implementation need not be the same as the article markup language or parser implementation. The DOM fragment returned by the template engine is spliced into the article DOM. Ideally we develop common semantics for how that is done (see T149658) so most engines can reuse an implementation provided in core.

+ Allows interoperability between various markup languages

+ Encourages third-party use of "wikitext" by decoupling it from the mediawiki-specific details of template expansion

- Alters core semantics of template expansion; requires mediawiki to manipulate DOM-structured output which could increase memory pressure.

High-level observations about the proposals
The proposals cleanly separate into two camps.

Options 1 - 3 require a clear consolidation behind a single parser (whether Parsoid, PHP-port of Parsoid, C++ port of Parsoid).

Options 4 - 6 don't require a clear consolidation behind a single parser and provides for a multiple implementation world.

Options 1 - 3 are all fairly disruptive and potentially sucks scarce developer resources (except maybe option 1, but it seems like a non-starter for MediaWiki the software package).

Options 4 - 6, in reality, fit better together as part of a strategy that lets us move to a multiple parsers world without disrupting anyone's world seriously. It also provides a clean boundary between feature upgrades. We will continue to support existing MediaWiki installations, but any MediaWiki user that wants / needs newer features might have to shift to an installation mode that is not a shared PHP install. That said, in the foreseeable future, if the Wikitext 2.0 evolution strategy succeeds and we move to a "simpler" processing model, it is conceivable to think of a PHP-only solution that moves the parsing functionality into "core". However, that still doesn't do much for 3rd parties since they would still need to do the painful and necessary work of fixing up their wikitext in some ways.

First steps: feature & output equivalence between Parsoid & PHP-parser wt2html output
No matter which of the paths above we take, output (and feature) equivalance of PHP parser and Parsoid output is necessary. If PHP parser were to be replaced with Parsoid today on the WMF cluster, there will be lots of breakage. There are probably a lot of tools, gadgets, bots, extensions that are tied to the PHP parser output and parser hooks.

Given that Parsoid & PHP parser are the two realistic candidates out there, and a PHP port is not a realistic option for the WMF cluster, and any Parsoid port has to deal with output equivalence issues for read views, strategically, it makes sense to focus on output & feature "equivalence" first and resolve that problem.

However, the direction of how this work will go will be influenced by the other decision. For example, if we go with option 1, functionality that Parsoid supports need not be implemented in the PHP parser, i.e. Parsoid's support can be a superset of what the PHP parser supports. If not, both PHP parser and Parsoid needs to be in sync wrt new features (ex: DOM-based solutions like "balanced" templates).

Here is the work that needs doing:

Move towards feature equivalence (those that affect/depend on output)

 * red link, bad image support in Parsoid HTML (to be done -- solutions outlined in phab tickets)
 * language variant support (work in progress in 2016)
 * multimedia support (work in progress in 2016)
 * port wikitext-parsing extensions (work in progress in 2016)
 * define an extension spec that is not tied to parsing implementation details and port parser-hook-based extensions to this new model (related: see in output equivalence section)
 * identify how things like abuse filter and other PHP-parser-html based tools would work with the new output and develop plans to port them.
 * ideally, this would happen gradually as part of getting to output equivalence.
 * Reduce dependence on site message. Ex: Cite output styled using CSS

Move towards output equivalence

 * PHP parser: move to unify output markup
 * Ex: Use for images https://phabricator.wikimedia.org/T118517
 * PHP parser: replace Tidy (ongoing in 2016)
 * PHP parser: deprecate and remove parser hooks that are tied to PHP parser internals (potentially controversial proposal never proposed before now)
 * [CSA] it would be worth digging down and figuring out exactly which parser hooks we're talking about here. Some of them might still be implementable with parsoid.
 * [SSS] See https://www.mediawiki.org/wiki/Manual:Parser.php#Hooks
 * Parsing/Parser Hooks Stats has information about which parser hooks are being used by extensions installed on the WMF cluster. That table is heartening since most extensions don't use the hooks, and the most commonly used hooks have equivalents in parser that doesn't expose its internals.
 * Parsoid: in a post-Tidy world, run mass visual diffs to identify other sources of rendering diffs
 * updated visual diff instrastructure (in place in 2016)
 * identify and fix remaining bugs / diffs (to be done)

Long-term desirables
The longer-term direction would be to start cleaning up the wikitext processing model.
 * Move towards a DOM-centric instead of string-centric processing model.
 * Deprecate parser functions and migrate most of that functionality to Scribunto.
 * Extensions should not be tied to specifics of a parser's implementation, i.e. parser tag hooks like  should be deprecated and removed ( https://www.mediawiki.org/wiki/Manual:Parser.php ). We should examine the functionality that is currently supported by these hooks and see how we could support them instead.

Related documents

 * 1) Parsoid performance landscape -- July 2015 document and pertinent because of some intersection with the notes here. The decisions made in July 2015 can always be revisited.
 * 2) Wikitext processing model -- February 2015 document and useful for consideration when looking at longer-term direction of what the parser should do.
 * 3) Document composability -- April 2016 document that discusses the need for on-wiki documents to be composable from individual fragments. This is again useful when looking at longer-term direction of what is needed of the parser.