Parsing/Notes/Two Systems Problem

Why this page?
This document is an attempt to present the proposals and ideas that have been mooted in different places to address the 2-systems (wikitext) parser problem. Relatedly, requirements of this single parser aren't fully agreed upon yet, especially around support for 3rd party wikis, and what kind of wikitext evolution we want to get to.

This page is an attempt to pull together all related concerns and proposals in one place so we can have a more informed discussion about how to move ahead wrt solving the 2-systems problem. Some of these proposals and sections need to be filled out in more detail. The proposals on this page should not be considered a roadmap or what will or should happen. These notes have been pulled together for discussion and will inform RFCs, possible dev summit proposals, and talks. The parsing team roadmap will evolve through those other processes.

2-systems problem in parsing land
Right now, in parsing land, we have the 2-systems problem in 2 different ways.
 * For Wikimedia wikis, there are two HTML renderings of a page
 * PHP Parser HTML - used for read views, action=edit previews
 * Parsoid HTML - used by VE, CX, Flow, MCS, OCG + Google, Kiwix

This is not tenable in the long term. Migrating to a single parser that supports both read and edit requirements would solve both these in one shot but let us consider the two separately because at least one path doesn't require a consolidation behind a single parser.
 * For MediaWiki the software, we have two different parsers
 * PHP parser - does not support VE, CX, OCG, Kiwix, has limited support for Flow
 * Parsoid - does not replace all PHP parser components + 3rd party installation woes + awkward boundary between Parsoid & MediaWiki + some performance concerns

Source of output differences between Parsoid & PHP parser
There are 2 primary sources for why output between Parsoid & PHP parser differ:

DOM processing vs. string processing
In order to support HTML editing clients like VE, Parsoid tries hard to map the macro-preprocessing-like parsing model to a DOM-centric model. This mostly works, except when it doesn't (link to other documents for more info).

Reliance on state that Parsoid doesn't have direct access to
Right now, Parsoid is structured as a separate stateless service and Parsoid HTML = f(wt, mw-config). Parsoid caches the wiki configs at startup and computes output HTML as a function of the input wikitext. This model is not reflective of how the PHP parser generates its output HTML.

Right now, mediawiki HTML output depends on the following: i.e. MediaWiki-HTML = F_php(wt, mw-config, media-resources, user-state, site-messages, corpus-state, PHP-parser-hooks, tidy-hacks)
 * input wikitext
 * wiki config (including installed extensions)
 * media resources (images, audio, video)
 * PHP parser hooks that expose parsing internals and implementation details (not replicable in other parsers)
 * wiki messages (ex: cite output)
 * state of the corpus and other db state (ex: red links, bad images)
 * user state (prefs, etc.)

Parsoid gets away with its assumption by (a) calling the mediawiki API to support rendering of media resources, native extensions and other hooks. (b) proposing to run HTML transformations to support things like redlinks, bad images, user preferences. Parsoid supports some Tidy hacks, but has rendering differences nevertheless. Replacing Tidy with a HTML5 parser effectively eliminates the tidy-hacks as one of the inputs.

Possible refactoring of this computation model
However, output still depends on external state which could be handled differently. For example, changes to the corpus (change in link state) can force a reparse of the entire page. Or, needing to parse a page to accommodate user preferences, or change in site messages. If instead, we could have MediaWiki-HTML be a series of composable transformations, then we could improve cacheability and untangle some of the dependencies.

i.e make MediaWiki HTML = f1(f2(f3(wt, mw), user-state), corpus-state) and so on. This lets you cache the output of f3 and use it to apply transformations (ideally client-side since it lets you return cached output rapidly, but could also be server-side if necessary).

The original proposal for using Parsoid HTML for read views was based on this idea of composable transformations. This idea is up for discussion.

Requirements of the unified parser

 * Needs to support all existing HTML editing clients (VE, CX, Flow, etc.) and functionality
 * Full feature and output compatibility wrt reading clients, tools, bots, etc
 * Needs to be "performant enough" for use on the Wikimedia cluster
 * Ideally, it will support the long-term move to DOM-based semantics
 * Cannot drastically break rendering of old revisions
 * What kind of 3rd party installation considerations are relevant? Many 3rd party wikis might have expectation of being able to use the latest and greatest m/w features (VE & Flow) found on Wikipedias
 * Do all installation solutions (shared hosting, 1 click-install, vms, containers etc.) need to provide this support? This question was the focus of a 2016 dev summit discussion and hasn't yet been resolved satisfactorily. Answer to this question has an implication to the unified parser discussion. Alternatively, the decision made for the unified parsing solution will de facto dictate what is possible for 3rd parties. The qn. to resolve: which is the dog, which is the tail .. ? :)
 * Note that even if we have a PHP/PHP-binary-compatible version of Parsoid for 3rd party wikis (options 1 & 2 earlier), there is also RESTBase, Citoid, Graphoid, Mathoid which are all node.js services as well. RESTBase could be optional maybe as long as those wikis don't care about dirty diffs with VE edits, but, if they want Citoid (or maps), are we now expecting a PHP/PHP-binary-compatible port of Citoid (and Graphoid) as well?
 * On the other hand, we did make it reasonably easy to use Scribunto without installing any PHP extensions or running external services, even though the WMF cluster does use a PHP extension. Image scaling, pre-Mathoid math rendering, and Tidy have long worked on a plain PHP installation too (AFAIK we're wanting to kill Tidy because it doesn't do HTML5 well, not because the integration is a problem).
 * However, if a 3rd party wiki does not want VE, but only wants wikitext editing without a non-PHP language dependency for the foreseeable future, then that eliminates option 1. But, that means not all 3rd party wikis will get the latest and greatest features of mediawiki as found on Wikipedias.
 * Question: Perhaps we need a support matrix for installation modes like for browsers .. i.e. what installation modes must support what features, which at least provides some agreement that such a thing is needed (rather than the amorphous "everything must support everything" requirement) and let the discussion and debate move towards how the table cells are filled up.

Tangent: Nomenclature
A small tangent. Parser is no longer an accurate name since we are now looking at transformations in both directions between wikitext and HTML. Separately, the content-handler abstraction enables other content representations, for example, markdown, json, HTML, etc.

All that said, wikitext is still the primary content format for wikimedia and many 3rd party wikis. And, if we restrict our attention to just this wikitext component, perhaps a wikitext runtime/implementation (?) or something else is a better name? (Maybe a bi-directional compiler (or, transpiler))

Possible approaches for addressing the 2-systems problem
Here are some options that have been mooted.

Deprecate the core parser and replace it with Parsoid
+ no need to replicate new functionality in PHP parser

+ a faster path to get to the end goal

- eliminates the simple install option for mediawiki, the software.

- introduces a hard dependency on a non-php component.

Port Parsoid functionality to PHP and ditch Parsoid
+ everything is back in PHP land

- non-trivial amount of work

- not performant enough for WMF's use

Port Parsoid to a more performant PHP-binary compatible language (C / C++)
This was the original 2012 plan when Parsoid was first being developed and the then Parsoid team even embarked on a proof of concept for this port, but quickly abandoned it since it was diverting scarce developer time from the more urgent task at hand -- supporting rollout of the Visual Editor.

+ improved performance

+ could potentially integrate Parsoid functionaly more tightly with core and could eliminate the whole 3rd party dance

- non-trivial amount of work

- more complex programming and debugging environment

[CSA note: or use node-php-embed to embed JavaScript implementation inside PHP...]

Define a spec for wikitext, extensions, HTML output
The spec should not tied to implementation internals and spec wt -> html and html -> wt behavior. + How many parsers there are is irrelevant as long as they are spec-compliant.
 * html -> wt might be an optional component of the spec
 * html -> wt might generate normalized wt and generate dirty diffs on edits. No-dirty-diffs serialization would be an implementation detail

+ Different implementations might have different maintainers. WMF cluster can continue to use Parsoid or a Java port or a C++ port or Rust port or whatever. Let mediawiki foundation or whichever group represents the interest of shared hosting providers take primary responsibility for the PHP parser.

- Getting to the spec is not trivial, but Parsoid's implementation provides a way out. This immediately makes the current PHP parser non-compliant since template wrapping is non-trivial in its curent model. So, this is not entirely different from solution 1. except that the focus here is on the spec, not on the lone compliant implementation.

Maintain both PHP parser and Parsoid
+ If we solve the feature and output equivalence pieces for wt2html (see section below) and continue to clean up wikitext and implement this in both PHP and Parsoid land, we end up much closer to getting to a spec and actually adopting the 'define a spec' proposal.

+ Continue to provide 3rd party wikis with a php-only install option in wikitext-editor-only scenarios, i.e. no VE for shared-install 3rd party wikis.

- Kicking the ball down the road.

- Ongoing maintenance burden of feature compatibility in two parsers that have two different computational models (string vs. dom). This negative can be mitigated if we don't provide newer wt2html features that we are considering and that are better implemented in a DOM-based implementation (ex: tagging, "balanced" templates, etc.)

Modularizing parser interface
If we extract a good interface to the parser, you could make the selection of parser (PHP, Parsoid, markdown, wikitext 2.0, html-only) a configurable option. WMF could move to Parsoid-as-a-service as the primary parser while still allowing 3rd parties to run the "legacy" PHP parser as an independent component (or even run Parsoid embedded in PHP via node-php-embed, if they are allergic to the services model). The community can take up the responsibility for maintaining the PHP parser (or node-php-embed).

First steps: feature & output equivalence between Parsoid & PHP-parser wt2html output
No matter which of the paths above we take, output (and feature) equivalance of PHP parser and Parsoid output is necessary. If PHP parser were to be replaced with Parsoid today on the WMF cluster, there will be lots of breakage. There are probably a lot of tools, gadgets, bots, extensions that are tied to the PHP parser output and parser hooks.

Given that Parsoid & PHP parser are the two realistic candidates out there, and a PHP port is not a realistic option for the WMF cluster, and any Parsoid port has to deal with output equivalence issues for read views, strategically, it makes sense to focus on output & feature "equivalence" first and resolve that problem.

However, the direction of how this work will go will be influenced by the other decision. For example, if we go with option 1, functionality that Parsoid supports need not be implemented in the PHP parser, i.e. Parsoid's support can be a superset of what the PHP parser supports. If not, both PHP parser and Parsoid needs to be in sync wrt new features (ex: DOM-based solutions like "balanced" templates).

Here is the work that needs doing:

Move towards feature equivalence (those that affect/depend on output)

 * red link, bad image support in Parsoid HTML (to be done -- solutions outlined in phab tickets)
 * language variant support (work in progress in 2016)
 * multimedia support (work in progress in 2016)
 * port wikitext-parsing extensions (work in progress in 2016)
 * define an extension spec that is not tied to parsing implementation details and port parser-hook-based extensions to this new model (related: see in output equivalence section)
 * identify how things like abuse filter and other php-parser-html based tools would work with the new output and develop plans to port them.
 * ideally, this would happen gradually as part of getting to output equivalence.
 * Reduce dependence on site message. Ex: Cite output styled using CSS

Move towards output equivalence

 * PHP parser: move to unify output markup
 * Ex: Use for images https://phabricator.wikimedia.org/T118517
 * PHP parser: replace Tidy (ongoing in 2016)
 * PHP parser: deprecate and remove parser hooks that are tied to php parser internals (potentially controversial proposal never mooted before now)
 * [CSA] it would be worth digging down and figuring out exactly which parser hooks we're talking about here. Some of them might still be implementable with parsoid.
 * [SSS] See https://www.mediawiki.org/wiki/Manual:Parser.php#Hooks
 * Parsing/Parser Hooks Stats has information about which parser hooks are being used by extensions installed on the WMF cluster. That table is heartening since most extensions don't use the hooks, and the most commonly used hooks have equivalents in parser that doesn't expose its internals.
 * Parsoid: in a post-Tidy world, run mass visual diffs to identify other sources of rendering diffs
 * updated visual diff instrastructure (in place in 2016)
 * identify and fix remaining bugs / diffs (to be done)

Long-term desirables
The longer-term direction would be to start cleaning up the wikitext processing model.
 * Move towards a DOM-centric instead of string-centric processing model.
 * Deprecate parser functions and migrate most of that functionality to Scribunto.
 * Extensions should not be tied to specifics of a parser's implementation, i.e. parser tag hooks like  should be deprecated and removed ( https://www.mediawiki.org/wiki/Manual:Parser.php ). We should examine the functionality that is currently supported by these hooks and see how we could support them instead.

Related documents

 * 1) Parsoid performance landscape -- July 2015 document and pertinent because of some intersection with the notes here. The decisions made in July 2015 can always be revisited.
 * 2) Wikitext processing model -- February 2015 document and useful for consideration when looking at longer-term direction of what the parser should do.
 * 3) Document composability -- April 2016 document that discusses the need for on-wiki documents to be composable from individual fragments. This is again useful when looking at longer-term direction of what is needed of the parser.