Core Platform Team/Initiative/Unify Parsers-Phase 2/Initiative Description

Project Leads
Subbu Sastry

Current state
Blocked, waiting for phase 1 to be complete.

Some work is less defined until several tasks are complete which are expected to define the rest of the project. See milestones and major tasks below.

Expected start
Late FY1920 Q1

Summary
MediaWiki currently has two wikitext parsers: the (legacy) parser and Parsoid supporting different use cases. This project aims to arrive at a single parser that supports all use cases.

Significance and motivation
Parsoid was developed to support HTML-editing clients but is also used by some read view use cases but not all of them. It is not tenable to have two parsers in the long term since it hamstrings development and upgrades to the parsing codebase, wikitext, and templates since we would have to add that support to both codebases. More importantly, the parsing pipelines in the two parsers are different which makes replicating functionality in both parsers more complex.

We would like to consolidate behind Parsoid as the new default parser given its support for HTML clients, annotated HTML output, and more structured internal pipeline. This requires identifying all output and feature incompatibilities between Parsoid and the legacy parser and bridging those gaps. This may also require updating (a) bots (b) gadgets (c) extensions (d) wikitext. This project aims to minimize all such changes by handling any differences with appropriate tooling and support.

Once Parsoid is deployed as the default and only parser for all wikitext-based use cases, we can embark upon much needed work to enhance wikitext and templates and make them easier to use, more performant, less error-prone, and easier to write tools for.

Milestones and major tasks

 * Fix known issues in Parsoid relating to using Parsoid HTML for read views
 * Complete language variant support
 * Address any other issues in Parsoid/Known differences with PHP parser output
 * Finish updating legacy PHP parser media output to match Parsoid
 * This might require updates to some bots and gadgets
 * Identify any other Parsoid feature gaps (This can/will reveal new work)
 * Finalize new parser hooks API (Parsoid and legacy PHP parser have different pipelines and internals)
 * Migrate over Wikimedia extensions using existing hooks
 * Compatibility Testing (this can/will reveal new work)
 * Establish regular visual diff QA runs to identify uncaught issues
 * Analyze results and file Parsoid bugs or identify any wikitext changes required on wikis
 * Decide on what compatibility is acceptable (100% compatibility is not achievable and there might be insignificant output differences)
 * Connect with CL and engage with community if we require any wikitext / templates to be fixed (This can/will reveal new work)
 * Production Readiness
 * Improve Parsoid performance (undefined until phase 1 is complete)
 * Switch over all read views to Parsoid on the Wikimedia cluster

Outcome
Reduce complexity in core

Baseline

 * TBD

Target

 * TBD

Methodology and rationale
TBD

Time and resource estimate
18-24 months

3.5 FTE and .5 Engineering and Project Manager for the duration

Possible augmenting of other engineers, but more clarity is needed.

Dependencies
Reduce Extension Interface Surface Area

Collaborators

 * Parsing Team
 * Core Platform
 * Performance
 * SRE

Stakeholders

 * Client teams (Web, VE, Flow, CX, Apps)
 * Bot, Gadget, and Extension authors (only as pertaining to the Wikimedia cluster initially)
 * Editing community
 * Core Platform

Open questions

 * To what extent do we want to refactor the Parsing Interface in Core? It is currently coupled with the legacy wikitext parser and the templating implementation.
 * What is acceptable output disparity between Parsoid and the PHP parser? How do we decide this? What qualitative analysis should be used?
 * What are our strategies for engaging with the community on any changes this might require them to do?
 * What additional work is required on the Linter extension to better support editors with any required wikitext and template changes?

Phabricator
https://phabricator.wikimedia.org/tag/parsoid-read-views/

Plans and RFCs

 * The Long And Winding Road To Making Parsoid The Default MediaWiki Parser ( Slides Video )

Other documents

 * Parsoid/Known differences with PHP parser output
 * Parsing/Parser Hooks Stats
 * Parsing/Media structure
 * Parsoid/LanguageConverter