Core Platform Team/Initiatives/Unify Parsers-Phase 2
Unify Parsers-Phase 2
MediaWiki currently has two wikitext parsers: the (legacy) parser and Parsoid supporting different use cases. This project aims to arrive at a single parser that supports all use cases.
- Significance and Motivation
Parsoid was developed to support HTML-editing clients but is also used by some read view use cases but not all of them. It is not tenable to have two parsers in the long term since it hamstrings development and upgrades to the parsing codebase, wikitext, and templates since we would have to add that support to both codebases. More importantly, the parsing pipelines in the two parsers are different which makes replicating functionality in both parsers more complex.
We would like to consolidate behind Parsoid as the new default parser given its support for HTML clients, annotated HTML output, and more structured internal pipeline. This requires identifying all output and feature incompatibilities between Parsoid and the legacy parser and bridging those gaps. This may also require updating (a) bots (b) gadgets (c) extensions (d) wikitext. This project aims to minimize all such changes by handling any differences with appropriate tooling and support.
Once Parsoid is deployed as the default and only parser for all wikitext-based use cases, we can embark upon much needed work to enhance wikitext and templates and make them easier to use, more performant, less error-prone, and easier to write tools for.
- Baseline Metrics
- Target Metrics
- Client teams (Web, VE, Flow, CX, Apps)
- Bot, Gadget, and Extension authors (only as pertaining to the Wikimedia cluster initially)
- Editing community
- Core Platform
- Known Dependencies/Blockers
Epics, User Stories, and Requirements
- Fix known issues in Parsoid relating to using Parsoid HTML for read views
- Finish updating legacy PHP parser media output to match Parsoid
- This might require updates to some bots and gadgets
- Identify any other Parsoid feature gaps (This can/will reveal new work)
- Finalize new parser hooks API (Parsoid and legacy PHP parser have different pipelines and internals)
- Migrate over Wikimedia extensions using existing hooks
- Compatibility Testing (this can/will reveal new work)
- Establish regular visual diff QA runs to identify uncaught issues
- Analyze results and file Parsoid bugs or identify any wikitext changes required on wikis
- Decide on what compatibility is acceptable (100% compatibility is not achievable and there might be insignificant output differences)
- Connect with CL and engage with community if we require any wikitext / templates to be fixed (This can/will reveal new work)
- Production Readiness
- Improve Parsoid performance (undefined until phase 1 is complete)
- Switch over all read views to Parsoid on the Wikimedia cluster
Time and Resource Estimates
- Estimated Start Date
Late FY1920 Q1
- Actual Start Date
- Estimated Completion Date
- Actual Completion Date
- Resource Estimates
3.5 FTE and .5 Engineering and Project Manager for the duration
Possible augmenting of other engineers, but more clarity is needed.
- Parsing Team
- Core Platform
- To what extent do we want to refactor the Parsing Interface in Core? It is currently coupled with the legacy wikitext parser and the templating implementation.
- What is acceptable output disparity between Parsoid and the PHP parser? How do we decide this? What qualitative analysis should be used?
- What are our strategies for engaging with the community on any changes this might require them to do?
- What additional work is required on the Linter extension to better support editors with any required wikitext and template changes?
- Other Documents
- Parsoid/Known differences with PHP parser output
- Parsing/Parser Hooks Stats
- Parsing/Media structure
Blocked, waiting for phase 1 to be complete.
Some work is less defined until several tasks are complete which are expected to define the rest of the project.