Parsing

The parsing team is, as the name indicates, responsible for maintaining the wikitext parsing products.

You can find us on IRC at or reach us on email on the parsing-team id on the wmf domain.

Mission Statement

 * (Input) Advance wikitext as a language (easier to write, faster to parse, less mistake prone).
 * (Output) Making wikitext content easier to analyze, manipulate, evolve to support tools like Visual Editor, Content Translation, Wikitext Linting, and others.
 * (Parsers) Arrive at a unified parser for reading as well as editing.

History
Prior to May 2015, the erstwhile mediawiki core team was responsible for the mediawiki core PHP parser and the Parsoid team was responsible for Parsoid. Since May 2015, the erstwhile Parsoid team was rebranded as the Parsing team and the core PHP parser and Parsoid were brought under the purview of the Parsing team - this coincided with Tim Starling joining the group. Kunal joined the team in April 2016. As of April 2017, Tim and Kunal are now part of the Mediawiki Platform team.

Current projects

 * Parser Unification as part of the Platform Evolution Cross-Departmental Program.
 * As of Jan 2018, we started work in earnest to move Parsoid into MediaWiki Core by porting it to PHP.
 * As of Dec 20 2019, we have switched over all traffic from Parsoid/JS to Parsoid/PHP.
 * Parsoid:
 * Addressing needs of editing products deployed on the Wikimedia cluster (VisualEditor, Flow, Content Translation, bots) and non-editing projects (OCG, Google, Kiwix).
 * Making progress towards enabling read views with Parsoid HTML.
 * T43716: Language variant support in Parsoid based on Finite State Transducer formalism (T191925)
 * T156099: Continue refining Parsoid's extension API
 * The goal is to hide parsing and implementation details while supporting existing WMF extensions that use the PHP Parser hooks.
 * PHP Parser:
 * Using the  tag for media markup
 * Evolving wikitext
 * Various RFCs that seek to evolve wikitext which can help with the goal of addressing technical debt in content. A complete list is on Phabricator: RfCs tagged with "Parsing Team".  Some examples:
 * T114432: Heredoc-style syntax for long template arguments -- stalled on other projects for now
 * Balanced templates -- stalled on other projects for now
 * Testing infrastructure associated with these projects

Get involved
For updates and calls to action, please see and bookmark Parsing/Get involved. We will be using that space to push out community notifications around our work.

Code repositories that the Parsing team is either directly responsible for or shares responsibility with other teams

 * Parsoid (both JS & PHP versions)
 * Node.js libraries that Parsoid depends on
 * domino HTML5 parser and DOM
 * prfun promises library
 * wikipeg PEG parser (previously a fork of pegjs)
 * MediaWiki-Parser
 * RemexHtml HTML5 parser
 * MediaWiki Extensions
 * ParsoidBatchAPI (this will be retired and undeployed from the Wikimedia cluster now that Parsoid/JS is going to be decomissioned)
 * Linter
 * ParserMigration
 * QA tools
 * TestReduce
 * VisualDiff
 * UprightDiff

Long-term directions as of November 2016
The parsing team had its offsite in October 2016 in Seattle. Going into the offsite, the two systems problem was one of the core problems we wanted to tackle. That informed and influenced a number of discussions we had at the offsite.

Consequently, we arrived at the following broad outline for where we want to move towards in the coming years. This is deliberately a broad outline without a lot of details. The specific details will evolve based on team discussions, consultation with other WMF teams, architecture committee, input from editors, as well as results and feedback from work we undertake.

We will separately publish a list of specific projects that we are going to be working on as we move along these directions.

A: Move to adopt Parsoid as the primary MediaWiki wikitext parser

 * We'll adopt the Parsoid DOM Spec as the output spec for wikitext which will be a versioned spec.
 * DOM features in Parsoid won't be replicated in PHP parser (except those that might be easy to support and will help with adoption of Parsoid HTML for read views and will help evolving wikitext semantics).
 * Once we address output and feature incompatibilities between PHP parser and Parsoid, we'll use Parsoid HTML for read views as well as editing.
 * PHP parser of today will be treated as a legacy implementation and will get deprecated and removed in the long term.

B: Make MediaWiki friendly to multiple parser implementations

 * Make the parser API used by extensions (ex: parser hooks) implementation-neutral. The current parser hooks provided by the PHP parser don't all have Parsoid equivalents since they refer to PHP parser internals.


 * Update the parsing API used in MediaWiki (as necessary) to ensure that alternative implementations (ex: markdown, wikitext 2.0, legacy PHP parser, etc.) can be plugged in.

C: Migrate wikitext semantics towards a DOM-composition based processing model
We are broadly referring to this body of work as "wikitext 2.0" work.
 * Clean up templating model to return balanced DOM fragments +  introduce syntactical features to support this cleanup (ex: heredoc syntax).
 * Propose and adopt a spec for DOM-fragment composition.
 * Any other additional work to address rough edges in wikitext.

D: Develop and promote specs
Develop parser specifications to aid with MediaWiki output interoperability, extension development, template and extension authors, and compliance of (alternate) wikitext implementations.

There are 4 different pieces to this:
 * 1) Output spec: Parsoid DOM Spec (in place). We'll cleanup and update the documentation to make it more friendly. But, this spec helps with interoperability of MediaWiki output.
 * 2) Implementation compliance spec: Parser tests (in place). We'll cleanup the test infrastructure and make it more usable and maintainable.
 * 3) Implementation-neutral extension and Parser API spec: To be developed (enables pluggability of parsers, and extensions to hook into parser). This will help extension authors write extensions that can be supported in any implementation of a wikitext parser (vs. being tied to a specific parser's internals). This is work that will be done as part of B: above.
 * 4) Language (wikitext) spec: To be developed in conjunction with evolving wikitext semantics ("wikitext 2.0" in C: above). We will NOT attempt a spec for wikitext of today.
 * 5) * We will develop a base markup spec for wikitext markup
 * 6) * We will develop a DOM fragment composition spec that specifies how template and extension output will compose with the markup of a page

E: Evaluate feasibility of porting Parsoid to PHP
See Parsing/Notes/Moving Parsoid Into Core for a more detailed note about this.

Since at least 2013, there have been calls to port Parsoid to PHP for a bunch of different reasons (need to support 3rd party monolithic MediaWiki installations, the service boundary around Parsoid that is somewhat awkward, additional language to support on the cluster).

But, we have always rejected this proposal thus far because of the absence of HTML5 parsing libraries in PHP, potential serious performance drawbacks (compared to the PHP parser, Parsoid does a lot more work including lot of DOM passes), and the early days of the Parsoid project before it was an established part of the Wikimedia product ecosystem.

As of late 2016, this picture is a bit different from 2013. Parsoid is fairly well established with clear utility to the Wikimedia projects and is not going away. The Wikimedia cluster has been using HHVM which enables much higher PHP performance. Separately, as part of work to replace Tidy and support balanced template output, the parsing team has worked on two separate PHP HTML5 parsers both of which seem viable at this point. This updated landscape is letting us revisit the unresolved questions from 2013 with respect to whether a PHP port of Parsoid should be undertaken and whether it would be viable for using on the Wikimedia cluster. However, it is unclear if this PHP port of Parsoid is really going to serve the interest of 3rd party MediaWiki users since they may not have access to HHVM on their installation. For them, the serious performance drawbacks of using vanilla PHP for Parsoid's much higher computational requirements may remain unaddressed.

Without committing to doing a full port of Parsoid to PHP, and without committing to adopting it for the Wikimedia cluster, we nevertheless want to evaluate and experiment with porting Parsoid to PHP. We believe we have pathways available to us that don't require us to do a full port of Parsoid in order to evaluate its feasibility. Separately, some of the performance and code cleanup work we need to do before we undertake this experiment will be beneficial to Parsoid anyway.

We will embark on this experiment once we are further along the path of direction A above and have sufficiently addressed output and feature incompatibilities between Parsoid and the PHP parser.

In the event that a PHP port proves viable for adoption, the API that Parsoid provides will remain unchanged. RESTBase will continue to proxy Parsoid output, and will continue to store one version of output per article and applications like Mobile Content Service, etc. will continue to use their existing transformations for their needs.

Since this is a potentially risky and controversial direction, we'll prepare a more detailed note that discusses this in greater detail.