Parsing

From mediawiki.org
Jump to navigation Jump to search

The parsing team is, as the name indicates, responsible for maintaining the wikitext parsing products.

You can find us on IRC at #mediawiki-parsoid connect or reach us on email on the parsing-team id on the wmf domain.

Mission Statement[edit]

  • (In) Advance wikitext as a language (easier to write, faster to parse, less mistake prone for authors/editors).
  • (Modify) Support robust and reliable content modification, via wikitext, HTML, or beyond.
  • (Out) Making our rendered content more standards-compliant and easier to analyze.
  • (Tools) Support tools such as Visual Editor, Content Translation, Wikitext Linting, and others.
  • (Parsers) Unified reusable parser library for reading as well as editing.

History[edit]

Prior to May 2015, the erstwhile mediawiki core team was responsible for the mediawiki core PHP parser and the Parsoid team was responsible for Parsoid. Since May 2015, the erstwhile Parsoid team was rebranded as the Parsing team and the core PHP parser and Parsoid were brought under the purview of the Parsing team - this coincided with Tim Starling joining the group. Kunal joined the team in April 2016. As of April 2017, Tim and Kunal are now part of the Mediawiki Platform team.

Current projects[edit]

  • Parser Unification as part of the Platform Evolution Cross-Departmental Program.
    • As of Jan 2018, we started work in earnest to move Parsoid into MediaWiki Core by porting it to PHP.
    • As of Dec 20 2019, we have switched over all traffic from Parsoid/JS to Parsoid/PHP.
    • As of May 2019, we are working to prepare Parsoid to be used for all wikitext use cases within MediaWiki, initially only on Wikimedia wikis. It is still at least an year (if not longer) away.
      • We are extracting an abstract Parser class that the legacy parser and Parsoid can both extend
      • We are working on an Extension API for extensions to hook with Parsoid
      • Picked up work related to T43716: Language variant support in Parsoid based on Finite State Transducer formalism (T191925)
      • Picked up work related to task T118517: Using the <figure> tag for media markup
      • Addressing other functionality gaps and bugs
  • Parsoid:
    • Addressing needs of editing products deployed on the Wikimedia cluster (VisualEditor, Flow, Content Translation, bots) and non-editing projects (OCG, Google, Kiwix).
    • Specifically, we are assisting the Editing Team in its talk page reboot project.
  • Evolving wikitext -- all these projects are on hold now
    • Various RFCs that seek to evolve wikitext which can help with the goal of addressing technical debt in content. A complete list is on Phabricator: RfCs tagged with "Parsing Team". Some examples:
      • T114432: Heredoc-style syntax for long template arguments -- stalled on other projects for now
      • task T114445: Balanced templates -- stalled on other projects for now
  • Testing infrastructure associated with these projects

Get involved[edit]

For updates and calls to action, please see and bookmark Parsing/Get involved. We will be using that space to push out community notifications around our work.

Code repositories that the Parsing team is either directly responsible for or shares responsibility with other teams[edit]

  • Parsoid (both JS & PHP versions)
  • Node.js libraries that Parsoid depends on
    • domino HTML5 parser and DOM
    • prfun promises library
    • wikipeg PEG parser (previously a fork of pegjs)
  • MediaWiki-Parser
  • RemexHtml HTML5 parser
  • MediaWiki Extensions
    • ParsoidBatchAPI (this will be retired and undeployed from the Wikimedia cluster now that Parsoid/JS is going to be decomissioned)
    • Linter
    • ParserMigration (this is not currently deployed to the Wikimedia cluster)
  • QA tools
    • TestReduce
    • VisualDiff
    • UprightDiff

Long-term directions as of November 2016[edit]

The parsing team had its offsite in October 2016 in Seattle. Going into the offsite, the two systems problem was one of the core problems we wanted to tackle. That informed and influenced a number of discussions we had at the offsite.

Consequently, we arrived at the following broad outline for where we want to move towards in the coming years. This is deliberately a broad outline without a lot of details. The specific details will evolve based on team discussions, consultation with other WMF teams, architecture committee, input from editors, as well as results and feedback from work we undertake.

We will separately publish a list of specific projects that we are going to be working on as we move along these directions.

A: Move to adopt Parsoid as the primary MediaWiki wikitext parser[edit]

  • We'll adopt the Parsoid DOM Spec as the output spec for wikitext which will be a versioned spec.
  • DOM features in Parsoid won't be replicated in PHP parser (except those that might be easy to support and will help with adoption of Parsoid HTML for read views and will help evolving wikitext semantics).
  • Once we address output and feature incompatibilities between PHP parser and Parsoid, we'll use Parsoid HTML for read views as well as editing.
  • PHP parser of today will be treated as a legacy implementation and will get deprecated and removed in the long term.

B: Make MediaWiki friendly to multiple parser implementations[edit]

  • Make the parser API used by extensions (ex: parser hooks) implementation-neutral. The current parser hooks provided by the PHP parser don't all have Parsoid equivalents since they refer to PHP parser internals.
  • Update the parsing API used in MediaWiki (as necessary) to ensure that alternative implementations (ex: markdown, wikitext 2.0, legacy PHP parser, etc.) can be plugged in.

C: Migrate wikitext semantics towards a DOM-composition based processing model[edit]

  • Clean up templating model to return balanced DOM fragments +  introduce syntactical features to support this cleanup (ex: heredoc syntax).
  • Propose and adopt a spec for DOM-fragment composition.
  • Any other additional work to address rough edges in wikitext.

We are broadly referring to this body of work as "wikitext 2.0" work.

D: Develop and promote specs[edit]

Develop parser specifications to aid with MediaWiki output interoperability, extension development, template and extension authors, and compliance of (alternate) wikitext implementations.

There are 4 different pieces to this:

  1. Output spec: Parsoid DOM Spec (in place). We'll cleanup and update the documentation to make it more friendly. But, this spec helps with interoperability of MediaWiki output.
  2. Implementation compliance spec: Parser tests (in place). We'll cleanup the test infrastructure and make it more usable and maintainable.
  3. Implementation-neutral extension and Parser API spec: To be developed (enables pluggability of parsers, and extensions to hook into parser). This will help extension authors write extensions that can be supported in any implementation of a wikitext parser (vs. being tied to a specific parser's internals). This is work that will be done as part of B: above.
  4. Language (wikitext) spec: To be developed in conjunction with evolving wikitext semantics ("wikitext 2.0" in C: above). We will NOT attempt a spec for wikitext of today.
    • We will develop a base markup spec for wikitext markup
    • We will develop a DOM fragment composition spec that specifies how template and extension output will compose with the markup of a page

See also[edit]