Parsing

From MediaWiki.org
Jump to: navigation, search

The parsing team is, as the name indicates, responsible for maintaining the wikitext parsing products.

You can find us on IRC at #mediawiki-parsoidconnect or reach us on email on the parsing-team id on the wmf domain.

Mission Statement[edit]

  • (Input) Advance wikitext as a language (easier to write, faster to parse, less mistake prone).
  • (Output) Making wikitext content easier to analyze, manipulate, evolve to support tools like Visual Editor, Content Translation, Wikitext Linting, and others.
  • (Parsers) Arrive at a unified parser for reading as well as editing.

History[edit]

Prior to May 2015, the erstwhile mediawiki core team was responsible for the mediawiki core PHP parser and the Parsoid team was responsible for Parsoid. Since May 2015, the erstwhile Parsoid team was rebranded as the Parsing team and the core PHP parser and Parsoid were brought under the purview of the Parsing team - this coincided with Tim Starling joining the group. Kunal joined the team in April 2016. As of April 2017, Tim and Kunal are now part of the Mediawiki Platform team. As of July 2017, we have an open req for a new person to join our team.

Current projects[edit]

  • Parsoid:
    • Addressing needs of editing products deployed on the Wikimedia cluster (VisualEditor, Flow, Content Translation, bots) and non-editing projects (OCG, Google, Kiwix).
    • Making progress towards enabling read views with Parsoid HTML.
    • Using Parsoid as a wikitext linting tool (see T48705). As of June 2017, we have deployed the Linter extension to wikimedia wikis to expose this to users and using it to aid us with replacing Tidy on the Wikimedia cluster.
  • PHP core parser:
    • Replacing Tidy (task T89331): Replacing Tidy with a HTML5 parser (which not only addresses longstanding Tidy complaints, but moves the core parser output closer to Parsoid)
  • Evolving wikitext
  • Testing infrastructure associated with these projects

Get involved[edit]

For updates and calls to action, please see and bookmark Parsing/Get involved. We will be using that space to push out community notifications around our work.

Long-term directions as of November 2016[edit]

The parsing team had its offsite in October 2016 in Seattle. Going into the offsite, the two systems problem was one of the core problems we wanted to tackle. That informed and influenced a number of discussions we had at the offsite.

Consequently, we arrived at the following broad outline for where we want to move towards in the coming years. This is deliberately a broad outline without a lot of details. The specific details will evolve based on team discussions, consultation with other WMF teams, architecture committee, input from editors, as well as results and feedback from work we undertake.

We will separately publish a list of specific projects that we are going to be working on as we move along these directions.

A: Move to adopt Parsoid as the primary MediaWiki wikitext parser[edit]

  • We'll adopt the Parsoid DOM Spec as the output spec for wikitext which will be a versioned spec.
  • DOM features in Parsoid won't be replicated in PHP parser (except those that might be easy to support and will help with adoption of Parsoid HTML for read views and will help evolving wikitext semantics).
  • Once we address output and feature incompatibilities between PHP parser and Parsoid, we'll use Parsoid HTML for read views as well as editing.
  • PHP parser of today will be treated as a legacy implementation and will get deprecated and removed in the long term.

B: Make MediaWiki friendly to multiple parser implementations[edit]

  • Make the parser API used by extensions (ex: parser hooks) implementation-neutral. The current parser hooks provided by the PHP parser don't all have Parsoid equivalents since they refer to PHP parser internals.
  • Update the parsing API used in MediaWiki (as necessary) to ensure that alternative implementations (ex: markdown, wikitext 2.0, legacy PHP parser, etc.) can be plugged in.

C: Migrate wikitext semantics towards a DOM-composition based processing model[edit]

  • Clean up templating model to return balanced DOM fragments +  introduce syntactical features to support this cleanup (ex: heredoc syntax).
  • Propose and adopt a spec for DOM-fragment composition.
  • Any other additional work to address rough edges in wikitext.

We are broadly referring to this body of work as "wikitext 2.0" work.

D: Develop and promote specs[edit]

Develop parser specifications to aid with MediaWiki output interoperability, extension development, template and extension authors, and compliance of (alternate) wikitext implementations.

There are 4 different pieces to this:

  1. Output spec: Parsoid DOM Spec (in place). We'll cleanup and update the documentation to make it more friendly. But, this spec helps with interoperability of MediaWiki output.
  2. Implementation compliance spec: Parser tests (in place). We'll cleanup the test infrastructure and make it more usable and maintainable.
  3. Implementation-neutral extension and Parser API spec: To be developed (enables pluggability of parsers, and extensions to hook into parser). This will help extension authors write extensions that can be supported in any implementation of a wikitext parser (vs. being tied to a specific parser's internals). This is work that will be done as part of B: above.
  4. Language (wikitext) spec: To be developed in conjunction with evolving wikitext semantics ("wikitext 2.0" in C: above). We will NOT attempt a spec for wikitext of today.
    • We will develop a base markup spec for wikitext markup
    • We will develop a DOM fragment composition spec that specifies how template and extension output will compose with the markup of a page

E: Evaluate feasibility of porting Parsoid to PHP[edit]

Since at least 2013, there have been calls to port Parsoid to PHP for a bunch of different reasons (need to support 3rd party monolithic MediaWiki installations, the service boundary around Parsoid that is somewhat awkward, additional language to support on the cluster).

But, we have always rejected this proposal thus far because of the absence of HTML5 parsing libraries in PHP, potential serious performance drawbacks (compared to the PHP parser, Parsoid does a lot more work including lot of DOM passes), and the early days of the Parsoid project before it was an established part of the Wikimedia product ecosystem.

As of late 2016, this picture is a bit different from 2013. Parsoid is fairly well established with clear utility to the Wikimedia projects and is not going away. The Wikimedia cluster has been using HHVM which enables much higher PHP performance. Separately, as part of work to replace Tidy and support balanced template output, the parsing team has worked on two separate PHP HTML5 parsers both of which seem viable at this point. This updated landscape is letting us revisit the unresolved questions from 2013 with respect to whether a PHP port of Parsoid should be undertaken and whether it would be viable for using on the Wikimedia cluster. However, it is unclear if this PHP port of Parsoid is really going to serve the interest of 3rd party MediaWiki users since they may not have access to HHVM on their installation. For them, the serious performance drawbacks of using vanilla PHP for Parsoid's much higher computational requirements may remain unaddressed.

Without committing to doing a full port of Parsoid to PHP, and without committing to adopting it for the Wikimedia cluster, we nevertheless want to evaluate and experiment with porting Parsoid to PHP. We believe we have pathways available to us that don't require us to do a full port of Parsoid in order to evaluate its feasibility. Separately, some of the performance and code cleanup work we need to do before we undertake this experiment will be beneficial to Parsoid anyway.

We will embark on this experiment once we are further along the path of direction A above and have sufficiently addressed output and feature incompatibilities between Parsoid and the PHP parser.

In the event that a PHP port proves viable for adoption, the API that Parsoid provides will remain unchanged. RESTBase will continue to proxy Parsoid output, and will continue to store one version of output per article and applications like Mobile Content Service, etc. will continue to use their existing transformations for their needs.

Since this is a potentially risky and controversial direction, we'll prepare a more detailed note that discusses this in greater detail.

See also[edit]