Parsoid/Extension API

Introduction
Terminology:


 * In the rest of this document, we use the term wikitext engine interchangeably with wikitext parser. Parsoid not only parses wikitext and generates HTML but also serializes HTML to wikitext. As such, wikitext engine is a better term.
 * wt2html is a shortcut for wikitext to HTML transformation.
 * html2wt is a shortcut for HTML to wikitext transformation.

This page only concerns extensions that interact with the wikitext engine either to process wikitext or because they register for one of the many parser hooks currently supported by the MediaWiki core wikitext engine. In this first pass drafting the extension API here, we are going to only deal with extensions that implement tag handlers. The Parsoid codebase has support for extensions that implement content handlers as well and in the next round of updates, we will update this page to document that support.

Types of extension tags
It is useful to broadly distinguish between a few types of extension tags for the purpose of figuring out how to interact with Parsoid and the extension API.


 * Type 1: Extension tags like  and   that don't treat their contents as wikitext at all, i.e.
 * Type 2: Extension tags like  that wrap regular wikitext and use the parser's output more or less as is, i.e.
 * Type 3: Extension tags like  that wrap regular wikitext but preprocess their source to generate wikitext that they need the parser to generate output for, but then use the parser's output more or less as is, i.e.
 * Type 4: Extension tags like  that have wikitext-like snippets in their source that they preprocess to feed to the parser, and then stuff the parser's output in a DOM tree they construct separately, i.e.

For lack of imagination, we are just using names like type 1, 2, 3, 4 so we can refer back to them in the rest of the document. We believe this broad categorization can be mildly helpful in making sense of how to use Parsoid's extension API. This categorization may not have much value outside this page and should not be given more merit than it deserves.

Core Parsoid concepts to be familiar with
Annotations added to extension wrappers: Parsoid decorates extension wrappers with a few attributes  that lets Parsoid's clients (reading or editing) demarcate content that comes from an extensions and convey additional information about it.

Selective Serialization: Parsoid has a html2wt mode for edited documents where it diffs the original and edited HTML and runs the html2wt transformation only for edited DOM subtrees. For the unedited subtrees, it uses source offsets for those subtrees to emit the original wikitext for those subtrees. This technique is critical to prevent "dirty diffs" when edited documents are converted to wikitext. Not all wikis care about this issue but it is a significant concern for Wikimedia run wikis.

DSR (Dom Source Range) offsets: As part of transforming wikitext to HTML, Parsoid maps DOM elements to the source wikitext substring that generated that DOM element. Based on this mapping, Parsoid assigns a 4-element offset array to all DOM elements (with caveats that we'll skip here for now). While this is considered Parsoid-internals information as far as Parsoid's clients are concerned, extensions might need to be aware of this concept in case they want to generate (or ensure the accuracy of) these offsets for DOM elements in their output.

DSR considerations will likely only apply to extensions of types 2-4, and only if they decide to implement their own html2wt transformations and only if they decide to support selective serialization. So, for a majority of extensions, for initial implementations, this may not be relevant. For type 2 extensions, Parsoid's computed DSR offsets are likely going to be accurate and so, only extensions of type 3 and 4 would need to worry about this.

All that said, as a first approximation, it is always safe to set a parse-time option asking the API to null out all computed DSR offsets (see more below).

API & Hooks
In the Parsoid regime, extensions will NOT get direct access to the wikitext engine. All interaction happens through an extension API and hooks; extensions can register "hook listeners" by implementing interfaces that declare those hooks. Unlike the current set of parser hooks, Parsoid hooks are primarily transformation hooks. While some of them might refer to a timeline in the processing of the input document (whether wikitext or DOM) like initialization, post-processing, or finalization, any such exposed events do not reference implementation-specific pipeline events (before/after some pipeline stage).

For now, we only support three transformation hooks,,  , and a DOM post processor. As we analyze more extensions and get feedback, we will consider what other hooks might becomes necessary and how to support them.

Support for html2wt transformations
Parsoid provides a default html2wt transformation based on information encoded in the  attribute during the wikitext to HTML transformation. Given this basic support, Visual Editor can only provide extremely basic editing support (direct editing of the  attribute likely). However, if extensions intend to provide custom editing support for editing clients like Visual Editor, they should implement the  transformation to convert edited HTML back to appropriate wikitext. Currently, Parsoid does not have selective serialization support for all extensions (support for Cite may have been implicitly baked in). But, at that time, Parsoid will expose more interface methods in the core extension tag interface if extensions choose to more carefully control how their extension HTML is serialized back to wikitext.

No support for global ordering
Extensions should not expect to maintain global document state within the extension where ordering matters. Parsoid does not guarantee that repeated occurrences of the same extension tag will be processed in the same order in which they are seen on the page (for ex: because of batched processing or cooperative multitasking). Nor should extensions assume that they will be invoked for every instance that is seen in wikitext (for ex: because we reuse parsed content from a cache). All Parsoid guarantees is that in the final output of the page, the output for extension tags will be found in the same order as they showed up in source wikitext.

Given this implementation flexibility that Parsoid reserves for itself, global state like counters cannot be reliably maintained by the extension. Extensions will get access to the fully processed DOM of the page which they could inspect to reconstruct source ordering. Parsoid's Cite implementation is one example of this scenario.

Extension registration and configuration
Parsoid will use the same extension registration interface that core uses. While we haven't yet worked out the specific details of how  will need to be updated for Parsoid, Parsoid will only recognize those extensions that implement the   interface. Currently, this interface has exactly one method:. The config object is an associative array with the following fields currently:


 * : The name of the extension
 * : If an extension implements extension tags (ex: Cite implements and :  Style modules that this extensions exports and need to be included in the list of modules on the page if this extension tag is used on the page.
 * FIXME: Should this be a per-extension-tag configuration, vs a per-extension configuration?
 * FIXME: Should this be a more generic modules property vs. being a styles property?
 * : If an extension needs to inspect the global document after is it is constructed, extensions are expected to register a DOM processor that is invoked by Parsoid. This DOM processor is expected to transform the DOM in place as appropriate. To future proof against new DOM transformations that Parsoid might support for extensions,  is an associative array that maps a transformation to an implementation class that implements an interface. Currently, only the   transformation is supported.

ExtensionTagHandler abstract class
Implementations of extension tags should extend the  abstract class. This class provides four methods:,  ,  ,. Extensions are only expected to implement the  method at the very least (otherwise, what is this extension tag even doing?). Parsoid takes care of annotating the output DOM fragment returned by the  method so that clients that process Parsoid HTML can demarcate extension output and extract other information from it besides enabling Parsoid's default html2wt transformation for this tag.

Please look at the docs for this class for more specific details about these methods. FIXME: Need to improve inline docs for this class.

Configuring extension tags
The extension tag config object is an associative array with the following fields currently:


 * : The name of the tag
 * : The class that provides the implementation of this tag. This class should extend the  abstract class as above
 * : This has 2 properties, one for wt2html and another for html2wt
 * : This options block dictates how the DOM fragment returned by the  method should be handled. Currently, only one option exists. The vast majority of extensions will not need this.
 * : By default, Parsoid takes the DOM fragment returned by the  method and splices it into the parent document in the appropriate place. However, if   is , Parsoid will leave a marker instead and store the fragment in a map. It is expected that the extension's   DOM processor will appropriately deal with these DOM fragments and manipulate them. For example, the Cite extension relies on this to migrate the ref's fragments to the references section and leave behind a citation that is appropriately globally numbered.
 * : This options block influences Parsoid's HTML to wikitext transformation. Given that extensions might implement their own  implementation, these options primarily influence how the generated wikitext interacts with its context. Currently, only one option exists.
 * : By default, the wikitext from converting the HTML is rendered inline. However, if extensions specify a  value for this property, the wikitext output is rendered on its own separate line.

Parsoid API for extensions
As part of implementing the various methods (sourceToDom, domToWikitext, etc) for extension tags and the DOM post processors, extensions might need access to certain kinds of information or functionality. For example, extensions that intend to handle wikitext as part of their implementation will rely on Parsoid to convert that wikitext to DOM. Or, they might need access to configuration information for the wiki, or the page. Or, they might need to log error messages or metrics. The  class provides this API. Please look at the linked docs for specific details about the interface. The following sections document the API methods in a few broad classes with some discussion of how / where to use them.sourceToDom

Converting wikitext to DOM
Extensions that used  when interacting with the MediaWiki core parser have two different methods to choose from:


 * transforms an extension tag to a DOM tree rooted in a requested wrapper tag (ex: div, span, sup). Extensions of type 2 and 3 are most likely going to use this API method.
 * transforms wikitext to a DOM tree. Extensions of type 4 are most likely going to use this API method.

The wikitext passed in to these methods are processed fully - there is no notion of partially processed wikitext in Parsoid. The following options are provided which are of declarative / semantic nature. Beyond this, extensions will not be able to turn on / off specific pieces of the parsing pipeline. In the long run, this makes for simpler semantics and more robust code since (a) the underlying implementation can be changed without breaking extensions (b) wikitext doesn't behave differently when used outside extensions and inside extensions which makes for a better user experience.


 * : This specifies that the output of this wikitext will be embedded in an inline / phrasing HTML context. This effectively turns off paragraphs and pre behavior.

Converting HTML to DOM and DOM to HTML
During both wt2html and html2wt transformations, Parsoid maintains the DOM in an optimized form where data attributes are not directly stored on the DOM. This is an implementation detail that extensions should not be concerned about and the specifics of this representation might change in the future. However, this has implications for when extensions need to convert HTML to a DOM and vice versa.


 * : Where extensions need to parse HTML and construct a DOM document (for example, creating an empty base document, or for processing HTML snippets), they should use the  method since it returns a DOM that is in Parsoid's canonical form. All of Parsoid core code assumes this canonical representation and without that, extensions might experiences subtle (or not so subtle) failures in certain scenarios.
 * : For the same reason as above, where extensions need to serialize a DOM node to string, they should use the  method which knows about the internal data representation while serializing. This provides additional options. One is to get innerHTML vs. outerHTML. The other option turns on a performance optimization (where the user doesn't expect to continue to using the DOM) that puts the DOM in a non-canonical form.

Sanitization helpers
While we originally planned to proxy a subset of sanitization helpers through the  object, after analyzing how current extensions use the Sanitizer code, and in light of T247804 and in the interest of reducing disruption, once T247804 is resolved, extensions will be able to use the   class directly.

Converting DOM to wikitext
This part of the API is only relevant to extensions that intend to provide custom editing support for their extensions in editing clients like VisualEditor. For example, the Cite and Gallery extensions make use of this API. The methods in this section mirror those in the wikitext to DOM section.


 * : All extension tags will need to use this. This method takes care of converting the HTML attributes to the extension's arguments while handling Parsoid-specific annotations.
 * : Use this method to convert input DOM to wikitext. Extensions of type 2 or type 3 will primarily benefit from this API method. There are no options provided to specify the context since the result wikitext is meant to be used as is between  and.
 * : Extensions that need to convert a HTML string (instead of a DOM) would use this method. As you might imagine, this is just a convenience function that chains  and   internally.
 * : Extensions of type 4 that used the  method will most likely need this method to convert DOM fragments to wikitext. The extension will have to extract relevant DOM fragments from its input DOM and convert those fragments to wikitext. This method provides additional arguments to control this conversion to wikitext. (FIXME: Should this API method get a better name?)
 * : This is a bit-wise OR of one or more flags that specifies context for this wikitext (ex: caption, option, link, start-of-line, etc.).
 * : If true, this indicates that the wikitext should be a single-line output (so, for example, lists, tables, and other multi-line constructs cannot be present)
 * : Type 4 extensions may need to escape wikitext-like constructs in a string so that the string can be used as part of a larger wikitext fragment without breaking those semantics.. For example, wikitext used in template arguments cannot use the  as is, wikitext used in table cells cannot use   as is, and so on. This method lets extensions delegate this logic to Parsoid and provides a few pre-defined context options currently and this will be expanded in the future based on usage and further analysis. Note that type 2 & 3 extensions that use   or   will not have to deal with this - Parsoid handles this automatically on their behalf.

Sundry API methods
... to be completed ...

Helpers and Utility classes
Besides the extension API, extensions also have access to the following additional classes from the Parsoid codebase and these classes are subject to the standard MediaWiki code deprecation and removal policies.


 * : This class provides a compatibility layer on top of PHP's libxml library to fill in gaps in DOM2 support that Parsoid relies on. Extension authors are strongly encouraged to use these compatibility methods when operating on the DOM.
 * : This class provides a number of potentially useful DOM helpers that extension authors might find helpful. However, extension authors could potentially develop their code without having to use these helpers.
 * ... to be completed ... ex: SiteConfig, PageConfig, DomSourceRange

Examples
Let us look at a few simple examples that will hopefully help make some sense of how this works.

RawHTML
This extension is used by parser tests and the code below is the entirety of the extension. The code should be self-explanatory.

Cite
Let us look at snapshots of a slightly more complex extension.

Config
The config is pretty straightforward. It declares two extension tags, style modules, and a wt2htmlPostProcessor. The actual implementation of the ref and references tags are in their own classes.

Ref.php
Let us take a look at the implementation of the ref tag. We won't present the entire implementation, but just a snippet of it to demonstrate the use of the API. This snippet demonstrates the use of the API to convert wikitext to DOM. That code is the entire implementation of the ref tag's processing. It simply parses the wrapped wikitext to DOM and wraps it in a  tag. It does not migrate the content of the ref to the references section, nor does it leave behind a numbered link to that section. This handler cannot do either of those tasks because (a) it does not have access to the entire document, and (b) as we noted earlier, you cannot maintain global counters reliably. Both of these tasks are accomplished by the wt2htmlPostProcessor defined in the config section earlier.

RefProcessor.php
... to be completed ...