Parsoid/Extension API

Introduction
Terminology:


 * In the rest of this document, we use the term wikitext engine interchangeably with wikitext parser. Parsoid not only parses wikitext and generates HTML but also serializes HTML to wikitext. As such, wikitext engine is a better term.
 * wt2html is a shortcut for wikitext to HTML transformation.
 * html2wt is a shortcut for HTML to wikitext transformation.

This page only concerns extensions that interact with the wikitext engine either to process wikitext or because they register for one of the many parser hooks currently supported by the MediaWiki core wikitext engine. In this first pass drafting the extension API here, we are going to only deal with extensions that implement tag handlers. The Parsoid codebase has support for extensions that implement content handlers as well and in the next round of updates, we will update this page to document that support.

Types of extension tags
It is useful to broadly distinguish between a few types of extension tags for the purpose of figuring out how to interact with Parsoid and the extension API.


 * Type 1: Extension tags like  and   that don't treat their contents as wikitext at all, i.e.
 * Type 2: Extension tags like  that wrap regular wikitext and use the parser's output more or less as is, i.e.
 * Type 3: Extension tags like  that wrap regular wikitext but preprocess their source to generate wikitext that they need the parser to generate output for, but then use the parser's output more or less as is, i.e.
 * Type 4: Extension tags like  that have wikitext-like snippets in their source that they preprocess to feed to the parser, and then stuff the parser's output in a DOM tree they construct separately, i.e.

For lack of imagination, we are just using names like type 1, 2, 3, 4 so we can refer back to them in the rest of the document. We believe this broad categorization can be mildly helpful in making sense of how to use Parsoid's extension API. This categorization may not have much value outside this page and should not be given more merit than it deserves.

Core Parsoid concepts to be familiar with
Annotations added to extension wrappers: Parsoid decorates extension wrappers with a few attributes  that lets Parsoid's clients (reading or editing) demarcate content that comes from an extensions and convey additional information about it.

Selective Serialization: Parsoid has a html2wt mode for edited documents where it diffs the original and edited HTML and runs the html2wt transformation only for edited DOM subtrees. For the unedited subtrees, it uses source offsets for those subtrees to emit the original wikitext for those subtrees. This technique is critical to prevent "dirty diffs" when edited documents are converted to wikitext. Not all wikis care about this issue but it is a significant concern for Wikimedia run wikis.

DSR (Dom Source Range) offsets: As part of transforming wikitext to HTML, Parsoid maps DOM elements to the source wikitext substring that generated that DOM element. Based on this mapping, Parsoid assigns a 4-element offset array to all DOM elements (with caveats that we'll skip here for now). While this is considered Parsoid-internals information as far as Parsoid's clients are concerned, extensions might need to be aware of this concept in case they want to generate (or ensure the accuracy of) these offsets for DOM elements in their output.

DSR considerations will likely only apply to extensions of types 2-4, and only if they decide to implement their own html2wt transformations and only if they decide to support selective serialization. So, for a majority of extensions, for initial implementations, this may not be relevant. For type 2 extensions, Parsoid's computed DSR offsets are likely going to be accurate and so, only extensions of type 3 and 4 would need to worry about this.

All that said, as a first approximation, it is always safe to set a parse-time option asking the API to null out all computed DSR offsets (see more below).

API & Hooks
In the Parsoid regime, extensions will NOT get direct access to the wikitext engine. All interaction happens through an extension API and hooks; extensions can register "hook listeners" by implementing interfaces that declare those hooks. Unlike the current set of parser hooks, Parsoid hooks are primarily transformation hooks. While some of them might refer to a timeline in the processing of the input document (whether wikitext or DOM) like initialization, post-processing, or finalization, any such exposed events do not reference implementation-specific pipeline events (before/after some pipeline stage).

For now, we only support the following transformation hooks:,  ,  , and a DOM post processor. As we analyze more extensions and get feedback, we will consider what other hooks might become necessary and how to support them.

Clarification / disambiguation of an overloaded term: We use events here to refer to implicit parser pipeline timeline events (ex: completion of tokenization, link parsing, DOM building, etc), not explicit events emitted for the purpose of logging, metrics, instrumentation, etc. and anything that is handled by the event infrastructure like Kafka, etc. Those events are outside the purview of the parser and Parsoid.

No support for global ordering
Extensions should not expect to maintain global document state within the extension where ordering matters. Parsoid does not guarantee that repeated occurrences of the same extension tag will be processed in the same order in which they are seen on the page (for ex: because of batched processing or cooperative multitasking). Nor should extensions assume that they will be invoked for every instance that is seen in wikitext (for ex: because we reuse parsed content from a cache). All Parsoid guarantees is that in the final output of the page, the output for extension tags will be found in the same order as they showed up in source wikitext.

Given this implementation flexibility that Parsoid reserves for itself, global state like counters cannot be reliably maintained by the extension. Extensions will get access to the fully processed DOM of the page which they could inspect to reconstruct source ordering. Parsoid's Cite implementation is one example of this scenario.

Support for html2wt transformations
Parsoid provides a default html2wt transformation based on information encoded in the  attribute during the wikitext to HTML transformation. Given this basic support, Visual Editor can only provide extremely basic editing support (direct editing of the  attribute likely). However, if extensions intend to provide custom editing support for editing clients like Visual Editor, they should implement the  transformation to convert edited HTML back to appropriate wikitext. Currently, Parsoid does not have selective serialization support for all extensions (support for Cite may have been implicitly baked in). But, at that time, Parsoid will expose more interface methods in the core extension tag interface if extensions choose to more carefully control how their extension HTML is serialized back to wikitext.

Extension registration and configuration
Parsoid-compatible extensions can be registered via the  property in the   file. This property can either specify the Parsoid configuration inline OR provide an  declaration that provides an implementation of the Wikimedia\Parsoid\Ext\ExtensionModule interface.

The config object is an associative array with the following fields currently:


 * : The name of the extension
 * : If an extension implements extension tags (ex: Cite implements  and  ), this property is an array of configuration objects for each such extension tag (see configuration spec below).
 * : Style modules that this extensions exports and need to be included in the list of modules on the page if this extension tag is used on the page.
 * FIXME: (1) Should this be a per-extension-tag configuration, vs a per-extension configuration? (2) Should this be a more generic modules property vs. being a styles property?
 * : This is an array of ObjectFactory declarations each of which return a class that extend the Wikimedia\Parsoid\Ext\DOMProcessor abstract class.

Configuring extension tags
The extension tag config object is an associative array with the following fields currently:


 * : The name of the tag
 * : ObjectFactory declaration that provides a class implementing this tag. The class should extend the Wikimedia\Parsoid\Ext\ExtensionTagHandler abstract class.
 * : This has 2 properties, one for wt2html and another for html2wt
 * : This options block dictates how the DOM fragment returned by the  method should be handled. Currently, only one option exists. The vast majority of extensions will not need this.
 * : By default, Parsoid takes the DOM fragment returned by the  method and splices it into the parent document in the appropriate place. However, if   is , Parsoid will leave a marker instead and store the fragment in a map. It is expected that the extension's   DOM processor will appropriately deal with these DOM fragments and manipulate them. For example, the Cite extension relies on this to migrate the ref's fragments to the references section and leave behind a citation that is appropriately globally numbered.  While this is still in Parsoid master, this option is in the process of being removed / rebranded for better semantic coherence.
 * : This options block influences Parsoid's HTML to wikitext transformation. Given that extensions might implement their own  implementation, these options primarily influence how the generated wikitext interacts with its context. Currently, only one option exists.
 * : By default, the wikitext from converting the HTML is rendered inline. However, if extensions specify a  value for this property, the wikitext output is rendered on its own separate line.

ExtensionTagHandler abstract class
This class provides four methods:,  ,  ,.

Extensions are expected to implement the  method at the very least. Parsoid annotates the output DOM fragment returned by the  method so that clients that process Parsoid HTML can demarcate extension output and extract other information from it besides enabling Parsoid's default html2wt transformation for this tag. Please look at the docs for this class for more specific details about these methods.

DOMProcessor abstract class
Currently, the only real supported DOM processor is the  method. This method will be provided an instance of the ParsoidExtensionAPI object as well as the DOM for the wikitext being processed.

We do anticipate supporting a  method in the future which will be invoked when converting HTML to wikitext. This processor will be invoked at the beginning to give extensions a chance to preprocess the DOM or extract any information necessary for later use. While the DOMProcessor class provides a method for this, this is not hooked up anywhere in Parsoid currently.

In future, based on need, other DOM processors might be supported.

Parsoid API for extensions
As part of implementing the various methods (, , etc) for extension tags and the DOM processors, extensions might need access to certain kinds of information or functionality. For example, extensions that intend to handle wikitext as part of their implementation will rely on Parsoid to convert that wikitext to DOM. Or, they might need access to configuration information for the wiki, or the page. Or, they might need to log error messages or metrics. The Wikimedia\Parsoid\Ext\ParsoidExtensionAPI class provides this API. Please look at the linked docs for specific details about the interface. The following sections document the API methods broadly with some discussion of how / where to use them.

Converting wikitext to DOM
Extensions that used  when interacting with the MediaWiki core parser have two different methods to choose from:


 * transforms an extension tag to a DOM tree rooted in a requested wrapper tag (ex: div, span, sup). Extensions of type 2 and 3 are most likely going to use this API method.
 * transforms wikitext to a DOM tree. Extensions of type 4 are most likely going to use this API method.

The wikitext passed in to these methods are processed fully - there is no notion of partially processed wikitext in Parsoid. The following options are provided which are of declarative / semantic nature. Beyond this, extensions will not be able to turn on / off specific pieces of the parsing pipeline. In the long run, this makes for simpler semantics and more robust code since (a) the underlying implementation can be changed without breaking extensions (b) wikitext doesn't behave differently when used outside extensions and inside extensions which makes for a better user experience.


 * : With this option, you can specific the embedding context for the DOM. Currently, the only available value is  to specify that the output of this wikitext will be embedded in an inline / phrasing HTML context. This effectively turns off paragraphs and pre behavior. In the future, other context values (like table cell, list item, link etc.) might be supported. Most extensions wouldn't need to specify this option except if your extension is only meant to be used in such contexts. For example, Cite uses this option as a backward-compatibility hack to support its paragraph wrapping and space-indented-pre behavior (which don't make sense when content is meant to be used in an inline / phrasing HTML context).

Converting HTML to DOM and DOM to HTML
During both wt2html and html2wt transformations, Parsoid maintains the DOM in an optimized form where data attributes are not directly stored on the DOM. This is an implementation detail that extensions should not be concerned about and the specifics of this representation might change in the future. However, this has implications for when extensions need to convert HTML to a DOM and vice versa.


 * : Where extensions need to parse HTML and construct a DOM document (for example, creating an empty base document, or for processing HTML snippets), they should use the  method since it returns a DOM that is in Parsoid's canonical form. All of Parsoid core code assumes this canonical representation and without that, extensions might experiences subtle (or not so subtle) failures in certain scenarios.
 * : For the same reason as above, where extensions need to serialize a DOM node to string, they should use the  method which knows about the internal data representation while serializing. This provides additional options. One is to get innerHTML vs. outerHTML. The other option turns on a performance optimization (where the user doesn't expect to continue to using the DOM) that puts the DOM in a non-canonical form.

Sanitization helpers
While we originally planned to proxy a subset of sanitization helpers through the  object, after analyzing how current extensions use the Sanitizer code, and in light of T247804 and in the interest of reducing disruption, once T247804 is resolved, extensions will be able to use the   class directly.

Converting DOM to wikitext
This part of the API is only relevant to extensions that intend to provide custom editing support for their extensions in editing clients like VisualEditor. For example, the Cite and Gallery extensions make use of this API. The methods in this section mirror those in the wikitext to DOM section.


 * : All extension tags will need to use this. This method takes care of converting the HTML attributes to the extension's arguments while handling Parsoid-specific annotations.
 * : Use this method to convert input DOM to wikitext. Extensions of type 2 or type 3 will primarily benefit from this API method. There are no options provided to specify the context since the result wikitext is meant to be used as is between  and.
 * : Extensions that need to convert a HTML string (instead of a DOM) would use this method. As you might imagine, this is just a convenience function that chains  and   internally.
 * : Extensions of type 4 that used the  method will most likely need this method to convert DOM fragments to wikitext. The extension will have to extract relevant DOM fragments from its input DOM and convert those fragments to wikitext. This method provides additional arguments to control this conversion to wikitext. (FIXME: Should this API method get a better name?)
 * : This is a bit-wise OR of one or more flags that specifies context for this wikitext (ex: caption, option, link, start-of-line, etc.).
 * : If true, this indicates that the wikitext should be a single-line output (so, for example, lists, tables, and other multi-line constructs cannot be present)
 * : Type 4 extensions may need to escape wikitext-like constructs in a string so that the string can be used as part of a larger wikitext fragment without breaking those semantics.. For example, wikitext used in template arguments cannot use the  as is, wikitext used in table cells cannot use   as is, and so on. This method lets extensions delegate this logic to Parsoid and provides a few pre-defined context options currently and this will be expanded in the future based on usage and further analysis. Note that type 2 & 3 extensions that use   or   will not have to deal with this - Parsoid handles this automatically on their behalf.

Sundry API methods

 * Extension argument methods : Please refer to the documentation for the specific details of how to use this method, but if you need to loop over them, modify or add args, or sanitize them, these API methods are your friends.
 * : Use this method if your extension needs to render an image from an image name, and an array of options (each of which need to be preceded by a "|" prefix currently). without have to construct a wikitext string from it. If your extension also intends to generate DSR information, you will need to provide source offsets for the option strings you pass into the method.
 * : If your extension content can be used in wikitext that doesn't actually render (ex: image captions for inline images, language variant markup, etc.), and you need to run a DOM processor (ex: the wtPostProcess) on all uses of the extension independent of whether it rendered or not, you will need to use this method when you walk the DOM tree by passing in a handler that can be invoked on HTML strings.
 * Many other methods to get information: Site or Page config, extension information like tag offsets, full extension source as seen on the page, whether it was self-closed, whether this extension tag was used in a template, methods to get the URI for the page, get the URI for a title, make a title, etc. Please reference the generated documentation for this class for the full listing of methods and how to use them.

Helpers and Utility classes
Besides the  class, extensions also have access to the following additional classes from the Parsoid codebase and these classes are subject to the standard MediaWiki code deprecation and removal policies.


 * namespace: The various  classes provides a number of potentially useful helpers that extension authors might find helpful.
 * .Wikimedia\Parsoid\Utils\DOMCompat: This class provides a compatibility layer on top of PHP's libxml library to fill in gaps in DOM2 support that Parsoid relies on. Extension authors are strongly encouraged to use these compatibility methods when operating on the DOM. We may move this class into the Wikimedia\Core namespace or upstream it into Remex possibly.
 * Wikimedia\Core\DomSourceRange: This class implements the DSR concept described above.
 * Wikimedia\Config\SiteConfig and Wikimedia\Config\PageConfig classes for extensions that need to access wiki or page config information.

Mapping existing parser hooks
With the current MediaWiki core wikitext parser, extensions have access to a number of parser hooks at different points in the parsing pipeline. A vast majority of use cases are subsumed by the  transformation as well as the   DOM pass.

In this document, we'll attempt to provide a mapping from existing parser hooks to equivalent code patterns. There is unlikely to be an exact 1:1 mapping since the processing model is quite different but for the most part, we'll provide guidelines about how to implement your use case that uses an existing parser hook.


 * : This should be subsumed by the extension config for the most part. But, if it becomes necessary where the extension needs to initialize some state it cannot do in its constructor, we can provide a hook for this.
 * : Parsoid doesn't distinguish between these states and for the most part, the  DOM Processor should cover uses cases not handled implicitly by the   transformation hook.
 * : As far as we can tell, this hook might not be necessary with Parsoid at all since extensions can never access Parsoid's internal state directly and can only go through the API and Parsoid ensures clean state for every use.
 * : Parsoid does not have notions of strip state right now. One of the primary uses of the strip state mechanism is to tunnel extension output through the parser without further mangling. Parsoid already does this by default for all extension output. But, if there are uses of strip state functionality that isn't covered by the API, we'll investigate that and figure out how to support it.

... to be completed ...

Mapping parser methods to ParsoidExtensionAPI methods
Currently extensions make use of one or more of the following methods to deal with wikitext:. The equivalents of these in ParsoidExtensionAPI would be one of. However, one of the signiicant differences in functionality is that there is no notion of "half-parsed" or "fuly-parsed" wikitext in Parsoid. You always get a DOM that is processed to the same stage in the parsing pipeline.

There is also no strip-tag notion in Parsoid currently. Extensions seem to primarily make use of it to tunnel content through the parser without further processing. In Parsoid, all extension output (the DOM produced by one of the above mehods) is always tunneled through the parser and expanded into the DOM before handing it off to additional processing that operates on the final DOM (including the DOM post processors that extensions might register for). So, extensions should not have to deal with this detail. As such, you will find all such methods absent in Parsoid's extension API.

Examples
Let us look at a few simple examples that will hopefully help make some sense of how this works.

RawHTML
This extension is used by parser tests and the code below is the entirety of the extension. The code should be self-explanatory.

Cite
Let us look at snapshots of a slightly more complex extension. The configuration for this extension is available earlier in this document (Example 2 in the extension registration section).

Ref.php
Let us take a look at the implementation of the ref tag. We won't present the entire implementation, but just a snippet of it to demonstrate the use of the API. This snippet demonstrates the use of the API to convert wikitext to DOM. That code is the entire implementation of the ref tag's processing. It simply parses the wrapped wikitext to DOM and wraps it in a  tag. It does not migrate the content of the ref to the references section, nor does it leave behind a numbered link to that section. This handler cannot do either of those tasks because (a) it does not have access to the entire document, and (b) as we noted earlier, you cannot maintain global counters reliably. Both of these tasks are accomplished by the wt2htmlPostProcessor defined in the config section earlier.

RefProcessor.php
... to be completed ...