Parsoid/OutputTransform

This page discusses a proposal for an "Html2Html" conversion infrastructure in core. Loosely speaking, this is a set of "post processing" passes on HTML content conforming to the MediaWiki DOM spec.

Desiderata
These are desired features that affect the fundamental architecture.


 * 1) Abstract HTML-as-a-string and DOM. Core contains a mix of string-based processing and DOM-based processing.  Although we'd like to migrate the codebase over time to exclusively DOM-based processing to avoid footguns like unbalanced tags, improper content escaping, etc, this will take some time.  Further, ParserOutput content is serialized as JSON using a string representation of the HTML, so content coming from the cache will always originate in string form; and of course the final output to the browser will require serialization to a string as well.
 * 2) At a minimum, there should be a thin abstraction over the current representation so that we avoid unnecessarily serialize-to-string -> deserialize-to-DOM pairs.  A pass that takes DOM as input and emits a DOM should be able to chain with another such pass without serializaing to a string in between; and both should work correctly whether given DOM as input (eg from Parsoid directly as a result of an initial parse) or a string (eg from ParserCache).
 * 3) Longer-term, it would be helpful for the transition away from string-based processing to have a representation that effectively allows something like document.write on an DOM; that is, storing appending HTML strings to an existing DOM so that methods like   don't have to serialize the DOM to a string, do the append, then reparse as HTML.  Note that   are currently implemented in terms of   even though their output is guaranteed to be balanced; in the future we can additionally skip the serialization of the parsed wikitext to a string before the   call and append the DOMs directly.
 * 4) Note that   stores content in string   but a related refactor would reduce the metadata duplication between   and   and have   store its content (including metadata) in a   (T301020, but not on the critical path for html2html work).
 * 5) This implies that   has an object type, which plays nicely with the JSON serialization framework to (a) store itself as a string, and (b) reconstitute itself from a string. (Related to T327439 in the sense that we need to build consensus around JsonCodec, but I don't think anything is technically blocking this.  But this is not a blocker for the rest of the work either.)
 * 6) Support HTML "Flavors". We will inevitably build several chains of postprocessing steps.  For example, to generate PCS output we might take the "core" DOM, apply redlink processing, apply language variant conversion, and then apply PCS transformations.  In our discussions (T293512) we agreed that we didn't want to expose the full granularity of each processing chain to the end user (ie, request "core+redlink+langvar+pcs" with the potential cache explosion implied), but instead hard-code certain chains and give them names, so you can request the "pcs" version of the HTML (for example).  We need a way to register well-known names and configure the corresponding pipelines in a clean way.  In addition to flavors for each supported language variant, which are orthogonal, the currently-known flavors are:
 * 7) core ("Editable" aka "what's given to Visual Editor") (may split this further and introduce a   flavor w/ data-parsoid and section tags stripped, etc)
 * 8) article ("Read views" aka the main article read views HTML, which is a stripped version of core)
 * 9) pcs (a differently-stripped and restructured version of core used for apps)
 * 10) mobile (eventually, this will be the output of "MobileFrontend" and used for mobile web views)
 * 11) Support caching of intermediate results. Ideally this is orthogonal to the flavor and pipeline configurations, so that caching is an ops decision made at a configuration level, and does not require any code changes.  This caching *must* integrate with the basic invalidation mechanisms in the ParserCache so that out-of-date content is never served.  Various decisions here:
 * 12) Does caching happen only at the flavor level, or can it be done more fine-grained? (ie, if pcs and mobile both start with "core plus redlinks plus language conversion" can that be tagged as cachable, or do we need to explicitly name that pipeline as a flavor in order to cache it)
 * 13) Can caching have different max-cache-size and max-lifetime settings than the main cache, or is this just a fork of the main cache?
 * 14) The main cache is split three ways: the latest-revision cache, the old-versions cache, and the flagged-revision cache (the last implemented in an Extension, not in core).  Do these splits automatically apply to cached flavors as well?
 * 15) Can we generate flavors recursively; that is, do the pipelines for all flavors generate everything from scratch?  Or, in order to generate the "article" flavor do we attempt first to generate (and possibly cache) the "core" flavor?  What happens if we time out after recursively generating flavors A, B, and C as subtasks but before we finish generating the flavor actually requested?

API Issues
These are mostly bikeshed colors, but still require a specific decision.


 * 1) Where does this code live? In T293512 we settled on   or   as the package name, but there are still many names left to choose.  Is there a "OutputTransformService"?  How are flavors registered?  Ideally RedlinkTransform would be a class, but should it also be a package, ie   ?  And where does the current ParserOutput::getText code get moved to?  ?
 * 2) What's the API signature? Is the input a , a  , or a    Is the transformation in-place or must we clone? In theory all of this new infrastructure is just implementing  ; that is, at some level for compatibility we have a ParserOutput as input and want a string as output.  But:
 * 3) In some cases we might want a DOM as output (desiderata #1)
 * 4) In some cases we might want a ParserOutput as output (ie, a post processing pass might also populate some metadata fields)
 * 5) We need to integrate with ParserOutputAccess if we want caching, and the main method there is    That is, we might want to specify a particular flavor in the , instead of/in addition to the flavor specified by the options in the   call.  Should there be a "flavor" property in ParserOptions?  Or not?  Maybe flavor caching is handled separately?  If flavors are directly exposed in the ParserOptions by default every flavor will split the parser cache and be cached, and maybe we don't necessarily want that.  If ParserOutputAccess decides whether or not to cache a specific flavor request, that implies that we can only cache "flavors" not arbitrary sequences of passes, but maybe that's ok.  Note that   returns a  , so this implies case #2 above (ie, the flavor pass pipeline returns a   not just a string).
 * 6) It seems like we'd want to integrate with Content/ContentHandler as well. The current interaction of ParserOptions with Content/ContentHandler is a bit awkward.  One option is for   to call   -- if we can make that work the way we want with caching, etc.
 * 7) What happens to the ParserAfterTidy etc hooks? These hooks effectively define new flavors (alternate flavors? flavor variants?).  An issue arising with DT is that the modifications *probably* don't want to apply to the "core" flavor (ie, what VE consumes) -- although it's possible an extension might tweak the HTML and *also* define a VE extension to handle the tweaked HTML.  The result of the post processing may want to be cached, or not. (This should be handled by desiderata #3, that is, the caching should be separately configurable.)  If a hook asks to be applied to "read views", does that include "article", "pcs" and "mobile"?  What happens if we introduce an additional flavor in the future?  Can extensions define a flavor?

Initial Straw Dog

 * 1) Very thin immutable "HtmlOrDom" class (better name please!), with   and   methods, as well as appropriate JSON serialization/deserialization support.    is still a string (for now) but we try to use   whenever possible which returns an   wrapper.  (Eventually   can be migrated to   and we can use JSON serialization of that object type, but we'll separate out that change because it's a parser cache format migration.)
 * 2) Proposed alternative:   with methods   and  .  Naming less tied to Parsoid-specific distinction between "html" and "dom".
 * 3) Hollow out   so that all it does is:
 * 4) * Actually,  might return a   so maybe   is what we want.
 * 5) Add , and add support in   to check the flavor and if it's is not 'core', to call  .  It will also consult MW config to determine whether or not to cache the result.
 * 6) Add a hook in FlavorDispatcher to allow inserting additional passes into the chain for any flavor; this will replace the ParserOutputAfterTidy hook.

Notes about somewhat arbitrary decisions made:


 * 1) Transforms take in ParserOutput and return ParserOutput.  Let's say we *don't* do a complete clone and explicitly state that the input ParserOutput should be discarded after the transform (and that it is valid to return the input object as the output).  Within the transform we start by getting the DOM via ParserOutput::getHtmlOrDOM->getDOM.  This is setting us up to potentially mutate the DOM in-place, which is why we want to also explicitly state that the input ParserOutput could be mutated by this.  Probably want to keep a mutation counter of some sort so that we can make some assertions that the ParserOutput hasn't been changed "behind our back".
 * 2) Punting on exactly how the flavor pipelines are set up, but I'd have to make some decisions to implement the FlavorDispatcher.
 * 3) Not implementing a `flavor` option in ParserOutput::getText for now; instead putting the option in ParserOptions.  In theory getText could support additional "uncached" flavors, but I think it's best to have all flavors go through the caching mechanism (ParserOutputAccess) and let it be a policy decision on the part of ParserOutputAccess which ones are cached.
 * 4) Also individual passes are never cached, only  "flavors".
 * 5) Hooks can alter flavors but not adding a mechanism to define new flavors at this time.
 * 6) Language Variants are done via DefaultTransform (aka ParserOutput::getText) but are *not* separate flavors (and thus not independently cacheable).
 * 7) Just doing the very minimal HtmlOrDom thing for now, not trying to solve the more general "OutputPage should use a ParserOutput and some sort of document.write mechanism" at this time.
 * 8) I used static methods for   and   for simplicity; in theory these should perhaps be dynamic methods of service objects.

Open Questions / Work In Progress
The interaction of this framework with DiscussionTools still has a few open ends. First: should  be a separate flavor (which would enable it to be cached) or is it a modification of the standard   flavor? Right now, we're going with the latter, but we'd need to build out an interface for "extensions which want to define their own flavor" if we wanted/needed to shift to the former.

Secondly, the "modification of standard flavor" hook is assumed to be a  post  processing phase. But DiscussionTools also uses  Parsoid markup to build its discussion metadata structure, and might not be comfortable working with "article" markup, which will be stripped for bandwidth reasons. Maybe it will work fine (it works with legacy HTML, after all, which is an even more-stripped markup). But the discussion tools pass might actually want to be a pre -processing pass, that is, inserted into the pipeline  before  the HTML-stripping that Parsoid does for the  flavor. That would require some thought, and it might work better implemented as "discussion tools implements its own flavor based on the  flavor" than trying to support a hook mechanism that allows insertions at arbitrary places in the processing pipeline.

DiscussionTools reports that parsing the discussion tree on a very large page could cost half a second or more (eg 484ms on a village pump archive) and so they very much want the read-view output to be cached. Parsing the discussion tree on https://en.m.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Rollback_of_Vector_2022 takes 592ms -- transclusion-expansion on that page costs 947ms, so DT is reasonable by comparison, but the proposal ends up splitting the transclusion-expansion ("edit flavor") from the DT transofmration ("read flavor"). Three possible solutions:


 * 1) Perhaps the "should cache" decision should be hookable so that we can selectively cache discussion tools etc even if "in general" we don't cache the read views flavor.
 * 2) Perhaps there should be a mechanism for DT to "undo" its transformation so that it could apply it to the cached 'canonical/edit' flavor and still have those changes "undone" before VE gets its hands on that flavor.
 * 3) Automatically cache any flavor which takes over Xms to generate, which should also address these outliers.