Parsoid/Parser Unification/Output flavors

From mediawiki.org

This page summarizes the current discussion on output flavors and the associated caching considerations. It does not describe a current implementation and is not a formal design document (although it may evolve in this direction).

Changelog[edit]

February 29th 2024 - initial revision

March 7th 2024 - expansion with more details on cache operation

What are output flavors?[edit]

For both legacy and Parsoid the "type" of parse is specified in a ParserOptions object and the "result" of a parse is a ParserOutput object,[1] accessed via ParserOutputAccess which caches results in ParserCache. Callers typically use ParserOutput::getText() with a set of options to indicate which post-cache transformations should be applied; without loss of generality we will treat these as additional ParserOptions properties (T293512, T350626).

The output flavors concept is an attempt to normalize which set of options can/should be used in post-cache transformations, with the goal of being able to cache a subset of outputs without risking combinatorial explosion. In general, we see these flavors as a set of options.

We have identified several flavors that we may want to define:

  • the "canonical" flavor (also called "vanilla" or "raw"), which is the direct output of the parser. This flavor is served via REST APIs and used by some clients.
  • the "edit flavor, which is the content served to VisualEditor. The "edit" flavor is presently identical to Parsoid's "canonical" output, although VE currently does a few client-side transformations (section stripping, for example), which could be moved server-side if the edit flavor diverged from the canonical flavor.
  • the "read views" flavor, which is the raw flavor + the post-cache transformations that add link color processing, section edit links, modifications due to the skin, etc
  • the "mobile views" flavor, which is currently (March 2024) derived directly from "read views" flavor with additional transformations currently performed in the OutputPageBeforeHTML hook.

More options may be considered: we may want to provide a Kiwix flavor, for instance, or cache "pre-skin" output to avoid splitting the cache by skin, or allow extensions to define their own flavors (for example, at one point DiscussionTools had an alternate "read views" flavor used by folks opting in to the DiscussionTools beta; this was done using an earlier form of cache-splitting).

Caching multiple flavors[edit]

The main change being proposed here is to allow generating one flavor using a cached result of a different flavor. A cache-splitting mechanism already exists that would allow (e.g.) content modified by DiscussionTools to be cached, but the mechanism would cache the result of a new from-scratch parse, followed by the DiscussionTools mutation. The flavor mechanism allows the reuse of cached (say) "canonical" content to generate the "read views" flavor, without requiring another from-scratch parse.

Opening the possibility to cache different flavors would enable us to tweak the space/time compromise of the cached vs computed parses.

Dragons[edit]

The primary key to the ParserCache is the ParserOptions object, but the interaction of the two is quite complex.

First, in production expiring the entire ParserCache at once would bring down the site. So (as explained by a comment in ParserOptions::optionsHash) the key only records properties with non-canonical values. Since the default values are not included in the key, changing the default value of a parser option will not immediately invalidate the cache, instead entries with the "old" default value will be silently returned from cache until the entries eventually expire.

Second, for obscure historical reasons, the "canonical" values are the ParserOptions used by anonymous users (generated by ParserOptions::newFromAnon()). The default options may differ.

Third, certain options are "lazy", including date format and speculative values for revision ID and page ID. These are not initialized to specific values until "as late as possible".[2]

Fourth, only the options returned by ParserOptions::allCacheVaryingOptions() are included in the key. That excludes some executable properties (currentRevisionRecordCallback, templateCallback, etc)[3] as well as options used for tracking or debugging but which don't affect the output (renderReason), lazy options, and "ineffective" options (which control post-processing but not the initial parse). Extension code can add additional properties to ParserOptions using the ParserOptionsRegister hook, which can either be included or excluded from the cache key.

Finally, ParserOptions tracks which of its options are actually "used", using ParserOptions::$onAccessCallback. Any options which could affect the output but were never actually accessed during a particular cache are excluded from the cache key. This prevents splitting the cache based on (for example) different desired date formats if nothing on the page actually formatted a date.

This last item introduces a chicken-and-egg problem: if the cache key depends on the used options, how can I know how to construct the key without having parsed the page already to determine the used options?

To solve this dilemma, ParserCache contains a separate "metadata" cache, which maps a given page ID (not revision ID!) to a set of used options for that page.[4] The page is first looked up in the metadata cache. If no entry is found, then the page is certainly not in the main cache and we perform a from-scratch parse and save. If a metadata entry is found, then we use the "used options" found there to construct the main cache key. Note that there may be multiple values of the used options, so multiple entries may be present in the main cache for the given page ID. When a new revision is parsed, the initial lookup may use the "wrong" used options, but any entry fetched with that key will mismatch on the revision ID and be discarded; the first parse of the new revision will also reset the used options in the metadata cache.

Note the following two necessary conditions:

  • Every parse of a given revision ID in a given ParserCache is expected to result in the same "used options". If the used options change for different parses of a given revision ID then cache lookups could produce unexpected results.[5]
  • Only parses of "the latest" revision ID will be stored in the ParserCache. The metadata cache does not contain "used options" for anything but the single latest revision of a page ID.

Different flavors may end up getting different used options. For instance, a "raw" Parsoid parse may not be localized (and hence not use the language-related options) but still have internationalized content that would need resolving for the "read view" flavors (which would consequently use the language-related options).[6] Hence, the "revision ID => used options" invariant may not hold if the flavors coexist in the same ParserCache. This would argue for either tweaking the metadata cache key to also include the flavor, or to insist that flavors are cached in distinct ParserCache objects and hence have distinct metadata caches.

To complete our tour of dragons, there are two additional "parser cache" mechanisms at work. The RevisionOutputCache stores "old revisions", ie any parse which is not of the latest revision. It reuses the ParserOptions::optionsHash() mechanism but skips the separate metadata cache; hence "unused options" are not omitted from the cache key used by RevisionOutputCache. In addition, the FlaggedRevisions extension has its own FlaggedRevsParserCache used for storing the output of "stable" page revisions. It reuses the mechanisms of ParserCache wholesale, including the metadata cache for used options, with the sole exception being that it is only used for stable revision IDs. So instead of storing only "the latest revision ID" for a given page it stores only "the latest stable revision ID", with the same two necessary conditions modified to match.

Grand Unified Theory of ParserOptions[edit]

Principles[edit]

Currently, a full parse of a page depends on ParserOptions and on an array of options passed to ParserOutput::getText() that decide which post-cache transforms are applied. Ideally, this array of options would be folded into ParserOptions so that they can be a part of the ParserCache key in conjunction with the flavor setting. This way, these options would have the opportunity be a part of the cache key if they are used in a given flavor. (Note that this requires thinking about the independence of the usage of said keys to ensure both correctness and lack of combinatorial explosion of the cache keys.)

This opens the opportunity to handle cache misses in a way that "downgrades" the key in order to get a variation of the output that is close, but not exactly the one we want, and to update it with a low-complexity operation (compared to a full reparse). The most straightforward example of such a downgrade would be a mutation of the ParserOptions to go from a (cache-miss) "read-view" flavor to a (hopefully cached) "raw" flavor version, to which the output pipeline would be applied to reach the "read-view" state. In the case where there is a string ParserOptions::$flavor property, downgrading could be simply ParserOptions::setFlavor($weakerFlavor), but this also works if flavors combine a bundle of different ParserOptions settings.

The current idea is to provide a "downgradeKey" (bikeshed "weakensKey") mechanism, returning an new ParserOptions that would correspond to a "weaker" version of the parse that might be present in the cache. We could imagine extend this type of mechanism to handle language variants, dark modes, etc, depending on which versions of the parse we want to cache. This policy would be kept separate from ParserOptions, which (despite a lot of legacy cruft) is "mostly" a pure value object.

Considerations about combinatorics of cache keys[edit]

There are considerations about having more complex key downgrade paths than just "drop the flavor to raw" so that we could further tweak the space/time compromise of the cache. But, this needs to be carefully thought about, because multiple orthogonal dimensions may be involved in the cache key combination. For instance, one could imagine a scenario in which a key that refers to a given flavor F1 and a language variant L-l1 could be downgraded to

  • flavor F1, language L-l2
  • flavor F1, no language definition
  • flavor F2 (subset of options of F1), language L-l1

and F2 could in turn be downgraded according to these options. Adding a few options on top of these would make the search space grow very quickly, and trying to access all these cache keys might end up being more expensive than re-parsing the page from scratch.

The logic of what sorts of "weaker" keys to try is a combination of cache policy (what sorts of things are we caching?) as well as OutputTransform pipeline (what sorts of things can be make from what other sorts of things). Ideally this logic/policy can be isolated to a separate class, and not leak directly into ParserOptions, ParserOutputAccess, or the OutputTransform pipeline.

Notes[edit]

  1. ↑ For mostly-historical reasons Parsoid maps the ParserOptions to a combination of PageConfig, DataAccess, and an $options array passed to a method in the top-level Parsoid class, and Parsoid's result is collected using a PageBundle and a ContentMetadataCollector object. These are mapped to ParserOptions/ParserOutput in core, however, so these internal details can/will be ignored for the remainder of this document.
  2. ↑ Date format appears to be deferred because it is costly (requires a database query) to fetch the User object containing the preferred date format; it might also be attempting to avoid adding a dependency on the User unless absolutely necessary. The speculative revision and page IDs appear to be deferred because the "guesses" about the likely revision and page IDs which will be assigned will be better if the guesses are made as close in time to the actual database store as possible.
  3. ↑ The executable properties can certainly affect the output, for example by redirecting template references or using not-the-latest-revisions for transclusions. (The Translate extension makes {{Foo}} look first at (eg) {{Foo/fr}} for example.) The expectation seems to be that anyone who sets these executable properties will either not cache the output, or will use the ParserOutput::$mExtraKey mechanism to add an appropriate key suffix corresponding to the effect of their executable code.
  4. ↑ See ParserCache::getMetadata(). This is an in-memory cache, with a fallback to storage shared with the main ParserCache but with a different key format.
  5. ↑ This is mitigated by the fact mentioned earlier that "canonical values" of a certain option are not included in the key. The problematic case is where the first parse of a page has "foo" set to a non-canonical value but "foo" is not included in the used options set. The key for such a parse will not include "foo", because foo is "unused". Later, a different parse adds "foo" to the used options list. Finally, a subsequent request for "foo" set to the canonical value will also generate a key omitting foo (this time because foo is "canonical" even though it is "used"), which will hit in the cache on that earlier parse where foo was set to a non-canonical value. This shouldn't matter because by excluding "foo" from the used options the earlier parse was asserting that "foo" should have no effect on the result, whatever its value. But Bugs Happen and there is the possibility that now the result fetched from the cache will have "foo" visibly at odds with what was requested.
  6. ↑ "Used options" tracking is done via a dynamic callback attached to the ParserOptions. So in theory any access to ParserOptions from within the OutputTransform code could cause the set of used options to change.