Parsoid/Minimal performance strategy for July release

See also: Parsoid/Ops needs for July 2013

Peak edit rates are around 50reqs/second across all Wikipedias. Re-parses triggered by template edits can reach single-digit millions per day, for an average rate over a day of another 50 requests per second. In July, we need to sustain rates close to that as the Visual Editor is scheduled to become the default editor on all Wikipedias. Parsoid itself can be scaled up with more machines. We do however use the MediaWiki API to expand templates and extensions and to retrieve information about images. This can mean hundreds of API requests on large pages, which would overload the API cluster.

We have a long-term performance strategy as outlined in our roadmap that will also address the API overload problem. We might however not be able to implement enough of this before July. A minimal backup strategy to avoid overloading the API cluster is needed.

Leverage cached parse results to avoid API overload on edit

We have a Varnish cache in front of the Parsoid cluster, which caches the parse result for a given revision (see the Parsoid page on wikitech. We can use this cached parse result to speed up subsequent parses. The main things we are interested in to avoid API requests (template / extension expansions and image dimensions / paths) are available in the previous version and are marked up in a way that makes it relatively easy to extract and reuse.

On edit

Retrieve previous version's HTML DOM from cache (using oldid in get parameter)
Extract template, extension and image data from it and pre-populate internal caches with it
Parse new page, which will trigger API requests only on changed template transclusions / extensions / images.
Purge old version from cache

On HTMLCacheUpdate job after template / image edit

Templates and images in particular can be modified, so we'll have to make sure our cached expansions are not getting too stale. A simple and promising option is to piggyback onto the HTMLCacheUpdate job with a hook. The hook action can then either purge + re-request or implicitly refresh the Varnish copy.

Request new version with 'Cache-control: no-cache' header set. With the proper configuration, this will cause Varnish to go to origin. The Cache-control header will be forwarded (TODO: verify!), which Parsoid can use as an indication to fully expand all templates from scratch. Varnish will update its cached copy implicitly when configured to do so.

For all of this to work, the current cache-busting page_touched parameter needs to be removed from the GET URL.

Relevant links:

Implicit refresh without purge: https://www.varnish-cache.org/trac/wiki/VCLExampleEnableForceRefresh
Parsoid page on wikitech

Possibly relevant for other invalidation approaches:

touched and timestamp in page source response. Touched after an edit can theoretically be at most a few seconds behind the revision timestamp, but this is close enough to distinguish template updates from edit updates.

Cache invalidation hooks

We'll need a chunked and deferred job similar to HTMLCacheUpdate (TODO: can we subclass and override invalidateTables?). Given a title and table name, get all titles to purge and do so. The revision ID (for the oldid GET parameter) can be accessed via Title::getLatestRevID() and WikiPage::getLatest().

ArticleEditUpdates

Main edit hook. Used to schedule updates to the page itself and templatelinks updates.

public static function onArticleEditUpdates( &$article, &$editInfo, $changed ) { ... }

ArticleDeleteComplete

Similar as above. TODO: Are template links etc gone at this point?

public static function onArticleDeleteComplete( &$article, User &$user, $reason, $id ) { ... }

ArticleUndelete

SpecialUndelete. Links are updated before this hook is run.

public static function onArticleUndelete( $title, $create ) { ... }

ArticleRevisionVisibilitySet

public static function onArticleRevisionVisibilitySet( &$title ) { ... }

TitleMoveComplete

Basically purge both old and new titles.

public static function onTitleMoveComplete( Title &$title, Title &$newtitle, User &$user, $oldid, $newid ) { ... }

FileUpload

Purge pages in imagelinks. Won't work for commons IIRC.

public static function onFileUpload( $file ) { ... }