Parsoid/Minimal performance strategy for July release

See also: Parsoid/Ops needs for July 2013

Peak edit rates are around 50reqs/second across all Wikipedias. Re-parses triggered by template edits can reach single-digit millions per day, for an average rate over a day of another 50 requests per second. In July, we need to sustain rates close to that as the Visual Editor is scheduled to become the default editor on all Wikipedias. Parsoid itself can be scaled up with more machines. We do however use the MediaWiki API to expand templates and extensions and to retrieve information about images. This can mean hundreds of API requests on large pages, which would overload the API cluster.

We have a long-term performance strategy as outlined in our roadmap that will also address the API overload problem. We might however not be able to implement enough of this before July. A minimal backup strategy to avoid overloading the API cluster is needed.

Leverage cached parse results to avoid API overload on edit
We have a Varnish cache in front of the Parsoid cluster, which caches the parse result for a given revision (see the Parsoid page on wikitech. We can use this cached parse result to speed up subsequent parses. The main things we are interested in to avoid API requests (template / extension expansions and image dimensions / paths) are available in the previous version and are marked up in a way that makes it relatively easy to extract and reuse.

On edit

 * Retrieve previous version's HTML DOM from cache (using oldid in get parameter)
 * Extract template, extension and image data from it and pre-populate internal caches with it
 * Parse new page, which will trigger API requests only on changed template transclusions / extensions / images.
 * Purge old version from cache

On HTMLCacheUpdate job after template / image edit
Templates and images in particular can be modified, so we'll have to make sure our cached expansions are not getting too stale. A simple and promising option is to piggyback onto the HTMLCacheUpdate job with a hook. The hook action can then either purge + re-request or implicitly refresh the Varnish copy.


 * Request new version with 'Cache-control: no-cache' header set. With the proper configuration, this will cause Varnish to go to origin. The Cache-control header will be forwarded (TODO: verify!), which Parsoid can use as an indication to fully expand all templates from scratch. Varnish will update its cached copy implicitly when configured to do so.

For all of this to work, the current cache-busting page_touched parameter needs to be removed from the GET URL.

Relevant links:
 * Implicit refresh without purge: https://www.varnish-cache.org/trac/wiki/VCLExampleEnableForceRefresh
 * Parsoid page on wikitech

Possibly relevant for other invalidation approaches:
 * |info&rvprop=content&titles=Foo touched and timestamp in page source response. Touched after an edit can theoretically be at most a few seconds behind the revision timestamp, but this is close enough to distinguish template updates from edit updates.