User:Daniel Kinzler (WMDE)/DependencyEngine

MediaWiki tracks dependencies between page content and rendered HTML (ParserOutput) in "links tables", like pagelins or templateslinks or imagelinks. It tracks dependencies between parser output and fully rendered HTML pages with all the chrome implicitly, via mappings of Titles to sets of URLs. It implements purging and re-rendering with various jobs, like RefreshLinksJob and HTMLCacheUpdateJob, which carry with them lists of page IDs to process. When we throw parsoid output, wikidata usage, centralized Lua modules and per-user-language rendering into the mix, things get rather complicated. And don't scale well, as the recent problems with JobQueue explosions show.

I propose to build a service that does just one thing: track dependencies between resources, and make sure resources get updated when needed. This page will only outline the interface and behavior, the underlying technology is to be decided. A baseline implementation could easily be based on SQL, but that weould probably not scale to the needs or the Wikimedia cluster.

Service methods:
 * update( $resourceURL, $artifactHash, $timestamp, $dependencies[] ): this is called when $resourceURL is updated. The resource URL and its dependencies are recorded, along with the timestamp and the hash (the hash is here just for good measure), replacing any older information about the $resourceURL. This builds a directed dependency graph, with URL, timestamp and hash on each node. The graph can have several billion nodes, but the median path length would be low, probably lower than 10. Median in- and out-degree of each node would be similarly low, though there would be outliesers with a degree of several million.
 * popDirty( $n ): returns up to n URLs of resources that are currently dirty. The receiver is then responsible for updating these resourfces. Some transaction logic will be needed to make sure no updates are lost. A resource is dirty when it is older than its parents. It should be safe for several processes to call popDirty in parallel. popDirty should return directy URLs with a lower timestamp before those with a newer timestamp.

One problem with this design is that it is prone to cycles, which would lead to indefinite churn. Cycles could be periodically detected and broken (perhaps using a map/reduce approach) or they could be detected in insert (should be ok due to small path length).

The entire system doesn't need full ACID guarantees, just eventual consistency. Losing information for a few nodes would not be too bad, they can easily be re-generated. Losing the entire graph would however be problematic, since all content would need to be re-parsed to rebuild it.

If we had this kind of service, and we would be able to easily scale it horizontally, this would give us the freedom to re-use resources freely, without the need to worry about tracking and purging. It would Just Work (tm).

Scale: We will probably want two implementations of this: one based directly on SQL, which should be OK to scale up to a few million resources. And another implementation suitable for large wiki clusters like the one run by Wikimedia.
 * The system should be built to handle dependencies ten billion resources.
 * Resources would typically have around ten dependencies, but some may have thousands.
 * Most resources are leaves, but some are used by millions of other resources.
 * Average path length in the (directed, acyclic, disconnected) graph is probably around ten.
 * The graph needs to be able to handle thousands (maybe tens of thousands) of updates per second.

Related proposals:
 * Dependency graph storage
 * Requirements for change propagation
 * Publish all resource changes to a single topic
 * Use varnish xkey to purge output of Special:EntityData when appropriate