User:Daniel Kinzler (WMDE)/DependencyEngine

MediaWiki tracks dependencies between page content and rendered HTML (ParserOutput) in "links tables", like pagelins or templateslinks or imagelinks. It tracks dependencies between parser output and fully rendered HTML pages with all the chrome implicitly, via mappings of Titles to sets of URLs. It implements purging and re-rendering with various jobs, like RefreshLinksJob and HTMLCacheUpdateJob, which carry with them lists of page IDs to process. When we throw parsoid output, wikidata usage, centralized Lua modules and per-user-language rendering into the mix, things get rather complicated. And don't scale well, as the recent problems with JobQueue explosions show.

I propose to build a service that does just one thing: track dependencies between resources, and make sure recources get updated when needed. This page will only outline the interface and behavior, the underlying technology is to be decided. A baseline implementation could easily be based on SQL, but that weould probably not scale to the needs or the Wikimedia cluster.

Service methods:
 * update( $resourceURL, $artifactHash, $timestamp, $dependencies[] ): this is called when $resourceURL is updated. The resource URL and its dependencies are recorded, along with the timestamp and the hash (the hash is here just for good measure), replacing any older information about the $resourceURL. This builds a directed dependency graph, with URL, timestamp and hash on each node. The graph can have several billion nodes, but the median path length would be low, probably lower than 10. Median in- and out-degree of each node would be similarly low, though there would be outliesers with a degree of several million.
 * popDirty( $n ): returns up to n URLs of resources that are currently dirty. The receiver is then responsible for updating these resourfces. Some transaction logic will be needed to make sure no updates are lost. A resource is dirty when it is older than its parents. It should be safe for several processes to call pollDirty in parallel. popDirty should return direty URLs with a lower timestamp before those with a newer timestamp.

One problem with this design is that it is prone to cycles, which would lead to indefinity churn. Cycles could be periodically detected and broken (perhaps using a map/reduce approach) or they could be detected in insert (should be ok due to small path length).

The entire system doesn't need full ACID guarantees, just eventual consistency. Losing information for a few nodes would not be too bad, they can easily be re-generated. Losing the entire graph would however be problematic, since all content would need to be re-parsed to rebuild it.

If we had this kind of service, and we would be able to easily scale it horizontally, this would give us the freedom to re-use resources freely, without the need to worry about tracking and purging. It would Just Work (tm).