User:Daniel Kinzler (WMDE)/DependencyEngine
MediaWiki tracks dependencies between page content and rendered HTML (ParserOutput) in "links tables", like pagelinks or templateslinks or imagelinks. It tracks dependencies between parser output and fully rendered HTML pages with all the chrome implicitly, via mappings of Titles to sets of URLs. It implements purging and re-rendering with various jobs, like RefreshLinksJob and HTMLCacheUpdateJob, which carry with them lists of page IDs to process. When we throw parsoid output, wikidata usage, centralized Lua modules and per-user-language rendering into the mix, things get rather complicated. And don't scale well, as the recent problems with JobQueue explosions show.
I propose to build a service that does just one thing: track dependencies between resources, and make sure resources get updated when needed. This page will only outline the interface and behavior, the underlying technology is to be decided. A baseline implementation could easily be based on SQL, but that would probably not scale to the needs or the Wikimedia cluster.
- update( $resourceURL, $artifactHash, $timestamp, $dependencies ): this is called when $resourceURL is updated. The resource URL and its dependencies are recorded, along with the timestamp and the hash (the hash is here just for good measure), replacing any older information about the $resourceURL. This builds a directed dependency graph, with URL, timestamp and hash on each node. The graph can have several billion nodes, but the median path length would be low, probably lower than 10. Median in- and out-degree of each node would be similarly low, though there would be outliers with a degree of several million.
- popDirty( $n ): returns up to n URLs of resources that are currently dirty. The receiver is then responsible for updating these resources. Some transaction logic will be needed to make sure no updates are lost. A resource is dirty when it is older than its parents. It should be safe for several processes to call popDirty in parallel. popDirty() should return direct URLs with a lower timestamp before those with a newer timestamp.
One problem with this design is that it is prone to cycles, which would lead to indefinite churn. Cycles could be periodically detected and broken (perhaps using a map/reduce approach) or they could be detected in insert (should be ok due to small path length).
The entire system doesn't need full ACID guarantees, just eventual consistency. Losing information for a few nodes would not be too bad, they can easily be re-generated. Losing the entire graph would however be problematic, since all content would need to be re-parsed to rebuild it.
Note that the system is built around resource URLs (resp. URIs). Such URLs should be assigned not only to any editable (primary) resource, but also to any artifact that can be re-generated automatically. E.g. there would be a URL for a pages Wikitext, and another URLs for the page's rendered HTML, and another URL for the page's representation in the cirrus index, and URLs for the page's representations in the CDN (cached with all the chrome), etc.
- The system should be built to handle dependencies ten billion resources.
- Resources would typically have around ten dependencies, but some may have thousands.
- Most resources are leaves, but some are used by millions of other resources.
- Average path length in the (directed, acyclic, disconnected) graph is probably around ten.
- The graph needs to be able to handle thousands (maybe tens of thousands) of updates per second.
We will probably want two implementations of this: one based directly on SQL, which should be OK to scale up to a few million resources. And another implementation suitable for large wiki clusters like the one run by Wikimedia. If we had this kind of service, and we would be able to easily scale it horizontally, this would give us the freedom to re-use resources freely, without the need to worry about tracking and purging. It would Just Work™.