Requests for comment/Requirements for change propagation


 * This is a copy of T102476, converted to wikitext. We should probably migrate this content over to mediawiki.org (since MediaWiki is better for changetracking and collaborative editing) and leave Phab for an abstract/summary plus the things that work well in Phab (state tracking, discussion, and Phab task linking) -- RobLa-WMF (talk) 06:26, 16 May 2016 (UTC)

At a high level, change propagation involves three main ingredients:


 * 1) Distributing change events, and
 * 2) processing and propagating those change events using
 * 3) a dependency graph.

Publish / subscribe event bus
Many of our internal services are interested in following a common set of events. The most popular events are perhaps those related to edits, which are followed to keep derived content and caches up to date. Our current mechanisms for event propagation aren't very reliable, lead to a lot of duplicated effort, and aren't usable for services other than MediaWiki core.

We intend to set up a more reliable and scaleable event bus along with a standard set of update events. This event bus will support a reliable publish/subscribe consumer pattern, which means that several clients can reliably follow the same event, thereby decoupling producers from consumers. This should be architected so that that one is able to use this on any Mediawiki installation on the public Internet (think instant Wikidata or instant Commons). Some of these clients are only interested in a small subset of data some will want all of them. See T84923 for more background on the event bus.

Dependency tracking
In our applications, dependencies normally form a DAG. We currently have specific link tables tracking some kinds of dependencies (image links, template links etc), but nothing that can track tree-shaped relationships in an extensible manner. The current mechanisms are further limited to a single project, which means that they can't be directly used by wikidata and other shared projects.

A big challenge with such a dependency graph is its maintenance. Dependencies are added and removed all the time, and this needs to be reliably reflected in the dependency graph. Ideally the maintenance of dependency information should be automated to avoid the need to write custom update logic in each service.

Addressing of components

To allow addressing, each node in the dependency graph needs a unique identifier. By using deterministic identifiers based on the description of the item, we can avoid duplicate work. It would also be desirable if those identifiers could be used directly to dereference the dependency, ideally in a way that support loose coupling of systems across projects. URLs or more generally HTTP requests can satisfy these requirements. There are length limitations for GET requests (2083 bytes in IE), but those can likely be worked around with request storage (GET with a hash of the request) and a POST fall-back for dynamic requests from clients like VisualEditor.

For fine-grained template updates or subscriptions, it would also be useful if we could identify fragments of a resource in a standard manner. In a URL, this could potentially be encoded as a query string or fragment identifier. It is important that we make this mechanism uniform and deterministic.

Change propagation
Change propagation can be broadly implemented using two techniques, push vs. pull. A push-based change propagation service listens to specific event streams, and then figures out which resources should be updated by consulting the dependency graph and event properties. The update of those resources triggers additional change events, which can then recursively trigger additional updates. Push is generally preferred if there are many reads of each dependent bit of content, or where lowest possible read latency is required. This is the strategy we currently pursue for template updates.

In poll-based change propagation, dependencies are simply checked on each access. This can be implemented by rendering everything from scratch on each access or with a slightly more efficient freshness check.Pull is preferred if low propagation latencies need to be supported with high fan-out, and if there are few reads per change.

Studies have shown that an adaptive combination of push &amp; pull is optimal if the distribution of number of dependencies and updates is skew. Template updates are skew in the number of uses (with some templates used in &gt;7 million pages), but currently less so in the number of edits. Our current approach of re-rendering all seven million articles can easily result in large backlogs of template updates. It might be useful to consider pull based or hybrid solutions (where only a timestamp is propagated and polled) as an alternative to pure push.

Implementation sketch

After an event, lets say a new revision of a page was saved, an event message is enqueued into a topic of the distributed queue. The message contains the identifier of the event source, the kind of event and various event-specific metadata. On the other end of the distributed queue, several clients are reading messages off the queue. Each independent client (group in Kafka's case) maintains its own offset (or offsets), which lets multiple consumers react to the same event. When receiving a message, each of these consumers performs a client-specific action.

When a change propagation worker receives a message, it will look up a chunk of dependencies in the dependency graph storage. For each dependency, it will call the provided URL (or request template), passing along information from the original event. For each of these dependent updates, this will trigger another update event, recursively propagating the change through the system. When the number of dependencies is large, it will enqueue a follow-up event to trigger the processing of the next page of dependencies later. Once the chunk is fully processed, the worker commits its offset and requests the next message. Should any dependency update fail, it will enqueue a retry event in a separate topic to make sure that the update is retried a few times. Persistently failing jobs will be retired to a 'dead letter' queue for later inspection.

Current status

 * T84923: Reliable publish / subscribe event bus is deployed, and provides event streams for edits &amp; resource changes. Under the hood, all topics are prefixed by source datacenter &amp; replicated (see T127718), which lets us cleanly move event processing between datacenters.
 * T117933: Change propagation service, phase 1 is gradually being rolled out at the moment. Driven by a declarative config file, this service subscribes to EventBus topics, and processes events by making HTTP requests to other services, or by sending purges to Varnishes. Events are consumed from specific topics, and can be further filtered by arbitrary properties, including URL patterns.
 * An example module for iterative backlink processing was already created. This module nicely separates the expansion of dependencies from their processing, and can serve as a model for further iterative dependency expansion.
 * T126687: RFC: Publish all resource changes to a single topic introduced a single topic recording URL-based resource changes. This topic is intended to be used for CDN purges, and is already used to trigger secondary updates in the ChangeProp service. @Smalyshev and @aaron are looking into sending all MediaWiki CDN purge requests to this topic.

Next steps and open questions


 * ChangeProp service expansion: ChangeProp will gradually expand to cover more use cases. Initially the services team will focus on RESTBase's use cases (including red link, template &amp; media re-renders), and will also move CDN purging from RESTBase to ChangeProp.
 * Reliable CDN purging: There have been various discussions about making CDN purging more reliable. Current ideas include running Kafka clients on each Varnish node, which would effectively replace the best-effort multicast setup. However, there are also plans to reduce the purge volume by using alternate keys (like Varnish's XKey / T122881). A single purge would match all resources associated with an underlying resource like a page. However, with asynchronous updates it will be tricky to determine the best time to issue such a purge. We will still need to issue several purges after a primary event. Coordinating these is currently an open problem.
 * Reliable RCStream: @Ottomata has been looking into leveraging Kafka events in RCStream. This can potentially let clients catch up after being disconnected.
 * Cross-project dependency tracking &amp; change propagation: We currently don't have any general way to track dependencies across projects. Special-case mechanisms were developed for commons and to some degree Wikidata, but other applications (like T91162: RFC: Shadow namespaces) will need dependency tracking abilities as well. It would be good to generalize this infrastructure, so that efforts can be shared across several use cases. Open questions in this space:
 * API requirements &amp; possible designs for querying &amp; updating dependencies.
 * Dependency graph storage: T105766: RFC: Dependency graph storage; sketch: adjacency list in DB discusses some options for storing such dependencies in a general manner, but it's early days &amp; we should probably make our requirements more precise before diving too deeply into the concrete design.