Wikimedia Technical Conference/2018/Session notes/Identifying the requirements and goals for dependency tracking and events

= Session Setup - https://phabricator.wikimedia.org/T206068 = Facilitator Instructions: /Session_Guide#Session_Guidance_for_facilitators

= Questions to answer during this session =

= Daniel’s personal brain dump =


 * Purpose: update things when stuff they depend on changes, recursively. Replace “links tables” and “RefreshLinksJobs” with something more scalable, more flexible and extensible, more granular, and cross-wiki.
 * Two components: event bus (or http polling) and dependency graph. My focus is on the dependency graph.
 * Idea: each node in the graph can be “touched”, all nodes that depends on the touched node become “dirty”, and need to be touched.
 * Graph properties: directed, not connected, acyclic (guaranteeing this is going to be tricky), shallow. Low median degree (in and out), but some nodes with very high degree. Roughly two orders of magnitude more edges than nodes.
 * Initial graph size: edges ~ size of all link tables combined < 10 billion, nodes ~ number of all pages on all wikis < 1 billion. Increasing granularity -> less superfluous updates, larger graph.
 * Splitting by wiki is tempting, but won’t work cleanly (commons, wikidata, global user pages, etc)
 * Writes: 100 nodes / 1000 edges per second.
 * Unfinished idea of how change propagation can work: https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/DependencyEngine
 * Increasing granularity: isolated template rendering, tracking dependency on wikidata statements, distinguishing dependency on page content and page title, etc.
 * Decreasing overhead: deduplication and coalescing of events (e.g. subsequent edits), (always means delay), ignoring redundant updates.
 * Technology questions: interface and model (HTTP and/or events), graph storage technology, horizontal scalability, availability and persistence requirements.

= Attendees list =


 * About 16 people.

= Structured notes = There are five sections to the notes:


 * 1) Questions and answers: Answers the questions of the session
 * 2) Features and goals: What we should do based on the answers to the questions of this session
 * 3) Important decisions to make: Decisions which block progress in this area
 * 4) Action items: Next actions to take from this session
 * 5) New questions: New questions revealed during this session

= Questions and answers = Please write in your original questions. If you came up with additional important questions that you answered, please also write them in. (Do not include “new” questions that you did not answer, instead add them to the new questions section)

= Features and goals =

= Important decisions to make =

= Action items =

= New Questions =

= Detailed notes = Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.


 * Goals: Purge the edge-cache, varnishes. Process/protocol uses for this is opportunistic. There are cases where purging cache is lot of race conditions, there are people who [make?] past versions of a page.
 * D: is it just about edge cache, or anything? Anything.
 * 10 minutes of writing down issues, including the producer and consumer
 * Examples of eventbus in production? Just the problems,
 * D: Are we asking,  if we wanted to rebuild it, what are the problems we'd need to address?  - No, just looking at current problems that exist. -- caching and purging. Don’t think about EventBus. Think lower-case “e” events.

Clustering happened of the issues.

Group dot votes on the issues they want to talk about

5 top issues from voting


 * 1) How do we notify external users of changes
 * 2) Wikidata change dispatching
 * 3) Some edits require millions of pages to be re-rendered
 * 4) Template change don’t show up in article until purge of long delay
 * 5) Content purges (send: all services, rec: all caches, problems: 1: unreliable transport, 2: race conditions, 3: scaling f/ multi-view)

Group 1 Discussion: (Greg, ...) - Purging issues

What is the nature of the issue

What obvious solutions exist

The importance of the issue (low/medium/high)


 * B: short summary of how they fit together together
 * Edge caches cache on URI, we have to get purges from cacheable content services, using transport between the services to the caches
 * Uses UDP, efficient but lossy
 * Services, emit purge events, of a URI
 * Problems:
 * unreliable UDP: pubsubhub, deals with downtime, easy to solve conceptually
 * Race condition on purging: there’s two layers of caches, if a purge goes to them in random order it can cause stale content to be re-cached again. Requirements it imposes is on the transport mechanism, eg: layer 1 must consume it before layer 2
 * Should the caches themselves transmit the purge events
 * There are many machines in different DCs
 * Scaling: for one article there might be a lot of views (mobile, desktop, page preview), when we invalidate a page there are actually like 15 things that need to be invalidated, which means the scale could be on the order of magnitute of a million of destinations due to template purges
 * A key associated with the various views/etc
 * “X-Key” was a potential solution, maybe doesn’t scale to a million articles from a template
 * We probably shouldn’t have a million events in any of the transports
 * Some sort of alternate indexing system like X-Key
 * The other issue: “Some edits require millions of pages to be re-rendered”
 * Cascading updates
 * T: If we were to have X-key or similar, do we actually want everything that uses the infobox to be purged at the same time?
 * B: probably no
 * T: when you edit infobox, we don’t send millions of jobs at once, jobqueue does the purges in chunks of 100 or so
 * B: we wouldn’t want to do this with an X-key
 * We’re trying to reduce our TTL, 4 weeks is way too long, even less than 24 hours is enough, the vast majority is within 6 hours
 * The 24 hours is more operationally important with turning off some cache centers
 * LRU type policy? Yes, and when moved to ATS it’ll be LF(requently)U
 * If the purge will take more than 24 hours to complete, why not just let the cache policy do it for you?
 * Not just HTML, but link-tables in the DB, we don’t want that to be tied to the caching purge system
 * We’ll need to jobqueue the millions of re-parses
 * S What’s an acceptable stale content window?
 * B: that’s a deep dive….
 * T: If I edit a template, how long is reasonable for a page that uses it to not update?
 * T: from a porduct perspective if a user edits an article it’s pretty important, the user who did it is most important
 * J: prioritize reverts
 * B: might not have to
 * T: template edits requiring the million
 * A: wikidata is the content use case
 * B: another wikidata snowflake issue
 * T: infobox edit will take 3 days to re-parse, but 24 hours for the cache layer to get rid of it
 * Rough consensus :)

Group 2 discussion: (Nick, lydia, daniel, markb, erikB, ramsey, corey, Alexia, Karsten) - Wikidata issues


 * Wikidata change dispatching
 * Some edits require millions of pages to be re-rendered

Notes:


 * When someone makes a change, we want to notify the rest of the world about it. Specifically WP, but also all WM< and in the future the rest of the world who uses our items and properties to describe things in their own software. Internal dispatching is kinda working, but not perfect,. Problems: When someone changes an item that is used in a lot of articles like "label for imdb" property, as sued in authority control, that means purging a huge amount of pages. We currently limit this to a [?1000 articles?].
 * Er: the limit is not the event system?
 * D: what's capped is recent changes/watchlist system.
 * L: it takes time to make these changes show up in RC/Watchlist, and if it takes too long, they drop off. If it takes 5 minutes to show up, then it's below the fold.
 * We try to reduce redundant purges, by having highly granular tracking, not only which items are which pages, but also …  Decreases by 2 orders of magnitude. You can optimize for purging but that means more churn of tables.
 * Also we don't want to re-render all the time, or flood RC with irrelevant stuff. (which was a problem in the past and was turned  off)
 * What was irrelevant?
 * E.g. mayor of berlin is used in Dewiki, but only population of Berlin is used in Enwiki, previously if someone changes the mayor then it would show up everywhere.
 * We have one mechanism for purging and for RC. they have different latency requirements and scaling issues.
 * L: 2nd lissue:  we want more and more people to use items and props in their own software, and we need to tell them if values change.
 * Ramsey: federation?
 * Tim: complete copies?
 * L: commons
 * T: So they only want to know about changes they're subscribed to
 * D: question is transport  -if you have to notify a thousand external site, and then 3 seconds later about the next edit ….
 * T: digests?
 * L: instant notification probably not as important externally as it is internally
 * D: Requirements are: RC - seconds, re-rending articles - minutes, external users - hours
 * Latency not the same as importance
 * Erik: what is source of latency today?
 * D: application surface doing rendering of stuff
 * E: not enough jobruners?
 * D: yes-ish
 * T: can't have a burst of 100,000 edits at once,
 * E: bursty-happy?
 * D: going to have bursts of something used 5million times, which fills queue
 * M: re: filling RC..
 * D: we want to report the event with correct timestamp, so kinda need 2 timestamps… how do we make that clear to the user.
 * T: Why do you need to know the original timestamp?
 * C: is it more important to see it, or to show it chronologically
 * E: Stas has had problems with this. Wikidata query is delayed by 10 mins.
 * D: tangent… that makes it more important…  edit a page, pushed to RC< RC reacts, API asks for external links on page, API gets old results.
 * M: need to make it faster.
 * D: yes, reduces the problem. But want a guarantee of causality
 * C: order events by internal to outside. Should never give the old view.  Make sure things are updated before re-rendering
 * D; do we not publish event, ie delay publication of RC event until internal events have all happened. This would be asynchronous, which is a change.
 * C: Islands of synchronicity - finish this before doing that
 * D: still have user waiting. If we say don't push until everything is ready, then
 * D: that reduces the problems but doesn't eliminate it.
 * D: still get into api [race conditions?]
 * M: What are the bottlenecks exactly
 * D: no way for external user to get a consistent view. Poll for latest, and you might be getting old things.  Could have a placeholder in RC saying "something will be added here" - You could poll RC for "give me quickest" or "give me done"
 * M: get efficient data structure of all dependencies - and set a bit for that, so that you know something is pending.
 * C: is this like graphDB
 * D: Do you know a graphDB that meets our scaling and performance for this? We're talking 100 billion edges, and [?] thousand changes a second.
 * C: is that graph problem validated?
 * E: google must have bigger graphs than that, but unsure if it exists in the FOSS world
 * ACTION: Investigate graphDB options.
 * ACTION: Investigate current bottlenecks.
 * We need to build the actual graph and mechanism to feed it.

Bucketing of changes is important to wikidata people

Goal:


 * Be able to purge multiple urls in varnish

Action:


 * investigate why we have bottlenecks
 * Investigate ways to represent the dependency graph (in Daniel's head)


 * Needs help answering questions about running DBs at scale.

Questions:


 * Will modern event platform survive number of events?
 * What's acceptable latency for external change propagation (aka: when we notify external subscribers of changes)
 * BB: Are we talking browser caches?
 * D: we're talking federated wikis. - Places using wikidata items/props.
 * BB: Once its async…
 * D: async doesn't mean no bounds, what are those bounds.
 * Product question! What do users require

Decisions:


 * Event Transport: Must be persistent and reliable.
 * Event Transport: Must support ordered subscription groups
 * BB: Stream of purge events with pubs and subs, two different groups of subs (A+B) so that we can say, deliver to group A before group B
 * D: It’s one possible way to solve it but is is a requirement, have A send along to B?
 * Except A and B are different clusters of machines
 * Possible if we go looking for a transport that gives these things and find there is none.
 * Caches (edge) - hard 24h limit
 * Hit rate? It varies depends on your perspective, from 90-98%