Technical decision making/Decision records/T291120

From mediawiki.org

Original Google Doc with Decision Record

What are your constraints?[edit]

General Assumptions and Requirements Source

The person or document that this requirement comes from.

We will not fully solve comprehensiveness all at once.  Instead, we will aim to increase comprehensiveness by targeting a subset of important use cases. We will initially target:
  • Wikitext content (MCR?)
  • Wikitext diffs
  • Html content
  • Page links changes
  • Wikibase entity data
  • Edit history of user
  • Page redirects info
  • Wikimedia Enterprise needs
This Decision Record specifically addresses getting more MediaWiki state into events. It does not include recommendations or solutions for building other event driven services. However it will make building services that use MediaWiki state in events possible.
Security Requirements

Describe any security limitations or constraints for the proposed solutions.

We don’t currently have any access control over how engineers produce or consume event stream data.  This will not be solved in this decision record, but we should be aware of this while solutioning.
Privacy Requirements

Describe potential privacy limitations or constraints for the proposed solutions and how they will be mitigated.

Emitting state changes into streams means that those changes will persist immutably for the retention period of each stream.  We should be careful with PII, and 1. Never expose it publicly and 2. Include a way to remove this data using Kafka compacted topics. Using compacted topics might be out of scope of this Decision Record, but should be considered.

Decision[edit]

Selected Option Option 4: Streaming Service(s)
Rationale Option 1. Do Nothing and Option 3. JobQueue can easily be ruled out. JobQueue was quickly dismissed by Petr as not reliable enough. 

Option 2. EventBus is a possibility, but really only if we only ever planned to add 1 or 2 new event streams. 

This leaves Option 4. Streaming Service(s).  Doing this in a streaming service allows us to have a decoupled deployment from MediaWiki, and will be more flexible when adding new event streams.  It also allows us to build more expertise and tooling for doing this for new products in the future.

Data See Options below.
Informing Andrew Otto and Luke Bowmaker will be informing and working with others on this. Our initial need for interfacing with others will be to vet any new event streams we design to be sure they meet potential needs.
Who Andrew Otto
Date 2022-03-02

What are your options?[edit]

Option 1: Do Nothing
Description Continue using the existent MediaWiki event streams for state transfer.
Benefits No streaming services to maintain.
Risks Teams that need state outside of MediaWiki will have to get it themselves, injecting latency and complexity into their data pipelines.

Limits our ability to create timely and more relevant data products.

Effort More for teams implementing services.
Option 2: MediaWiki directly produces more events
Description MediaWiki directly emits more comprehensive event streams via EventBus extension.
Benefits
  • No streaming pipelines to maintain
  • Easy to do now
Risks
  • Not clear that we can get all we need via existent MediaWiki hooks.
  • Makes MediaWiki do much more on the app servers, which may increase load and/or latency in user interactions.
  • May make solving the consistency problem more difficult.
  • Produces to EventGate rather than Kafka directly: less consistency guarantees
  • No way to bootstrap using historical data
Effort
  • If the risks are not considered, then this could be implemented in a quarter.
Costs
Testing
Performance & Scaling Since MediaWiki itself will be producing more data at request time, we need to be very careful about what happens on MediaWiki app servers.
Deployment
  • Comprehensive events may be large; we need to be careful that Kafka can handle them.
Rollback and reversibility Until there are active consumers of new streams, rollback is just as easy as any other code change.
Operations & Monitoring
Additional References
Consultations
Consulted party 1 Search Platform - Zbyszko Papierski
Consulted party 2 WMDE Wikibase and Wikidata - Leszek Manicki
Consulted party 3 Platform Engineering - Petr Pchelko
Consulted party 4 SRE - Giuseppe Lavagetto
Option 3: Change-Prop / MW Job Queue produces more events
Description Create new MW jobs to react to existant MediaWiki notification events (e.g. revision-create) to produce new MediaWiki event streams (e.g. wikitext or html revision content).
Benefits
  • MW Job queue exists and is maintained by Platform Eng and SRE.
Risks
  • Likely requires the MW job to access the MariaDB database to get data
  • Can only react to one event at a time
  • Produces to EventGate rather than Kafka directly: fewer consistency guarantees
  • Will add load to MW job servers.
  • No way to bootstrap using historical data.
  • Jobs are delayed and can be lost.
Effort 2 quarters?
Costs Maintenance of new jobs. Will need to be owned by an engineering team.
Testing
Performance & Scaling
  • Need to be careful about requesting too much from the MediaWiki MariaDB (especially if we have to bootstrap a stream with historical data).
  • Comprehensive events may be large; we need to be careful that Kafka can handle them.
Deployment Jobs will only be run in active DC(?)
Rollback and reversibility Reversible until someone starts consuming the streams it produces.
Operations & Monitoring
  • Latency of events (how long does it take between a revision create in MW and a new revision content event to be produced)
Additional References
Consultations
Consulted party 1 SRE Data Persistence - Manuel Arostegui
Consulted party 2 WMDE Wikibase and Wikidata - Leszek Manicki
Consulted party 3 Platform Engineering - Petr Pchelko
Consulted party 4 SRE - Giuseppe Lavagetto
Option 4: Streaming service produces more events
Description Streaming service(s) react to existent MediaWiki notification events (e.g. revision-create) and ask the MediaWiki API for more data (e.g. wikitext or html revision content) and produce new event streams.  Tech TBD, but could be Flink, Kafka Streams, KNative eventing, etc.
Benefits
  • Independent from MediaWiki monolith (only coupled via the API)
  • Easy to add new data and streams once we have a baseline service implemented
  • Produces to Kafka directly: more consistency guarantees
  • If needed, possible to get data from data sources other than MW to include in the event.
  • Will need to do more streaming apps in the future, doing this builds expertise and tooling in support to do that.
Risks
  • Not clear if we can get all we need from MediaWiki API
  • Operating streaming services is new for us (Search Platform has experience now).
Effort 2 or 3 quarters to get the initial service in production.  After that, minimal effort to add more data streams.
Costs Maintenance of streaming service(s). Will need to be owned by an engineering team.
Testing
Performance & Scaling
  • Need to be careful about requesting too much from the MediaWiki API (especially if we have to bootstrap a stream with historical data).
  • Comprehensive events may be large; we need to be careful that Kafka can handle them.
Deployment Multi datacenter deployments of streaming pipelines is complicated.  Search Platform has settled on a pattern (active-active with multi dc compute). We may choose a different pattern here, since the existent MW events are not multi-compute.
Rollback and reversibility Since this is a separate service, it is reversible until someone starts consuming the streams it produces.
Operations & Monitoring
  • Stream throughput
  • Latency of events (how long does it take between a revision create in MW and a new revision content event to be produced)
  • Late events
Additional References Data Platform - Event Driven Services
Consultations
Consulted party 1 Search Platform - Zbyszko Papierski
Consulted party 2 WMDE Wikibase and Wikidata - Leszek Manicki
Consulted party 3 Platform Engineering - Petr Pchelko
Consulted party 4 SRE - Giuseppe Lavagetto

Resource:

https://www.atlassian.com/blog/inside-atlassian/make-team-decisions-without-killing-momentum

Use Cases and required MediaWiki state events[edit]

**To show that even though we could solve a lot of comprehensiveness problems with option 2 (EventBus), that still leaves a gap of using that data to compute something new, we likely need a centralized platform to compute new datasets.

Project What additional data was needed that streams didn’t have? Which option would have solved this problem? Why didn’t it use streams? What did it use? What was impact of not using streams to implement
Image Suggestions - suggests an image from Commons if an article doesn’t have one Images linked to article 4

2 - could only help with additional data requests which are fairly small

Additional data is easy to get from MW API but requires an event compute component to run algo that wasn’t easily available Scheduled monthly batch job Data only refreshed monthly so dataset can be stale or missing for up to a month
WikiWho (T293386) - assigns ownership to each word of article and revisions Revision diffs 4

2 - could only help with additional data requests but needs

It does but it’s built outside WMF using RabbitMQ. Is being brought in house using combo of WCS Copy over of systems used by community - https://phabricator.wikimedia.org/F34639572 WMF increased technical debt for components we may not support (Python pickles, Postgres, etc)
Sections - ML/AI model to define sections of an article Wikitext? 2 - if LiftWing could listen directly to a new stream

4 - may be needed to transform data/store as we want

Project in early phases so still might N/A If we rely on monthly data then dataset can be stale or missing for up to a month
Similar Users Edit history of user? Would it be too much for 2 to provide this?

4 - could solve with MW API call and then compute part for algo

Startup costs of event platform like Flink are high for one project Scheduled monthly batch job Data only refreshed monthly so dataset can be stale or missing for up to a month
Enterprise?
Wikidata Query Service Well ordered diffs of the RDF data 4 - needs some way to hold state so updates can be ordered It did but took a lot of effort to get going It used Flink streams N/A
Search updates (The search platform is currently thinking of a possible rewrite using other technologies) Content (wikitext + html), redirects information (perhaps more, still exploring) 4 - as we probably want to batch updates but also join multiple streams (pageviews data, ores scores, …) Current setup is written inside MW, this system has the largest footprint on the jobqueue MW JobQueue The current JobQueue cannot handle the load induced by CirrusSearch due to its design

Decision Record Drafting Meeting Notes[edit]

2022-03-01: Otto and Giuseppe discussion:[edit]

  • Possible we want to enrich events with stuff that might come from other places than MW.
  • Want to free MW app worker as soon as possible.
  • 1 or 100 more API requests per edit is still okay.
  • Stream processing approach is the more long term sustainable one.
  • BUT, if you want something here and now for some short term goal.  EventBus okay. Worry: that thing will remain there forever.  Don’t want to maintain both forever.  
  • Need to make developing services around the big thing easier.  They tend to want to store the data in docker image now.  
  • Preference for stream processing over eventbus.

Feb 15, 2022 | petr & otto discussion[edit]

Attendees: Petr Pchelko Andrew Otto Dan Andreescu

Notes

  • PP: we should dismiss the job queue idea.  Worst of both worlds.  Still in PHP, but jobs are delayed and can get lost.  All downsides.
  • Making just content events might be ok in eventbus.  But if we have 500 new events, maintaining in MW might be difficult.  
  • PP: What about consistency?    
  • DA: perhaps Debezium on just the content table for content events.  Rev_id and content, that’s it.  This should be a considered solution.
  • PP: then we could generalize it: when MW table schema is ‘reasonable’ we could just use Debezium for other things too.  When not reasonable, use EventBus.
  • AO: people also will want html content, and page links changes.
  • PP: maybe sending 4mb of content and12mb html on every edit in a PHP deferred update (eventbus) isn’t great.
  • PP: my preferred solution: start with EventBus, then do separate streaming service.  If fat events gets traction and we need more and more, then we do streaming service solution.
  • DA: would be easiest now, but what about performance about producing all that data from the app servers after an edit?
  • What would giuseppe say?  Will this bog down app servers?

Action items

  • Talk to SRE about emitting from EventBus, if okay with them, let’s do it.
  • However, if this needs to emit many different kinds of events, then maybe doing it in EventBus is not that flexible and we should do streaming service anyway.
  • Talk with Giuseppe: he prefers streaming service idea.  Doing this in EventBus will likely just be tech dept.


Feb 14, 2022 | Discuss Comprehensive MediaWiki Events Decision Record[edit]

Attendees: Luke Bowmaker Andrew Otto Petr Pchelko Leszek Manicki David Causse Andy Craze


Note

  • https://libwas.readthedocs.io/en/latest/What MW state would be most useful to have in streams now?
    • Wikitext content
    • Wikitext diffs
    • Html content
    • Page links changes
    • Wikibase entity data
    • Citation changes?  (is this different than links?)
  • AC: ORES preprocessing for models?
    • Most are just fetching article text or diff.
    • Every ores model is at the revision level, text and diffs most useful
    • In the future, lots of things we can do, depends on use case.
  • LM: From wikidata/wikibase
    • Could be rubbish!? :)
    • Wikidata edts are slow sometimes because of abuse filter. Could we build this functionality outside of request pipeline.
    • AbuseFilter: Community can set up their own filters, which can slow things down.  This is done before page save.
  • DC: Redirects? These are separate from pages.  When a redirect is added to a page, we would like to have an event for this.  Consider page as an object with its redirects.
    • Existing events have page_is_redirect flag.  We could put where the redirect is to by asking MW.  
    • Other side too.  What pages redirect TO a page?  Page A is redirected from Page X,Y,X.  
    • PP: redirect sources are stored in a denormalized table, i think page links.

Solution discussion:

  • MW Job Queue vs Stream Processor
  • PP: page content is immutable.  You can attach it to whatever event at any time in the future.  Async is okay here, it will be correct.  Doesn’t really matter if job queue or not.  
    • Option 2: at request time (EventBus) is actually okay.
    • MW Job doesn’t really add us much.  Its just more async.  Just adding a step that doesn’t really give you anything.
    • Option 4 is cool, especially if MySQL External Store had its own API separate from MW.    
    • Option 4 isn’t really decoupled, its a separate deployment unit, that’s something.  But is it worth it?
    • AO: Option 2 and 3 have to POST to EventGate.
      • PP: there are maybe ok PHP kafka producers now?
    • PP: There are a ton of things that are coded in MW PHP.  Having to recode that in other languages is annoying. E.g. MW normalizing page titles.  
    • PP: What are you getting from doing this from just having all consumers asking API for what they need?  
    • DC: reading directly from MW events: ordering is hard to accomplish.  Reading multiple topics.  Streaming processor helps, but it is complicated.