Data Platform Engineering/Data Products/Decision Records/Metrics Platform Instrument Configurator
Decision Record: Where Do We Build the Instrument Configurator?
- Status: open for feedback
- Recommender: Sam Smith
- Decider: Will Doran
- To be consulted:
- Agreers:
- Service Operations
- DPE SRE
- Security
- Inputers:
- MediaWiki Engineering
- Agreers:
- To be informed: SDS 2.5 Steering Committee
- Date authored: 2024-01-29
- Target decision date: 2024-03-15
Technical Story: [SPIKE] Draft of Mediawiki extension proposal for Metrics Platform Instrumentation (& Experimentation)
Keywords
[edit]instrument, instrumentation, config, configuration, mediawiki, extension, service, services
Context and Problem Statement
[edit]From SDS2.5.2: Instrumentation Configuration
If we create an instrumentation configuration system that has a low technical barrier to entry, we can
- reduce the amount of engineering time required to create and manage instruments
- decrease the time to data in order to enable confident data-based decision making across product decision makers.
In order to accomplish SDS2.5.2, we must first decide where to build the configurator. If we do not decide, then we cannot deliver on SDS2.5.2.
Decision Drivers
[edit]- Data Products’ commitment to delivering a Minimum Lovable Product for SDS2.5.2 by EOQ3
- Performance. The instrument configurator must be able to deliver instrumentation configuration to all Wikipedia users
- Security
- The levels of support from Site Reliability Engineering (SRE), Service Operations (ServiceOps), Data Platform Engineering SRE (DPE SRE), and Release Engineering (RelEng)
- The long-term goal of have a unified tool for configuring instruments and experiments across all Wikimedia-hosted sites
- The competencies of Metrics Platform/Data Products engineers and the amount of engineering effort required
- Long-term maintainability.
Considered Options
[edit]- Standalone application (app) in the WikiKube cluster with a bridging MediaWiki extension
- Standalone app in the dse-k8s cluster with a bridging MediaWiki extension
- MediaWiki extension
Recommendation
[edit]Option 2: Standalone application in the dse-k8s cluster with a bridging extension.
The guidance we received from the Principal Architect for MediaWiki Platform aligns with this recommendation.
Positive Consequences
[edit]- The option maintains performance and security
- The option cleanly separates the domains of instrumentation and its configuration, which should be in the Data Products domain
- The option allows for future expansion to feature flagging/experimentation (see T335034: [Goal] Build the Experiment Control Plane)
- Following on from the above, this option allows us to more easily experiment and pivot with solutions for this and for feature flagging/experimentation (see also T335482: Investigate open-source feature flagging/experimentation platforms)
- The same app can serve the Beta Cluster, Wikipedia, and non-MediaWiki sites hosted by WMF so we can know the complete history of an instrument (see Where Do We Build the Instrument Configurator)
Risks and Mitigations
[edit]- Performance and security reviews have difficult-to-predict outcomes. We will consult with the Security team early, providing them with the Decision Record and Design Document once an option is selected. We will also continue to engage proactively with the MediaWiki Platform Team to ensure alignment
- We may have to request a concept review from SRE to help us work through the various failure scenarios of a distributed system. To mitigate this we will consult with Service Ops early and in the immediate term we will use the dse-k8s cluster, which will allow us to more efficiently prototype
- We will be using the dse-k8s cluster for a non-mainstream use case, though one that still falls within its remit. Technically, this kind of service exists in a liminal space. Wikikube is geared specifically toward MediaWiki-related services and dse-k8s is geared towards data engineering work. This service could be argued to exist in either space. As part of the work, we will accept the potential work involved in porting from dse-k8s to Wikikube if that is required in the long term
- In using dse-k8s, we accept that support is provided only during working hours and that availability is not guaranteed. In the event that dse-k8s or the app is unavailable, the bridging extension will disable all instruments
- In using the Data Platform PostgreSQL cluster, we accept the cluster is not multi-DC. In order to mitigate the risk of performance degradation during the eqiad-codfw DC switchover, the app will maintain its own cache
- All considered options require that we do some MediaWiki development. Two members of the team have a lot of experience in this domain and will be knowledge sharing throughout development with those team members who don’t have as much experience
- We are aware that there is the beginning of an effort to abstract away the source of configuration from MediaWiki so that the current static configuration blob can be moved into an external data store. We accept that there will be work involved in porting the system to that paradigm
Standalone Application in the WikiKube Cluster
[edit]Dimension | Remarks | Notes |
Collaborating teams? | SRE, ServiceOps, Data Persistence, RelEng | ServiceOps have already signaled that they cannot provide support until April |
Is the deployment path clearly defined? | Yes | |
Estimated time to build and deploy? | 10-12 weeks | |
Availability after deployment? | High | |
Extension required? | Yes | T355599: Where Do We Build the Instrument Configurator |
Does this affect build or commission? | No | Must
|
Programming languages available? | JS (frontend), PHP,.JS (backend), Go, Python |
Standalone Application in the dse-k8s Cluster
[edit]Dimension | Answer | Notes |
Collaborating teams? | DPE SRE, RelEng | |
Is the deployment path clearly defined? | Yes | |
Estimated time to build and deploy? | 10-12 weeks | |
Availability after deployment? | Variable | The dse-k8s cluster has no SLO for availability.
dse-k8s is only deployed in the eqiad DC. If there were some catastrophic event in eqiad, then the service and its functionality will not be available until after eqiad were available again. |
Extension required? | Yes | T355599: Where Do We Build the Instrument Configurator |
Does this affect build or commission? | No | Must
|
Programming languages available? | JS (frontend), PHP,.JS (backend), Go, Python, Java |
MediaWiki Extension
[edit]Dimension | Remarks | Notes |
Collaborating teams? | Data Persistence, RelEng | |
Is the deployment path clearly defined? | Yes | |
Estimated time to build and deploy? | 8-9 weeks | |
Availability after deployment? | High | Two team members are already deployers and can onboard other team members. |
Does this affect build or commission? | Yes | We cannot commission a third-party piece of software to act as the instrument configurator. |
Programming languages available? | PHP, JS |
Storage
[edit]Cluster | Owner | Notes |
---|---|---|
Main MariaDB cluster | Data Persistence | Multi-DC, writes in primary DC, reads possibly from both (needs app to be capable of that) |
Analytics Meta MariaDB cluster | DPE SRE | In the eqiad DC. When codfw becomes the primary DC, our app would still be talking to the DB in eqiad, decreasing perceived performance for the app user |
Data Platform PostgreSQL cluster | DPE SRE | See the above |
Cassandra RESTBase cluster | Data Persistence | Multi-DC, read/writes to both DCs, eventually consistent |
We predict a row size of 3.34 KiB and estimate that there will be on the order of 100s of rows.
Notes
[edit]Concept, security, and performance reviews are required for all options.
Bridging Extension
[edit]If we opt to deploy a standalone application, we must also build and deploy a bridging extension that adapts the output of the app to MediaWiki and gives access to its internals, configuration, and the extensions involved in instrumentation on the Wikipedias.
For example, the extension should be responsible for embedding the output of the app in a ResourceLoader config module. If it doesn’t, then the browser must make a request directly to the app to fetch instrument configuration, which would need signoff from SRE and Performance.
On the other hand, we could build the bridge inside of an already in-production extension, EventLogging. Data Engineering owns EventLogging so this would require a little coordination with them. However, it would be less flexible in the long-term.
Auth(n|z)
[edit]If we opt to build and deploy a MediaWiki extension, then authn and authz will be implemented using MediaWiki’s user rights and groups subsystem.
If we opt to deploy an app, then it must support OpenID and/or OAuth.
OpenID will allow us to authenticate users and authorize users using the Wikitech account via CAS-SSO. This flow is familiar to users who have authenticated with various Wikimedia-hosted apps, e.g. Superset and Turnilo.
OAuth will allow us to authenticate and authorize users using their Wikitech account. This flow is familiar for users who have authorized tools to interact with wikis on their behalf. To authorize users, we must grant them a custom MediaWiki user right which is allowlisted in the app. Fortunately, we can define rights in the bridging extension.
Purview
[edit]If we opt to build and deploy a MediaWiki extension, then there will be a separate instance running on the Beta Cluster and on the production app servers. These instances will not be able to communicate with each other. The user will have to manually stitch together the history of an instrument.
Whereas if we opt to build an app, then we will have to make the app and bridging extension environment-aware. For example, the bridging extension running on the Beta Cluster would be configured to request instrument configurations for the Beta Cluster etc.
Flexibility
[edit]In the near future, Data Products will also be deciding where to build an equivalent experiment configurator app – either a third-party solution or our own. Opting to deploy an app leaves us in a better position to explore third-party solutions to experiment configuration later with as little follow-up work as possible.
Links
[edit]Additional Comments
[edit][Anyone can add anything that doesn’t neatly fit into the format above]
Was this Helpful
[edit]If you just read this TDR, please let us know how to improve this template.
[edit]- Did the TDR provide the information you were looking for?
- How was it overdone?
- How was it underdone?