Platform Engineering Team/Data Value Stream/Data Demo September 08, 2021

Data DEMO 2021-09-08


Topic	Objectives	Presenter
First Demo	first Demo Session	Data Team

Notes/Q&A

Airflow Test System
- Links
  - Platform engineering instance
    - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#platform_eng
  - Specs for VM
    - https://phabricator.wikimedia.org/T284934
  - Original algorithm
    - https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb
  - Draft of refactoring efforts
    - https://github.com/clarakosi/ImageMatching/blob/refactoring/algorithm_v2.ipynb
- Can we experiment with breaking the algo down further? Moving away from a Jupyter notebook?
  - There is an underlying assumption that the researchers needed to be able to change the notebook, however, it is likely inevitable that we will have to move away from it. THis is likely the next step
- Since this clearly relates to Research work are they part of this - are there other use cases for this / is the goal to use this across the org?
  - Research is definitely a primary stakeholder and you are right that there are other needs throughout the org. There is a need to play and experiment - Jupyter notebook could work depending - but there could equally be something that is a recurring need. For example, structured data also have this need. No comprehensive solution yet, but this is an area of focus
- We need to answer who owns the code we run - this work has been largely optimised for now - this is one tool to process data. Who is going to be responsible to do this kind of work? In other places, I have seen the clash of cultures between throwaway code in order to get answers vs engineers who want to use this software in production. We have a gap between research code and what it means to be in production. How do we get to a decision making point for this?
  - Foreshadowing discussion for tomorrow. One piece is how far will we go and what is the path to production, what are the handoffs, what are the responsibilities. We won’t have the answers yet but we should have guidelines and frameworks. This is key piece - ownership.
Cassandra AQS Cluster Migration
- Comment: depending on the metric you're looking at (tablestats v load) you may be looking at uncompressed data size ("tablestats" is an estimate of the size of data stored, "load" is basically the size of files on disk after compression)
- With the new setup what is the implication for things like Similarusers who live in this space but for which we have to do horrible horrible things to move the data around? Will there be an improvement or not yet?
  - Current AQS Cassandra Cluster lives inside the network, so nothing changes. However, if we needed k8s services accessing the clusters this would be relatively simple firewall work
- Would it be fair to assume we could think about current use cases for use? Is this cluster the somewhere?
  - Yep, we don’t need to wait on the cluster. Wouldn’t have made sense to add use cases to the complexity. We’re turning the cluster into a multi tenant storage, not physically putting data onto cluster until complete
AQS 2.0
- https://gitlab.wikimedia.org/eevans/aqs
- We’re lacking staging ENVs
- There is a tool built by SRE Pontoon that lets you build out ENVs
- Here we used Docker Compose
- Is there a way to get all the pieces in the same place, mini-kube for example? Is there a way to go from the middle ground to move toward a staging ENV?
  - This is a general problem here
  - Takes more than just this team, need multiple teams to think through it. We want to introduce as part of the frameworks what is a path to production - for the specific value streams, so we can say what we need from other teams - identifying our requirements and build out our use case for what we would need from a path to production.
    - Can we version services in production? We have to work across tech to build this out
      - How does gitlab etc get us to a better place
  - Are there ways we could get traffic? Is the approach still viable with changeprop yes but only with restbase, it’s still a patch or bandaid not a real testing env strategy. One area we could try out getting traffic replication - Kafka well suited to this but really only useful to test restbase.