Data Platform Engineering/Data Platform SRE/Status Update/2026-03-13
Appearance
Highlights
[edit]- Kafka Upgrade: We're ready to migrate all mirrormaker instances to Kubernetes, which will simplify the kafka upgrade, as only kafka itself will run on the brokers. We have written cookbooks to manage the rolling upgrade/rollback of a given cluster. We are connecting producers/consumers to kafka-test to observe their behavior premid/post upgade
- Druid upgrade/expansion: Upgrade of druid-public is scheduled for Monday 16th at 10:30 task T278056.
- OpenSearch on Kubernetes: Good progress on: gathering ingressgateway metrics to enable SLO. T417187 Import/adapt OpenSearch Operator and OpenSearch Cluster 3.x helm charts confirmed that the WMF OpenSearch Operator image works with the new version of the chart. Currently testing our OpenSearch image. Good progress on enabling custom readahead settings for block devices serving opensearch on k8s.
- Spark on Kubernetes: Adding 5 hosts with local storage to enable spark.
- Misc: Fixed all outstanding disk issue on Hadoop and closed: T415002 Unusually high disk errors on the an-worker nodes since upgrading the disks - Now monitoring for a reduction in the failure rate.
- k8s / canary events crash: We had a complete crash of our k8s cluster, related to how Canary Events job is creating too many pods. We're increasing robustness of the system with additional limits, tuning retry strategies and rewriting the job to make efficient use of resources task T419457.
OpenSearch
[edit]- T408586 ☂️ OpenSearch on K8s: Ensure that our first tenant workload is ready for production ☂️ (Resolved)
- T398986 Deploy next iteration of iPoid in dse-k8s-codfw (Resolved)
- T416167 Create SLOs for OpenSearch on k8s (Duplicate)
- T416269 Create SLIs for opensearch on k8s (Duplicate)
GrowthBook
[edit]- T406580 Investigate limiting GrowthBook's access to the Data Lake (Declined)
- T405749 [EPIC Deploy GrowthBook - FY25/26 SDS 2.2] (Resolved)
User Management
[edit]- T419029 Grant Access to ops for ebernhardson (Invalid)
- T416495 Check home/HDFS leftovers of chandra-wmde (Resolved)
- T419167 Grant Mikhail access to Presto UI to help troubleshoot GrowthBook queries (Resolved)
Hardware
[edit]- T405276 Y2526 Q2:eqiad:(3) - expansion wdqs - Config Custom (Resolved)
- T419000 Degraded RAID on an-worker1205 (Resolved)
- T416166 Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) (Resolved)
- T415002 Unusually high disk errors on the an-worker nodes since upgrading the disks (Resolved)
Misc / Operations
[edit]- T419540 Make canary-events use a single airflow task per dg-run instead of one per stream (Resolved)
- T419457 dse-k8s control plane OOM (Resolved)
- T418388 Upgrade DSE k8s opensearch clusters to 3.5.0 (Resolved)
- T418380 Review the security posture of DPE SRE's Turnilo Kubernetes deployment (Resolved)
- T385551 Roll out wdqs-categories reload job to dse-k8s (Declined)
- T396501 Decide on/run a benchmark for DPE SRE-owned OpenSearch clusters (Resolved)