Jump to content

Data Platform Engineering/Data Platform SRE/Status Update/2026-03-13

From mediawiki.org

Highlights

[edit]
  • Kafka Upgrade: We're ready to migrate all mirrormaker instances to Kubernetes, which will simplify the kafka upgrade, as only kafka itself will run on the brokers. We have written cookbooks to manage the rolling upgrade/rollback of a given cluster. We are connecting producers/consumers to kafka-test to observe their behavior premid/post upgade
  • Druid upgrade/expansion: Upgrade of druid-public is scheduled for Monday 16th at 10:30 task T278056.
  • OpenSearch on Kubernetes: Good progress on: gathering ingressgateway metrics to enable SLO. T417187 Import/adapt OpenSearch Operator and OpenSearch Cluster 3.x helm charts confirmed that the WMF OpenSearch Operator image works with the new version of the chart. Currently testing our OpenSearch image. Good progress on enabling custom readahead settings for block devices serving opensearch on k8s.
  • Spark on Kubernetes: Adding 5 hosts with local storage to enable spark.
  • Misc: Fixed all outstanding disk issue on Hadoop and closed: T415002 Unusually high disk errors on the an-worker nodes since upgrading the disks - Now monitoring for a reduction in the failure rate.
  • k8s / canary events crash: We had a complete crash of our k8s cluster, related to how Canary Events job is creating too many pods. We're increasing robustness of the system with additional limits, tuning retry strategies and rewriting the job to make efficient use of resources task T419457.

OpenSearch

[edit]

GrowthBook

[edit]

User Management

[edit]

Hardware

[edit]

Misc / Operations

[edit]