Kubernetes SIG/Meetings/2023-05-23

Agenda:

Introductions for new members (if any)
SIG administrivia:
- Nothing to report
Misc:
- Gonna strike out the Training Needs topic, objections?
- Action Items from previous meeting
  - Speak to our respective managers and get sign-off for pre-planning this upgrade.
  - Get acquainted with reading changelogs and security bulletins
  - Read https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Infrastructure_upgrade_policy
Topic: Best practices for taking community helm chart and “wikimediafying” it?
- Public helm charts, not wikipedia community charts
- Logging and metrics (labels) unlikely to be compatible
- Network policies unlikely to be compatible
- Ingress support ?
- [JM] We already have some of these components. But none live in the “service” side, aka it’s k8s cluster components. Stuff that is deployed from admin_ng (e.g. flink-operator, cert-manager). A few other things where we have adapted the chart to our needs. For a piece of software that we/wmf/volunteer has written it probably makes sense to tell everyone to use our scaffolding (sextant) stuff and keep it easier to manage (for both the cluster operators as well as the deployers). In the case of cluster components, it makes more sense to fork and adapt to our needs.
- [BT] 2 use cases up to now: Datahub. There was an upstream helm chart and the recommendation was “don’t bother, it will become a mess”. Went with scaffolding and for better or worse, it’s fine and works well. Deployed by normal deployers and it’s fine. 2nd one was spark-operator. The assessment of the quality of the helm chart, (the datahub experience probably helped) said that it wasn’t that much complete. Created again with scaffolding, took a long time to do and get through, but now happy with it. Upstream isn’t very active. This one is a cluster component. Would argue for forking if the quality of upstream helm chart is good enough.
- [JM] The flink operator helm chart was way better and higher quality.
- [BT] Formalize a process where a couple of people do some assessment ?
- [TK] Sorta a list of Dos and Don'ts?
- [SD] Wondering if we are missing an opportunity to collaborare with the opern source community. We might be enforcing antipatterns if we are forking. We are also possibly siloing ourselves. Merge our changes upstream? The community sees what we are doing and if we meet resistance it might be a sign of a antipattern. Similarly, if we start from scratch, does it make sense to upstream the helm chart and see if it benefits the open source community in general
- [AO] We are trying to do this with the flink-operator, it depends on the community of course. Helm charts, possibly due to being templates, appear to be easier to maintain and bring up to date.
- [JW] We mostly have the use case that deployers don’t need to have upstream helm charts, but the cluster operators have had that use case up to now.
- [JM] The above is absolutely correct. Re: Stef’s comment, we have seen both our changes being merged upstream as well as knowing we have anti-patterns. Network policies are one such case, no upstream charts have netpols, especially not in both direction. It might be because noone is firewalling off applications.
- [TK] I suspect many rely entirely on off-cluster netfitlering
- [JM] Datahub service exception of service outside of admin-ng with upstream chart available
- [CD] OpenTelemetry will go into admin_ng, it’s a service for SREs
- [SD] CI is using upstream charts in CI environment. E.g. using the upstream MariaDB chart cause it’s super easy. Pre-patch production autoscaling <insert name>. Wound’t go to production with those.
- [JW] Gitlab-runner chart as well.
- [AK] Evaluate upstream helm-chart case by case for new services and spend ~1 day to evaluate quality
- [JM] Assess the upstream chart and see how complicated it is to re-write from scratch. It depends heavily on size and the nature
- [TK] It’s a spectrum, not necessarily binary decision. Upstream might be not necessarily happy with our changes. The receptiveness of upstream and the usability of the chart is a guiding factor
- [SD] Playbook, not a policy. Be flexible, recognize that helm chart is a new space and we might hit friction
- [AK] Action Item: Alex to write a one pager, to be reviewed by group, roughly defining the decision to support using upstream charts, provided they go through a playbook based, assessment process of the chart. Key areas to be (in no specific order):
  - Receptiveness of upstream
  - Reusability of chart
  - Overall quality of chart
  - Amount of work it would take to bring the upstream chart into the level of quality we want to strive for.