Wikimedia Cloud Services team/goals/2023-24 Q1

Availability

 * Komla out all Quarter
 * Arturo out all of August
 * David out for half of July, one week of August

Goals

 * Toolforge
 * Continue work on Envvars service (David)
 * https://phabricator.wikimedia.org/search/query/WVN9f5IZgGkm/
 * Toolforge build service beta round 2 (David)
 * https://phabricator.wikimedia.org/T335249
 * NFS-free webservice (David/Taavi)
 * Maybe https://phabricator.wikimedia.org/T334081
 * Migration of projects from grid (Mostly communication, not active hand-holding this quarter)
 * https://phabricator.wikimedia.org/T267374
 * Continuous delivery (gitlab pipeline things) (Arturo/David)
 * https://phabricator.wikimedia.org/T341084
 * Organize toolforge-wide workgroup to replace subtask-specific workgroups. (David)
 * https://wikitech.wikimedia.org/wPortal:Toolforge/Ongoing_Efforts/Toolforge_Build_Service_Ongoin_Effort_page
 * Reduce and define the UI surface of the toolforge platform via CLI and API definitions. (David + Taavi + Arturo)
 * https://phabricator.wikimedia.org/tag/toolforge/ ?
 * Kubernetes version updates (Taavi + Arturo)
 * https://phabricator.wikimedia.org/T298005
 * Superset
 * Continue operational exploration and responding to feedback (Rook)
 * https://phabricator.wikimedia.org/tag/superset.wmcloud.org/
 * Cloud-vps + infra
 * Openstack Upgrade to A (Francesco + Andrew)
 * https://phabricator.wikimedia.org/T341285
 * Magnum
 * Phabricator topic: https://phabricator.wikimedia.org/tag/openstack-magnum/
 * Experiment with magnum-ui, decide w/not it’s useful in current state (Rook + Andrew)
 * https://phabricator.wikimedia.org/T328711
 * Automated testing (Rook and/or Andrew)
 * Prepare service for non-wmcs users (Rook + Andrew)
 * https://phabricator.wikimedia.org/T328712
 * Trove
 * Automated testing (Rook and/or Andrew)
 * T337396 Investigate possible postgres improvements (Francesco)
 * Continue to shepherd upstream fixes (Andrew)
 * https://review.opendev.org/c/openstack/trove/+/875262?usp=dashboard
 * https://review.opendev.org/c/openstack/trove/+/869511?usp=dashboard
 * NFS
 * Continue to improve/stabilize/expand cinder backups (Andrew)
 * https://phabricator.wikimedia.org/T292546
 * Review/improve observability (Andrew + Taavi)
 * https://phabricator.wikimedia.org/T333477
 * Network redesign
 * Migrate swift to new network setup in codfw1dev (Arturo and Andrew)
 * https://phabricator.wikimedia.org/T338937
 * Design Eqiad implementation (Arturo)
 * https://phabricator.wikimedia.org/T341060
 * Ceph
 * Host OS upgrades (Francesco + Raymond)
 * https://phabricator.wikimedia.org/T309789
 * Ceph Version upgrade (Post Host OS Upgrade) (David + Raymond)
 * https://phabricator.wikimedia.org/T306820
 * Multirack HA (David)
 * https://phabricator.wikimedia.org/T297083
 * Team infra
 * Cumin
 * Finish work on dedicated cumin exec hosts (Francesco)
 * https://phabricator.wikimedia.org/T319401
 * https://phabricator.wikimedia.org/T325067
 * Observability
 * Continue to improve VM+prometheus+alert manager integration (Taavi)
 * Move more alerts from icinga to alert manager (Taavi)
 * Enable silences from alerts.w.o for metricsinfra alerts (depends on o11y team upgrades to alertmanager)
 * Hackathon Support (Slavina)
 * Documentation (Tricia, support from Francesco)
 * https://phabricator.wikimedia.org/T327319