Wikimedia Release Engineering Team/Deployment pipeline/2018-12-20

Last Time

 * 2018-11-08
 * Archive

General

 * "I survived another meeting that could have been an email"
 * Strive for this not to be true
 * Sometimes it is
 * Let's be bold about skipping (but lets have an email version instead)


 * topic: discuss Beta aka deployment-prep and k8s
 * (couldn't find task that tracks this)
 * but we have a patch instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/478637
 * Marko: is beta important? If so something should be done. Have run into this since the last meeting
 * Joe: I would like to move to a proper staging, do things have to in beta? probably not, but sometimes they are needed
 * Marko: A higher percentage of the puppet code we use in production will become obsolete or maybe won't be in puppet
 * Joe: whatever is needed to test a mediawiki extension is probably needed there (for services)
 * Joe: hiera to run this image, use this config, etc. Want to avoid setting up a k8s cluster to run in beta that is different than production.
 * Marko: try this next quarter for eventgate
 * Joe: I want to try with mathoid Soon™


 * Track and install additional npm packages for all service container images
 * SRE nodeX base image in the operations base image repo
 * Joe: gc-stats?
 * Marko: used for sending stats
 * Dan: There's another way to do this with Blubber that doesn't involve relying on a "custom" docker-pkg base image
 * Plus-side: more ability to make changes by services
 * Downside: lots of blubber file duplication


 * Allow access to blubberoid.discovery.wmnet:8748
 * Summary so far:
 * Use Cases: local development, CI, Pipeline building prod images
 * Dan: single deployment for developers and CI and prod unifies environments (due to things like policy files [not currently in use, but is useful])
 * Alex: WMCS can't talk to wmnet, so opening to WMCS == opening to everyone
 * Alex: Blubber as a Service (BaaS) works counter to unified tooling because it neglects offline/low-bandwidth use-case
 * Dan: I don't see how the service model works counter to unifying but perhaps it works counter to an offline dev-env requirement that we haven't named. That's fine but we shouldn't conflate the requirements
 * Joe: people download and install so much untrusted  binary garbage from github, we can distribute binaries for linux/windows/osx quite easily I think?
 * Thcipriani: FWIW, we do have garbage binaries via `make release` target in repo posted on my people page currently, unfortunately: https://people.wikimedia.org/~thcipriani/blubber/
 * Lars: it would be good to avoid perpetuating bad security practices? Sure, that wasn't my point :)
 * Joe: I wouldn't point developers to BaaS, but it could be exposed publicly -- low potential for abuse
 * Dan: I don't see much potential for abuse either
 * liw: provides means to overload CPU, but maybe k8s policies can prevent this
 * alex: we have policies already (1800 millicores is blubber's limit -- max found via testing with Jeena)
 * alex: I worry that BaaS becomes critical to the tooling due to networking problems for developers -- non-up-to-date policy files, non up-to-date blubber
 * fselles: could commit output from blubberoid into some repo
 * joe: could generate lots of variants from one blubber file; I think we could tell folks to download the binary from gerrit
 * Joe: I worry about a tool that creates images for the k8s cluster being dependant on the k8s cluster -- maybe we should use blubberoid in it's own container -- I need to think this through
 * fselles: we have 2 clusters also we should trust blubberoid
 * Joe: redudantcy probably means this is OK
 * compromise: use blubber for local development, and blubberoid for CI
 * TASK: releases should be updated automagically
 * EPIC TASK: for developer tooling to keep track of this discussion
 * Alex: we know the components of the developer tooling, but we don't know how those will fit together yet

RelEng

 * Initial production image build fails helm test
 * just check for .pipeline/helm.yaml
 * thcipriani: ooooh...


 * Cleaning old image tags (confuses version sort): https://people.wikimedia.org/~thcipriani/docker/wikimedia/mediawiki-services-mathoid/tags/ ?
 * Currently no way to delete images on the registry

Serviceops

 * TEC3 goal posted by mark
 * Lots of services for next quarter
 * ORES is going to consume some time


 * changeprop/cpjobqueue at least a month apart?
 * Marko: need some clarification; I don't think that's doable. We need the same version of the kafka driver and since these share a repo not sure how to use node6 and node10 to have the same driver version
 * Joe: cpjobqueue is scary to move (we can only handle a few minutes of outage for that service). If we need to stagger these repos we could maybe use the same deploy repo
 * Joe: we could maybe use git branches, or something, for a short period: we shouldn't migrate both at the same time
 * Marko: heuristics in terms of resource allocation for these services
 * Alex: both of these are hard to benchmark
 * fselles: try to assign similar resources and adjust using monitoring
 * Joe: I think we're not limited to 1 process per pod
 * Alex: we do not want to use ncpu

Services
= As Always =
 * Release Pipeline Workboard
 * Meeting notes