Wikimedia Release Engineering Team/Deployment pipeline/2019-01-29

This discussion took place at the 2019 WMF All-Hands at the Bently Reserve.

Last Time

 * 2018-12-20
 * Archive

Current Quarter Goals

 * TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation
 * TEC3:O3:O3.1:Q3: Move cxserver, citoid, changeprop, eventgate (new service) and ORES (partially) through the production CD Pipeline

General

 * ideal flow through the pipeline from the developer perspective and/or mapping the current flow from project inception to production.
 * More concrete idea: talk about what's missing for CDep Blubberoid to production
 * Alex: the problem with option B is that it changes things drastically
 * Joe: MVP for developers so we need to provide a decent experience, focus on what that is in 2 months time
 * Greg: coming up with an ideal workflow depends on what the endpoint is, i.e., a CDep is going to be a different experience
 * Joe: How many interactions do people need to have in order to get something through the pipeline. Right now it's coming to us to figure out what to do.
 * Greg: Avoid the cargo culting.
 * Tyler: This dovetails with conversation at the RelEng offsite.  How many points of contact and how can we reduce that?  Bringing in a bigger view would be useful.

Summarizing RelEng exercise:
 * Ideal next step:  Like toolforge but for production.
 * Dev requests a project that goes into a proper namespace on Gerrit.
 * Sets up CI, etc.
 * On repo creation adds a dotfile that configures pipeline.

Discussion:
 * Dan: What's correct form of feedback for a developer?
 * Alex: Gerrit is the thing that developers interact with, so that should be the thing that users interact with, we shouldn't make developers click through to several sites
 * Joe: This is a common problem with how we report feedback to gerrit, but the amount of indirection means this is a bigger problem that it actually is. There's more interaction and it's a more complicated set of jobs
 * Mukunda: Deciphering console output is a mess.
 * Joe: Summary: creation of a pipeline should be automatic as soon as someone puts a .pipeline/config into their repository. Feedback from the pipeline needs to be better. Not have the link many pages down in gerrit.
 * James: the "standard" is github, you get a comment form abot, you clikc that, you see travis, you see red X, you click that you read that.
 * Travis output isn't that great either, basically.
 * Alex: for the failure scenario is fine to send people down a deep path, but in a success scenario we need something simple
 * Joe: Do we publish an image for each successful merge? (Yes)
 * Alex:  We publish for each successfully merged commit.
 * Joe: for whatever we merge we should get back the url of the artifact for the image
 * Lars: if you change the interface so that the link to the artifact is in the metadata area (???)
 * Dan: What's the MVP for a feedback mechanism in the short term?
 * Alex: docker image plus version, also nice to have a link to the entire pipeline state so you know it's step 1 vs step 2.
 * Alex: In Gerrit you konw where you are in the process of getting it deployed. "I'm in step2 of 5 steps to production"
 * Dan: adding a nother label to Gerrit would be simple, like an "image built" label with links to the docker register url
 * Tyler: Summarize of ideal workflow now:
 * How do you request a project currently? A task. For now keep that for the MVP.
 * Somehow get the url of the image into the Gerrit UI on successful build, and a link to the successfull run
 * QUESTION: do we want to change the image creation process?
 * sidenote: no image per patchset :)
 * Joe: retention of our (old?) images needs an answer.
 * Joe: if we move to CDep it'll be impossible to store, for most thing smoving that direction, keep the latest N versions
 * Joe: questions of the workflow in the CI pipeline
 * developer want to build a nodejs project in the pipeine, are there things I need to do that are different here than what I used to do?
 * Mukunda: not much, just the .pipeline config
 * Dan: the blubber config has the entry point
 * Lars: we give a number of options to choose from, otherwise we end up with 100 projects copy/pasting but there turns out to be an issue so we have to upgrade them all
 * Joe: ... less free blubber templates...
 * Dan: blubber has proven to be flexible, which is good, without much modification at all. the importance of explicitness and tie in entrypoints/dependencies. Hesitant to make it more contrained than it is.
 * James: we have CI entry points across 2000 repos, we have bots to sync them together, not too worried about it being c/p and fix it later.
 * Lars: I'm convinced ^
 * Joe: there is a value in using containers so that developers are contained :)
 * Joe: would it be possible for someone to build their blubber image starting from an image not in our registry
 * Tyler/Dan: yes
 * Tyler: however we have a policy file that was built for this scenario
 * Joe: let's make it clear so that CI uses that policy file
 * Dan: the pipleine job references it which is centrally located (and away from developers ;) And you can get really specific with it.
 * Lars: we can make it dow what we need. We should allow our developers do something useful without constraining them too much.
 * Lars: For an MVP of CDep, we need to get it started and then iterate.
 * Joe: we just want to build images that start from ones we (bless)
 * Lars: we need to know what versions of what is in each image
 * Joe: that will be a part of debmonitor (as planned)
 * Fabian: sometimes updates have to be done, what happenes when update Debian, we need to figure out the underlying serbvices
 * Joe: we will know because when we build an image through the pipeline we submit it to a thing that analyzes it with debmonitor. How do we update those images after building? TBD.
 * Tyler: we have  atask about mass rebuilding all the images
 * Tyler: to answre your nodejs developer question:
 * https://wikitech.wikimedia.org/wiki/Blubber
 * https://wikitech.wikimedia.org/wiki/Blubber/Tutorial/HelloWorld
 * Joe: James needs to convince audiences to migrate to it
 * Joe: from SRE's side, what does a developer need to do..
 * Alex: you get your image, you're happy, the pipeline deployed it CI staging...
 * ssh deploy1001, scap-helm, 100lines of bash, give it an image version, it deploys, to eqiad, or staging
 * the user interface includes setting things via ENV variables
 * moriel has already used it herself
 * it's ugly UI
 * currently evaluating replacing it with helmfile ( https://github.com/roboll/helmfile )
 * TODO pipeline should use helmfile
 * things devs can't do: LVS, DNS, etc
 * Lars: there are few review points: eg: does this project make any sense to Wikimedia? needs a security review?
 * Joe: how it's done now ^
 * Lars: not just security but also SRE, design an implementation that's suitable for production
 * James: will the helmfile configs be in the repo itself or somewhere else? pros and cons...
 * Joe:
 * Alex: operations/deploymentcharts
 * Joe: to get into production
 * create a helm chart via scaffold script in the deploy-charts
 * review from SRE, setting up DNS, load balance it
 * Greg's arms are getting tight... slowing down with note taking
 * thcipriani picks up the batton!
 * Dan: We could make this part of our setup skaffold project, i.e., filing a task, what the skaffold script creates is probably confusing for newcomers
 * James: how often are people going to do this?
 * Joe: if we make the process good enough then probably we would see more services to be creates, but I think making a few requests is a Good Thing
 * Lars: in 1996 someone wrote a packaging helper and we went from a very small amount of packages to 600 packages

Beta

 * Joe: has a solution, running an image in docker
 * antoine: devs want to test stuff in beta with an updated service image in staging ask the backend to their thing
 * Joe: open staging to public internet
 * Alex is sad about that idea
 * Dan: deploy to the service namespace in automatically as part of the pipeline
 * Tyler: Can we have a k8s in labs for labs use
 * Alex: BGP, LVS, Calico -- none of these things exist in labs
 * Exposing staging to beta cluster would require staging to be open to the public internet
 * TODO we'll need some way to update this automagically in beta...restart and pull

next steps

 * Lars: If dan and lars want to do this? what do we do next?
 * Joe: we're a bit behind on SRE side
 * Lars: not general person, me and dan :)
 * Joe: oh ok :)
 * Alex: missing a token in production
 * Dan: try to include the reporting back to gerrit (image uri etc)

Other questions

 * How will you know what is deployed?
 * How will you troubleshoot logs?
 * How do you troubleshoot deployments?
 * How do we rollback?

TODOs

 * TODO: write blubber policy to ensure that we're using only wmf base images
 * TODO: file task to automagically create job from seed job
 * TODO: file task about automagic setup of pipeline on .pipeline/config.yaml creation
 * TODO: continuous deployment, what's missing? a k8s api token on contint1001
 * TODO: support documention like the one tyler did for the portal and pipeline/helmfile and deployment

Services
= As Always =
 * Release Pipeline Workboard
 * Meeting notes