Wikimedia Release Engineering Team/Deployment pipeline/2019-03-14

= 2019-03-14 =

Last Time

 * 2019-02-28
 * Archive

Current Quarter Goals

 * Roughly 2 weeks left!


 * TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation


 * TEC3:O3:O3.1:Q3: Move cxserver, citoid, changeprop, eventgate (new service) and ORES (partially) through the production CD Pipeline
 * cxserver
 * Images built via deployment pipeline
 * Namespaces created for k8s eqiad/codfw
 * 8% of traffic (evidently as of 3 hrs ago :))
 * Plan is to finish Tuesday moving the remainder of traffic!


 * ✅ citoid
 * Images built via deployment pipeline
 * Deployed


 * changeprop
 * Should we bump this?
 * marko: we have to fix the kafka driver depending on the node version and kafka version: how will we have different versions of different things?
 * alex: side-step the problem and build the image with node6


 * ✅ eventgate
 * Image built via pipeline
 * Chart
 * Deployed


 * ✅ (for this quarter I'd guess?) ORES
 * cf: Dan's comments

Services to migrate

 * cpjobqueue
 * marko: can use node6 image, but scaling is still a problem: sometimes it uses a lot of resources, sometimes it does nothing. I worry about scaling. How do we determine the resources needed so that service doesn't starve?
 * jeena: are we against autoscaling?
 * alex: autoscaler is not yet deployed. Could we work from current scb capacity?
 * marko: will have to continue conversation about number of workers per pod -- we don't want 100 pods, nor do we want to have 1 pod that is massive, so we'll have to find a balance
 * liw: are there means to perform benchmarks and capacity tests?
 * marko: we know current resource usage
 * alex: we have ways to perform benchmarks (jeena used that for blubberoid), but in this case we have prod services already
 * marko: the most important thing is to get everything correct for when surges happen
 * alex: I think we can accomodate, we're adding more capacity next quarter, we can also add more pods as needed. Provides more flexibility than the current environment
 * marko: still manual
 * alex: yes, manual, but we don't have any way to scale currently, so this is an improvement
 * marko: worse-case scenario is that cpjobqueue would "just" begin to lag
 * ORES

New Services

 * mobrovac: RESTBase?
 * marko: for next quarter Q4 we want to split RESTBase into 2 services: api routing layer and storage layer (is current thinking) -- storage on cassandra nodes (where resbase is) -- api routing on k8s


 * alex: termbox (wmde) -- renders javascript for wikidata; session storage for CPT -- moves sessions into cassandra; Discourse for Quim

General

 * Install heapdump and gc-stats when env production
 * tl;dr: installing node deps into images is hard and trying to figure out where to do that


 * QUESTION: what is the plan with "evaluation environments"?
 * https://docs.google.com/document/d/1QU_6Svn4iduK0TPLSOghYP4g1lK-byCv-0ZKoHfIAVY/edit#heading=h.6gq2j7lm5pz8
 * Tyler: Is this something happening in the near term / something RelEng should be involved in?
 * Alex: What is that?

TODOs from last time

 * TODO various attack vectors document to start
 * antoine and I started to talk about it
 * thcipriani to more thoroughly noodle


 * TODO: support documention like the one tyler did for the portal and pipeline/helmfile and deployment
 * https://wikitech.wikimedia.org/wiki/Deployment_pipeline now exists, https://wikitech.wikimedia.org/wiki/Continuous_Delivery has been deleted.


 * TODO: Joe & James_F to work on eventual 2019-04-01 email
 * Beware: announces on 04/01 can be considered an April's fool


 * ✅ TODO: improve feedback from pipeline -- link to actual failing job, show images, and tags as applicable
 * still no feedback for git tags https://phabricator.wikimedia.org/T177868#4984766
 * tags also currently "failing", i.e., the run of test-and-publish fails (due to not being able to comment), but test and publish actually succeeds
 * image name point to internal registry
 * might be nice to vote on a label
 * failure feedback is much improved IMO
 * https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/495398/#message-63fc709bee82599ca720c5c293802587f1a9800d

RelEng

 * Dan starting work on .pipeline/config.yaml
 * The pipeline should provide a way to save artifacts from a stage
 * .pipeline/config.yaml Proposal The Latest™
 * marko: how do services relate to the blubber.yaml?
 * dan: you could use the same blubber file if you want, or you could specify a seperate file if that makes sense, I want to have sensible defaults for these things, but if you do have special requirements you should be able to specifiy those and control the execution and steps in the pipeline. You can specify variants that are built and run in parallel in addition to the sequential steps of the pipeline.
 * marko: if I have one service and I want to use this to tell jenkins what to do that could also be done?
 * dan: yep. This has come up since we have people who want to run helm test, but don't want to deploy to k8s. There are other use-cases that want test variants but not run helm test. This allows folks to specify which parts of the pipeline execute and in which order
 * brennen: what happens if wind up with CI tooling that conflicts with this?
 * dan: what we have now is written in groovy so we'll have to refactor unless we move to jenkins x -- it's possible that this could be a benefit -- perhaps there could be a translation layer
 * hashar: the groovy is very minimal at this point so should be easy to refactor -- let's migrate every year to ensure that we keep our code to a minimum! Point taken on potential of creating the next tech debt though.


 * Tyler: We're migrating stuff to v4 of Blubber.


 * Jeena: Talking with Greg about local dev environment, we're working on the mediawiki part whereas pipeline is working on services. However:  Seems like it's not really useful for developers if they can't run services in the local env.  We've been adding services like RESTBase and parsoid; Greg also mentioned Zotero.  These aren't classified as a priority to move to the pipeline for various reasons.  For example RESTBase.
 * marko: you can use SQLite for RESTbase.
 * Jeena: So there's not going to be an image built in near future...
 * marko: Shouldn't be too much an issue. Task becomes repetitive.
 * Jeena: My thought was:  We're not officially putting them into k8s / prod pipeline...  Is it ok if we build images in the pipeline that aren't going to production.
 * marko: We do serve the images... Well, we could.
 * Maybe could use different tags, create them semi-manually for the transition period. (similar to https://github.com/wikimedia/mediawiki-containers )
 * alex: depends on the service we're talking about. RESTBase and parsoid? moving to the pipeline is a saner approach than building manually
 * dan: we could still use the same process
 * it's going to depend on the service whether or not we put things through the pipeline. I.e. some services (MediaWiki) are not going to fit through the pipeline currently and in those instances we'll have to build manually (e.g., with docker-pkg)


 * Antoine: track which version of Debian package are in which container image, weren't we talking about a system to track this? This is going to be an issue soon. How do we know what images we need to rebuild for update?
 * alex: adding support for this to debmonitor, but it is not resourced. We want to do *exactly that*. We're writing an image lifecycle document inside serviceops
 * hashar: if you have documents I'd be happy to read them

Services
= As Always =
 * Release Pipeline Workboard
 * Meeting notes