Wikimedia Release Engineering Team/CI Futures WG/Meetings2019-07-18

From mediawiki.org

Purpose[edit]

Let's chat about where we are and what we can/need to do in the near term.

The instigation for this meeting is the Zuulv2/python2 EOL and determining where we are in the relevant timelines.

See also: https://docs.google.com/spreadsheets/d/1TrkGTfPLR0C74va3XyY6faYplSh6UggGiPdmxIVm1uo/edit#gid=0

Notes[edit]

gitlab update

  • gitlab and one runner, with ansible
  • 3 components to address CI Arch
    • VCS worker fetches components
    • artifact store, outside of gitlab due to not being able to figure out gitlab's release storage
    • test environment, just publishes the artifacts
    • missing 4th component: a controller that's triggered by something, which tells the VCS worker to fetch, and tells to pull from artifact store and push to test env.
      • to hide gitlab behind it
  • Process:
    • Pushing to gitlab triggers commit stage
    • commit stage builds binaries/runs unit test, uploads to an artifact store (ick's)
    • deployment worker gets artifacts and puts them into a test env
  • hoping (at least some of) the written components can be shared across different POC options
  • timing: either Friday or Monday Lars should have something for others to play with
  • currently using a hello world C program,
    • Building blubber might be a good first test
    • Worker can be a VM, a container in k8s, bare-metal machine
    • Register workers for GitLab by running a script and passing a secret from the master

Migration plans

  • Tyler: Wanted to get to actual plan of attack for Zuul v2. 2 potential approaches:
    • Migrate to v3 as interim step
    • Do proof of concepts and do the work of migrating and building out new solution all in one step
  • Python2 end of lifes at end of 2019
    • we *may* need to be off of python2 by the end of 2019
    • TODO: Ask SRE if Python 2 is really going away.
  • Tyler: Python 2 aside, we're already past EOL for Zuul v2. We've already had problems with this (Gearman isn't getting patched).
  • Antoine: oh yeah the way Zuul talks with Jenkins is ... outdated and prone to breakage anytime Jenkins breaks something
    • Jenkins plugin maintained by openstack (4 years ago) -- that person isn't involved in the openstack project -- and openstack doesn't even use jenkins
    • Jenkins is going to make a lot of backwards-incompatible changes
    • This is already the case for the pipeline jobs (Deployment Pipeline) -- they don't register in gearman -- so we cannot trigger the jobs from zuul -- we have a hack in place
  • Greg: we're already hacking around zuul with Pipeline -- question: Dan will be back next week to move forward with PoC we can have him focus on argo
  • liw: I want to talk with Dan about argo
  • Tyler: I feel like we ought to focus on Zuul v3 in the near term. We're going to get bitten (security issues)
  • Greg: Other than switch to Ansible, what does that mean for infra? How involved is that really?
  • Antoine: We could migrate Jenkins jobs to Ansible, have v2 run those, then migrate those to v3 - turns out to be way more complicated.
  • Tyler: Is it possible to run 2 versions of Zuul server with different configurations to watch for different events?
  • Antoine: Yes.
    • The work that's already happened to migrate to docker makes most jenkins jobs very trivial
    • docker run + jenkins to store output
    • all of the logic is inside of shell scripts inside of docker containers
  • Greg: infra changes: what does that mean hardware-wise? WMCS?
  • Antoine: New CI solution cannot use WMCS.
    • k8s + duplication of effort?
  • Lars: We need somewhere to run GitLab, but GL can use k8s containers as runners...
  • Tyler: I disagree with the idea that Zuul v3 is an equivalent amount of effort to full new CI. Production code isn't gonna stop moving. Fewer unknown-unknowns with Zuul v3. Unknowns for GitLab are bigger. We know Zuul v3 can handle the complex use cases.
  • Ĺ˝eljko: Other solutions might be equivalent work for unknown payoff? tl;dr: Move to v3 then...
  • Tyler: Summary is there will be a big effort at the start of all projects; v3 will not have a long tail, others might.
  • Lars: Even though v3 is entirely different?
  • Tyler / Ĺ˝eljko: Concepts / features are similar, implementation differs.
  • Tyler: Is this right? Will it be easy once things are set up?
  • Antoine: Still has concepts of pipelines.
    • zookeeper + nodepool ... this stuff is new
    • Does gearman still exist?
    • PoC still needed
    • Migration should be straight-forward after some initial PoC
    • Where are artifacts stored? Dunno
    • This requires a whole new infrastructure, just like all of our solutions (argo, gitlab), but after the infra is setup migration of the jobs should be straightforward
    • Side note: Brennen's notes on Zuul v3 from earlier: https://phabricator.wikimedia.org/T218138
  • Greg: Maybe we do v3 in a way that's aimed at speed / efficiency and not necessarily how we'd do it if we were staying with v3 long-term.
    • Do we see an infra issues that we'd be blocked on?
  • Tyler: Zookeeper and Nodepool will be contentious.
  • Antoine: And Ansible.
  • Lars: Suggested next action: Discuss with SRE, make decisions based on feedback.
    • Hopefully temporary Zuul v3 migration.
    • TODO set up a meeting with SRE about this
    • TODO thcipriani to start google doc:
      • What goes in doc:
        • Background of problem (Tyler)
        • Current POCs (eg: gitlab) (Lars)
        • Constraints with EOL Zuul v2 (Antoine)
        • Estimates of implementations for each option (collaborative)
          • 1) zuulv3 now, migration to $something later
          • 2) zuulv2 for a while past deprecation
      • Doc should be ready to share early next week (2019-07-22)