User:LarsWirzenius/NewCI

= META =

This is v3, being written, work in progress. Feedback to wikitech-l or lwirzenius@wikimedia.org or via the discussion page. Please don't edit this wiki page directly, for now. Thanks.

TODO
Things that need changing in this page:


 * write missing chapters (marked FIXME)
 * review/fix the security embargo use case
 * copyedit

Changes from v2

 * Use cases and processing of changes has been entirely rewritten.
 * Added open questions (see below), based on comments to previous versions.
 * Clarified that “CI implementation” and “Transitioning to new CI system” are waiting for RelEng to experiment.
 * Clarified what production-like environments are. Combined the two sections disucssing environments into one.
 * Added the ArtifactStorageManagement, BuildRecursively, and DebianPackages requirements.
 * Wrote proposal for handling embargoed security fixes.
 * Added suggestion that credentials-holding components protect against Rowhammer, Spectre, etc.
 * Added Cucumber/Gherkin-style scenarios for some use cases, outlined some more.
 * Various minor improvements to spelling, grammar, style.
 * Converted from Google Docs to a wiki page for better maintainability and software freedom.

Changes from v1
This is a summary of the changes compared to the previous version circulated for comment. It seemed easier to make a new version than edit the previous one while people were reading and commenting on it.


 * Added section on what CI/CDel/CDep are
 * Added SELFSERVE/MIGRATION requirement.
 * Renamed POSTMERGETESTS to REPOGROUPS and reworded it explicitly consider concurrent changes to multiple repositories.
 * Added MAINTAINED/KNOWNLANG requirement.
 * Combined GERRIT and OTHERGITORTICKETING under new WORKFLOWTOOLING requirement. Renamed OTHERGITORTICKETING to PHABRICATOR.
 * Added AUTOMATEDEPLOYMENT/STOPCHANGES.
 * Added Scala to PROGLANGS.
 * Reworded CACEDEPS requirement to specify some package repositories, and not get into details.
 * Dropped HA requirement, as the description doesn't really seem useful.
 * Added TESTINSTANCES wishlist.
 * Raises EMBARGO up from wishlist.
 * Added an optional code health stage.
 * Added section about projects that don't target WMF production.
 * Added section on not being production-like.
 * Added a section clarifying environments.
 * Many minor tweaks and improvements.
 * Added some claification about security reviews still being needed.
 * Change requirement ids to be CamelCase instead of ALLUPPERCASE.

Open questions

 * How should the Understandable requirement be tested? Specifically, how do we measure that our developers can use CI productively?
 * How can we protect the dependency cache (CacheDeps requirement) against poisoning with vulnerable versions?
 * How can we do security patches (Embargo requirement) securely, without breaking embargo, in an automated fashion?
 * If we can’t convert all deployments to be to Kubernetes containers, do we keep manual deployments?
 * Automated deployments need automated test suites of sufficient quality that they give us confidence to deploy if they pass. Alternatively, we could have automated rollbacks at the first sign of trouble. What’s the best approach for MediaWiki and its extensions?

= Introduction =

The WMF RelEng team plans replacement of the current WMF CI system with one of Argo, GitLab CI, Zuul v3. These were selected in the first phase of the CI WG.

We aim to do “continuous deployment”, not only “continuous integration” or “continuous delivery”. The goal is to deploy changes to production as often and as quickly as possible, without compromising on the safety and security of the production environment.

This document goes into more detail of how the new CI system should work, without (yet) discussing which replacement is chosen. A meta architecture if you wish.

It is assumed as of the writing of this document that future CI will build on and deploy to containers orchestrated by Kubernetes.

An important change is that we aim to change things so that as much as possible, all software deployments are made to containers orchestrated by Kubernetes, in the long run. There will by necessity be a long transition period when other kinds of deployments will continue to be necessary. Also, some software only needs to be built and published, rather than deployed, such as Debian packages and Docker base images.

On continuous integration, delivery, and deployment
Somewhat simplistically, continuous integration (CI) is the practise of integrating changes from all developers in a project to the main line of development frequently. Crudely, changes get merged at least once a day, if the software still builds, and that its tests pass. The benefit of the practise is that there is rarely a lot of effort that needs to be spent on merging, because changes are small, and there are rarely merge conflicts. If something breaks, debugging is usually easy, again because the change is small. In WMF, we use Gerrit to make sure changes are not even merged unless the software builds and tests pass. Further, it is less likely that different parts of the development community spend time on features or other changes that will conflict. This encourages developers to communicate about their intentions and collaborate on significant changes before committing effort on something that’s going to require a different approach to work with other changes that are happening at the same time.

Continuous delivery builds on top of CI to deliver the software after every change. This might, for example, mean publishing release tarballs after every successful integration. Effectively, the software is released after every successful change. The benefit of this is that those who want can easily use the very latest of the software, giving fast feedback to the developers about the direction of the software.

Continuous deployment builds on top of CI to also deploy the software to production. This gives some of the same benefits as continuous delivery, but is more suited for web applications and web sites.

Vision for CI
This is Lars’s personal opinion, for now, but it’s based on discussions with various people while at WMF. It’s not expected to be new, radical, or controversial, compared to the status quo.

In the future, CI at WMF serves WMF, its developers, and the Wikimedia movement by making software development more productive, more confident, and faster. The cycle time of changes (the time from idea to running in production) is short: for a trivial change, as little as five minutes. At the same time, the safety and security of production is protected: malicious changes do not get deployed, breaking changes are rare, and can easily be fixed or the problematic change reverted.

A major change here is to start doing continuous deployment instead of a weekly train.

Production here means all the software needed to run all the sites (Wikipedias in different languages, Commons, etc), as well as supporting services, including tooling and services that supports development.

Overall solution approach
The overall approach to the architecture of the CI system, and the workflow supported by it, is to make all changes via version control (git), which includes code, configuration, and scripts for building and deploying. When a change is pushed to version control, CI builds and tests the change, humans review the change, and if all seems to be in order, CI deploys to production.

It should be pointed out explicitly that continuous deployment relies on an automated test suite that the movement, and its developers, Release Engineering, and SRE are comfortable with. Everyone needs to trust that the tests catch problems with sufficient likelihood that if the tests pass, it is safe to deploy to production. This may take some time to achieve, but it will enable us to develop software with much greater productivity.

Regardless of how good the tests will be, we will need to have a good way to roll back any changes to a previous working version. A good test suite and rolling back will enable us to deploy boldly.

Since we currently lack a sufficient test suite, we may want to take the approach of automatically reverting changes that cause error messages in log files.

On environments
The new CI system will need to provide various environments, which are sufficiently production-like for testing or running the software. These environments will include all the components that are needed for simulating production, for the tests in question. The specific components depend on the test, and a mechanism for specifying them will be built detail during the CI implementation phase. Ideally, the main difference between a test environment and production is capacity.

For example, an environment might include MediaWiki, MariaDB, and the microservices we have supporting wikis, all configured to work together, and independently from other environments. The various components might all run in the same container or VM, or in several, depending on what makes the most sense.

It will be possible to specify that a production-like environment has multiple wikis, to test things that require interaction between multiple wikis.

The environments for running acceptance and other integration tests will be implemented in a way that makes them indistinguishable from a newly built fresh environment.

The production-like environments will be created automatically. They will be sufficiently similar to production to allow tests to happen, in such a way that tests passing in a production-like test environment gives us reasonable confidence that the software will work in the real production environments as well.

Sometimes there’s a need to use different versions of software than in production in test environments. For example, when preparing a migration to a new version of PHP. Production will stay with the currently best version of PHP, but there’s a need for test environments with the next version of PHP so that tests and related debugging can be done using that. This will be supported.

On security and security reviews
Nothing in this document is meant to lessen how security reviews are handled. Just like with the current status quo, if a MediaWiki extension embeds (vendors) a dependency into its own source tree, the vendored version needs to be acceptable to those responsible for reviewing software running in production from a security perspective.

Stakeholders
Stakeholders in the WMF CI system include:


 * RelEng, who are responsible for keeping CI running
 * SRE, who are responsible for the infrastructure on which CI runs
 * MediaWiki developers, who develop MW and its extensions, and will (eventually) be doing MW releases for external MW deployers
 * staff and volunteer developers of anything else built, tested, and deployed by CI
 * WMF, who pays for CI using donations
 * The Wikimedia movement, who use sites and services operated by WMF
 * all of humanity, who benefit from having knowledge freely disseminated

= Requirements =

This chapter lists the requirements we have for the CI system and which we design the system to fulfill.

Each requirement is given a semi-mnemonic unique identifier, so it can be referred to easily.

The goal is to make requirements be as clear and atomic as possible, so that the implementation can be more easily evaluated against the requirement: it’s better to split a big, complicated requirement into smaller ones so they can be considered separately. Requirements can be hierarchical: The original requirement can be a parent to all its parts. Further, a way to objectively check if a requirement is met should be outlined.

Note that these requirements are not meant to constrain or dictate implementation. The requirement specifies what is needed, not how it is achieved.

These requirements were originally written up in the WG wiki pages and have been changed compared to that (as of the 21 March 2019 version): Phase 1 report. This document is now the maintained version of the requirements.

Scope of work
These are non-negotiable requirements that must all be fulfilled by our future CI system.


 * SelfHostable Must be hostable by the Foundation. It’s not acceptable for WMF to rely on an outside service for this, to achieve the security, reliability, and privacy required of Wikimedia.
 * FreeSoftware Must be free software / open source. “Open core” like GitLab can be good enough, as long as we only need the parts that provide software freedom. This is partly due to the SelfHostable requirement, but also because free software is a form of free knowledge, and it’s a WMF value is to prefer open source.
 * GitSupport Must support git. We’re not switching version control systems for CI.
 * SelfServe Must support self-serve CI, meaning we don’t block people if they want CI for a new repo. Due to ProtectProduction, there will probably be some human approval process for new projects, but as much as possible, people should be allowed to do their work without having to ask permission.
 * SelfServePipeline Should allow the developers to define or declare at least parts of the pipeline jobs in the repository: what commands to run for building, testing, etc.
 * Understandable Must be understandable without too much effort to our developers so that they can use CI/CD productively. Acceptance criteria: our developers can use CI productively, after given a short, 1-page tutorial on how to specify build instructions for their software.
 * Migration There needs to be a migration strategy to move existing repositories to the new CI.
 * RepoGroups All automated tests must pass both before and after merging a change, for each repository separately and all the repositories together that get deployed together to production. For example, tests for each extension must pass, but also for mediawiki with all the extensions that we deploy to production, and any supporting services. This can be achieved by various means (run tests twice, or make sure target branch doesn’t change in between). The justification is that there can be changes to another repository while tests for one repository runs, and those changes may break things for this repository.

Requirements
These are not absolute requirements, and can be negotiated, but only to a minor degree.


 * Rollback: If a change is deployed to production and turns out to cause problems, it is easy and quick to revert the change.
 * Fast Must be fast enough that it isn’t perceived as a bottleneck by developers. We will need a metric for this.
 * ShortCycleTime Must enable us to have a short cycle time (from idea to running in production). CI is not the only thing that affects this, but it is an important factor. We probably need a metric for this.
 * Transparent Must make its status and what-is-going-on visible so that its operation can be monitored and so that our developers can check the status of their builds themselves. Also the overall status of CI, for example, so they can see if their build is blocked by waiting on others.
 * Feedback Must provide feedback to the developers as early as possible for the various stages of a build, especially the early stages (“got the source from git”, “building”, “running unit tests”, etc.). The goal is to give feedback as soon as possible, especially in the case of the build failing.
 * Feedback2 Must support providing feedback via Gerrit, IRC, and email, at the very least. These are our current main feedback channels. We don’t want to replace these other tools at this time.
 * Secure Must be secure enough that we can open it to community developers to use without too much supervision.
 * ProtectProduction Must protect production by detecting problems before they’re deployed, or revert troublesome changes after deployment, and must in general support a sensible CI/CD pipeline. This is necessary both for the safety and security of our production systems, a higher speed of development, and higher productivity. The protection brings developer confidence, which tends to bring even more speed and productivity.
 * EnforceTests Must allow Release Engineering team to enforce tests on top of those specified by system consumers, to allow us to set minimal technical standards. The repository owner mustn’t be able to disable or skip the enforced tests.
 * MaxBuildTime Should have build timeouts so that a build may fail if it takes too long. Among other reasons, this is useful to automatically work around builds that get “stuck” indefinitely.
 * Isolation Builds must done in isolated environments or sandboxes. Jobs should not be able to affect the environment of other jobs.
 * Maintained Must be maintained and supported upstream, and we should expect that to be the case for at least the next five years. The CI system should not require substantial development from the Foundation. Some customization is expected to be necessary.
 * Relationship We should have a good relationship with the CI system’s upstream.
 * KnownLang It would be preferable for the CI system to be implemented in a language known by foundation staff to make customizations, debugging, fixing, etc, easier.
 * Metrics Must enable us to instrument it to get metrics for CI use and effectiveness as we need. Things like cycle times, build times, build failures, etc. FIXME: Start collecting a list of metrics we must have, want to have, or would like to have.
 * WorkflowTooling CI system must not tie us to specific tooling for workflow. We must have the freedom to change tooling in the future without it being a major project.
 * Gerrit Must work with Gerrit as that’s what we have now for change review and will continue to use, unless we want to change at a later date.
 * Phabricator Must work with Phabricator for ticketing, as that’s what we have and will continue to use, unless we want to change at a later date.
 * Promotion Must promote (copy) Docker images and other build artifacts from “testing” to “staging” to “production”, rather than rebuilding them, since rebuilding takes time and can fail or produce a different result. Once a binary, Docker image, or other build artifact has been built, exactly that artifact should be tested, and eventually deployed to production.
 * LocalTests Must allow a developer to replicate locally the tests that CI runs. This is necessary to allow lower friction in development, as well as to aid debugging. For example, if CI builds and tests using Docker container, a developer should be able to download the same image and run the tests locally.
 * AutomatedDeployment Must allow deployment to be fully automated.
 * StopChanges The developers must have a way to stop changes from being merged into a repository, and therefore from being deployed. This is needed, for example, if there’s a problem in the way deployments work, and new changes should wait until the deployment is fixed.
 * AutomatedSelfDeployment The CI system itself must be automatically deployable by us or SRE, onto a fresh server.
 * DeployCIBuilt Only build artifacts built and tested by CI may be deployed to production by CI itself.
 * Scalable Must be scalable to the projects and changes we work with. This includes number of projects, number of changes, and kind of changes.
 * HScalable Must be horizontally scalable: we need to be able to add more hardware easily to get more capacity. This is particularly important for build workers, which are the mostly likely bottleneck. Also, probably environments used for testing.
 * ProgLangs Must be able to support all programming languages we currently support or are likely to support in the future. These include, at least, shell, Python, Ruby, JavaScript, Java, Scala, PHP, Puppet and Go. Some languages may be needed in several versions.
 * OutputLinks Must support HTTP linking to build results for easier reference and discussion. This way a build log, or a build artifact, can be referenced using a simple HTTP (or HTTPS) link in discussions.
 * ArtifactArchive Should allow archiving build logs, executables, Docker images, and other build artifacts for a long period.
 * ArtifactStorageManagement The artifact store should delete old artifacts in order avoid filling all storage.
 * Retention The retention period should be configurable based on artifact type, and whether the build ended up being deployed to production.
 * ConfigVC Must keep configuration in version control. This is needed so that we can track changes over time.
 * Gating Must support gating / pre-merge testing. FIXME: This needs to be explained.
 * PeriodicBuilds Must support periodic / scheduled testing. This is needed so that we can test that changes to the environment haven’t broken anything. An example would be changes to Debian, upon which we base our container images.
 * CIMerges Must support tooling to do the merging, instead of developers. We don’t want developers merging by hand and pushing the merges. CI should test changes and merge only if the tests pass, so that the branches for main lines of development are always releasable.
 * TestVC Must support storing tests in version control. This is probably best achieved by having tests be stored in the same git repository where the code is.
 * BuildDeps The repository must declare its build dependencies, and CI must make sure they are provided in the build environment.
 * BuildRecursively After a repository (or project) has been built successfully, build and test anything that depends on it. For example, if MediaWiki core uses a library to generate cat videos randomly, build MW after the library has built successfully.
 * TestServices Must support services for tests — i.e., some PHPUnit tests require MySQL. These are most important for integration tests. Proper unit tests do not depend on any external stuff. However, integration tests may well need MediaWiki, some specific extensions, and backing services, such as databases, “oid” services, and possibly more. CI needs to be able to provide such environments for testing as defined by developers.
 * CacheDeps Must support dependency caching of popular package repositories: Debian, npm, pypi, and probably more. This is needed for speed, if nothing else. However, this needs to be done securely.
 * GoodOps: The CI system needs to be maintained in a proper way, with monitoring, multiple people who can get notified of problems and can fix things.
 * Monitoring: The CI system needs to have monitoring to automatically alert of problems.
 * OnCall: There must be multiple, named people who can be alerted to deal with problems in CI that users can’t deal with themselves. These people need to cover enough time zones to be available at almost all times, and have enough overlap to cover for each other when they’re not available due to illness, vacation, travel, or other such reasons.
 * SLA: RelEng provides a service-level agreement, for how well CI serves WMF and the movement.
 * CleanState Builds and tests must run in a clean, well-defined environment.
 * CleanBuildEnv Must provide a build environment with only explicitly declared build dependencies installed.
 * CleanWorkspace Must provide a clean workspace for each build or test run - either a clean VM or container. The workspace should have the source code of the project to be built, with the right commit checked out from version control, and anything else explicitly declared, but nothing else.
 * Secrets Must support secure storage of credentials / secrets.
 * AllTests When merging a change, all tests for all components must pass. For example, when changing one MediaWiki extension, which is depended on by another extension, the change should be rejected if either extension’s tests fail.
 * Embargo Would be nice to be able to run for secret/security patches. This means CI should be able to build and deploy changes that can’t be made public yet, for security embargo reasons.
 * DebianPackages Support building and publishing Debian packages (.deb) for software.

Wishlist items
These requirements are easily negotiated and can be left unfulfilled if hard to achieve.


 * LiveLog Should have live console output of build so it can be viewed while the build is running, without having to wait until the build is finished.
 * RateLimit Should have rate limiting - one user/project can not take over most/all resources.
 * CheckSig Should support validation and creation of GPG/PGP-signed git commits as well as artifacts and generated containers.
 * LimitBoilerplate Would be nice for test abstractions to limit boiler-plate, i.e., all of our services are tested roughly the same way without having to copy instructions to every repository.
 * PrioritizeJobs Would be nice to prioritize jobs.
 * Use case: if there is a queue of jobs, there should be some mechanism of jumping that queue for jobs that have a higher priority.
 * We currently have a Gating queue that is a higher priority than periodic jobs that calculate Code Coverage.
 * We could run jobs that have historically been fast in a separate queue.
 * PostMergeBisect Would be nice to post-merge git-bisect to find patch that caused a particular problem with, say, a Selenium test.
 * DeployWherever Would be nice to have a mechanism for deployment to staging, production, toollabs. We could do with a way to deploy to any of several possible environments, for various use cases, such as bug reproduction, manual exploratory testing, capacity testing, and production.
 * TestInstances Should provide test instances for proposedt changes. If a developer pushes change 12765 to Gerrit, CI should build something like https://12765.test.wikimedia.org with the change running a sufficiently production-like environment that the change can be tested “for real” by testers and volunteering users. This might only happen for changed marked specially.
 * Mobile Would be nice to support building and testing mobile applications (at minimum for iOS and Android).

= The CI pipeline =

FIXME: insert pipeline image of the default pipeline

CI will provide a default pipeline for all projects. Projects may use that or specify another one. Each project will be able to pick the stages it wants in its pipeline, though this may be overridden by Release Engineering or SRE to protect the sites (ProtectProduction requirement).

The pipeline will be divided into several stages. Mandatory stages for all changes and all projects are the commit and acceptance test stages. Other stages may be added to specific changes or projects as needed. Most projects will need either a stage to publish releases or deploying to production.

The goal is that if the commit and acceptance test stages pass, the change is a candidate that can be deployed to production, unless the project is such that it needs (say) manual testing or other human decision for the production deployment decision. Likewise, if the component or the change is particularly security or performance sensitive, stages for checking those aspects may be required. CI will have ways of indicating the required changes per component, and also per change. (FIXME: It is unclear how this will be managed.)

If the commit or acceptance stage fails, there is not a production candidate. The pipeline as a whole fails. Any artifacts built by the pipeline will not be deployable to production, but they may be deployable to test environments, or downloaded by developers for inspection.

The commit stage
The commit stage builds any deployable artifacts, such as executable binaries, minimized JavaScript, translation files, or Docker images. It is important that artifacts don’t get rebuilt by later stages, because rebuilding does not always result in bitwise identical output, and any change in the artifacts may invalidate any testing that has been done. Instead the goal is to build once, test the artifacts, and deploy the tested artifacts, instead of rebuilding and maybe deploying something different than what was tested.

The commit stage also runs lint and unit tests, and any other tests that can be run in isolation from other parts of the system, and that are also quick. The commit stage does not have access to backing services, such as databases or other components of the overall system. For example, when the pipeline processes a change to a MediaWiki extension, the commit stage doesn’t have access to a running MediaWiki core or the MariaDB MediaWiki uses. Integration or system tests should be done in the acceptance test stage.

The commit stage also runs code health checks. (Unless they are too slow.)

The commands to build (compile) or run automated tests are stored in the repository, either explicitly, or by indicating the type of build needed. There might, for example, be a .pipeline/config.yaml file in the repository, which specifies that make is the command that builds the artifacts. Otherwise, the file may specify that it’s a Go project, and CI would know how to build a Go project. In this case RelEng can change the commands to build a Go project by changing CI only, without having to change each git repository with a Go program.

Only the declarative style will be possible for building Docker images (e.g., .pipeline/blubber.yaml), as we want control over how that is done (Secure requirement).

CI may enforce specific additional commands to run, to build or test further things; this can be used by RelEng to enforce certain things. For example, we may enforce code health checks, or enable (or disable) debug symbols in all builds. Such enforcement will be done in collaboration with our developers.

Any build dependencies needed during the commit stage must be specified explicitly. For example, the minimum required version of Go that should be installed in the build environment would be a build dependency. If a project build-depends on another project, it needs to specify which project, and which artifacts it needs installed from the other project. Explicit build-dependencies are more work, but result in fewer problems due to broken heuristics.

A code health stage
If the commit stage is too restrictive for running some code health tools, they can be run a separate stage instead. Or possibly moved to the acceptance test stage, or a separate stage that can run in parallel with other stages. We’ll figure it out during implementation when we see what’s needed.

An integration stage
We may want to optionally have an integration stage. This would be like the acceptance stage, but run tests that are more developer oriented than acceptance tests are. For example, some of browser-based (Selenium) tests may fit here. But if they fit well into a project’s acceptance stage, they can go there.

The acceptance test stage
During the acceptance test stage CI deploys artifacts built in the commit stage to an isolated production-like system that has the same versions of all software as production, except for the changes being processed by the pipeline. CI will then run automated acceptance tests, and other integration and system tests, against the deployed software. The test environment is clean and empty, and well-known, unless and until the test suite inserts data or makes changes.

The deployment stage
If prior stages have passed successfully, and manual code review (“Gerrit CR:+2 vote”) has approved the change, this stage deploys the change to production.

Manual tests
Testers may instruct CI to deploy any recently built set of artifacts to a dedicated test environment, and can use the software in that environment where it is isolated from others, and won’t suddenly change underneath them. The details of how this will be implemented are to be determined later.

This feature of the CI can also be used to demonstrate upcoming features that are not yet ready to be deployed to or enabled in production.

(“Recent” means any artifacts not yet purged from the artifact store. See ArtifactArchive requirement.)

Capacity tests, non-functional requirements
Capacity tests, and other tests for non-functional requirements, will also be done in dedicated, isolated production-like environments. RelEng will work with the performance team to sort out the details.

On repositories, components, projects not targeting WMF production
We can support projects, components, and repositories that don’t target WMF production. As an example, MediaWiki tarball releases. For such projects, we can have the commit stage build the tarball (or whatever is released for the project), and acceptance or other stages test that. We can even deploy the projects to test environments to test the code in a production-like environment. We can test things with many versions of PHP, many Linux distributions, and any other variant we want to support. However, most of this document aims at improving things in WMF production, so other things are given less attention, at least for now.

= CI Architecture =

The WMF development ecosystem
FIXME: Insert pic of the WMF develompent ecosystem

The figure above is simplistic, but gives a general idea of what happens when a developer has finished with a change:


 * 1) developer pushes a change to Gerrit, which triggers CI
 * 2) CI builds and tests change (commit stage)
 * 3) CI deploys to a test environment, runs tests against that (acceptance test stage); if everything is OK, Gerrit is notified and requests code reviews from relevant parties
 * 4) testers can request CI to deploy the change to an environment dedicated for manual testing
 * 5) after a successful code review, CI merges changes to the branch, runs all automated tests again, and deploys to the production environment

The commit and acceptance test stages are triggered as soon as a developer pushes changes to be reviewed. Human reviews won’t be requested automatically until the two stages pass, as there’s no point in spending human attention on things that are not going to be candidates for deployment to production. Reviews for changes marked work-in-progress won’t be requested automatically. Developers may request reviews of work-in-progress changes when they want.

Other stages may run in parallel with code review, and if they fail they may nullify the release candidacy of the change. For example, stages for manual and capacity testing, and security test/review; depending on the change and the component in question, some or all of these may be necessary.

CI internal components
FIXME: Insert pic of CI internal components

Note: this is speculative for now, until we actually implement at least one of the candidates.

The CI system consists, at least conceptually, of several internal components, depicted in the architecture diagram above. The actual implementation may end up using different components, but the conceptual roles will probably still be there.


 * CI interacts with Gerrit and production externally. Also test environments, but those ephemeral. Gerrit is code review and git server, and produces an event stream to which CI reacts.
 * The controller listens to the Gerrit event stream and commands the various workers to do things as appropriate. It is the conductor of the symphonic CI orchestra.
 * The VCS worker interacts with Gerrit in its role as a git server. The VCS worker is the only CI component which has credentials for accessing Gerrit. The VCS worker retrieves source code via git, does merge and other git operations, pushes changes back to Gerrit (when it’s OK to do that), and publishes snapshots of the working tree to the artifact store. Build workers get their source code from the artifact store, not from Gerrit directly. This avoids having to give build workers credentials, or network access, to gerrit, and having to push speculative changes to Gerrit.
 * The artifact store stores blobs of things that CI needs or produces: git working trees, built artifacts, etc.
 * The build worker runs builds and tests. It gets the source code from the artifact store and publishes built artifact back to the artifact store. There can be any number of build workers running in VMs or containers. The build worker sets up (or installs) build dependencies in the build environment, before it starts the actual build, based on information provided by the repository being built. Depending on the way the build is configured, the actual build may prevent network access.
 * The deployment worker deploys built artifacts to an environment, such as a test environment or the production environment.
 * The log store stores build logs, what Jenkins calls “console output”: the stdout and stderr of the command run to do a build, and to run tests. Also the commands to deploy, and to interact with Gerrit.
 * A test environment is an isolated environment that is sufficiently like production to running tests. The test may be automated or manual. There can be any number of test environment, and they might be ephemeral, if that turns out to be useful.
 * The production environment is what the world sees. It’s the actual Wikipedia in all the various languages, and all the other sites we run.

The components that have security sensitive credentials (VCS worker, deployment worker) should probably run on separate physical hardware from the rest, to avoid exposing the credentials to Rowhammer, Spectre, and other CPU security issues. Alternatively, the software implementing the components might store the credential encrypted in RAM with a random key, and only decrypt them while the credential is used.

= Code review: what it's for and not for =

Code review is for checking a proposed change in ways that can't be automated because they require human appraisal. Code review attempts to answer questions like "is this something we want at all?", "is this a good way of making this change?", and "is this done in good style and easily understood by other humans?".

Code review is not for checking for syntax errors, code formatting, or that any automated tests pass. These are all things that can and should be checked for automatically. But code review can check them too, especially if CI does not already check them automatically. There's no hard limit on what code review can check for, but it's silly to spend human effort on things that can be checked by automation.

= Use cases =

FIXME: Discuss security embargoed changes.

Terminology
We have many projects, used to run many sites and services, and much concurrent development is happening. To simplify (and systematize) the discussion, the following classification and terminology is used in this document. It may not be applicable in other contexts.

Lone, concurrent, and coupled changes
A change may be happening alone: no other change that affects the same repository, site, or service is happening at the same time. This is called a lone change.

Often, two or more changes are happening at the same time. If they don't depend on each other, they are called concurrent changes. If they do depend on each other, they are called coupled changes.

FIXME: Coupled changes aren't discussed anywhere yet.

Simple and complex sites and services
Our source code is stored in git repositories hosted on Gerrit. Services, and the software running sites (such as MediaWiki) may be built out of one or several repositories. For example, MediaWiki is often deployed with a set of extensions, each of which comes from a separate repository.

Also, one repository may produce more than one service, depending on how it is built or configured.

A site or service that is built out of one repository is called simple, even if that repository can build multiple services. Sites and services that are built out of more than one repository is called complex.

Successful, failing, rejected, and conflicting changes
When a developer considers a change to be ready to be deployed to production, they push it to Gerrit for build, test, review, and deployment. Gerrit tells CI to build and test the change. If the build or automated tests fail, the change as a whole fails, and the processing it ends there. The developer is notified, and if they want to, they can fix the problem and push an update to the change, at which point the process starts over, as if the new version of the change was a completely new change.

If the CI build and test works, Gerrit asks other developers to do a code review of the change. If a code reviewer votes CR+2, processing of the change continues (see below). A CR-1 vote rejects the change, which ends the process (but the developer may fix the problem and push a new change, starting the process over). In all other cases the change is stalled at the code review phase, possibly for a long time. (It is not the job of Gerrit or CI to chase after reviewers, but we may build something for that later.)

After a CR+2 vote, CI does a merge of the change in a local copy of the git repository to the master branch. If the merge fails, possibly because the master branch has moved on, the change is conflicting, processing it ends, and the developer is notified. The developer may push a fix, but that either A or B happens. We need to discuss which one we choose.

A: The whole process starts over, and the code review that was already done is forgotten. The justification for this is that the new version of the change may be radically different and a fresh code review is warranted. On the other hand, a re-review of a change that isn't radically new should be quick and not much work, as Gerrit allows you to see the differences between to versions of a change.

B: The developer may push a new version of the change, but it will not trigger a new code review. CI will still need to build, test, and merge the new version of the change successfully, but if it does, the process continues as if the merge hadn't failed. This would require us to trust our developers to remember to ask for a review if they make radical changes in the new change version.

Otherwise, CI builds and tests it's local, modified copy of master, and if this fails, notifies the developer, at which point A or B above happens again, same as for a merge conflict.

If merge, build, and test are all successful, CI deploys the change to production. This is called a successful change. A successful change is not expected to get updated again.

If there are problems found in production, it will be handled either by making a new change to fix the problem, or reverting the change where the problem occurred. In either case, the change (or revert) is pushed to Gerrit as a new change, and processed from there as any other change.

Reverting the change is a form of rollback. We can also do rollback by telling Kubernetes to switch to using container based on a previous Docker image version. Either way, we need to have a reliable, and fast, way of handling rollbacks. (Ideally, also a way of triggering rollbacks automatically, but it is probably sufficient to trigger manually, at least initially.)

A change to a repository that is used by more than one site or service
As a rule, when a repository is changed, all sites and services built from the repository need to be re-built and re-tested and re-deployed, unless CI can somehow prove it's unnecessary.

For example, if a two Wikipedia sites both use the same extension, and a change is proposed to the extension, software running both sites need to be built and tested and deployed. It's not enough that one site works.

CI will act *as if* it only merges changes one at a time, including running post-merge builds, tests, and deployments. Only after the change has been successfully deployed will CI start processing the next change.

This may be much too slow, and thus CI may do things concurrently, as long as the effect is the same. It can't merge one change, and while doing the build, test, or deployment of that change, merge another. This is because if it does, and there is a problem with the first change, this may affect the second change, including preventing the second change from working at all.

CI can optimize this by only applying changes to non-overlapping sets of repositories. For example, say site A uses the repositories mediawiki-core, mw-ext-maps, and mw-ext-math; site B uses mediawiki-core and mw-ext-video; and service C uses only blubberoid. A change for site A that affects mediawiki-ext-maps has to finish deploying before any change to site B (meaning mediawiki-core or mw-ext-video) is processed, since A and B share the mediawiki-core repository. However, while the change to site A is being processed, a change to site C (blubber) can be process concurrently, as it does not affect anything for site A, and vice versa.

META: We should possibly model this and do some experiments to see if it works well enough.

Non-successful changes
Any change (or version of change) where a build or automated tests fail, or code review rejects the change, terminates processing of the change, until and unless the developer pushes a new version of the change. If they do, the new version is processed as if the previous versions had never existed.

If a change fails after a CR+2 code review vote, it is handled as described above (option A or B).

Processing a change
All changes get handled using the process, as follows.

given a service baroid from foo.git on Gerrit (head of master being REF0) and production runs REF0 of baroid when developer pushes change CHANGE1 for foo.git to Gerrit, giving commit REF1 then Gerrit triggers CI to build and test CHANGE1 and CI votes V+1 on CHANGE1 when reviewer votes CR+2 on CHANGE1 then CI merges the CHANGE1 locally to master, giving master tip REF1 and CI builds and tests its local master branch and CI deploys the artifacts from its local master to production and production runs version REF1 of baroid and CI pushes the merge to Gerrit and Gerrit has tip for master for foo.git as REF1

FIXME: This doesn't yet handle complex sites and services. It also doesn't handle mutexing against other concurrent changes.

= Other aspects to consider for CI =

Security embargoed change
NOTE: There is yet no good solution described here. Further thought is needed.

A general goal of the future CI system is to make all changes go via git, so that they can be tracked. At the same time, we must not break an embargo on security fixes. The embargo means the fixes probably need to be in private git repositories, unless we can in the future have secure private branches, which Gerrit currently doesn't provide in a way that we trust. The embargo also means we can't publish a Docker image with the fixes applied, since the PHP, JS, and other code is in cleartext in the Docker image. Alternative approaches that fulfill these requirements are listed here for discussion.


 * We could not publish Docker images at all. This makes it harder for our developers to develop, test, and debug issues locally. Probably not acceptable.


 * We could only publish Docker images when they don't have embargoed security fixes. This would still mean local development etc is hampered, but also leaks the information that there is an embargoed fix in the pipeline. Probably not acceptable.


 * We could build, test, and publish a Docker image from public sources, and build, test, and deploy to production a second one, with the security changes applied. The fundamental problem here is that the published Docker image, which our developer would use for local development, is different from what actually runs in production. However, this would be a temporary problem that only lasts as long as the embargo. This is effectively what we currently have. Security patches are applied by the train conductor during deployment, on top of the code that is in public git repositories.

If we go for the last option, someone with access to security patches would need to be responsible for dealing with problems found in the second Docker image, but not in the first one. Basically, the process would be something like:


 * build public Docker image
 * deploy public image to a test environment
 * run acceptance tests against that test environment
 * build private Docker image by applying security patches
 * deploy private image to a second test environment
 * run acceptance tests against second test environment
 * deploy private image to production

If anything goes wrong with the building, testing, or deploying of the second image, it needs to be debugged by someone with access to the security patches. Ideally, this would be the person or people who wrote the security patch in the first place, possibly backed by the release engineering team.

Log storage
FIXME: This needs to be written fully.


 * We want to capture the build log or “console output” (stdout, stderr) of the build and store it. This is an invaluable tool for developers to understand what happens in a build, and especially why it failed.
 * Ideally, the build log is formatted in a way that’s easy for humans to read.
 * It’d also be nice if the build log can be easily processed programmatically, to extract information from it automatically. More than nice, having log files that programs can automatically extract information from would be so good it’s almost a requirement. But it may be difficult to achieve for some programs and build systems.
 * We may want to store build logs for extended periods of time so that we can analyze them later.
 * FIXME: We should have a list of metrics we want to gather from CI logs.

Artifact storage
FIXME: This needs to be written fully.


 * Artifacts are all the files created during the build process that may be needed for automated testing or deployment to production or any other environment: executable binaries, minimized JavaScript, automatically generated documentation from source code.
 * We basically need to store arbitrary blobs for some time. We need to retrieve the blobs for deployment, and possibly other reasons.
 * We may want to store artifacts that get deployed to production for a longer time than other artifacts so that we can keep a history what was in production at any recent-ish point in time.
 * We will want to trace back from each artifact which git repository and commit, and CI job/build, it came from.
 * We can de-duplicate artifacts (a la backup programs) to save on space. Even so, we will want to automatically expire artifacts on some flexible schedule to keep storage needs in control.
 * We need to decide when we can make these artifacts publicly accessible.
 * Artifact storage must be secure, as everything that gets deployed to production goes via it.
 * There are some artifact storage systems we can use.

Credentials management and access control
Credentials and other secrets are needed to allow access to servers, services, and files. They are highly security sensitive data. The CI system needs to protect them, but allow controlled use of them.

Example: a CI job needs to deploy a Docker image with a tested and reviewed change as a container orchestrated by production Kubernetes. For this, it needs to authenticate itself to the Kubernetes API. This is typically done by a username/password combination, but might be an API token of some kind (though it doesn’t really matter; it’s all just secret bits at some level). How will the future CI system handle this?

Example: for tests, and in production, a MediaWiki container needs access to a MariaDB database, and MW needs to authenticate itself to the database. MW gets the necessary credentials for this from its configuration, which CI will install during deployment. The configuration will be specific for what the container is being used: if it’s for testing a change, the configuration only allows access to a test database, but for production it provides access to the production database.

Builds are done in isolated containers. These containers have no credentials. Build artifacts are extracted from the containers and stored in an artifact storage system by the CI system, and this extraction is done in a controlled environment, where only vetted code is run, not code from the repository being tested. The build environment can’t push artifacts directly to the artifact store.

Deployments happen in controlled environments, with access to the credentials needed for deployment. The deployment retrieves artifacts from the artifact storage system. The deployments are to containers, and the deployed containers don’t have any credentials, unless CI has been configured to install them, in which case CI installs only the credentials for the intended use of the container.

Note that credentials should not come directly from the source code of the deployed program. CI deploys configuration when it deploys the software. This way, the same software (build artifacts) can be deployed to different environments. (This may be complicated by the way MediaWiki is configured, using a PHP file in the source tree. This will need discussion.)

Tests run against software deployed to containers, and those containers only have access to the backing services needed for the test, and may even be firewalled to not have access to any other network locations.

Suggestion: Deployments will be done dedicated deployment environments, which run a “pingee” service. When a pipeline executes a deployment stage, deploying to any environment, the stage runs in a suitable container, but doesn’t actually do the deployment itself. Instead, it “pings” a deployment service, with information on who is deploying, what, and where. The deployment service then inspects the change, and if it looks acceptable, does the actual deployment to the desired environment. The deployment service has access to the credentials it needs for accessing the artifacts and doing the deployment. There may be several deployment services, for deploying to environments with different security needs.

= CI implementation =

FIXME This needs to be written, but it needs a lot of thinking and experimentation first. RelEng is prototyping with GitLab, Argo, and Zuul v3.

= Transitioning to new CI system =

FIXME This needs to be written. The migration plan will depend on the details of the new system, and so will need to wait until we’ve chosen what to implement. As it stands, there are at least the following questions to solve:


 * MediaWiki isn’t going to be ready to deploy to a container for many months. How will we deal with this? Continue doing manual deployments? Automate the deployment to non-containers?
 * MediaWiki does not currently seem to have a test suite that gives us sufficient confidence to do automated deployments. What can and should we do about this?

= Maintenance =

FIXME this needs to be written.

This chapter discusses how maintenance of the new CI system will be divided between various teams.