User:GLavagetto (WMF)/Container image policy for production

Introduction
With the move of production services to Kubernetes, and the general abundant use of containers in various spaces, we are presented with a series of challenges compared to our traditional management of security and consistency of the environment - almost all of our tools break down in that context.

Specifically:


 * Container images are built as immutable layers, so applying security updates means that all layers dependent on the vulnerable and thus updated layer need to be rebuilt and redeployed. That might mean the entire tree (in case the vulnerable layer is the very base one).
 * It can become very hard to keep track of what is installed in each container image.
 * Updating software in a higher-level image often depends on updating the base images.

We are building a full toolchain to manage the problems above, which will be described in the next paragraph. This toolchain starts from a few technical assumptions that all container images deployed in production MUST follow. Any deviation from such assumptions WILL break down our ability to ensure the security of what runs in production in a timely manner. In the last paragraph of this policy we'll spell out and justify those few assumptions

Our toolchain
We take a multi-layered approach at building images, with different tools for images with different purposes. Specifically:


 * Base images: these are our minimal "basic distribution" images, upon which everything else is built. We use bootstrap-vz for those.
 * All images that are used independently of software projects developed in-house, or that are used as base for building other images, use docker-pkg . Docker-pkg is a simple helper software that provides basic templating for dockerfiles and also allows to build and update images respecting a dependency chain. One example of such images is our node10-devel image, which we use as a base to build nodejs-based applications, or all of the images we run in CI.
 * All images that are built from our own software and are intended to be used in production use our own deployment pipeline, and the containerization is defined via the blubber control file under .pipeline/blubber.yaml

All the images that we use in production or in CI are built upon this toolchain. We're building progressively a system to monitor and manage the software installed inside those images, and to trigger a full rebuild of the whole stack in case some critical vulnerability is found in one of the base images (say a new Shellshock or Heartbleed).

More technical information about the various images types and tools can be found at Wikitech

Policy
There are a series of requirements for this toolchain to be effective in allowing us to detect, manage and resolve security vulnerabilities in a timely manner, which is sometimes of the utmost importance. Every image that we use in production must meet the following requirements:


 * It MUST be re-built from scratch on our infrastructure, meaning they must not be based on any external image
 * It MUST be based on Debian, so that we can collect the status of the packages into debmonitor
 * It MUST make use of the toolchain, as explained above

Any change/exception to this policy needs to be agreed upon with the relevant SRE teams.