Wikimedia Release Engineering Team/MW-in-Containers thoughts

What we want[edit]

Objective: MediaWiki* is automatically packaged into (one or more) OCI container images**, which are semi-automatically deployed*** into Wikimedia production as kubernetes pods

"MediaWiki"[edit]

All of what we consider the current appserver layer running on mw\d\d\d\d servers (plus the Parsoid sub-cluster), as currently deployed through scap.
- Included:
  - the mediawiki/* repos including the code running on the appservers and the general and specialist jobrunnners
  - the core site configuration in operations/mediawiki-config's CommonSettings.php and related files like static assets
  - the appserver layer of Apache and PHP-FPM and related code (like etcd clients) and APCu
  - specialist code like ghostscript (for PDF rendering), lilypond (for sheet music LaTeX rendering) and ffmpeg (for video transcoding and scaling)
  - built artefacts that change based on the code but are invariant based on request; currently, the l10n CDB/PHP files, perhaps also base ResourceLoader bundles?
- Excluded:
  - the persistent data-heavy layers (primary and replica MySQL instances, external store MySQL instances, logstash, kafka, Swift, …)
  - the caching layers which vary on content more than code (memcached, Varnish, and Apache Traffic Service)
  - the independent services running in their own k8s pools (mathoid, page content service, kask, …) or independently on bare metal (ElasticSearch, Kafka transit, Thumbor, …)
  - site-variant configuration, as currently specified in operations/mediawiki-config's InitialiseSettings.php
TBDs:
- What is missing from or wrong in this list?

"automatically packaged into (one or more) OCI container images"[edit]

When triggered by base image updates or code being merged, an automated process selects the correct versions of each of the components and assembles them into a warmed-up container image or set of images that is tested, verified, and considered ready for production
TBDs:
- Do we run Apache on the same container as PHP-FPM or isolated from it? Do jobs run in the same container or a different one? Etc.
- How do we handle local security patches in a way that applies reliably? And how do we avoid disclosing their existence / contents?
- How do we handle upstream disclosed security patches that applies reliably? And how do we avoid disclosing their existence / contents?
- How do we debounce/group changes to build? Currently the scap init i18n build and sync process alone takes ~45 minutes and we land roughly 600 patches a week into production repos (aka one per ~17 minutes)
  - Initial start point: Build automatically every 24 hours (perhaps at 04:00 UTC as current global commit trough) [chat]
- How can we make this build as fast as possible (re-using layers?) without loss of generality (MediaWiki is monolithic, so a change in one extension could break all the others)?
- How do we audit changes if the built artefacts are private (due to local and upstream mitigations and pre-release security patches)?
- How do we apply emergency instant fixes (other than the static site config)? Do we need that still?
- How do we change our current build & test pipeline such that we have confidence in its results enough to deploy code?
- Are we still just building from master, or do we want to move to manual feature-branch-based picks?
TODOs:
- Provide a mechanism to build and inject the site-variant configuration in a static form (see T223602)
- Provide a mechanism to warm up the APCu cache ahead of an image going live (replay last N production GET requests or similar?)
- Provide a mechanism to pre-build the common ResourceLoader requests and inject them into the Varnish caches.

"semi-automatically deployed"[edit]

On some trigger, the k8s production deployment state is updated to add new version at a low percentage of traffic, and slowly scaled out to answer user requests until it is the only pod running, or is rolled back and removed if things go wrong.
TBDs:
- Will we have a staging environment for manual final verification before deployment?
- How do we trigger deployments? Just manually? Automatic every hour during "business hours"?
- Who can trigger deployments? All current deployers? All mergers (no)? Just SRE Service Ops and RelEng? Etc.
- How does we tell the controller (?) to know to which deployment state to route a given request?
- How does the use scale out? By wiki in risk-led phases (like current train)? By use case?
- What metrics are we going to use to judge the pool scale-out?
  - Reliability: Logstash is quite noisy (and misses some things)?
  - Performance: ?!?
  - Features: ?!?
- Do we plan to generally run one and at most two flavours at once, or would we run more than that ever?
- How long a window of "old" pods would we keep around to roll back to?
- How do we handle content-variant cache purges in a way that scales? Can we make this less manual and more reliable?

What changes[edit]

Deployments will now be:
- atomic;
- more isolated; and
- scaled out and rolled back without manual intervention
…

How we get there[edit]

…