Wikimedia Release Engineering Team/MW-in-Containers thoughts

What we want

 * Objective: MediaWiki* is automatically packaged into a k8s pod**, which is semi-automatically deployed*** into Wikimedia production

"MediaWiki"

 * All of what we consider the current appserver layer running on  servers (plus the Parsoid sub-cluster), as currently deployed through scap.
 * Included:
 * the  repos including the code running on the appservers and the general and specialist jobrunnners
 * the core site configuration in 's   and related files like static assets
 * the appserver layer of Apache and PHP-FPM and related code (like  clients) and APCu
 * specialist code like ghostscript (for PDF rendering), lilypond (for sheet music LaTeX rendering) and ffmpeg (for video transcoding and scaling)
 * built artefacts that change based on the code but are invariant based on request; currently, the l10n CDB/PHP files, perhaps also base ResourceLoader bundles?
 * Excluded:
 * the persistent data-heavy layers (primary and replica MySQL instances, external store MySQL instances, logstash, kafka, Swift, …)
 * the caching layers which vary on content more than code (memcached, Varnish, and Apache Traffic Service)
 * the independent services running in their own k8s pools (mathoid, page content service, kask, …) or independently on bare metal (ElasticSearch, Kafka transit, Thumbor, …)
 * site-variant configuration, as currently specified in 's
 * TBDs:
 * What is missing from or wrong in this list?

"automatically packaged into a k8s pod"

 * When triggered by base image updates or code being merged, an automated process selects the correct versions of each of the components and assembles them into a warmed-up pod that is tested, verified, and considered ready for production
 * TBDs:
 * Do we run Apache on the same container as PHP-FPM or isolated from it? Do jobs run in the same container or a different one? Etc.
 * How do we handle local security patches in a way that applies reliably? And how do we avoid disclosing their existence / contents?
 * How do we handle upstream disclosed security patches that applies reliably? And how do we avoid disclosing their existence / contents?
 * How do we debounce/group changes to build? Currently the  i18n build and sync process alone takes ~45 minutes and we land roughly 600 patches a week into production repos (aka one per ~17 minutes)
 * How can we make this build as fast as possible (re-using layers?) without loss of generality (MediaWiki is monolithic, so a change in one extension could break all the others)?
 * How do we audit changes if the built artefacts are private (due to local and upstream mitigations and pre-release security patches)?
 * How do we apply emergency instant fixes (other than the static site config)? Do we need that still?
 * How do we change our current build & test pipeline such that we have confidence in its results enough to deploy code?
 * Are we still just building from master, or do we want to move to manual feature-branch-based picks?
 * TODOs:
 * Provide a mechanism to build and inject the site-variant configuration in a static form (see T223602)
 * Provide a mechanism to warm up the APCu cache ahead of an image going live (replay last N production GET requests or similar?)
 * Provide a mechanism to pre-build the common ResourceLoader requests and inject them into the Varnish caches.

"semi-automatically deployed"

 * On some trigger, the new pod is added into the production pool and slowly scaled out to answer user requests until it is the only pod running or is removed
 * TBDs:
 * Will we have a staging environment for manual final verification before deployment?
 * How do we trigger deployments? Just manually? Automatic every hour during "business hours"?
 * Who can trigger deployments? All current deployers? All mergers (no)? Just SRE Service Ops and RelEng? Etc.
 * How does we tell the controller (?) know which pod to route a given request to?
 * How does the use scale out? By wiki in risk-led phases (like current train)? By use case?
 * What metrics are we going to use to judge the pool scale-out?
 * Reliability: Logstash is quite noisy (and misses some things)?
 * Performance: ?!?
 * Features: ?!?
 * Do we plan to generally run one and at most two flavours at once, or would we run more than that ever?
 * How long a window of "old" pods would we keep around to roll back to?
 * How do we handle content-variant cache purges in a way that scales? Can we make this less manual and more reliable?

What changes

 * Deployments will now be:
 * atomic;
 * more isolated; and
 * scaled out and rolled back without manual intervention