Wikimedia Release Engineering Team/Runbooks

This is a list of runbooks for the Wikimedia Release Engineering Team, covering step-by-step lists of what to do when things need doing, especially when things go wrong.

Gerrit

 * Monitoring/Metrics
 * Take a Thread dump
 * Github replicas

GitLab

 * Provisioning a new shared runner

Configuration

 * Add/modify CI for a new/existing repo (Zuul)
 * Adding a new release pipeline for MediaWiki to CI (Zuul)
 * Add/modify a new type of CI job (Jenkins Job Builder)
 * Add/modify a new docker environment for CI jobs (Dockerfiles)
 * Creating and deploying a new Quibble release (Dockerfiles + JJB for the critical CI workflow)
 * Update doc.wikimedia.org static content (docroot)
 * Replay a gerrit CI event into Zuul to re-trigger jobs

Infrastructure

 * Clear part of Jenkins, when jobs are deadlocked ("waiting on executors") / Jenkins stuck
 * Restart zuul (and drop all running jobs!)
 * Agent remote call failed
 * Upgrade Jenkins
 * Work requests waiting in Zuul is CRITICAL -- this is usually a long chain of patchsets
 * If you caught it fast enough you can do: Continuous_integration/Zuul
 * Otherwise your options are: wait or restart zuul
 * Adding a new Jenkins agent
 * Deploy doc.wikimedia.org changes
 * Switch primary host for doc.wikimedia.org

Phabricator

 * Phabricator Administrative Commands

Deployments Schedule

 * Generating the wikitech:Deployments Page
 * Generating the train blocking tasks on Phab