Wikimedia Release Engineering Team/Train deploys

Pairing on the Train
As of October 2019, there are two people assigned to each week's train: One as primary, and one as backup. These are rough guidelines for sharing the work, and should be improved as we learn more.


 * On Monday, communicate with your partner and establish how you'll collaborate over the course of the week.
 * Updates on IRC while your partner is working and updates on the train blocker ticket if they're offline seems to be a useful pattern.
 * Liberal use of video chat for pairing on hard problems is encouraged.
 * It seems to work well to have the primary do the work of cutting the branch, syncing wikis, etc., while the backup keeps an eye on logs, works on improvements to deploy tooling, and is generally an extra pair of eyes for the whole process.
 * If you are in doubt about any part of the process and it's during your partner's working hours, consult them first and get their help in resolving your questions.
 * If one member of the pair is in the European window and one is in the American window, both train deployment windows should be reserved on the Deployments calendar. This gives a backup deployer a defined window for moving the train forward outside the primary's working hours, if it becomes necessary.
 * If the train is blocked or there are any other issues, communicate the transfer of responsibility on the train blocker ticket by assigning it to the responsible party and leaving a note.

Breakage
There will be times when this process does not go smoothly. There are guidelines for what do to when that happens.

In general, if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.

Rollback
To rollback a wikiversion change, it should be pretty quick. Go ahead and rollback production before you send patches up to gerrit since waiting on Jenkins may take a while:

Example:


 * Wait for the patch to merge and the fetch back down to the deployment server


 * .

Places to Watch for Breakage
Train deployers should check for breakage as they are rolling out train as they are effectively the first line of defense for train deploys. Some of the places to watch for breakage:


 * IRC
 * primary channel is
 * useful channels are
 * for more channels see MediaWiki on IRC and IRC/Channels
 * mwlog1001
 * logspam-watch
 * Logfiles in
 * Logstash Fatal Monitor
 * Logstash MediaWiki Errors
 * Logstash "mediawiki-new-errors" dashboard (linked from logstash front page)
 * Showing only timeout errors (see T204871)
 * Group-specific Logstash Dashboards:
 * group0
 * group1
 * Grafana Varnish error-rate dashboard (HTTP 5XX % should have 3+ 0s after the decimal point, e.g. 0.0001%)
 * Grafana Frontend Responses NGINX vs Varnish
 * Grafana Production Logging
 * Minerva Client Errors - Browser JS errors count (only wikipedias on mobile)

If the train is blocked

 * A task will be assigned to you, for example T191059 (1.32.0-wmf.13 deployment blockers)
 * Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.

Checklist

If there are blocking tasks, please do the following:


 * Make sure all tasks blocking train are set to  priority in phabricator
 * Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
 * Send e-mail to:
 * ops@lists.wikimedia.org
 * wikitech-l@lists.wikimedia.org
 * Subject:
 * Body
 * Add relevant people (see Developers/Maintainers) to the blocking task
 * Ping relevant people in IRC
 * Once train is unblocked be sure to thank the folks who helped unblock it

Monday: Sync up with your deployment partner
See the train pairing section above.

Before the deploy window
Depending on how practiced you are and where you choose to run commands (full clones of mediawiki-core from outside the cluster can take a while) the steps will typically take 30 to 90 minutes.

Incident documentation

 * If there were problems during the train, follow instructions at Incident documentation on incident reports and post-mortem review.
 * Use  form to create a new page,  . Example: Incident documentation/20181212-Train-1.33.0-wmf.8.
 * For Timeline section, events from SAL and Phabricator task are a good start.