Wikimedia Apps/Team/RESTBase services for apps/Deployment process

Developer setup for updating deployment repo
Check out the basics for deployment.

We build the deploy repo in Docker in order to ensure that the Node modules have the correct binaries. We build the deploy repo using Docker for Mac or a Linux machine. Before you start the first build of the deploy repo run through the setup instructions.

Update and build the deployment repo
Sync the code and deploy repos with current master:

If using Docker for Mac start the Docker daemon by clicking on the Whale icon in the menu bar. (Should work automatically on Linux.) Run the tests in Docker and build the new commit for the deploy repo:

And push to Gerrit:

You will find the new patch in the deploy repo in Gerrit.
 * in Gerrit

Deploy to Beta Cluster
The steps are similar to but using different machines, of course.

To deploy: (instead of instead of ssh deployment .)

To verify something on the box you can ssh into.

Use  (instead of  ) to see if there are issues. You may want to log manually in this channel using  until T156079 is resolved. This logs to the Releng team's Server admin log

More about: Beta Cluster

Deploy to Production
Scan through recent chat in  channel on IRC to make sure there's nothing blocking the deploy.

Optional: Look at deployment logs:

In another terminal start the actual deployment:

The scap deploy command above takes a reason string argument. If this string contains phab tasks, those tasks will get comments about the deployment happening (start + finish). So, let's say in the deployment we have fixes for tasks T123 and T234 you could write instead of the last command:

The command will deploy first on the canary server scb2001.codfw.wmnet. In a different terminal you can log in to the canary server and verify that the service responds to expectations. Examples:

Once satisfied press  in the deployment terminal to continue deploying on the other servers without asking again. You can also press  to be prompted after every group.

The string parameter for the scap deploy command will show up in IRC  and SAL. Once for the start and then at the end.

Consider running following commands from the same directory to check deployment:

In case of issues see how to undo deploy.

See also scap3 and deployment guide for further info.

Consider purging URLs
If the pagelib has changed we should consider purging the pagelib URLs. See Purge Varnish cache below.

Tagging deployments in Git
Production deployments are tracked with git tags in the main mobileapps repo. The most recent commit included in each deployment is given a tag in this format:  (e.g., ).

The mobileapps repo contains a shell script at scripts/git-deploy.sh that is used to apply these tags. Tags are cryptographically signed and a GPG signing key is therefore required. See the Git tag setup section for the one-time setup of that.

Note: First update the source and deploy repos on your machine if you use another machine for tagging!

Then run:

Example:

To verify it worked you can do either of these:
 * You can pull in the tag from a different clone of the repo.
 * A bit later you can see the new tag on Github.

Update tasks in Phabricator
Move the tasks in the 'To deploy' column of the Product Infrastructure Kanban board to the 'Sign off' column and add a comment with the deploy tag if not already there.

Monitor log files
A few minutes after the deploy is finished monitor Logstash for RESTBase and mobileapps.

Logs
The service is running on the following machines: In your first terminal tail log file: Alternatively:

Restart from deploy host via scap
From the deploy host restart the mobileapps service Node.js processes for one host, example scb2003:

( -l is shorthand for --limit-hosts )

Restart (directly on machine)
In another terminal restart the mobileapps service Node.js processes:

Simple checks
Check version and run the automatic monitoring check manually:

Wait 5-10 minutes, watching the log file and #wikimedia-operations for alerts.

Other things to check:
 * Uptime of service:
 * Versions:
 * If Swagger spec was changed for this deploy:
 * Example command to check an endpoint:

Refresh RESTBase cache
Refresh the aggregated featured feed stored in RESTBase/Cassandra for a single day. Example to run this from the prod cluster: Notes:
 * Adjust the date (and wikipedia.org subdomain if necessary).
 * Another RESTBase machine could be used, too, but only one is needed to update the entry in Cassandra storage.
 * There's still Vagrant cache, see

Purge Varnish cache
See Multicast_HTCP_purging#One-off_purge on Wikitech

From mwmaint1002.eqiad.wmnet (terbium or deployment?). Examples:

Logstash/Kibana
mobileapps, RESTBase (direct), RESTBase (ES), Parsoid

Performance

 * mobileapps
 * Event bus delays: look at page-edit_delay Parsoid HTML and mobile-rerender-resource-change_delay
 * RESTBase backend requests: Parsoid rates
 * Public entry point request rates, req/s

Configuration

 * If only one of the server sees load: (also see each host's weight)
 * Check conftool/eqiad/mobileapps
 * https://noc.wikimedia.org/conf/

Load

 * CPU: eqiad, codfw

Grafana
mobileapps, RESTBase, EventBus
 * eqiad: scb1001, scb1002, scb1003, scb1004; Cassandra: enwiki, other
 * codfw: scb2001, scb2002, scb2003, scb2004, scb2005, scb2006; Cassandra: enwiki, other

Icinga
Icinga (lower case user name):
 * eqiad: Mobileapps LVS eqiad; scb1001, scb1002, scb1003, scb1004
 * codfw: Mobileapps LVS codfw; scb2001, scb2002, scb2003, scb2004, scb2005, scb2006

Links

 * ServiceTemplateNode/Deployment
 * wikitech:Services/Deployment
 * older: Marko's deployment page
 * Deployment notes from service template
 * Docker
 * scap3 documentation
 * How to find basic info of last scap3 deployment