Wikimedia Apps/Team/RESTBase services for apps/Deployment process

Developer setup for updating deployment repo
Check out the basics for deployment.

We build the deploy repo in Docker in order to ensure that the Node modules have the correct binaries. We build the deploy repo using Docker for Mac or a Linux machine. Before you start the first build of the deploy repo run through the setup instructions.

Update and build the deployment repo
Sync the code and deploy repos with current master:

If using Docker for Mac start the Docker daemon by clicking on the Whale icon in the menu bar. (Should work automatically on Linux.) Run the tests in Docker and build the new commit for the deploy repo:

And push to Gerrit:

You will find the new patch on https://gerrit.wikimedia.org/r/#/projects/mediawiki/services/mobileapps/deploy,dashboards/default:recent.
 * in Gerrit

Troubleshooting
If something goes wrong you'll need to bring the deploy repo to it original state. Commands like these might help: Hint: If the  command fails you can add the   option and try again.

Deploy to Beta Cluster
The steps are similar to but using different machines, of course.

To deploy:

To verify something on the box:

Use  (instead of  ) to see if there are issues. You may want to log manually in this channel using  until T156079 is resolved. This logs to the Releng team's Server admin log

More about: Beta Cluster

Deploy to Production
Scan through recent chat in  channel on IRC to make sure there's nothing blocking the deploy.

Look at deployment logs:

In another terminal start the actual deployment:

The command will deploy first on the canary server scb2001.codfw.wmnet. In a different terminal you can log in to the canary server and verify that the service responds to expectations. Examples:

Once satisfied press  in the deployment terminal to continue deploying on the other servers without asking again. You can also press  to be prompted after every group.

The string parameter for the scap deploy command will show up in IRC  and SAL. Once for the start and then at the end.

Consider running following commands from the same directory to check deployment:

In case of issues see how to undo deploy.

See also scap3 and deployment guide for further info.

Tagging deployments in Git
Production deployments are tracked with git tags in the main mobileapps repo. The most recent commit included in each deployment is given a tag in this format:  (e.g., ).

The mobileapps repo contains a shell script at scripts/git-deploy.sh that is used to apply these tags. Tags are cryptographically signed and a GPG signing key is therefore required. See the Git tag setup section for the one-time setup of that.

Note: First update the source and deploy repos on your machine if you use another machine for tagging!

Then run:

Example:

Lately, Bernd has been getting error message like this back:. When that happens you may want to check if the tag was pushed to origin. If not then do something like this:

Update tasks in Phabricator
Move the tasks in the 'To deploy' column of the Reading Infrastructure Kanban board to the 'Deployed' column and add a comment with the deploy tag.

Logs
The service is running on the following machines: In your first terminal tail log file: Alternatively:

Restart
In another terminal restart the mobileapps service Node.js processes:

Simple checks
Check version and run the automatic monitoring check manually:

Wait 5-10 minutes, watching the log file and #wikimedia-operations for alerts.

Other things to check:
 * Uptime of service:
 * Versions:
 * If Swagger spec was changed for this deploy:
 * Example command to check an endpoint:

Refresh RESTBase cache
Refresh the aggregated featured feed stored in RESTBase/Cassandra for a single day. Example to run this from the prod cluster: Notes:
 * Adjust the date (and wikipedia.org subdomain if necessary).
 * Another RESTBase machine could be used, too, but only one is needed to update the entry in Cassandra storage.
 * There's still Vagrant cache, see

Purge Varnish cache
See Multicast_HTCP_purging#One-off_purge on Wikitech

From terbium (or tin?). Example for English announcements feed:

Dashboards

 * Performance:
 * Event bus delays: look at page-edit_delay Parsoid HTML and mobile-rerender-resource-change_delay
 * RESTBase backend requests: Parsoid rates
 * Public entry point request rates, req/s
 * Mobileapps request rates and latencies
 * If only one of the server sees load: (also see each host's weight)
 * Check conftool/eqiad/mobileapps
 * Load:
 * CPU: eqiad, codfw

Other things you might want to monitor:
 * logstash/Kibana: mobileapps, restbase
 * ICINGA Web UI
 * Grafana: mobileapps, RESTBase, EventBus
 * eqiad: scb1001, scb1002, scb1003, scb1004; Cassandra
 * codfw: scb2001, scb2002, scb2003, scb2004, scb2005, scb2006; Cassandra
 * ICINGA mobileapps endpoints health:
 * eqiad: scb1001, scb1002, scb1003, scb1004
 * codfw: scb2001, scb2002, scb2003, scb2004, scb2005, scb2006
 * Ganglia (OLD, prefer Grafana):
 * eqiad: scb1001, scb1002, scb1003, scb1004
 * codfw: scb2001, scb2002, scb2003, scb2004, scb2005, scb2006

See also Dealing with deploy problems and reverting deploys.

Links

 * ServiceTemplateNode/Deployment
 * wikitech:Services/Deployment
 * older: Marko's deployment page
 * Deployment notes from service template
 * Docker
 * scap3 documentation
 * How to find basic info of last scap3 deployment