Wikimedia Apps/Team/RESTBase services for apps/Deployment process

Developer setup for updating deployment repo

We build the deploy repo in Docker in order to ensure that the Node modules have the correct binaries. We build the deploy repo using Docker for Mac or a Linux machine. Before you start the first build of the deploy repo run through the setup instructions.

Update and build the deployment repo

Sync the code and deploy repos with current master:

cd ~/code/mcs/mobileapps
git status
git reset --hard origin/master
git checkout .
git clean -fd
git checkout master
git pull
git status
git --no-pager log --decorate -n 1
rm -rf node_modules/
npm ci service-runner

cd ../deploy
git status
git reset --hard origin/master
git checkout .
git clean -fd
git checkout master
git pull
git --no-pager log --decorate -n 1
git submodule update --init
git branch
git status
cd ../mobileapps

If using Docker for Mac start the Docker daemon by clicking on the Whale icon in the menu bar. (Should work automatically on Linux.) Run the tests in Docker and build the new commit for the deploy repo:

./server.js build --deploy-repo --force

And push to Gerrit:

cd ../deploy
git review

You will find the new patch in the deploy repo in Gerrit.

C:+2 in Gerrit

Deploy to Beta Cluster

The steps are similar to #Deploy to Production but using different machines, of course.

To deploy:

ssh deployment-deploy03.deployment-prep.eqiad.wmflabs

(instead of instead of ssh deployment.)

Example URL: https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/mobile-html/Dog.

To verify something on the box you can ssh into deployment-mcs01.deployment-prep.eqiad.wmflabs.

Use #wikimedia-releng (instead of #wikimedia-operations) to see if there are issues. You may want to log manually in this channel using !log until phab:T156079 is resolved. This logs to the Releng team's Server admin log

More about: Beta Cluster

Deploy to Production

Scan through recent chat in #wikimedia-operations channel on IRC to make sure there's nothing blocking the deploy.

Optional: Look at deployment logs:

ssh deployment  #deployment.eqiad.wmnet, currently points to deploy1001
cd /srv/deployment/mobileapps/deploy/
scap deploy-log

In another terminal start the actual deployment:

ssh deployment
cd /srv/deployment/mobileapps/deploy/
git pull && git submodule update
git log -n 1
scap deploy "`git log --pretty=format:'%s' -n 1`"

The scap deploy command above takes a reason string argument. If this string contains phab tasks, those tasks will get comments about the deployment happening (start + finish). So, let's say in the deployment we have fixes for tasks T123 and T234 you could write instead of the last command:

scap deploy "`git log --pretty=format:'%s' -n 1` (T123 T234)"

The command will deploy first on the canary server scb2001.codfw.wmnet. In a different terminal you can log in to the canary server and verify that the service responds to expectations. Examples:

ssh scb2001
curl localhost:8888/_info/version
curl localhost:8888/en.wikipedia.org/v1/page/mobile-sections/Dog
curl localhost:8888/en.wikipedia.org/v1/feed/announcements | jq .
curl localhost:8888/en.wikipedia.org/v1/page/news | jq .

Once satisfied press c in the deployment terminal to continue deploying on the other servers without asking again. You can also press y to be prompted after every group.

The string parameter for the scap deploy command will show up in IRC #wikimedia-operations and SAL. Once for the start and then at the end.

Consider running following commands from the same directory to check deployment:

grep '^t\|commit\|user' .git/DEPLOY_HEAD
git --no-pager log --decorate -n 1

In case of issues see how to undo deploy.

See also scap3 and deployment guide for further info.

Consider purging URLs

If the pagelib has changed we should consider purging the pagelib URLs. See Purge Varnish cache below.

Tagging deployments in Git

Production deployments are tracked with git tags in the main mobileapps repo. The most recent commit included in each deployment is given a tag in this format: <deploy/YYYY-MM-DD/{short deploy repo short commit hash}> (e.g., deploy/2016-01-12/683d73e).

The mobileapps repo contains a shell script at scripts/git-deploy.sh that is used to apply these tags. Tags are cryptographically signed and a GPG signing key is therefore required. See the Git tag setup section for the one-time setup of that.

Note: First update the source and deploy repos on your machine if you use another machine for tagging!

Then run:

./scripts/tag-deploy.sh

Example:

cd ~/code/mcs/deploy
git checkout master
git pull --rebase origin master
git submodule update --init
cd ~/code/mcs/mobileapps
git pull --rebase origin master
./scripts/tag-deploy.sh

To verify it worked you can do either of these:

You can fetch the tag from a different clone of the repo.
A bit later you can see the new tag on Github.

Update tasks in Phabricator

Move the tasks in the 'To deploy' column of the Product Infrastructure Kanban board to the 'Sign off' column and add a comment with the deploy tag if not already there.

Monitor log files

A few minutes after the deploy is finished monitor Logstash for RESTBase and mobileapps.

Troubleshooting & Restarting services

Logs

Scap3 handles most of the command line steps here. So, this is mostly kept for troubleshooting purposes.

The service is running on the following machines:

scb1001.eqiad.wmnet
scb1002.eqiad.wmnet
scb1003.eqiad.wmnet
scb1004.eqiad.wmnet
scb2001.codfw.wmnet
scb2002.codfw.wmnet
scb2003.codfw.wmnet
scb2004.codfw.wmnet
scb2005.codfw.wmnet
scb2006.codfw.wmnet

In your first terminal tail log file:

tail-mobileapps -f

Alternatively:

tail -2000 /srv/log/mobileapps/main.log \
| grep -v 'Could not find a definition for' | grep -v 'missingtitle' | grep -v 'Page or revision not found' | grep -v '501: unsupported_language'

Restart from deploy host via scap

From the deploy host restart the mobileapps service Node.js processes for one host, example scb2003:

cd /srv/deployment/mobileapps/deploy/
scap deploy --service-restart -l scb2003.codfw.wmnet "Restarting mobileapps on scb2003"

(-l is shorthand for --limit-hosts)

Restart (directly on machine)

In another terminal restart the mobileapps service Node.js processes:

cd /srv/deployment/mobileapps/deploy/
git log -n 1

ps -ef|grep mobileapps|wc -l
sudo service mobileapps restart
ps -ef|grep mobileapps|wc -l

Simple checks

Check version and run the automatic monitoring check manually:

check-mobileapps
# runs:
/usr/local/lib/nagios/plugins/service_checker 127.0.0.1 http://localhost:8888

Wait 5-10 minutes, watching the log file and #wikimedia-operations for alerts.

Other things to check:

Uptime of service:

sudo service mobileapps status

Versions:

curl localhost:8888/_info/version

If Swagger spec was changed for this deploy:

curl localhost:8888/?spec

Example command to check an endpoint:

curl localhost:8888/en.wikipedia.org/v1/feed/announcements
# beta cluster:
curl localhost:8888/en.wikipedia.beta.wmflabs.org/v1/feed/announcements

Refresh RESTBase cache

Refresh the aggregated featured feed stored in RESTBase/Cassandra for a single day. Example to run this from the prod cluster:

curl -H 'Cache-Control: no-cache' https://restbase.discovery.wmnet:7443/en.wikipedia.org/v1/feed/featured/2017/01/11

Notes:

Adjust the date (and wikipedia.org subdomain if necessary).
Another RESTBase machine could be used, too, but only one is needed to update the entry in Cassandra storage.
There's still Vagrant cache, see

curl -sI https://en.wikipedia.org/api/rest_v1/feed/featured/2017/01/11 | grep '^cache-control:'
cache-control: s-maxage=300, max-age=60

Purge Varnish cache

See Multicast_HTCP_purging#One-off_purge on Wikitech

From mwmaint1002.eqiad.wmnet (terbium or deployment?). Examples:

echo 'https://meta.wikimedia.org/api/rest_v1/data/css/mobile/base' | mwscript purgeList.php
echo 'https://meta.wikimedia.org/api/rest_v1/data/css/mobile/pcs' | mwscript purgeList.php
echo 'https://meta.wikimedia.org/api/rest_v1/data/javascript/mobile/pcs' | mwscript purgeList.php