Continuous integration/Zuul

Zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. It listens to Gerrit  feed and trigger jobs function registered by Jenkins using the Jenkins Gearman plugin. The jobs triggering specification is written in YAML and hosted in the git repository  as.

Operational information
There are a few monitoring probe in Icinga which would alert members of the 'contint' group. For the Gearman wait queue, one can look at the Grafana board https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10. It is often just a spike of requests for the zuul-merger, and sometime it might be due to Jenkins executors being all busy. The alarm usually self resolve.

On the CI master, one can look at the work status using: zuul-gearman.py status, see the Debugging section below.

Architecture overview
''Settings described below comes mostly from   which is maintained in puppet. They might not be up-to-date on this wiki page''.

Zuul maintains an ssh connection with the Gerrit master. It connects as the user   and issue the Gerrit command    which provides a JSON feed of anything happening in Gerrit that can be seen by the jenkins-bot user.

The main process is. On startup it forks to boot an embedded Gearman server used to communicate with Jenkins. Another independent process is   which connects to zuul-server and handles the git merges of proposed patches on tip of the target branch.

Zuul git repositories
Whenever a new project is detected, Zuul clones a non-bare repository from Gerrit master under the base path defined by   in zuul.conf. As of September 2013, that is. Zuul uses non-bare repositories to merge the received patchsets against the tip of the branch they are made against. The end result is often a merge commit which is marked as a git reference under refs/zuul/&lt;branch>/Z…). The reference is passed when triggering job so Jenkins can ultimately fetch it.

The local merge commits are not available publicly nor in Gerrit. Nonetheless, the Zuul bare repositories are made available to Wikimedia internal network over the git protocol on port 9418. This is made possible by using   configured via. The daemon is restricted to internal network using ferm rules defined in puppet.

Access by replica to Zuul repositories
The Zuul repositories should be accessed with the hostname  which points to the actve contint server.

On the server one can clone the mediawiki/core repository using:  though the master branch there will not be the one from gerrit but a random patch merge.

As of July 2014, an ongoing work is being conducted to have a Zuul merger to run on the second server lanthanum.eqiad.wmnet. The flow overview is:



A second merger on lanthanum is not implemented yet since labs instances do NOT have access to production private IP addresses.

Git replications
Note that the continuous integration production servers also receive Git repositories under. Thoses are bare repositories which are not suitable for testing patch sets via Zuul. The replication has been setup for two main usage:
 * take snapshots via   which is not supported by Gerrit 2.8
 * use them as a reference repository to avoid Jenkins replicas to fetch the whole repository over the network. Git clone will creates hardlinks since those repositories are on the same disk (ssd) as the workspace.

Triggering
When an event is received, Zuul would pass it via a workflow specification defined in a YAML file (available in  ). Zuul will communicate with its internal Gearman daemon to launch a Gearman function and resume proceeding. The Gearman server receives from Zuul a set of parameters such as the project name and commit SHA1, it then find a suitable worker to execute the function. As of January 2014 there is only one worker which is the Continuous integration Jenkins master server. Jenkins runs the job and execute a Gearman function to report back test results which is handled by Jenkins worker to update job descriptions and by Zuul itself to report back in Gerrit as a comment.

Whenever Jenkins is not reacheable or a job got deleted while running, the build result will be considered lost and Zuul will report the status of the build to be LOST.

Split between check and test
Jobs executed on patch upload are split between ones that execute code from the uploaded patch which run in the check pipeline and those jobs that don't in the check and test pipeline. This is so that unknown registered accounts can't execute code on the Jenkins replicas. (This will not be needed any more once everything runs in Continuous_integration/Architecture/Isolation.)

The white list for test pipeline and the negated white list for the check pipeline should be kept in sync.

Debugging
The Gearman server is embedded inside Zuul and uses the  python module. You can send administrative commands to the server by using our  utility. List of commands:

To list jobs registered in Gearman, send the  administrative commands to Zuul Gearman server:

The fields read as:
 * Gearman function (which is  followed by the Jenkins job name.
 * the number of currently queued instances of that job
 * the number of currently running jobs
 * the number of workers for the job (there is one Gearman worker per executor)

The list of workers and their attached job is obtained with the   command. Output cut to 72 characters and first 6 lines:

The fields read as:
 * worker number
 * worker IP address
 * worker name. The Jenkins Gearman plugin forge it using: node name, '_exec-', executor slot
 * list of function the worker can handle

One can use netcat as well:

echo status|nc -q 3 localhost 4730|grep TemplateData

is a three seconds timeout.

You can generate a thread dump by sending   to the zuul process. The result is send to the debug log in. Warning: do not send the signal to the forked zuul process which runs the gearman process, it will terminate it and causes havoc.

Replay events
Use the  command on the contint host (e.g. contint1001) to replay a Gerrit event to Zuul. This will then queue the same Jenkins jobs as if the event had just ocurred.

This can be useful when iterating locally on a Jenkins job that is managed via JJB (e.g. if it is difficult or impossible to trigger such build directly Jenkins, or when testing logic for Zuul merger or Zuul environment variables itself), or after creating a documentation publishing job to generate it for a backlog of previous releases.

Below are some examples: zuul enqueue --trigger gerrit --pipeline test --project fresh --change 591214,1
 * 1) Patch jobs

zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/extensions/EventLogging --change 591769,1
 * 1) Post-merge jobs

zuul enqueue-ref --trigger gerrit --pipeline publish --project mediawiki/php/luasandbox --ref 'refs/tags/3.0.3' --newrev '41dfc79bbcd619e50f7dc44891d19b9b3f812aa9' --oldrev '0000000000000000000000000000000000000000' zuul enqueue-ref --trigger gerrit --pipeline publish --project oojs/core --ref 'refs/tags/v2.0.0' --newrev '3cad296dc5b722c5061c12ae75c13fa8102fc693' --oldrev '0000000000000000000000000000000000000000'
 * 1) Release tag jobs

Change configuration
Clone the  repository: git clone -o gerrit ssh://gerrit.wikimedia.org:29418/integration/config.git

The Zuul configuration file is. Edit the file and push your commit to Gerrit then ask for review.

Deploy configuration
Once your configuration change is merged it needs to be deployed on the continuous integration server. This can be done by someone allowed to sudo as zuul user.

The deployment is done using a shell script named  in the   repository.

From the configuration directory, run

That will:


 * ssh to the contint server where the Zuul scheduler runs,
 * update the local git clone of integration/config,
 * show a difference of changes,
 * asks you to accept the diff,
 * if you are happy with them, the repository is updated (rebased) and the Zuul scheduler service is reloaded.

IMPORTANT: In a second terminal you might want to have a look at the Zuul log file: $ tail -f -n100 /var/log/zuul/zuul.log

Announce deployment to RelEng SAL via  in.

If you see any error in the log file, you should revert your change locally and reload the daemon again (and revert the patch in Gerrit, and merge the revert).

Restart

 * Graceful

A plain "restart" is graceful.

ssh contint.wikimedia.org sudo /usr/sbin/service zuul restart && tail -f -n100 /var/log/zuul/zuul.log


 * Forced

A plain restart waits for currently queued jobs to finish. If you're in a position where Zuul is unresponsive, restarting will be futile as that will leave it no less stuck then it already is. In that case, perform a  followed by a. The stop command, contrary to restart, is not graceful and terminates the process immediately with no regard for currently running or queued jobs.

ssh contint.wikimedia.org sudo /usr/sbin/service zuul stop sudo /usr/sbin/service zuul start tail -n100 /var/log/zuul/zuul.log

WMF Setup
Zuul source code is maintained by OpenStack, the WMF maintains a copy of their git repository in its own Gerrit installation under the project. The Continuous Integration team manually update our master branch from the OpenStack master.

The puppet module zuul handles installation. It clones the source code from the WMF git repository and installs it on the server using. WMF-specific configuration is handled via our puppet role classes:  and. The role classes invoke the zuul module using a set of parameter that fit our context. Changes to this configuration must be approved by the Operations team (it is in the project ).

Zuul has additional configuration to finely tune how to trigger jobs. Since this is regularly updated by people in charge of Continuous Integration, the related configuration files has been extracted to a git repository out of Operations' responsibility:. This let CI people make changes without bothering Operations with configuration changes that are harmless to most WMF servers. A wrong change can still render Zuul inoperable, but CI people should be able to fix it by themselves.

Log files are available under  and are rotated daily. should cover most needs, if not the  has extended informations. The logging configuration is handled via the puppet module zuul which copy the file in.

The configuration repository is initially deployed by puppet simply by cloning the repository under. The  refers to it. Whenever a change is merged in integration/config, one needs to update the git working directory and reload zuul. Watch out the log file, since Zuul does not validate its configuration, it can well be made unstable whenever a typo appear in the zuul/layout.yaml file.

new package
We deploy Zuul using Debian packages. The debian sources are in  in branches.

The quilt patches under  are maintained using   which grab the patches from sub branches.

To build for Jessie:

ssh integration-slave-jessie-1001.integration.eqiad.wmflabs git clone https://gerrit.wikimedia.org/r/integration/zuul git checkout origin/upstream git checkout debian/jessie-wikimedia

echo "USENETWORK=yes" > ~/.pbuilderrc
 * 1) We use dh-virtualenv which fetches from pypi

sudo -s DEB_BUILD_OPTIONS=nocheck GIT_PBUILDER_AUTOCONF=no DIST=jessie WIKIMEDIA=yes git-buildpackage -us -uc --git-builder=git-pbuilder

You should then have the resulting .deb stuff in the parent directory:

$ ls -1 ../zuul_* zuul_2.5.1.orig.tar.gz zuul_2.5.1-wmf10_amd64.changes zuul_2.5.1-wmf10_amd64.deb zuul_2.5.1-wmf10.debian.tar.xz zuul_2.5.1-wmf10.dsc $

creates the source tarball based on your local  branch. Make sure your local branch matches the version in the.

You should diff the package with the previous one to see potential differences with  or by extracting them:

$ dpkg-deb -x zuul_2.5.1-wmf9_amd64.deb current $ dpkg-deb -x zuul_2.5.1-wmf10_amd64.deb new $ colordiff -ur current new

Or to review only source code modifications:

$ colordiff -ur {wmf2,wmf3}/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul diff -ur wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py --- wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py	2015-02-05 15:46:17.000000000 +0000 +++ wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py	2015-07-23 14:50:19.000000000 +0000 @@ -120,7 +120,7 @@            if v is True: cmd += ' --%s' % k            else: -               cmd += ' --label %s=%s' % (k, v) +                cmd += ' --%s %s' % (k, v)         cmd += ' %s' % change out, err = self._ssh(cmd) return err $

Actually upgrade
On the contint host, as root, stop the servers and uninstall Zuul entirely:

/etc/init.d/zuul stop /etc/init.d/zuul-merger stop pip uninstall zuul

Repeat  in case several versions were installed until you have a message confirming it is not:

Cannot uninstall requirement zuul, not installed Storing complete log in /root/.pip/pip.log

Change the  branch of the local git working space to point to the desired commit. On contint, as root:

If happy with the changes, continue:

git reset --hard origin/master HTTP_PROXY=. HTTPS_PROXY=. python setup.py install

If easy_install attempts to download a python module, it will bails out. You will have to rollback master to whatever previous commit and package the missing python module.

MAKE SURE the layout still validates:

zuul-server -c /etc/zuul/zuul.conf -l /etc/zuul/wikimedia/zuul/layout.yaml -t

Any stack trace there mean Zuul will not be able to reload the configuration. Rollback.

Restart the services:

/etc/init.d/zuul-merger start /etc/init.d/zuul start

Check /var/log/zuul/debug.log and /var/log/zuul/merger-debug.log to verify the daemon start properly. Once they have settled, you can change a dummy patch in Gerrit to confirm.

Force merge
Force merge is clicking "Submit" when zuul is working through tests so that the patch is merged before zuul thinks it is. This causes zuul to enter a bad state and clogs the queue.

Gearman deadlock
The Gearman server sometimes deadlocks when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:


 * 1) Open https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account
 * 2) Log what you're about to do at the RelEng SAL via
 * 3) Search for "Gearman"
 * 4) Untick checkbox "Enable Gearman"
 * 5) "Save" at the bottom
 * 6) Search for "Gearman"
 * 7) Tick checkbox "Enable Gearman"
 * 8) "Save" at the bottom

Jenkins execution lock
Sometimes a Jenkins node (in particular deployment-deploy03, which runs the Beta Cluster update jobs) gets stuck


 * 1) Open https://integration.wikimedia.org/ci/computer/deployment-deploy03/
 * 2) Log what you're about to do at the RelEng SAL via
 * 3) Mark node as temporarily offline (there's a button at the top right of the page)
 * 4) Disconnect (there's a link in the left hand panel of the page)
 * 5) Relaunch replica agent
 * 6) Bring node back online

Very high queue of merger:merge functions
Zuul might be flowed with lot of merger:merge function to triggers, that is usually due to a single repository sending way too many patches. When the server can not be restarted (that would lost the queue), one can make the merger:merge fail fast by preventing read access to the git repository.

To confirm, on the zuul master check the number of jobs awaiting. In the example below 2803: $ zuul-gearman.py status|grep merger:merge merger:merge	2803	2	2

Identify the spamming repository: tail -f /var/log/zuul/merger-debug.log

on the zuul-merger instances. You should see a spam of messages such as:

DEBUG zuul.Repo: CreateZuulRef master/Zxxxx at yyy on 

On the zuul-merger instances, change ownership to root and prevent reads from the zuul user: chown root:root /srv/zuul/git/someproject/.git chmod go-rx /srv/zuul/git/someproject/.git

The merger:merge function will thus fail quickly and errors will show up in the. Once the queue has been drained to a more reasonable level: $ zuul-gearman.py status|grep merger:merge merger:merge	19	2	2

Then restore the ownership/permissions: chown zuul:zuul /srv/zuul/git/someproject/.git chmod go+rx /srv/zuul/git/someproject/.git

All Gerrit patches complain of merge conflicts
This appears to be caused by gerrit-bot holding open SSH connections and hitting the connection limit.

It is usually resolved by restarting Zuul per https://phabricator.wikimedia.org/T308943#7947453

In case that doesn't work, check the ssh connections to gerrit via the  command. You'll need to be in the Gerrit Administrators group to do this (see Gerrit:cmd-show-connections).

There should be two connections. If there are more than two connections it's a bad thing, it means something's hung-up in zuul. Go ahead and try to kill the oldest connection (the first in the list). You'll need to be in the Gerrit Administrators group to do this (see Gerrit:cmd-close-connection).