Continuous integration/Zuul

From MediaWiki.org
Jump to: navigation, search
shortcut: CI/Z

Zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. It listens to Gerrit stream-events feed and trigger jobs function registered by Jenkins using the Jenkins Gearman plugin. The jobs triggering specification is written in YAML and hosted in the git repository integration/config.git as /zuul/layout.yaml .

Operational information[edit]

What Where
Server contint1001.wikimedia.org (scheduler + merger)

lanthanum.eqiad.wmnet (merger)

Puppet classes manifests/role/zuul.pp
modules/contint

modules/zuul

Config /etc/zuul/zuul.conf
Init scripts /etc/init.d/zuul
/etc/init.d/zuul-merger
Log /var/log/zuul/*.log
Quick checks pgrep -l zuul (should yield zuul-merger and 2 x zuul-server on contint1001
https://integration.wikimedia.org/zuul/

Architecture overview[edit]

Settings described below comes mostly from /etc/zuul/zuul.conf which is maintained in puppet. They might not be up-to-date on this wiki page.

Zuul maintains an ssh connection with the Gerrit master. It connects as the user jenkins-bot and issue the Gerrit command stream-events which provides a JSON feed of anything happening in Gerrit that can be seen by the jenkins-bot user.

The main process is zuul-server . On startup it forks to boot an embedded Gearman server used to communicate with Jenkins. Another independent process is zuul-merger which connects to zuul-server and handles the git merges of proposed patches on tip of the target branch.

Zuul git repositories[edit]

Whenever a new project is detected, Zuul clones a non-bare repository from Gerrit master under the base path defined by git_dir in zuul.conf. As of September 2013, that is /srv/ssd/zuul/git . Zuul uses non-bare repositories to merge the received patchsets against the tip of the branch they are made against. The end result is often a merge commit which is marked as a git reference under refs/zuul/<branch>/Z…). The reference is passed when triggering job so Jenkins can ultimately fetch it.

The local merge commits are not available publicly nor in Gerrit. Nonetheless, the Zuul bare repositories are made available to Wikimedia internal network over the git protocol on port 9418. This is made possible by using git-daemon configured via /etc/default/git-daemon . The daemon is restricted to internal network using ferm rules defined in puppet.

Access by slave to Zuul repositories[edit]

The Zuul repositories should be accessed with the hostname zuul.eqiad.wmnet which points to the server hosting Zuul (as of September 2017: contint1001.wikimedia.org).

On the server one can clone the mediawiki/core repository using: git clone git://zuul.eqiad.wmnet:9418/mediawiki/core/ though the master branch there will not be the one from gerrit but a random patch merge.

As of July 2014, an ongoing work is being conducted to have a Zuul merger to run on the second server lanthanum.eqiad.wmnet. The flow overview is:

Drawing of Wikimedia continuous integration flows between Zuul mergers and Jenkins slave client.

A second merger on lanthanum is not implemented yet since labs instances do NOT have access to production private IP addresses.

Git replications[edit]

Note that the continuous integration production servers also receive Git repositories under /srv/ssd/gerrit . Thoses are bare repositories which are not suitable for testing patch sets via Zuul. The replication has been setup for two main usage:

  • take snapshots via git archive which is not supported by Gerrit 2.8
  • use them as a reference repository to avoid Jenkins slaves to fetch the whole repository over the network. Git clone will creates hardlinks since those repositories are on the same disk (ssd) as the workspace.

Triggering[edit]

When an event is received, Zuul would pass it via a workflow specification defined in a YAML file (available in integration/config.git ). Zuul will communicate with its internal Gearman daemon to launch a Gearman function and resume proceeding. The Gearman server receives from Zuul a set of parameters such as the project name and commit SHA1, it then find a suitable worker to execute the function. As of January 2014 there is only one worker which is the Continuous integration Jenkins master server. Jenkins runs the job and execute a Gearman function to report back test results which is handled by Jenkins worker to update job descriptions and by Zuul itself to report back in Gerrit as a comment.

Whenever Jenkins is not reacheable or a job got deleted while running, the build result will be considered lost and Zuul will report the status of the build to be LOST.

Split between check and test[edit]

Jobs executed on patch upload are split between ones that execute code from the uploaded patch which run in the check pipeline and those jobs that don't in the check and test pipeline. This is so that unknown registered accounts can't execute code on the Jenkins slaves. (This will not be needed any more once everything runs in Continuous_integration/Architecture/Isolation.)

The white list for test pipeline and the negated white list for the check pipeline should be kept in sync.

Debugging[edit]

To list jobs registered in Gearman, you can use the zuul-gearman.py utility to send the status administrative commands to Zuul Gearman server:

$ /usr/local/bin/zuul-gearman.py status
build:mwext-TemplateData-phpcs-HEAD:hasSlaveScripts    0    0    13
build:mwext-TemplateData-lint    0    0    13
build:mwext-TemplateData-lint:hasSlaveScripts    0    0    13
build:mwext-TemplateData-testextensions-master:hasSlaveScripts    0    0    13
build:mwext-TemplateData-testextensions-master    0    0    13
build:mwext-TemplateData-jslint    0    0    13
build:mwext-TemplateData-phpcs-HEAD    0    0    13
build:mwext-TemplateData-qunit:gallium    0    0    5
build:mwext-TemplateData-qunit    0    0    5
build:mwext-TemplateData-jslint:hasSlaveScripts    0    0    13
$

The fields read as:

  • jobs registered
  • the number of queued instances of that job
  • the number of currently running jobs
  • and the number of workers for the job


The list of workers and their attached job is obtained with the workers command. Output cut to 72 characters and first 6 lines:

$ zuul-gearman.py workers|cut -b-72
13 208.80.154.135 - : 
14 208.80.154.135 Zuul Merger : merger:merge merger:update
15 127.0.0.1 lanthanum_exec-3 : build:mwext-Diagnosis-phpcs-HEAD:UbuntuP
16 127.0.0.1 wikidata-jenkins3_exec-3 : build:mwext-Wikidata-client-none
19 127.0.0.1 deployment-tin.eqiad_exec-3 : build:label=deployment-tin
20 127.0.0.1 wikidata-jenkins1_exec-2 : build:mwext-Wikidata-client-none

The fields read as:

  • worker number
  • worker IP address
  • worker name. The Jenkins Gearman plugin forge it using: node name, '_exec-', executor slot
  • list of function the worker can handle

One can use netcat as well:

 echo status|nc -q 3 localhost 4730|grep TemplateData
-q 3  is a three seconds timeout.


You can generate a thread dump by sending SIGUSR2 to the zuul process. The result is send to the debug log in /var/log/zuul/debug.log . Warning: do not send the signal to the forked zuul process which runs the gearman process, it will terminate it and causes havoc.

Update configuration[edit]

Change configuration[edit]

Clone the integration/config.git repository:

git clone -o gerrit ssh://gerrit.wikimedia.org:29418/integration/config.git

The Zuul configuration file is zuul/layout.yaml . Edit the file and push your commit to Gerrit then ask for review.

Deploy configuration[edit]

Once your configuration change is merged it needs to be deployed on the continuous integration server (contint1001.wikimedia.org as of Nov 2016). This can be done by someone allowed to sudo as zuul user.

The deployment is done using Fabric, a python based DSL to run commands. You can install it with either sudo apt-get install fabric or for a fresh version via pip install --user fabric, the later would put the script at ~/.local/bin/fab.

Then (from the config directory): fab deploy_zuul

That will ssh to the server hosting the Zuul scheduler, update the git repository and show a difference of changes. If happy with them, accept the diff and the repository will be rebased then Zuul scheduler reloaded.

IMPORTANT: In a second terminal you might want to have a look at the Zuul log file:

$ tail -f -n100 /var/log/zuul/zuul.log

Announce deployment to RelEng SAL via !log in #wikimedia-relengconnect.

If you see any error in the log file, you should revert your change locally (git reset --hard HEAD^) and reload the daemon again (and revert the patch in Gerrit, and merge the revert).

Restart[edit]

Graceful

A plain "restart" is graceful.

ssh contint1001
sudo -su zuul
/etc/init.d/zuul restart && tail -f -n100 /var/log/zuul/zuul.log
Forced


A plain restart waits for currently queued jobs to finish. If you're in a position where Zuul is unresponsive, restarting will be futile as that will leave it no less stuck then it already is. In that case, perform a stop followed by a start. The stop command, contrary to restart, is not graceful and terminates the process immediately with no regard for currently running or queued jobs.

ssh contint1001
sudo -su zuul
/etc/init.d/zuul stop
/etc/init.d/zuul start
tail -n100 /var/log/zuul/zuul.log

WMF Setup[edit]

Zuul source code is maintained by OpenStack, the WMF maintains a copy of their git repository in its own Gerrit installation under the project integration/zuul . The Continuous Integration team manually update our master branch from the OpenStack master.

The puppet module zuul handles installation. It clones the source code from the WMF git repository and installs it on the server using python setup.py . WMF-specific configuration is handled via our puppet role classes: role::zuul::production and role::zuul::labs . The role classes invoke the zuul module using a set of parameter that fit our context. Changes to this configuration must be approved by the Operations team (it is in the project operations/puppet ).

Zuul has additional configuration to finely tune how to trigger jobs. Since this is regularly updated by people in charge of Continuous Integration, the related configuration files has been extracted to a git repository out of Operations' responsibility: integration/config . This let CI people make changes without bothering Operations with configuration changes that are harmless to most WMF servers. A wrong change can still render Zuul inoperable, but CI people should be able to fix it by themselves.

Log files are available under /var/log/zuul/ and are rotated daily. zuul.log should cover most needs, if not the debug.log has extended informations. The logging configuration is handled via the puppet module zuul which copy the file in /etc/zuul/logging.conf .

The configuration repository is initially deployed by puppet simply by cloning the repository under /etc/zuul/wikimedia . The /etc/zuul/zuul.conf refers to it. Whenever a change is merged in integration/config, one needs to update the git working directory and reload zuul. Watch out the log file, since Zuul does not validate its configuration, it can well be made unstable whenever a typo appear in the zuul/layout.yaml file.

upgrading[edit]

new package[edit]

draft

We deploy Zuul using Debian packages. The debian sources are in integration/zuul.git in branches debian/os-version .

The quilt patches under debian/patches are maintained via gbp-pq which grab the patches from sub branches patch-queue/debian/os-version .

To build for Precise:

ssh integration-slave-jessie-1001.integration.eqiad.wmflabs
git clone https://gerrit.wikimedia.org/r/p/integration/zuul
git checkout origin/upstream
git checkout debian/precise-wikimedia

echo "USENETWORK=yes" > ~/.pbuilderrc
sudo -s
DEB_BUILD_OPTIONS=nocheck GIT_PBUILDER_AUTOCONF=no DIST=precise WIKIMEDIA=yes git-buildpackage -us -uc --git-builder=git-pbuilder

You should then have the resulting .deb stuff in the parent directory:

$ ls -1 ../zuul_*
zuul_2.0.0-327-g3ebedde.orig.tar.gz
zuul_2.0.0-327-g3ebedde-wmf3precise1_amd64.build
zuul_2.0.0-327-g3ebedde-wmf3precise1.debian.tar.xz
zuul_2.0.0-327-g3ebedde-wmf3precise1.dsc
$

git-buildpackage creates the source tarball based on your local upstream branch. Make sure your local branch matches the version in the debian/changelog.

You should diff the package with the previous one to see potential differences with debdiff or by extracting them:

$ dpkg-deb -x zuul_2.0.0-327-g3ebedde-wmf2precise1_amd64.deb current
$ dpkg-deb -x zuul_2.0.0-327-g3ebedde-wmf3precise1_amd64.deb new
$ colordiff -ur current new

Or to review only source code modifications:

$ colordiff -ur {wmf2,wmf3}/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul
diff -ur wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py
--- wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py	2015-02-05 15:46:17.000000000 +0000
+++ wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py	2015-07-23 14:50:19.000000000 +0000
@@ -120,7 +120,7 @@
             if v is True:
                 cmd += ' --%s' % k
             else:
-                cmd += ' --label %s=%s' % (k, v)
+                cmd += ' --%s %s' % (k, v)
         cmd += ' %s' % change
         out, err = self._ssh(cmd)
         return err
$

Actually upgrade[edit]

On contint1001 as root, stop the servers and uninstall Zuul entirely:

/etc/init.d/zuul stop
/etc/init.d/zuul-merger stop
pip uninstall zuul

Repeat pip uninstall zuul in case several versions were installed until you have a message confirming it is not:

Cannot uninstall requirement zuul, not installed
Storing complete log in /root/.pip/pip.log

Change the master branch of the local git working space to point to the desired commit. On contint1001 as root:

cd /usr/local/src/zuul
git remote update
git log --oneline --decorate --graph master..origin/master

If happy with the changes, continue:

git reset --hard origin/master
HTTP_PROXY=. HTTPS_PROXY=. python setup.py install

If easy_install attempts to download a python module, it will bails out. You will have to rollback master to whatever previous commit and package the missing python module.

MAKE SURE the layout still validates:

zuul-server -c /etc/zuul/zuul.conf -l /etc/zuul/wikimedia/zuul/layout.yaml -t

Any stack trace there mean Zuul will not be able to reload the configuration. Rollback.

Restart the services:

/etc/init.d/zuul-merger start
/etc/init.d/zuul start

Check /var/log/zuul/debug.log and /var/log/zuul/merger-debug.log to verify the daemon start properly. Once they have settled, you can change a dummy patch in Gerrit to confirm.

Known issues[edit]

Gearman deadlock[edit]

The Gearman server sometime deadlock when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:

  1. Open https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account
  2. Log what you're about to do at the RelEng SAL via #wikimedia-releng !log
  3. Search for "Gearman"
  4. Untick checbox "Enable Gearman"
  5. "Save" at the bottom
  6. Search for "Gearman"
  7. Tick checkbox "Enable Gearman"
  8. "Save" at the bottom

Jenkins execution lock[edit]

Sometimes a Jenkins slave (in particular deployment-tin) gets stuck

  1. Open https://integration.wikimedia.org/ci/computer/deployment-tin.eqiad/
  2. Log what you're about to do at the RelEng SAL via #wikimedia-releng !log
  3. Mark node as temporarily offline (there's a button at the top right of the page)
  4. Disconnect (there's a link in the left hand panel of the page)
  5. Relaunch slave agent
  6. Bring node back online

References[edit]