Continuous integration/Zuul

Zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. It listens to Gerrit stream-events feed and trigger jobs function registered by Jenkins using the Jenkins Gearman plugin. The jobs triggering specification is written in YAML and hosted in the git repository integration/config.git as /zuul/layout.yaml.

Architecture overview
''Settings described below comes mostly from /etc/zuul/zuul.conf which is maintained in puppet. They might not be up-to-date on this wiki page''.

Zuul maintains an ssh connection with the Gerrit master. It connects as the user jenkins-bot and issue the Gerrit command stream-events which provides a JSON feed of anything happening in Gerrit that can be seen by the jenkins-bot user.

The main process is zuul-server. On startup it forks to boot an embedded Gearman server used to communicate with Jenkins. Another independent process is zuul-merger which connects to zuul-server and handles the git merges of proposed patches on tip of the target branch.

Zuul git repositories
Whenever a new project is detected, Zuul clones a non-bare repository from Gerrit master under the base path defined by git_dir</tt> in zuul.conf. As of September 2013, that is /srv/ssd/zuul/git</tt>. Zuul uses non-bare repositories to merge the received patchsets against the tip of the branch they are made against. The end result is often a merge commit which is marked as a git reference under refs/zuul/ /Z…). The reference is passed when triggering job so Jenkins can ultimately fetch it.

The local merge commits are not available publicly nor in Gerrit. Nonetheless, the Zuul bare repositories are made available to Wikimedia internal network over the git protocol on port 9418. This is made possible by using git-daemon</tt> configured via /etc/default/git-daemon</tt>. The daemon is restricted to internal network using ferm rules defined in puppet.

Access by slave to Zuul repositories
The Zuul repositories should be accessed with the hostname zuul.eqiad.wmnet</tt> which points to the server hosting Zuul (as of January 2014: gallium.wikimedia.org).

On the server one can clone the mediawiki/core repository using: git clone git://zuul.eqiad.wmnet:9418/mediawiki/core/</tt> though the master branch there will not be the one from gerrit but a random patch merge.

As of July 2014, an ongoing work is being conducted to have a Zuul merger to run on the second server lanthanum.eqiad.wmnet. The flow overview is:



A second merger on lanthanum is not implemented yet since labs instances do NOT have access to production private IP addresses.

Git replications
Note that the continuous integration production servers also receive Git repositories under /srv/ssd/gerrit</tt>. Thoses are bare repositories which are not suitable for testing patch sets via Zuul. The replication has been setup for two main usage:
 * take snapshots via git archive</tt> which is not supported by Gerrit 2.8
 * use them as a reference repository to avoid Jenkins slaves to fetch the whole repository over the network. Git clone will creates hardlinks since those repositories are on the same disk (ssd) as the workspace.

Triggering
When an event is received, Zuul would pass it via a workflow specification defined in a YAML file (available in integration/config.git</tt>). Zuul will communicate with its internal Gearman daemon to launch a Gearman function and resume proceeding. The Gearman server receives from Zuul a set of parameters such as the project name and commit SHA1, it then find a suitable worker to execute the function. As of January 2014 there is only one worker which is the Continuous integration Jenkins master server. Jenkins runs the job and execute a Gearman function to report back test results which is handled by Jenkins worker to update job descriptions and by Zuul itself to report back in Gerrit as a comment.

Whenever Jenkins is not reacheable or a job got deleted while running, the build result will be considered lost and Zuul will report the status of the build to be LOST.

Debugging
To list jobs registered in Gearman, you can use the zuul-gearman.py</tt> utility to send the status</tt>administrative commands to Zuul Gearman server:

The fields read as:
 * jobs registered
 * the number of queued instances of that job
 * the number of currently running jobs
 * and the number of workers for the job

The list of workers and their attached job is obtained with the workers</tt> command. Output cut to 72 characters and first 6 lines:

The fields read as:
 * worker number
 * worker IP address
 * worker name. The Jenkins Gearman plugin forge it using: node name, '_exec-', executor slot
 * list of function the worker can handle

One can use netcat as well:

echo status|nc -q 3 localhost 4730|grep TemplateData

-q 3</tt> is a three seconds timeout.

You can generate a thread dump by sending SIGUSR2</tt> to the zuul process. The result is send to the debug log in /var/log/zuul/debug.log</tt>. Warning: do not send the signal to the forked zuul process which runs the gearman process, it will terminate it and causes havoc.

Change configuration
Clone the integration/config.git</tt> repository: git clone -o gerrit ssh://gerrit.wikimedia.org:29418/integration/config.git

The Zuul configuration file is <tt>zuul/layout.yaml</tt>. Edit the file and push your commit to Gerrit then ask for review.

Deploy configuration
Once your configuration change is merged it needs to be deployed on the continuous integration server (gallium.wikimedia.org as of Oct 2014). This can be done by someone allowed to sudo as zuul user.

yourself@host$ sudo -su zuul zuul@host$ cd /etc/zuul/wikimedia zuul@host$ git remote update

Make sure that you are only going to deploy your change by reviewing the log between the local master branch and the remote one:.

zuul$ git log -p HEAD..origin/master zuul/

--- a/zuul/layout.yaml +++ b/zuul/layout.yaml @@ -348,16 +348,6 @@ jobs: - name: ^mwext-TranslationNotifications-testextensions.* voting: false - # New parsoid tests, not sure if they're working yet, but we want to run it to find out. - - name: ^parsoid-testCommit$ -   voting: false - - name: ^parsoid-server-sanity-check$ -   voting: false projects: zuul@host$ Apply the change:

zuul@host$ git rebase First, rewinding head to replay your work on top of it... Fast-forwarded master to refs/remotes/origin/master.

IMPORTANT: In a second terminal have a look at the Zuul log file: $ tail -f -n100 /var/log/zuul/zuul.log

Announce deployment to RelEng SAL via  in.

Then reload the daemon while watching the log file.

zuul@host$ /etc/init.d/zuul reload * Reloading Zuul zuul                    [OK]

If you see any error in the log file, you should revert your change locally and reload the daemon again (and revert the patch in Gerrit, and merge the revert).

After a few seconds, check Zuul is correctly running:

$ /etc/init.d/zuul status * zuul is running

Restart

 * Graceful

A plain "restart" is graceful.

ssh gallium sudo -su zuul /etc/init.d/zuul restart && tail -f -n100 /var/log/zuul/zuul.log


 * Forced

A plain restart waits for currently queued jobs to finish. If you're in a position where Zuul is unresponsive, restarting will be futile as that will leave it no less stuck then it already is. In that case, perform a  followed by a. The stop command, contrary to restart, is not graceful and terminates the process immediately with no regard for currently running or queued jobs.

ssh gallium sudo -su zuul /etc/init.d/zuul stop /etc/init.d/zuul start tail -n100 /var/log/zuul/zuul.log

WMF Setup
Zuul source code is maintained by OpenStack, the WMF maintains a copy of their git repository in its own Gerrit installation under the project. Integration team manually update our master based on OpenStack master.

Installation is handled by the puppet module zuul which takes care of cloning the source code from the WMF git repository and install it on the server using <tt>python setup.py</tt>. WMF specific configuration is handled via our puppet role classes: <tt>role::zuul::production</tt> and <tt>role::zuul::labs</tt>. The role classes will invoke the zuul module using a set of parameter that fit our context. Changes to that configuration must be approved by the operations team (it is in <tt>operations/puppet.git</tt>).

Zuul has another configuration to finely tune how to trigger jobs. Since it is going to be updated by people in charge of continuous integration, the related configuration files has been extracted to a git repository out of operations responsibility:. This let integration people to do their change without bothering ops with configuration changes which are harmless to most WMF servers. A wrong change can still render Zuul non operant though but the integration people should be able to fix it by themselves.

Log files are available under <tt>/var/log/zuul/</tt> and are rotated daily. <tt>zuul.log</tt> should cover most needs, if not the <tt>debug.log</tt> has extended informations. The logging configuration is handled via the puppet module zuul which copy the file in <tt>/etc/zuul/logging.conf</tt>.

The configuration repository is initially deployed by puppet simply by cloning the repository under <tt>/etc/zuul/wikimedia</tt>. The <tt>/etc/zuul/zuul.conf</tt> refers to it. Whenever a change is merged in integration/config, one needs to update the git working directory and reload zuul. Watch out the log file, since Zuul does not validate its configuration, it can well be made unstable whenever a typo appear in the zuul/layout.yaml file.

prechecks
'''work in progress ... Antoine &#34;hashar&#34; Musso (talk) Nov 2013'''

Python dependencies MUST be available as packages and installed via puppet. You will want to test out the dependencies, if anything is downloaded it will need to be packaged.

Before checking the dependencies, we will add the distribution packages to python path list:

export PYTHONPATH=/usr/lib/python2.7/dist-packages

Then create a virtual environment and attempt an install with download disabled:

$ virtualenv venv $ venv/bin/python setup.py easy_install --allow-hosts=None.

If anything is missing, setup.py will issue a stacktrace or at least exit code 1.

Actually upgrade
On gallium as root, stop the servers and uninstall Zuul entirely:

/etc/init.d/zuul stop /etc/init.d/zuul-merger stop pip uninstall zuul

Repeat <tt>pip uninstall zuul</tt> in case several versions were installed until you have a message confirming it is not:

Cannot uninstall requirement zuul, not installed Storing complete log in /root/.pip/pip.log

Change the <tt>master</tt> branch of the local git working space to point to the desired commit. On gallium as root:

cd /usr/local/src/zuul git remote update git log --oneline --decorate --graph master..origin/master

If happy with the changes, continue:

git reset --hard origin/master HTTP_PROXY=. HTTPS_PROXY=. python setup.py install

If easy_install attempts to download a python module, it will bails out. You will have to rollback master to whatever previous commit and package the missing python module.

MAKE SURE the layout still validates:

zuul-server -c /etc/zuul/zuul.conf -l /etc/zuul/wikimedia/zuul/layout.yaml -t

Any stack trace there mean Zuul will not be able to reload the configuration. Rollback.

Restart the services:

/etc/init.d/zuul-merger start /etc/init.d/zuul start

Check /var/log/zuul/debug.log and /var/log/zuul/merger-debug.log to verify the daemon start properly. Once they have settled, you can change a dummy patch in Gerrit to confirm.

Gearman deadlock
The Gearman server sometime deadlock when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:


 * 1) Open https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account
 * 2) Log what you're about to do at the RelEng SAL via
 * 3) Search for "Gearman"
 * 4) Untick checbox "Enable Gearman"
 * 5) "Save" at the bottom
 * 6) Search for "Gearman"
 * 7) Tick checkbox "Enable Gearman"
 * 8) "Save" at the bottom

Jenkins execution lock
Sometimes a Jenkins slave (in particular gallium) gets stuck


 * 1) Open https://integration.wikimedia.org/ci/computer/gallium
 * 2) Log what you're about to do at the RelEng SAL via
 * 3) Mark node as temporarily offline (there's a button at the top right of the page)
 * 4) Disconnect (there's a link in the left hand panel of the page)
 * 5) Relaunch slave agent
 * 6) Bring node back online