Continuous integration/Jenkins

Jenkins is a Java tool used to handle recurring tasks such as running tests or building packages. Our primary install is at https://integration.wikimedia.org/ci/.

The tool is permanently connected to our review tool (Gerrit) and can be made to react on changes submitted to Gerrit. A typical example, is running MediaWiki unit tests whenever a change is submitted to the mediawiki/core.git repository.

The configuration of individual jobs is abstracted via Jenkins job builder. The jobs are triggered with Zuul (which processes Gerrit events), and configured in the integration/config.git repository.

If you are on the allow list, you can force Zuul to run all tests on a patchset by adding a comment beginning with the word recheck in Gerrit.

Most of the infrastructure is detailed on wikitech:

Comment commands

You will need to be added to the allow list to trigger these commands.

You can monitor the running tests at https://integration.wikimedia.org/zuul/. As documented there, there are a number of special commands that you can write in as a gerrit comment to trigger Zuul to (re)run the tests defined in a repo on a given patchset:

recheck – the test jobs in the 'test' pipeline (i.e., the standard tests; this is a sub-set of the merge tests)
check coverage – the test job in the 'coverage' pipeline (for PHP code coverage)
check perf – the test job in the 'patch-performance' pipeline
check codehealth – the test job in the 'codehealth' pipeline
check experimental – any test jobs in the 'experimental' pipeline (this might be new versions of PHP, Node.js, Python, etc. that aren't yet tested, or alternative environment systems or databases such as PostgreSQL/SQLite)
check php – any test jobs in the 'php' pipeline (the tests that will be run on merge that are missing from the 'test' pipeline); you can also trigger this with the legacy triggers check php5, check zend, check sqlite, and check postgres.^[1]

Local Installation

Potentially obsolete system to install the stack on a local machine

Automatic installation

curl https://raw.github.com/valhallasw/wikimedia-mkjenkins/master/mkjenkins.sh | bash

To make the install go faster, it helps to have a mediawiki-core checkout in ~/src/mediawiki-core - if this repository exists, it will make a local clone. If it doesn't, it will download from gerrit instead (slow!).

Manual installation

git clone https://gerrit.wikimedia.org/r/integration/jenkins.git ~/.jenkins
- ~/.jenkins is the default jenkins configuration directory
WM-specific configuration patch I - ln -s $HOME/.jenkins /var/lib/jenkins
- because some jobs assume jenkins is installed in /var/lib/jenkins
Download jenkins and place it in ~/.jenkins
install the following plugins (download into ~/.jenkins/plugins):
- git
- git-client
- ansicolor
- notification
- scm-api
- timestamper
- build-timeout
- xunit
Download jenkins-job-builder and its configuration ‒ see Continuous_integration/Jenkins job builder for more information. You don't need a password when you install Jenkins locally.
WM-specific configuration patch II - Patch the JBB configuration that depends on Zuul (see the mkjenkins script for a diff)
If you already have a checkout of mediawiki-core: git clone --mirror -l -- your_existing_checkout /var/lib/jenkins/git/mw-core-bare. Otherwise, git clone --mirror -- https://gerrit.wikimedia.org/r/mediawiki/core.git /var/lib/jenkins/git/mw-core-bare
Start Jenkins: cd ~/.jenkins && java -jar jenkins.war&
When Jenkins is running, install the JBB jobs: rm -f $HOME/.cache/jenkins_jobs/jenkins_jobs_cache.yml && jenkins-jobs --conf jenkins_jobs.ini update config/.

Issue?

Hung beta code/db update

Tracked in Phabricator
Task T72597

This deadlock seems to happen more often than not following or during a database update that is taking a while to complete.

Take deployment-deploy03 offline in Jenkins https://integration.wikimedia.org/ci/computer/deployment-deploy03/markOffline
Kill any jenkins jobs running on deployment-deploy03 via Jenkins UI
Kill all pending jobs in the Jenkins queue that are "waiting on executors"
Disconnect deployment-deploy03 https://integration.wikimedia.org/ci/computer/deployment-deploy03/disconnect
Bring deployment-deploy03 back online (button labeled "Bring this node back online")
Launch slave agent (there's a button that says this)
Check agent log to see that it connected https://integration.wikimedia.org/ci/computer/deployment-deploy03/log

Sometimes you have to do this whole dance several times before Jenkins realizes that the there are a bunch of executors that it can use.

Alternate method:

Go to https://integration.wikimedia.org/ci/manage
Go to "Configure System"
Search page for "Enable Gearman"
Un-check the checkbox
Save
Wait 30s
Check the "Enable Gearman" checkbox
Save

This second method may interrupt communication between running Jenkins jobs and Zuul but it seems to work even when the offline/online method fails to clear the deadlock.

Alternate alternate method:

Login to https://integration.wikimedia.org/ci
Hit the red [x] to cancel one pending beta-scap-eqiad job
That's it!

It seems that this is some conflict between the Jenkins native scheduler and the Gearman scheduler. Cancelling the build seems to fix the problem. Subsequent builds will deploy the same code to production.

Restart

Zuul should not be restarted. Zuul preserves the queue and continues after the restart.

Via web interface

Apply the self-serve Jenkins repair!

With a safeRestart any currently running jobs will block a restart until they are canceled. Any long running jobs should be killed. Check for jobs on the main jenkins dashboard, cancel any long-running jobs there. Bonus points: make a note of the patches for which you have canceled jobs on the zuul dashboard, comment "recheck" for any patches in the test queue that you have aborted.

Head to https://integration.wikimedia.org/ci/safeRestart
Login with your labs account being part of the 'wmf' LDAP group
press "Yes"
in #wikimedia-operations ^connect: "!log restarting stuck Jenkins".

Shell

On the active host (run host contint.wikimedia.org to see which host is currently active):

sudo systemctl restart jenkins

And then wait a while. Monitor logs via sudo journalctl -f -u jenkins

OOM Issues

Troubleshooting

Whenever Jenkins appears to be stuck or facing high CPU usage, you will want to look at the Java threads: https://integration.wikimedia.org/ci/threadDump

This is the way to do it from the CLI

   jstack -l -F <pid of jenkins>

Last time this happened (2017-05-20) a restart of Jenkins "fixed" the problem, but we were unable to troubleshoot without a stacktrace from jstack

Build failures look unrelated

Sometimes, changes in other repositories may cause your builds to fail. You can check the Shared Build Failure board to see if any existing issues are similar to your build failure; if there aren’t any, and you’re reasonably certain that your build failures are unrelated to your own changes, you can create a new task.

Agent remote call failed

Errors like 11:53:37 FATAL: Remote call on integration-agent-docker-1001 failed are caused with problems in the java agent process running on each agent machine.

To fix these errors try restarting the agent on the target machine.

Take agent offline in Jenkins at https://integration.wikimedia.org/ci/computer/AGENTNAME/markOffline
Disconnect the agent: https://integration.wikimedia.org/ci/computer/AGENTNAME/disconnect

ssh into the agent and kill the java slave.jar process

thcipriani@integration-agent-docker-1001:~$ ps aux | grep -i jav[a]
jenkins+ 31931  0.9  1.9 12195832 483676 ?     Ssl  Feb19 158:55 java -jar slave.jar
thcipriani@integration-agent-docker-1001:~$ sudo kill -9 31931

Bring node back in Jenkins web ui: https://integration.wikimedia.org/ci/computer/AGENTNAME/toggleOffline
Relaunch the agent on the machine via Jenkins web ui: https://integration.wikimedia.org/ci/computer/AGENTNAME/launchSlaveAgent

Ensure the agent has launched on the agent itself, i.e., ensure that there is a new PID for the slave.jar process

thcipriani@integration-agent-docker-1001:~$ ps aux | grep -i jav[a]
jenkins+ 10618 27.8  0.5 10419524 141168 ?     Ssl  16:43   0:05 java -jar slave.jar

Debugging

Start Jenkins with Java option:

-Dhudson.plugins.git.GitSCM.verbose="true"

Text thread dump: https://integration.wikimedia.org/ci/monitoring?part=threadsDump