Beta Cluster/status

Last update on: 2014-02-monthly

2012-05-10
Chris, Sam, Antoine, Faidon, and Ryan met in San Francisco the week of May 7 to bootstrap work on this project. Current focus is getting media handling working smoothly.

2012-05-15
As of May 15th: 
 * Apaches instances have been build 100% using puppet classes, the old one will be removed. All queries (thumbs/regular text/bits) hits the applications apaches, upload.beta.wmflabs.org pointing to the IP address shared by all wikis.
 * MediaWiki logging is fine.
 * Blocker: /home/wikipedia needs a decent place with lot of disk space to host MediaWiki checkouts, MediaWiki logs and syslogs.
 * Blocker: no syslog-server yet, since it conflicts with a base class which is always installed.
 * MediaWiki configuration files in progress of being merged from prod to labs.

2012-05-20
Project is now a bit more on par with production status.
 * A job runner has been setup, currently catching up with all the pending jobs. Apparently, that includes some video resizing for TimeMediaHandler.
 * All code has been updated to a recent version and all databases have been upgraded.
 * Uploading file should work again (as of May 17th)

2012-05-monthly
Chris McMahon, Sam Reed, Antoine Musso, Faidon Liambotis, and Ryan Lane met in San Francisco the week of May 7 to bootstrap work on this project, kickstarting a process of aligning the configuration with our production cluster. Apache web server instances are now completely configured automatically using Puppet classes. A few key Wikimedia configuration files that were previously managed via private Subversion repository are now managed in a public Git repository. Much work remains to make this a stable testing environment, which will continue in June. 

2012-06-25
TimedMediaHandler has been setup though transcoding is not operational yet, since that would require a fully functional job queue. We discovered that the version of Ubuntu currently used in production (Lucid) won’t work with TimedMedia Handler. As a result, Antoine and Faidon updated the Puppet configurations for the Apache web servers to run on the next generation Ubuntu (Precise).

Administrative tools have been setup closely following the way it is done in production. As an example beta, use the exact same workflow to update the l10n cache. We will work on fetching l10n updates from translatewiki.

2012-06-monthly
The primary focus of Beta cluster work in June was in service to TimedMediaHandler (TMH). TMH has been setup though transcoding is not operational yet, since that would require a fully functional job queue. The team discovered that the version of Ubuntu currently used in production (Lucid) won’t work with TimedMedia Handler. As a result, Antoine and Faidon updated the Puppet configurations for the Apache web servers to run on the next generation Ubuntu (Precise).

Administrative tools have been setup closely following the way it is done in production. For example, the Beta Cluster now uses the exact same workflow to update the l10n cache as we do in production. The team plans to further improve this by fetching l10n updates from translatewiki.

2012-07-16
Beginning of July, the labs instances have been migrated to some new powerful hardware enhancing the performances by an order of magnitude. Some instances have been unfortunately corrupted in the process but thanks to our extensive use of Puppet, replacement have been pretty fast.

Antoine wrote an overview of the beta cluster, still need to be amended with sections about how to update code and debugging issues.

2012-07-23
The MediaWiki code and extensions are now being updated on a regular basis. Petr Benan is starting implementing the IRC feed system for bots consumption. We received spammer attention, several counter measures have been applied such as the Captcha system enabled by Platonides and automatic blocking of known open proxies. The job queue system is being improved by Jan Gerber so it could fit in beta, that is a requisite for the Time Media Handler extension which would let us test video transcoding. Thumbnails are still not working correctly, a workaround is still being worked on.

2012-07-30
<section begin="2012-07-30"/>All beta instances are now running out of the shared /data/project directory provided by the labs infrastructure instead of an NFS instance. Platonides has setup Captcha for user creation to help prevent spam, some well know IP have been banned. Jan Gabber is successfully using the infrastructure to work on Timed Media Handler, especially the job system that will process the video transcoding. Finally Ryan Kaldari is using the beta to setup E2 extensions.<section end="2012-07-monthly"/>

2012-07-monthly
<section begin="2012-07-monthly"/>The beta cluster infrastructure is now mostly in our configuration change engine (puppet) and start being used by third parties. The Features team and Jan Gerber are now taking advantage of the beta cluster to stage change for production. We have set up Captcha and IP blocking to reduce the amount of spam being generated on the beta wikis. An overview document has been started to help introduce new people to the beta cluster.<section end="2012-07-monthly"/>

2012-08-03
<section begin="2012-08-03"/>This past week has been focusing on cleaning out the cluster and working with ops to finish up the housework. All instances are now working on new hardware thanks to Andrew Boggot and all make use of the project storage path (/data/project) which was upgraded by Ryan Lane to use the latest GlusterFS release.

Most obsoletes and experimental instances have been removed.

The |overall documentation has been expanded.<section end="2012-08-03"/>

2012-08-31
<section begin="2012-08-31"/>The MediaWiki core and extensions are now automatically updating. The beta cluster is from now always using the very latest version published under the master branch of each repositories.<section end="2012-08-31"/>

2012-08-monthly
<section begin="2012-08-monthly"/>The MediaWiki core and its extensions are now automatically updating, and the beta cluster is now always using the very latest version published under the master branch of each Git repository.<section end="2012-08-monthly"/>

2012-09-24
<section begin="2012-09-24"/>bits.beta.wmflabs.org is now fully managed by puppet. It serves MediaWiki and its extensions assets as well as geographical lookup of IP addresses http://bits.beta.wmflabs.org/geoiplookup<section end="2012-09-24"/>

2012-09-monthly
<section begin="2012-09-monthly"/>In September, QA Lead Chris McMahon announced that the Beta cluster is a fit test environment: code is routinely deployed there ahead of production, the test environment emulates the production environment closely, and we can easily and reliably manipulate aspects of the test environment (configuration, permissions, etc.) for testing purposes. Also, bits.beta.wmflabs.org is now fully managed by puppet. It serves MediaWiki and its extensions assets, as well as geographical lookup of IP addresses. Some work remains to be done (performance tuning, configuration) but the infrastructure is in place for software testing and browser test automation.<section end="2012-09-monthly"/>

2012-10-monthly
<section begin="2012-10-monthly"/>The MediaWiki configuration on the beta cluster has still a few remaining live hacks that prevent it from being upgraded smoothly. The final bits have been tracked down and will need a final sprint.<section end="2012-10-monthly"/>

2012-11-06
<section begin="2012-11-06"/>We are working on getting Zuul in place so Jenkins can talk to Gerrit - that's Antoine's goal. The goal for the NL hackathon is to work on CI in general. We hope that the NL hackathon will aid in speeding Beta cluster work. Filipin is working on getting CloudBees into a slave for WMF's Jenkins installation -- that's one of the NL hackathon goals.<section end="2012-11-06"/>

2012-11-13
<section begin="2012-11-13"/>Deployed AFTv5 to beta cluster and New Page Patrol is being maintained there as well. Still working on issues of ongoing maintenance. <section end="2012-11-13"/>

2012-11-30
<section begin="2012-11-30"/>Beta played a role in handling a recent issue with a defect that escaped to production. Beta remains the primary host for AFTv5 testing, including browser test automation<section end="2012-11-30"/>

2012-11-monthly
<section begin="2012-11-monthly"/>We deployed ArticleFeedbackv5 to the beta cluster, which is the primary host for AFTv5 testing, including browser test automation. New Page Patrol is being maintained there as well. We are still working on issues of ongoing maintenance, and this cluster played a role in catching a defect that recently escaped to production.<section end="2012-11-monthly"/>

2012-12-11
<section begin="2012-12-11"/>Arthur Richards is organizing a quick meeting to kick start the deployment of MobileFrontend on beta. Asked for support of PageTriage in test envs apropos of conversation with Ryan Kaldari https://bugzilla.wikimedia.org/show_bug.cgi?id=43203<section end="2012-12-11"/>

2012-12-monthly
<section begin="2012-12-monthly"/>The project to support MobileFrontend in Beta labs continues. We intend for Beta labs to become a test environment for the new git-deploy script from the Operations team: this should be helpful in ongoing maintenance of the environment<section end="2012-12-monthly"/>

2013-01-22
<section begin="2013-01-22"/>Kickoff held for Mobile Frontend support for beta labs, scope in place, project proceeding. <section end="2013-01-22"/>

2013-01-29
<section begin="2013-01-29"/>Support for Mobile Frontend on beta labs is underway, also proposed a novel use for beta to answer needs from the AFT project and E3. <section end="2013-01-29"/>

2013-01-monthly
<section begin="2013-01-monthly"/>The main use for the Beta Cluster in January was to test git-deploy. Zeljko Filipin continues to run regular tests there. Antoine Musso, Max Semenik, and Andrew Bogott are setting up MobileFrontend to run on Beta for testing purposes.<section end="2013-01-monthly"/>

2013-02-05
<section begin="2013-02-05"/>Work continued to add support for MobileFrontend on the Beta cluster, and a discussion to support a pre-release version of AFTv5 is underway.<section end="2013-02-05"/>

2013-02-12
<section begin="2013-02-12"/>MobileFrontend support nearly done. Support for other extensions under discussion, especially db updates<section end="2013-02-12"/>

2013-02-25
<section begin="2013-02-25"/>Automated support for database updates should be in place this week, which will greatly increase the value of beta cluster as a test environment for multiple extensions not well supported now. <section end="2013-02-25"/>

2013-03-05
<section begin="2013-03-05"/>Search being added to beta, MobileFrontend tweaks. Discussing new ways to use beta as a result of San Francisco gathering.<section end="2013-03-05"/>

2013-02-monthly
<section begin="2013-02-monthly"/>We are adding search to the beta cluster following Mobile Frontend tweaks. We are discussing new ways to use the beta cluster as a result of our San Francisco gathering in February.<section end="2013-02-monthly"/>

2013-03-12
<section begin="2013-03-12"/>Search being added to beta cluster in preparation for changing Lucene<section end="2013-03-12"/>

2013-03-19
<section begin="2013-03-19"/>Beta cluster db now being updated automatically! Added BZ issue to track moving automated tests to beta cluster from test2wiki. Search is up on beta but untested. https://bugzilla.wikimedia.org/show_bug.cgi?id=34250<section end="2013-03-19"/>

2013-03-monthly
<section begin="2013-03-monthly"/>"Phase 1" support on beta for Mobile is complete and Mobile is using the beta cluster for testing now. We added search to beta, including Mobile. Lucene instances have been set up to provide search suggestion and ... search capabilities, but it's a rough base which still needs to be improved. More automated tests are now targeting beta cluster, and targeting the test2wiki/production cluster is underway.

Jenkins is now upgrading the database schemas on an hourly basis and deploying changes to the MediaWiki configuration just after they have been merged. If you are curious, have a look at the Jenkins dashboard for the beta project.<section end="2013-03-monthly"/>

2013-04-02
<section begin="2013-04-02"/>Starting to point automated tests currently targeting test2wiki to beta labs also to shake out issues there and ultimately improve test coverage. Maintenance ongoing.<section end="2013-04-02"/>

2013-04-08
<section begin="2013-04-08"/>Beta as target for more automated browser tests in process. New utility on beta pays off: https://bugzilla.wikimedia.org/show_bug.cgi?id=47015<section end="2013-04-08"/>

2013-04-16
<section begin="2013-04-16"/>Pointed browser tests to beta labs, config of tests and beta to accommodate that is ongoing. <section end="2013-04-16"/>

2013-04-monthly
<section begin="2013-04-monthly"/>We started to point automated tests currently targeting test2wiki to Beta labs to shake out issues there and ultimately improve test coverage. This will help us with earlier detection of bugs introduced into master (such as ). Mark Bergsma and Antoine Musso refined the Varnish configuration for MobileFrontend, and further refined the configuration of the search functionality.<section end="2013-04-monthly"/>

2013-04-30
<section begin="2013-04-30"/>The Apaches and the bastion are now using a NFS server instead of GlusterFS. The pages are served much faster as a result (from 560ms down to 260ms).<section end="2013-04-30"/>

2013-05-06
<section begin="2013-05-06"/>Ariel Glenn and Antoine Musso started the work toward enabling SSL on the beta cluster. That will in turn let the Mobile and QA teams enhance their browser tests and will be very useful for the OAuth project.<section end="2013-05-06"/>

2013-05-07
<section begin="2013-05-07"/>Niklas Laxström enabled Universal Language Selector on the beta wiki .<section end="2013-05-07"/>

2013-05-13
<section begin="2013-05-13"/>Max Semenik started setting up EventLogging on beta, he is supported by Ori Livneh.<section end="2013-05-13"/>

2013-05-14
<section begin="2013-05-14"/>HTTPS has been enabled on beta, the certificate do not match the domains yet though.<section end="2013-05-14"/>

2013-05-16
<section begin="2013-05-16"/>Ariel Glenn has setup a redis instance on beta. Antoine fixed the jobrunner that was no more processing any jobs because it could not access to the Mediawiki files. Jobs are thus processing again, and from redis!<section end="2013-05-16"/>

2013-05-26
<section begin="2013-05-26"/>Max Semenik and Ori Livneh have setup EventLogging on beta. The events are sent to the bit cache (varnish) and logged in the new deployment-eventlogging instance.<section end="2013-05-26"/>

2013-05-30
<section begin="2013-05-30"/>Roan Kattouw set up Parsoid in labs, and configured VisualEditor on beta to use it. VisualEditor is now working on the beta wikis.<section end="2013-05-30"/>

2013-05-monthly
<section begin="2013-05-monthly"/>In May, Ariel Glenn helped out setting up missing bits of infrastructure to the beta cluster, adding a Redis instance (that holds the job information) and HTTPS support. HTTPS will let us write scenarios related to logging in on the wiki or via a mobile device; it will also let us test out OAuth.

udp2log archiving is now working reliably. Max Semenik has set up an EventLogging infrastructure on beta, and Niklas Laxström enabled Universal Language Selector. The Job processing was malfunctioning but was restored.

Since April 30, the MediaWiki instances are using NFS, which is much faster than the previous GlusterFS share; pages serving time went from 560 ms to 260 ms.

Roan Kattouw has deployed Parsoid and VisualEditor on beta. Just like in production, users can enable it in your their preferences under 'Editing'. <section end="2013-05-monthly"/>

2013-06-15
<section begin="2013-06-15"/>On PHP fatal error, beta will now display an error page instead of a useless blank page.<section end="2013-06-15"/>

2013-06-17
<section begin="2013-06-17"/>Math support has been added. The texvc should be updated by Jenkins whenever a change is merged in the Math extension.<section end="2013-06-17"/>

2013-06-18
<section begin="2013-06-18"/>MaxSem wrote a script to synchronize CSS articles from production on beta. Will solve a few issues for the Mobile QA team as well as for our Selenium tests.<section end="2013-06-18"/>

2013-06-19
<section begin="2013-06-19"/>Steinsplitter and Antoine fixed the AbuseFilter configuration to have a global list of filters on the labswiki. Filters should be configured there and will be used by all the wikis.<section end="2013-06-19"/>

2013-06-20
<section begin="2013-06-20"/>The PHP fatal errors catched by wmerrors extension are now sent to the beta udp2log instance. The log file is /data/project/logs/fatal.log. That will largely improve our troubleshooting.<section end="2013-06-20"/>

2013-06-monthly
<section begin="2013-06-monthly"/>Max Semenik wrote a script to synchronize CSS from production on beta. Steinsplitter and Antoine Musso fixed the AbuseFilter configuration to have a global list of filters on the labswiki. Filters should be configured there and will be used by all the wikis. The PHP fatal errors catched by the wmerrors extension are now sent to the beta udp2log instance. That will largely improve our troubleshooting process.<section end="2013-06-monthly"/>

2013-07-26
<section begin="2013-07-26"/>Beta finally have a syslog receiver on deployment-bastion thus solving (no syslog::server in beta). The logs can be accessed via either /home/wikipedia/syslog or /data/project/logs/syslog/. Thank you Leslie.<section end="2013-07-26"/>

2013-07-monthly
<section begin="2013-07-monthly"/>The Beta cluster continues to be a target for automated and manual testing. It also finally has a syslog receiver on deployment-bastion, thus solving (no syslog::server in beta). The logs can be accessed via either /home/wikipedia/syslog or /data/project/logs/syslog/. This is thanks to Leslie Carr.<section end="2013-07-monthly"/>

2013-09-18
<section begin="2013-09-18"/>The Flow MediaWiki extension has been enabled on the beta cluster. Will let the QA team the possibility to write browser test in a context close to production. <section end="2013-09-18"/>

2013-11-14
<section begin=2013-11-14/>wfDebugLog messages are being sent to an an experimental logstash instance via udp2log<section end=2013-11-14/>

2013-11-monthly
<section begin="2013-11-monthly"/>In November, the Beta cluster saw greatly improved support for testing Parsoid, the parsing engine behind VisualEditor. The Beta cluster also continues to provide a real-world simulation for the Flow project in advance of Flow's limited release scheduled for December. Beta continues to be the the main test environment for MobileFrontend, CirrusSearch, and many other Wikimedia software projects.<section end="2013-11-monthly"/>

2013-12-monthly
<section begin="2013-12-monthly"/>Parsoid on the Beta cluster is now based on the  repository and is properly self-updating whenever a change is merged in that repository via a Jenkins job.

Beta labs played a key role in finding and fixing some significant errors that, in combination, were causing users to see 503 errors in production, particularly on large pages and for Mobile users. For one thing, some timeouts on the Varnish caches had been set too low. We had increased those for the text Varnish servers but had not done so for Mobile Varnish servers. A tricky bug was also causing parts of large pages to be parsed multiple times. Last, the browser tests that incurred the 503 errors should have been capable of ignoring them. Thanks to Beta labs, the Varnish server timeouts are now correct, the multiple-parsing bug is addressed and the browser tests for MobileFrontend are running correctly.<section end="2013-12-monthly"/>

2014-01-monthly
<section begin="2014-01-monthly"/>Beta is being used to test the Math extension rewrite. The Parsoid extension is now deploying continuously via a Jenkins job, status can be found on the CI dashboard job "Parsoid update". The wikis now send updates to the irc.wikimedia.org server .<section end="2014-01-monthly"/>

2014-02-monthly
<section begin="2014-02-monthly"/>Not much happened on the beta cluster beside the usual maintenance and the platform being used to detect nasty bugs before they land on the production cluster. It is being used successfully for staging various features, bugfixes and extensions as well as for browser tests tracking regressions. Next month will see the beta cluster migrating from the pmtpa datacenter to the eqiad datacenter.<section end="2014-02-monthly"/>