Wikimedia Engineering/Report/2012/November

Engineering metrics in November:
 * 112 unique committers contributed patchsets of code to MediaWiki.
 * The total number of unresolved commits went from about 440 to about 535.
 * About 45 shell requests were processed.
 * About 89 developers got access to Git and Wikimedia Labs.
 * Wikimedia Labs now hosts 145 projects, 792 users; to date 1322 instances have been created.

Major news in November include:
 * Wikivoyage.org soft-launched (image transfer and small fixes still ongoing)
 * New HTML5 video player launched with support for WebM video, subtitles and multiple resolution derivatives
 * First version of analytics Hadoop cluster put into service
 * We opened up the process of product management to volunteers with the announcement of a search for Volunteer Product Managers

''Note: As of last month, we're proposing a shorter and simpler version of this report for less technically savvy readers.

Work with us
Are you looking to work for Wikimedia? We have a lot of hiring coming up, and we really love talking to active community members about these roles.



Announcements

 * Quim Gil joined the Engineering Community Team of the Platform engineering group as "Technical Contributor Coordinator (IT Communications Manager)" (announcement).
 * Juliusz Gonera joined the Mobile team as Software Developer (announcement).

Technical Operations
Site Infrastructure
 * Mark Bergsma made a breakthrough in resolving an old and elusive instability issue in Varnish which occurs when they are under extreme load or experiencing hanging connections/packet load. The problem turned out to be the slow epoll thread. When under load and once the pipe buffer (64 KB) is full, the writing Varnish worker threads block, and the server situation deteriorates rapidly. Mark fixed this issue by moving the reading of the sessions earlier in the epoll event loop, before the thread does anything else, thereby reducing the size of the pipe buffer. With this enhancement, Mark is confident he could further reduce the number of Varnish servers in our caching infrastructure.


 * Asher Feldman is happy to report that the memcached instances on the app servers in Tampa are no longer in use. This will give us back an extra 2GB of RAM on many of the app servers (which only have 8 or 12GB to begin with) which can go towards increasing PHP capacity. It also improves the stability of the site by addressing some of the root causes of multiple site outages, and brings with it multiple client improvements including consistent hashing, igbinary serialization, and better timeout handling. The total cache pool has increased from 140GB to 1392GB, enough to currently meet full parser cache requirements from RAM. Sessions are no longer stored in memcached at all but have been migrated to redis, which will provide replication to the stand-by datacenter. In addition, performance is quite a bit better as well, as can be seen by comparing the max value in the 90th and 99th percentile times in the attached graph.


 * In recent months, we've seen a high hardware failure rate with our batch of Swift servers. After discussion with our vendor, they agreed to replace all those servers with newer hardware. All the required servers to replace the Tampa Swift servers have just arrived. We are in the process of migrating data from the old servers to the new ones, but it will take time to drain traffic, remove the old hardware from production and slowly ramp up the new machines. Ariel Glenn's current plan is to add 2 servers per week.


 * After several months of testing and tweaking, Peter Youngmeister finally rolled out the new Apache-on-Precise build on all our Tampa app and imagescaler servers. This will be the same (and tested) image that we'll be using on the Ashburn App servers in the coming month.


 * Thanks to the efforts of Leslie Carr and Mark Bergsma, we are now a RIPE NCC member, and with this membership, we may be eligible to receive a one-time allocation of a /22 of IPv4 address space from the last of /8 of IPv4 address space. This is particularly important to us since we have run out of IPv4 addresses in Europe.


 * The SSL cluster was upgraded to Ubuntu Precise which provided a newer version of nginx and openssl, closing out the CRIME vulnerability and giving us the possibility of using HTTP 1.1 to the back-end. Testing of HTTP 1.1 for proxying will occur in the future.

Fundraising
 * The fundraising season started. Jeff Green and Leslie Carr rolled out the new Ashburn Fundraising server cluster and it is currently handling all payments. Leslie applied and tested firewall rules for the new cluster. There were lots of bug fixes and small improvements to configuration management, monitoring, and logging to cluster administration by the Operations and the Fundraising tech teams. Jeff built out the second payments messaging box (ActiveMQ) as a hot standby. A new wiki was deployed for the Fundraising email unsubscribe page, to segregate it from sensitive services (payments, CiviCRM). Specifications for new payments bastion hosts were started.

Data Dumps
 * Media bundles are back in business at your.org now that the network issues have been fixed. Work has started on upgrading the OS on the servers that produce the dumps, rebuilding the necessary packages and testing. The 'add/changes' experimental dumps have been running stably long enough that we've made them available on the gluster public data volume accessible to all Labs projects.

Wikimedia Labs


 * Andrew Bogott continues to work on some long-term OpenStack issues. There's a new project, Moniker, which should (eventually) allow us to properly integrate the Labs cloud with our DNS back-end and provide better stability and a bit more user control. He continues to work on other more basic OpenStack work which will eventually trickle into Labs.
 * Andrew has also been fiddling quite a bit with the usability of OpenStackManager, which is the GUI for labsconsole. The interface is now marginally easier to use and understand, and improvements are ongoing.

Others
 * Chris Johnson has relocated from Tampa to work in our Ashburn datacenter. Steven Bernardin is now the main Tampa data center engineer.

Mobile

 * The Mobile team (Jon Robson, Juliusz Gonera, Arthur Richards and Max Semenik) deployed several features to our beta and production mobile web infrastructure this month. To beta, we deployed experimental edit functionality, reformatted tables, random article support, simpler layout for cleanup templates, and watchlists. For production, we added log-in support.

Offline
Kiwix
 * A new project, Phpzim, was started with the support of Wikimedia CH. This project will create a binding in PHP of the zimlib, allowing any PHP developer to easily create and read ZIM files. This is the first stone of a bigger project to allow quick ZIM file generation in Mediawiki (and also other PHP CMSes). Work on ZIM Autobuild continues and Kiwix ZIM throughput increases slowly (4 files in November). Small testing stage of Kiwix 0.9rc2 will finally start in early December, followed by the release.

Wikidata
The Wikidata project is funded and executed by Wikimedia Deutschland.


 * The repo side of Wikidata has been launched on http://www.wikidata.org. It contains the results of phase 1 (language links) and has already attracted a community to maintain the wiki. Meanwhile, the Wikidata team has continued work on Phase 2 of Wikidata (Infoboxes) to add statements with values to the items in the Wikidata repo. The team improved the propagation of changes from the repo to the client and the messaging in Recent Changes. There is a constant exchange with WMF about the upcoming deployment cycle. Feedback and questions are welcome on the mailing list and on meta.

Future
The engineering management team continues to update the Software deployments page weekly, providing up-to-date information on the upcoming deployments to Wikimedia sites, as well as the engineering roadmap, listing ongoing and future Wikimedia engineering efforts.