Wikimedia Engineering/Report/2012/November

Engineering metrics in November:
 * 112 unique committers contributed patchsets of code to MediaWiki.
 * The total number of unresolved commits went from about 440 to about 535.
 * About 45 shell requests were processed.
 * About 89 developers got access to Git and Wikimedia Labs.
 * Wikimedia Labs now hosts 145 projects, 792 users; to date 1322 instances have been created.

Major news in November include:
 * Wikivoyage.org soft-launched (image transfer and small fixes still ongoing)
 * New HTML5 video player launched with support for WebM video, subtitles and multiple resolution derivatives
 * First version of analytics Hadoop cluster put into service
 * We opened up the process of product management to volunteers with the announcement of a search for Volunteer Product Managers

''Note: As of last month, we're proposing a shorter and simpler version of this report for less technically savvy readers.

Work with us
Are you looking to work for Wikimedia? We have a lot of hiring coming up, and we really love talking to active community members about these roles.



Announcements

 * Quim Gil joined the Engineering Community Team of the Platform engineering group as "Technical Contributor Coordinator (IT Communications Manager)" (announcement).
 * Juliusz Gonera joined the Mobile team as Software Developer (announcement).

Technical Operations
Site Infrastructure
 * Mark made a breakthrough in resolving an old and illusive instability Varnish issue which occur when they are under extreme load or experiencing hanging connections/packet load. The problem turns out to be the slow epoll thread. When under load and once the pipe buffer (64 KB) is full, the writing Varnish worker threads block, and the server situation deteriorates rapidly. Mark fixed this issue by moving the reading of the sessions earlier in the epoll event loop, before the thread does anything else, thereby reducing the size of the pipe buffer. With this enhancement, Mark is confident he could further reduce the number of varnish servers in our caching infrastructure.


 * Asher is happy to report that the memcached instances on the app servers@tampa are no longer in use! This will give us back an extra 2GB of ram on many of the app servers (which only have 8 or 12GB to begin with) which can go towards increasing php capacity, improves the stability of the site by addressing some of the root causes of multiple site outages, and brings with it multiple client improvements including consistent hashing, igbinary serialization, and better timeout handling.  The total cache pool has increased from 140GB to 1392GB, enough to currently meet full parser cache requirements from ram.  Sessions are no longer stored in memcached at all but have been migrated to redis, which will provide replication to the stand-by datacenter. In addition to the improvements listed above, performance is quite a bit better as well!  Compare the max value in the 90th and 99th percentile times in the linked graph. MemcachedPeclBagOStuff is the new client, while MWMemcached is the old. Huge thanks to Asher, Aaron and Tim!


 * In recent months we saw a high hardware failure rate with our batch of Swift servers. After discussion with our vendor, they agreed to replace all those servers with newer hardware. All the required servers to replace the Tampa Swift servers have just arrived and we are in the process of migrating those data from the old servers to the new ones, though it will be an effort to drain traffic, remove from production and slowly ramp up the new hardware. Ariel's current plan is us to add 2 servers per week.


 * After several months of testing and tweaking, Peter finally rolled out the new Apache-on-Precise build on all our Tampa App/Imagescaler servers. This will be the same (and tested) image that we would be using when we spin up Ashburn App servers in the coming month.


 * Thanks to the efforts of Leslie and Mark, we are now a RIPE NCC member, and with this membership, we may be eligible to receive a one time allocation of a /22 of IPv4 address space from the last of /8 of IPv4 address space. This is particularly important to us since we have run out of IPv4 addresses in Europe.


 * The SSL cluster was upgraded to Ubuntu Precise which provided a newer version of nginx and openssl, closing out the CRIME vulnerability and giving us the possibility of using HTTP 1.1 to the backend. Testing of HTTP 1.1 for proxying will occur in the future.

Fundraising
 * Fundraising session started. Jeff and Leslie rolled out the new Ashburn Fundraising server cluster and it is currently handling all payments. Leslie applied and tested firewall rules for the new Ashburn cluster. There were lots of bug fixes and small improvements to configuration management, monitoring, and logging to cluster administration by ops and fr-tech. Jeff built out the second payments messaging box (ActiveMQ) as a hot standby. A new wiki was deployed for the Fundraising email unsubscribe page to segregate it from sensitive services (payments, civicrm). New payments bastion hosts were spec'd.

Data Dumps
 * Media bundles are back in business at now that the network issues have been fixed.  Work has started on upgrading the OS on the servers that produce the dumps, rebuilding the necessary packages and testing.  The 'add/changes' experimental dumps have been running stably long enough that we've made them available on the gluster public data volume accessible to all labs projects.

Wikimedia Labs


 * Andrew continues to work on some long-term OpenStack issues. There's a new project, Moniker, which should (eventually) allow us to properly integrate the Labs cloud with our DNS backend and provide better stability and a bit more user control.  He continues to work on other more basic OpenStack work which will eventually trickle into Labs.


 * Andrew has also been fiddling quite a bit with the usability of OpenStackManager, which is the GUI for labsconsole. The interface is now marginally easier to use and understand, and improvements are ongoing.

Others
 * Chris Johnson (see Aug 10 announcement) has relocated from our Tampa data center to work in our Ashburn datacenter. Steven Bernardin will now be the main Tampa data center engineer.

Mobile

 * The Mobile team of Jon, Juliusz, Arthur and max deployed several features to our beta and production mobile web infrastructure this week. To beta we deployed experimental edit functionality, reformatted tables, random article support, simpler layout for cleanup templates, and watch lists. For production we added login support.

Offline
Kiwix New project Phpzim started with the help of Wikimedia CH. This project will create a binding in PHP of the zimlib ; allowing any PHP developer to easily create and read ZIM files. This is the first stone of a bigger project to allow quick ZIM file generation in Mediawiki (and also other CMS coded in PHP). Works on ZIM Autobuild continues and Kiwix ZIM throughput increases slowly (4 files in November). Small testing stage of 0.9rc2 will finally start in the next days, followed by the release.

Wikidata
The Wikidata project is funded and executed by Wikimedia Deutschland.

The repo side of Wikidata has been launched on http://www.wikidata.org. It contains the results of phase 1 (language links) and has already attracted a community to maintain the wiki. Meanwhile, the Wikidata team has continued work on Phase 2 of Wikidata (Infoboxes) to add statements with values to the items in the Wikidata repo. The team improved the propagation of changes from the repo to the client and the messaging in Recent Changes. There is a constant exchange with WMF about the upcoming deployment cycle. Feedback and questions are welcome on the mailing list and on meta.

Future
The engineering management team continues to update the Software deployments page weekly, providing up-to-date information on the upcoming deployments to Wikimedia sites, as well as the engineering roadmap, listing ongoing and future Wikimedia engineering efforts.