Wikimedia Platform Engineering/Site performance and architecture/status
Last update on: 2014-05-monthly
An initial investigation has begun on the possibility of upgrading from PHP 5.3 to PHP 5.4. Benchmarks are very promising, but a security enhancement we currently using with PHP 5.3 (Suhosin) is not yet available for PHP 5.4, so the team is debating whether to carry on without it, as well as estimating the performance penalty introduced by this patch. More improvements have been made to Ganglia and Graphite.
An initial investigation has begun on the possibility of upgrading from PHP 5.3 to PHP 5.4. Benchmarks are very promising, but a security enhancement we are currently using with PHP 5.3 (Suhosin) is not yet available for PHP 5.4, so the team is debating whether to carry on without it, as well as estimating the performance penalty introduced by this patch. More improvements have been made to Ganglia and Graphite.
Aaron and Andrew Garrett are working on job queue improvements. Tim is working on Apache configuration cleanup. Antoine is working on normalizing labs and production configurations
Tim Starling investigated an LLVM PHP bytecode converter this month, which looked like a promising direction for performance optimization (slides here). The theoretical gain seems pretty significant, but actual performance he was able to observe was disappointing and we probably won't go in that direction. Asher Feldman has deployed an upgraded version of the parser cache server (db40) and the results have been impressive. Comparing 90th percentile and 99th percentile cache response times averaged over several days (July 3-5) for the parser cache server versus the last 8 hour for new improved parsercache shows 90th percentile response time dropping from 53.6ms to 7.17ms, and 99th percentile response time dropping from 185.3ms to 17.1ms. This is relevant to every page request from logged in and cookied logged out users so should have a meaningful impact on the user experience. Aaron Schulz and Andrew Garrett have been working on job queue improvements, Tim Starling on Apache configuration cleanup, and Antoine Musso on normalizing Labs and production configurations.
In addition to the Lua work, Tim Starling did some investigation of parallel parsing, but that project may go on the backburner until after Parsoid goes into production. Tim Starling wrote a new Redis-based client for session handling. This will be important for the Virginia Datacenter Migration.
Tim Starling committed changes to our implementation of libxml to use the PHP memory allocator, rather than using malloc (the C standard for allocating memory directly managed by the operating system). This will allow us to have per-page limits on the complexity of pages in a way that more closely mirrors their impact on our cluster.
Support needed for more complex data structures (lists,sets) in memcached (with atomic updates) is awaiting more review and testing. The coding is essentially done (Change-Id: Ic27088488f8532f149cb4b36e156516f22880134).
Various improvements to the job queue have been made to avoid CPU time wasted on duplicate jobs and redundant page cache purges. Changes have also been made to make it possible to edit heavily used templates without timeouts. Support needed for more complex data structures (lists, sets) in memcached (with atomic updates) is awaiting more review and testing. The coding is essentially done (changeset).
After an assessment by Asher Feldman, Patrick Reilly and Tim Starling, the RDB database patch was canceled. Instead, in the short term, a separate vertically partitioned data cluster will be provided as a temporary storage until a horizontally scalable architecture can be finalized. Matthias Mullie is modifying the RDB-dependent ArticleFeedbackToolv5 to remove that dependency through an abstraction layer. When a sharded or horizontally scaled solution is provided, AFTv5's abstraction will be migrated.
An initial assessment of various non-MySQL alternatives for using Aaron Schulz's JobQueue core patch in 1.20 is being done for Echo. Because of the time it takes to exhaust the Echo queues, it is written to bypass the JobQueue through direct calls. Luke Welling is abstracting the JobQueue for Redis, ZeroMQ, and others.
A patch to allow moving the DB job queue to another cluster is under review. An experimental redis-based job queue patch also exists in gerrit. Code was merged to support more complex data structures (lists, sets) in memcached (with atomic updates).
A patch to allow moving the job queue to another DB cluster has been merged, and another patch to support an alternative Redis-based queue is in review in gerrit. Currently, job-related operations consume a significant portion of production database master wall time.
All job queues were migrated to JobQueueRedis off of the main DB clusters. Improvements were made to the category update queries to reduce lock exceptions that users often encountered when deleting files. This works via a new transaction callback hook added to the core database class, which can be used to resolve similar problems.
Ceph: This morning (Pacific time) we enabled multi-write to both Ceph and Swift (ie: Ceph will be as up to date as the main file store constantly). Later this week (probably Thursday) around the same time (early early Pacific morning) we'll switch the 'master' store to Ceph. ie: you shouldn't see any issues now on the user facing end, but you might later this week (but probably not, based on what we saw this morning).
We implemented logging, aggregation & graphing of the VisualEditor DOM load & save timing. We also rolled out
mw.inspect, a library for inspecting static asset metrics. We configured stable URLs and improved cache headers for font resources, and rolled out a localStorage module caching to test wikis and the beta cluster.
We ran a controlled experiment to test the impact of module storage on performance. We expect to publish our findings within a week. We puppetized Graphite and MediaWiki's profiling log aggregator and migrated them to our Ashburn data center. Finally, we started working on a replacement profiling log aggregator that will process and visualize profiling data from both client-side and server-side code.
The team wrapped up the Puppetization of Graphite and its migration to Ashburn, and configured Travis CI to run MediaWiki's test suite under HHVM on each commit to core. They also added an initial HHVM role for MediaWiki-Vagrant and re-wrote MediaWiki's profiling data aggregator to be more performant. Prior to the rewrite, it was constantly saturated and would drop data; the rewrite reduced average CPU utilization by more than two thirds.
Aaron Schulz has been reviewing the Petition extension for deployment to the cluster, working with Peter Coombe to improve its performance. In addition, the reliability and speed of media uploads was increased by removing many failure cases on Commons. There were many other minor fixes over the course of the month.