Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, January 2014

This is an outline for a review of the MediaWiki Core team which will took place at WMF on January 21, 2014.
 * Notes for this review

Team
See Wikimedia MediaWiki Core Team page.

Changes since last review: (none)

Search (ElasticSearch/CirrusSearch) deployment
Nik Everett and Chad Horohoe have been working on this through the past quarter. We optimistically projected we'd be able to complete this deployment by the end of 2013, and disable our previous search technology (lsearchd) in January 2014. This rollout is going pretty well, but it's going to take a bit longer to get fully switched over. Several wikis are now using CirrusSearch as the primary search engine, constituting roughly 8% of our search traffic as of this writing. Also as of this writing, we are indexing English Wikipedia. If that indexing goes smoothly and the index size is manageable, we may attempt switching it next. We currently project that we'll be done rolling this out by the end of March with the ability to turn off lsearchd sometime shortly thereafter, subject to hardware availability and how smoothly our gradual rollout goes.

Some progress has been made on designing the UX for interwiki search, though development and deployment schedule for this work is TBD.

Architecture formalization
We held fortnightly IRC meetings to discuss many in-progress RFCs. Additionally, much of the planning for the Architecture summit happened in the quarter. A number of RFCs have been reviewed, and authors have gotten the feedback they've needed to refine their proposals. A lot of organization of the RFCs has happened now, setting us up to get through the list of items much more quickly. We believe it is quite likely that the summit will result in at least one high quality proposal worthy of significant investment of development time from the MediaWiki Core group in the coming calendar year, and we're generally looking to the summit as a means of informing our work planning.

DevOps sprint
The main areas of focus for this sprint were:
 * multi-site awareness - as it turns out, most of the issues were somehow addressed in the leadup to this work.
 * git-deploy - not done. Will be part of focus for next quarter (in some fashion).
 * deployment script improvements (scap) - we believe we have marked improvements, but they are hard to quantify due to stubborn problems with a legacy search host in Tampa that is slated to be decommissioned soon.
 * See Aaron's email to Engineering for some of the details.
 * monitoring / reporting - LogStash initial deployment in production is done (after extension testing in wmflabs).
 * DSL for Apache URL rewriting configuration (to reduce mistakes in the apache rewrite rules)

See Retrospective for what we learned from how we managed the sprint.

Wikimania scholarships app
Bryan Davis rewrote large swaths of the Scholarship app to adapt it to new requirements. working with Chad Horohoe and Katie Filbert on code review and some coding. Jessie/Ellie worked with the Wikimania planning team on this.

Auth systems
Chris Steipp continued work in this area as a solo effort, focusing mainly on refining our password expiration protocol as well as improving our password hashing. OAuth was actually deployed \o/. Several performance issues with SULv2 were addressed.

Security auditing and response
Chris Steipp continued his work on security auditing. The review queue for this has stacked up, with reviews promised for Limn, TimedMediaHandler v2, Kraken, and GLAM upload, among others. During the quarter, reviews for Flow and the Wikimedia Scholarship application were prioritized above Limn and Kraken, and completed.

Performance
Ori's work on tooling: Aaron's backend work:
 * udpprofiler improvements
 * Graphite, gdash, udpprofiler migrated from tampa
 * user-perceived latency measurement and aggregation http://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&tab=v&vn=Navigation+Timing&hide-hf=false
 * static asset monitoring in ganglia: http://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&tab=v&vn=Static+assets&hide-hf=false
 * Bugs caught using mediawiki.inspect module: (see tracking bug 55683)
 * File handling optimizations for Swift and LocalRepo
 * Avoid extra parsing in prepareContentForEdit
 * Several JobQueue management optimizations
 * Backend application caching changes (e.g. new MapCacheLRU class)
 * Databaase access improvements (e.g. SqlBagOStuff transaction avoidance, LogPager improvements)
 * Setup.php initialization and profiling improvements

PDF rendering

 * Brad Jorsch helped with the initial sprint, and has mainly been working on an advisory role on this.

HipHop VM Deployment
Things that are already done:


 * Initial HHVM role in Vagrant
 * Collaborated with P. Trajan from Facebook on getting HHVM packages signed
 * Travis CI running mediawiki core test suite under HHVM on each commit: https://travis-ci.org/wikimedia/mediawiki-core/ ; Also running test suite of deployment branches w/extensions
 * Bug triage

Next quarter: As stated in the previous review, HHVM is maturing quite quickly, and it has the potential to vastly improve performance. However, although the set of features supported by HHVM is nominally sufficient for running MediaWiki in production, a lot of work must be done to verify that the set of software components that are essential to our MediaWiki deployment work properly when running under HHVM. This work has a very long tail and we hope to engage other teams in this work so that they help us see it through. To do this, some infrastructure for collaboration needs to be in place: The big-ticket compatibility issue that we already know we have is the need to patch or rewrite several PHP extensions (wikidiff2, wmerrors, LuaSandbox) so that they run under HHVM.
 * Get a production service running on HHVM. (one of: job queue, l10n update, image scalers, etc.)
 * Port LuaSandbox
 * Packages and Puppet manifests for provisioning HHVM on Ubuntu Precise.
 * Jenkins job that tests patches to core and extensions using HHVM.
 * Jenkins job that runs the full suite of unit tests against HHVM.

Tim Starling, Chad Horohoe, Ori Livheh, Aaron Schulz and Antione Musso plan to play a role in this. The goal for this coming quarter is to have at least one production service migrated over to HipHop VM.

Search
Chad and Nik plan have all sites using CirrusSearch (and thus, ElasticSearch) as the primary search engine by the end of quarter. We may have an initial run at the interwiki search UI if time allows.

Architecture/RFC Review
Tim will be working part time on this, to help teams and individual contributors to develop specifications for implementing product requirements in a manner consistent with the design, performance, architecture, stability, etc. requirements of MediaWiki and the Wikimedia production cluster.

Deployment-related Development
LogStash is a new logging framework which should make it much easier to view and query system logs for purposes of debugging. Several team members have been and will continue to be involved in work on this as it nears initial rollout: Bryan Davis, Ori Livneh, Aaron Schulz and Antoine Musso.

Bryan Davis will work on requirements for a migration from scap to another tool with input from the Dev and Deploy process review happening on the 22nd.

PDF rendering
Brad continuing to work in a supporting role.

Performance Infrastructure
For next quarter: get front-end performance data piping into the same profiling data aggregator as back-end performance data and provide some unified view for looking at latency across the stack

Security
Password storage update, Security reviews

Admin tools development
Dan Garry will be scoping this project

SecurePoll cleanup
Dan Garry will be scoping this project, with Brad Jorsch doing development work if it can be scoped on time.

SUL finalisation
Dan is scoping this, and working with James on what's involved.

Central CSS discussion
Dan working on scoping

Allocations
This is our planned allocation for January through March of 2014:
 * Tim Starling: HHVM, Architecture/RFC Review, other review
 * Bryan Davis: git-deploy, LogStash
 * Nik Everett: Search, Search in Wikidata
 * Chad Horohoe: HHVM, Search
 * Brad Jorsch: SecurePoll cleanup, PDF rendering, API Maintenance, Scribunto maintenance
 * Ori Livneh: Performance Infrastructure, HHVM, git-deploy, LogStash
 * Aaron Schulz: HHVM, git-deploy, LogStash, Password storage update, l10n cache
 * Chris Steipp: Password storage update, Security reviews
 * Antoine Musso: HHVM, LogStash, JobQueue, Zuul upgrade
 * Sam Reed: Deployments
 * Dan Garry: Admin tools development (scoping out), SecurePoll (scoping out), SUL finalisation, OAuth improvements, Search