Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, April 2014

This is an outline for a review of the MediaWiki Core team which will take place at WMF on April 15, 2014.
 * Notes for this review

Team
See Wikimedia MediaWiki Core Team page.

Changes since last review: (none)

HipHop VM Deployment
Plans for the quarter with status:
 * Get a production service running on HHVM. (one of: job queue, l10n update, image scalers, etc.)
 * Status: Getting HHVM serving up Beta. (Done)
 * Port LuaSandbox
 * Status: In progress.
 * Tim made very substantial contributions to HHVM's Zend compatibility layer, which include an implementation of PHP's Thread-Safe Resource Manager (TSRM) for HHVM and an overhaul of the compatability layer's basic architecture to use HHVM-Native Interface (HNI) rather than IDL files. Tim's work makes it substantially easier to port any Zend extension to HHVM, not just Scribunto.
 * Port wikidiff2
 * Status: Done (thanks Max Semenik!)
 * Port wmerrors
 * Status: Not needed
 * Port FastStringSearch
 * Status: Done by Aaron Schulz
 * Packages and Puppet manifests for provisioning HHVM on Ubuntu Precise.
 * Status: Not done. Work is in progress, but we'll want to rely on some combination of upstream work with Faidon and possibly others in TechOps to accomplish this.
 * Jenkins job that tests patches to core and extensions using HHVM.
 * Status: Done
 * Jenkins job that runs the full suite of unit tests against HHVM.
 * Status: Done

Search
Plan:
 * Have all sites using CirrusSearch (and thus, ElasticSearch) as the primary search engine by the end of quarter
 * Status: Partial. Non-Wikipedias switched over April 2.  (done by review?)
 * We may have an initial run at the interwiki search UI if time allows.
 * Status: Partial

Architecture/RFC Review
Tim worked part time on this. We had loftier goals for this (to help teams and individual contributors to develop specifications for implementing product requirements in a manner consistent with the design, performance, architecture, stability, etc. requirements of MediaWiki and the Wikimedia production cluster) Much of the work was just keeping the RFC review process going, and Tim was absorbed into HHVM porting work.

Deployment-related Development
LogStash is a new logging framework which should make it much easier to view and query system logs for purposes of debugging.

Plan:
 * Several team members have been and will continue to be involved in work on this as it nears initial rollout: Bryan Davis, Ori Livneh, Aaron Schulz and Antoine Musso.
 * Status: limited rollout
 * Bryan Davis work on requirements for a migration from scap to another tool with input from the Dev and Deploy process review happening on the 22nd.
 * Status: Bryan and Ori rewrote scap in Python, and have made some usability improvements to it. This have given Bryan a great deal of insight into how the overall system works.

PDF rendering
Brad continuing to work in a supporting role.

Performance Infrastructure
For next quarter: get front-end performance data piping into the same profiling data aggregator as back-end performance data and provide some unified view for looking at latency across the stack

Security
Password storage update, Security reviews

Admin tools development
Dan Garry will be scoping this project

SecurePoll cleanup
Dan Garry will be scoping this project, with Brad Jorsch doing development work if it can be scoped on time.

SUL finalisation
Dan is scoping this, and working with James on what's involved.

Central CSS discussion
Dan working on scoping

Past Quarter Allocations
This is our planned allocation for January through March of 2014:
 * Tim Starling: HHVM, Architecture/RFC Review, other review
 * Bryan Davis: scap/git-deploy, LogStash
 * Nik Everett: Search
 * Chad Horohoe: HHVM, Search
 * Brad Jorsch: SecurePoll cleanup, PDF rendering, API Maintenance, Scribunto maintenance
 * Ori Livneh: Performance Infrastructure, HHVM, git-deploy, LogStash
 * Aaron Schulz: HHVM, git-deploy, LogStash, Password storage update, l10n cache
 * Chris Steipp: Password storage update, Security reviews
 * Antoine Musso: HHVM, LogStash, JobQueue, Zuul upgrade
 * Sam Reed: Deployments
 * Dan Garry: Admin tools development (scoping out), SecurePoll (scoping out), SUL finalisation, OAuth improvements, Search

Focus project: moving HHVM into production
We think that pushing HHVM toward full deployment is our best bet at achieving the greatest impact on end-users as a group. Historically, our strategy with respect to site performance was to rely heavily on a caching layer, based on the assumption that it is only a tiny minority of users that actually contribute content or personalize the interface, and that other users, representing the vast majority of visits to our projects, can be served by recycling generic views. This approach has allowed us to successfully scale to our current traffic levels, which is a significant feat. The cost of this approach has been twofold: it has concealed the user experience of logged-in editors, which is substantially degraded in comparison to anonymous users, and it has put a wall in front of our developers, frustrating most attempts to make the site more interactive or personalized for a bigger portion of our traffic. If we want to graduate features out of Beta and make them accessible to new users, we need to modernize our application server stack.

However, there are substantial risks to pushing this early. Debian packaging is still in a nascent state. The standard advice to developers is still to compile the very latest version from master because incompatibility bugs are still being fixed at a rapid pace. Rolling out HHVM would require extending our deployment system so that it builds a byte-code cache and synchronizes it to each application server. The mechanics of how we'll handle code changes, server restarts, etc. are still fuzzy. HHVM is not just a faster PHP: it's a totally new runtime that make many aspects of PHP that were previously set in stone highly configurable. It has completely new monitoring and profiling capabilities. It's a different beast, and it'll take us some time to become adequately familiar with it. So we can't promise that we only have one more quarter of work to do on this.

There are a couple of external dependencies that, if met, would accelerate our work:
 * A speedy Ubuntu 14.04 deployment
 * Assisting upstream with proper HHVM packaging

We have pretty good momentum right now. Both HHVM's upstream at Facebook and our own TechOps team have been very supportive of our work and are eager to see us deploy this. We want to keep going.

SUL finalization

 * Legoktm should be able to do the engineering work, Dan can do the other loose ends

Scap

 * Get to “100%” moved to Python

Revision storage revamp
This would involve:
 * Specifying an API for storage of revisions; anything implementing that API could be trivially used with MediaWiki.
 * Refactor core so that revision storage is an implementation of that API.

There are two parts to revision storage: revision metadata and revision text. For revision text storage, we have External Storage, which was very innovative in its day, but is showing its age. Due to the design, reads and writes are largely focused on a single node (where the most recent revisions are), so adding nodes doesn't necessarily improve performance. The compression scripts may be fine, but we don't know for sure because we're a bit afraid of running them to find out. Thus, we're not really getting the benefit of the compression offered by the system.

A likely solution for storing revision text is Rashamon. Metadata could also be stored in Rashamon as well, but we may still need a copy of the metadata in our SQL database for fast and simple querying. Regardless of our plans for Rashamon, there is significant work involved in MediaWiki to abstract away the many built-in assumptions that our code has about retrieving revision metadata directly from a database.

This project would be done in service of a broader push toward a service-oriented architecture. We would use this project to set an example for how we foresee other aspects of MediaWiki being turned into modular services, and would likely use this as an opportunity to further establish use of the value-object pattern exemplified in TitleValue. As part of this work, we would also provide a simple SQL-based implementation for developer installations of MediaWiki. We would also like to work on the infrastructure for providing proper authentication tokens for API access. Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).

This project also offers us an opportunity to make revision storage a nicely abstracted interface in core, and could be quite complementary to Rashamon, giving Gabriel some clean API points to plug it in. It also offers us the ability to establish the template for how MediaWiki can be incrementally refactored into cleanly-separated services.

After some investigation, we decided to pass on this project. What made us consider this project was indications that our current revision table for English Wikipedia is becoming very unwieldy, and it's really time for a sharded implementation. However, Sean Pringle made some adjustments, and we're comfortable that we have enough headway that this project isn't as urgent as we first thought.

Job queue work

 * Bug 46770 “Rewrite jobs-loop.sh in a proper programming language”? [TS]
 * Bug 46770 would be nice. So would more monitoring. I tend to agree with Rob that it’s probably not worth a full quarter (nor do I agree that we should scrap the whole thing in favor of $someNewThing) [CH]
 * Antoine: Isn’t our time better invested in overhauling the whole job queue system?
 * But WHY? Like Rob said...there doesn’t seem to be a compelling case for replacing the whole thing...just making some improvements around the edges. [CH]
 * Not really, after Aaron already spent so much time overhauling it over the last 18 months [TS]
 * +10 [CH]
 * We could use better monitoring of the queue processing, nicer priority systems..

OpenID connect (goes hand-in-hand with Phabricator)

 * Continuous integration fully tied to Gerrit stream-events / Gerrit comment. Need to either adapt Zuul or rethink the way we do CI.
 * https://secure.phabricator.com/book/phabricator/article/herald/ Herald -> gerrit stream events adapter should be an easy hack.
 * Or we can just get rid of Zuul… (We should, but the ability to kludge things temporarily during a migration would be good.)