Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, April 2014

This is an outline for a review of the MediaWiki Core team which will take place at WMF on April 15, 2014.
 * Notes for this review

Team
See Wikimedia MediaWiki Core Team page.

Changes since last review: (none)

HipHop VM Deployment
Plans for the quarter with status:
 * Get a production service running on HHVM. (one of: job queue, l10n update, image scalers, etc.)
 * Status: Getting HHVM serving up beta (done by review?)
 * Port LuaSandbox
 * Status: In progress (done by review?). Tim made enough improvement Zend plugin compatibility layer that its up to the job.
 * Port wikidiff2
 * Status: Done (thanks Max Semenik!)
 * Port wmerrors
 * Status: Not needed
 * Port other extensions
 * Status: Done by review?
 * Packages and Puppet manifests for provisioning HHVM on Ubuntu Precise.
 * Status: Not done. Work is in progress, but we'll want to rely on some combination of upstream work with Faidon and possibly others in TechOps to accomplish this.
 * Jenkins job that tests patches to core and extensions using HHVM.
 * Status: Done
 * Jenkins job that runs the full suite of unit tests against HHVM.
 * Status: Done

Search
Plan:
 * Have all sites using CirrusSearch (and thus, ElasticSearch) as the primary search engine by the end of quarter
 * Status: Partial. Non-Wikipedias switched over April 2.  (done by review?)
 * We may have an initial run at the interwiki search UI if time allows.
 * Status: Partial

Architecture/RFC Review
Tim worked part time on this. We had loftier goals for this (to help teams and individual contributors to develop specifications for implementing product requirements in a manner consistent with the design, performance, architecture, stability, etc. requirements of MediaWiki and the Wikimedia production cluster) Much of the work was just keeping the RFC review process going, and Tim was absorbed into HHVM porting work.

Deployment-related Development
LogStash is a new logging framework which should make it much easier to view and query system logs for purposes of debugging.

Plan:
 * Several team members have been and will continue to be involved in work on this as it nears initial rollout: Bryan Davis, Ori Livneh, Aaron Schulz and Antoine Musso.
 * Status: limited rollout
 * Bryan Davis work on requirements for a migration from scap to another tool with input from the Dev and Deploy process review happening on the 22nd.
 * Status: Bryan and Ori rewrote scap in Python, and have made some usability improvements to it. This have given Bryan a great deal of insight into how the overall system works.

PDF rendering
Brad continuing to work in a supporting role.

Performance Infrastructure
For next quarter: get front-end performance data piping into the same profiling data aggregator as back-end performance data and provide some unified view for looking at latency across the stack

Security
Password storage update, Security reviews

Admin tools development
Dan Garry will be scoping this project

SecurePoll cleanup
Dan Garry will be scoping this project, with Brad Jorsch doing development work if it can be scoped on time.

SUL finalisation
Dan is scoping this, and working with James on what's involved.

Central CSS discussion
Dan working on scoping

Past Quarter Allocations
This is our planned allocation for January through March of 2014:
 * Tim Starling: HHVM, Architecture/RFC Review, other review
 * Bryan Davis: scap/git-deploy, LogStash
 * Nik Everett: Search
 * Chad Horohoe: HHVM, Search
 * Brad Jorsch: SecurePoll cleanup, PDF rendering, API Maintenance, Scribunto maintenance
 * Ori Livneh: Performance Infrastructure, HHVM, git-deploy, LogStash
 * Aaron Schulz: HHVM, git-deploy, LogStash, Password storage update, l10n cache
 * Chris Steipp: Password storage update, Security reviews
 * Antoine Musso: HHVM, LogStash, JobQueue, Zuul upgrade
 * Sam Reed: Deployments
 * Dan Garry: Admin tools development (scoping out), SecurePoll (scoping out), SUL finalisation, OAuth improvements, Search

Focus project: moving HHVM into production
Our recommended option is for us to keep rolling with HHVM. We have pretty good momentum right now, and there is a lot of allure to getting really big performance gains that HHVM promises. The core HHVM developers have been very supportive of our work and are eager to see us deploy this.

However, there are substantial risks to pushing this early. Debian packaging is still in a nascent state. The standard advice to developers is still to compile the very latest version from master because incompatibility bugs are still being fixed at a rapid pace. Based on Tim's work, it's pretty clear that the Zend plugin compatibility layer isn't yet getting a lot of use just yet. We can't promise that we only have one more quarter of work to do on this.

Regardless of our deployment plans for HHVM, we strongly believe that we need to make the milestone of having a feature complete, largely functional version of the software running on the Beta cluster (including, for example, Lua sandbox). Maintaining this environment over the course of the quarter will take some extra effort, but will be a worthwhile investment to make sure that, when we are able to pick this work up again, we have a functional environment.

There are a couple of outside dependencies that, if met, would accelerate our work:
 * A speedy Ubuntu 14.04 deployment
 * Assisting upstream with proper HHVM packaging

Here's some of the work that still needs to happen:
 * Getting it running in vagrant
 * Getting it running in Jenkins
 * FastStringSearch porting
 * Updating the opcode cache

In the end calculation, we believe that pushing HHVM toward final deployment is our best bet at achieving greatest end-user impact as a group. While our benchmarks are still a bit crude, every indication is that many areas of editor-facing functionality will see noticeable performance improvements.

SUL finalization

 * Legoktm should be able to do the engineering work, Dan can do the other loose ends

Scap

 * Get to “100%” moved to Python

Revision storage revamp
This would involve:
 * Specifying an API for storage of revisions; anything implementing that API could be trivially used with MediaWiki.
 * Refactor core so that revision storage is an implementation of that API.

There are two parts to revision storage: revision metadata and revision text. For revision text storage, we have External Storage, which was very innovative in its day, but is showing its age. Due to the design, reads and writes are largely focused on a single node (where the most recent revisions are), so adding nodes doesn't necessarily improve performance. The compression scripts may be fine, but we don't know for sure because we're a bit afraid of running them to find out. Thus, we're not really getting the benefit of the compression offered by the system.

A likely solution for storing revision text is Rashamon. Metadata could also be stored in Rashamon as well, but we may still need a copy of the metadata in our SQL database for fast and simple querying. Regardless of our plans for Rashamon, there is significant work involved in MediaWiki to abstract away the many built-in assumptions that our code has about retrieving revision metadata directly from a database.

This project would be done in service of a broader push toward a service-oriented architecture. We would use this project to set an example for how we foresee other aspects of MediaWiki being turned into modular services, and would likely use this as an opportunity to further establish use of the value-object pattern exemplified in TitleValue. As part of this work, we would also provide a simple SQL-based implementation for developer installations of MediaWiki. We would also like to work on the infrastructure for providing proper authentication tokens for API access. Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).

This project also offers us an opportunity to make revision storage a nicely abstracted interface in core, and could be quite complementary to Rashamon, giving Gabriel some clean API points to plug it in. It also offers us the ability to establish the template for how MediaWiki can be incrementally refactored into cleanly-separated services.

After some investigation, we decided to pass on this project. What made us consider this project was indications that our current revision table for English Wikipedia is becoming very unwieldy, and it's really time for a sharded implementation. However, Sean Pringle made some adjustments, and we're comfortable that we have enough headway that this project isn't as urgent as we first thought.

Job queue work

 * Bug 46770 “Rewrite jobs-loop.sh in a proper programming language”? [TS]
 * Bug 46770 would be nice. So would more monitoring. I tend to agree with Rob that it’s probably not worth a full quarter (nor do I agree that we should scrap the whole thing in favor of $someNewThing) [CH]
 * Antoine: Isn’t our time better invested in overhauling the whole job queue system?
 * But WHY? Like Rob said...there doesn’t seem to be a compelling case for replacing the whole thing...just making some improvements around the edges. [CH]
 * Not really, after Aaron already spent so much time overhauling it over the last 18 months [TS]
 * +10 [CH]
 * We could use better monitoring of the queue processing, nicer priority systems..

OpenID connect (goes hand-in-hand with Phabricator)

 * Continuous integration fully tied to Gerrit stream-events / Gerrit comment. Need to either adapt Zuul or rethink the way we do CI.
 * https://secure.phabricator.com/book/phabricator/article/herald/ Herald -> gerrit stream events adapter should be an easy hack.
 * Or we can just get rid of Zuul… (We should, but the ability to kludge things temporarily during a migration would be good.)