Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, April 2014

This is an outline for a review of the MediaWiki Core team which will take place at WMF on April 15, 2014.
 * Notes for this review

Team
See Wikimedia MediaWiki Core Team page.

Changes since last review: (none)

HipHop VM Deployment
Plans for the previous quarter with status:
 * Get a production service running on HHVM. (one of: job queue, l10n update, image scalers, etc.)
 * Status: Getting HHVM serving up Beta. (Done)
 * Port LuaSandbox
 * Status: In progress.
 * Tim made very substantial contributions to HHVM's Zend compatibility layer, which include an implementation of PHP's Thread-Safe Resource Manager (TSRM) for HHVM and an overhaul of the compatability layer's basic architecture to use HHVM-Native Interface (HNI) rather than IDL files. Tim's work makes it substantially easier to port any Zend extension to HHVM, not just Scribunto.
 * Port wikidiff2
 * Status: Done (thanks Max Semenik!)
 * Port wmerrors
 * Status: Not needed
 * Port FastStringSearch
 * Status: Done by Aaron Schulz
 * Packages and Puppet manifests for provisioning HHVM on Ubuntu Precise.
 * Status: Not done. Work is in progress, but we'll want to rely on some combination of upstream work with Faidon and possibly others in TechOps to accomplish this.
 * Jenkins job that tests patches to core and extensions using HHVM.
 * Status: Done
 * Jenkins job that runs the full suite of unit tests against HHVM.
 * Status: Done

Search
Plan:
 * Have all sites using CirrusSearch (and thus, ElasticSearch) as the primary search engine by the end of quarter
 * Status: Partial. Non-Wikipedias switched over April 2. We've been approaching deployment cautiously due to hardware concerns. Nik has also done work with upstream to make ElasticSearch more efficient on our current hardware.
 * We may have an initial run at the interwiki search UI if time allows.
 * Status: Partial. Initial groundwork has been done and a basic form of interwiki search is live on the beta cluster.

Architecture/RFC Review
Tim worked part time on this. We had loftier goals for this (to help teams and individual contributors to develop specifications for implementing product requirements in a manner consistent with the design, performance, architecture, stability, etc. requirements of MediaWiki and the Wikimedia production cluster) Much of the work was just keeping the RFC review process going, and Tim was absorbed into HHVM porting work.

Deployment-related Development
LogStash is a new logging framework which should make it much easier to view and query system logs for purposes of debugging.

Plan:
 * Several team members have been and will continue to be involved in work on this as it nears initial rollout: Bryan Davis, Ori Livneh, Aaron Schulz and Antoine Musso.
 * Status: limited rollout
 * Bryan Davis work on requirements for a migration from scap to another tool with input from the Dev and Deploy process review happening on the 22nd.
 * Status: Bryan and Ori rewrote scap in Python, and have made some usability improvements to it. This have given Bryan a great deal of insight into how the overall system works.

PDF rendering
Brad did a little work in a supporting role. This is largely done.

Performance Infrastructure
For next quarter: get front-end performance data piping into the same profiling data aggregator as back-end performance data and provide some unified view for looking at latency across the stack
 * Status: not done. Ori has taken a leadership role on HHVM and has been supporting many other groups in the organization, which hasn't left him a lot of time for this work.

Security
Password storage update, Security reviews

Admin tools development
Dan Garry assessed the requested work for admin tools and found that much of the work is either blocked on SUL finalisation or would be made significantly easier in a post-finalisation world. This work was therefore postponed until the SUL finalisation could be completed.

SecurePoll Redesign
The existing SecurePoll software was analysed. It was agreed that the biggest flaw of the software was that the mechanism for creating polls involved hand-writing XML files and importing them using maintenance scripts, and that this had caused serious, unrecoverable errors in polls in the recent past. The structure of the WMF Board elections and the English Wikipedia Arbitration Committee polls was analysed, and a user interface will be created which supports creating polls with sufficient options to meet this use case.

SUL finalisation
After it was found that the admin tools development work was blocked on SUL finalisation, Dan Garry has assessed this and come up with a provisional timeline for the SUL finalisation. Kunal Mehta (Legoktm) will be doing the necessary engineering work after he finishes his current work with the Flow API, and Chris Steipp will be in a supporting role doing security and general code review.

Central CSS discussion
Extension:GlobalCssJs was deployed to the beta cluster with the user-specific global CSS/JS enabled and the site-wide global CSS /JS disabled.

Past Quarter Allocations
Our people allocations for January through March of 2014:
 * Tim Starling: HHVM, Architecture/RFC Review, other review
 * Bryan Davis: scap/git-deploy, LogStash, (added) Beta migration to eqiad, (added) Deployments
 * Nik Everett: Search
 * Chad Horohoe: HHVM, Search
 * Brad Jorsch: SecurePoll cleanup, PDF rendering, API Maintenance, Scribunto maintenance
 * Ori Livneh: Performance Infrastructure, HHVM, git-deploy, LogStash
 * Aaron Schulz: HHVM, git-deploy, LogStash, Password storage update, l10n cache
 * Chris Steipp: Password storage update, Security reviews
 * Antoine Musso: HHVM, LogStash, JobQueue, Zuul upgrade, (added) Beta migration to eqiad
 * Sam Reed: Deployments
 * Dan Garry: Admin tools development (scoping out), SecurePoll (scoping out), SUL finalisation, OAuth improvements, Search

Upcoming quarter
HHVM will be our flagship project. HHVM increases performance across the board on MediaWiki, and may eventually allow for the removal of the caching layer, thereby opening the doors for making the site more interactive and performant. In parallel with HHVM, the Search team will be focussing on getting CirrusSearch deployed on all Wikimedia wikis, and work will commence on the SUL finalisation.

The above work will continue alongside our team's ongoing responsibilities.

HHVM
We think that pushing HHVM toward full deployment is our best bet at achieving the greatest impact on end-users as a group. Historically, our strategy with respect to site performance was to rely heavily on a caching layer, based on the assumption that it is only a tiny minority of users that actually contribute content or personalize the interface, and that other users, representing the vast majority of visits to our projects, can be served by recycling generic views. This approach has allowed us to successfully scale to our current traffic levels, which is a significant feat. The cost of this approach has been twofold: it has concealed the user experience of logged-in editors, which is substantially degraded in comparison to anonymous users, and it has put a wall in front of our developers, frustrating most attempts to make the site more interactive or personalized for a bigger portion of our traffic. If we want to graduate features out of Beta and make them accessible to new users, we need to modernize our application server stack.

However, there are substantial risks to pushing this early. Debian packaging is still in a nascent state. The standard advice to developers is still to compile the very latest version from master because incompatibility bugs are still being fixed at a rapid pace. Rolling out HHVM would require extending our deployment system so that it builds a byte-code cache and synchronizes it to each application server. The mechanics of how we'll handle code changes, server restarts, etc. are still fuzzy. HHVM is not just a faster PHP: it's a totally new runtime that make many aspects of PHP that were previously set in stone highly configurable. It has completely new monitoring and profiling capabilities. It's a different beast, and it'll take us some time to become adequately familiar with it. So we can't promise that we only have one more quarter of work to do on this.

There are a couple of external dependencies that, if met, would accelerate our work:
 * A speedy Ubuntu 14.04 deployment
 * Assisting upstream with proper HHVM packaging

We have pretty good momentum right now. Both HHVM's upstream at Facebook and our own TechOps team have been very supportive of our work and are eager to see us deploy this. We want to keep going.

CirrusSearch Deployment
The nature of the work, namely for CirrusSearch to have feature parity with LuceneSearch, means that the cautious deployment of CirrusSearch is mainly due to hardware concerns. We hope to have CirrusSearch live everywhere by the end of the quarter.

Search page redesign
The goal of this project is to modernise the search results page and ensure we have a sensible user interface on top of which we can expose new features made possible by CirrusSearch, such as interwiki search.

As such, the main focus of the Search team this quarter is the deployment of CirrusSearch. Any spare cycles will be spent on the search page redesign project.

SUL finalisation
In the previous quarter, it was found that admin tools development was either blocked by SUL finalisation or would be made significantly easier in a post-finalisation world. However, the SUL finalisation would result in tools like local RenameUser being disabled, so there is engineering work to do before the finalisation can happen. This work has been scoped and a set of business requirements was developed.

Kunal Mehta (Legoktm) has agreed to do the necessary engineering work after he has finished his work on the Flow API. Chris Steipp will do general and security code reviews. Dan Garry will work on the product aspects of the finalisation, namely deciding the rules by which clashes are resolved, and also making sure that all the appropriate messages are properly translated and that every user who is affected by the finalisation is contacted in some form.

Scap

 * Get to “100%” moved to Python

SecurePoll Redesign
The work has been scoped and mockups were produced by Brandon. Development continues.

Architecture/RFC Review
Tim's time should free up a little bit after LuaSandbox is ported over to HHVM, so he's planning on spending more time on leading the community on architectural improvements. One thing that the team will try as a secondary task (to HHVM) is making initial steps toward moving our site configuration out of directly-accessed globals (see Talk:Architecture Summit 2014/Configuration).

Revision storage revamp
This would involve:
 * Specifying an API for storage of revisions; anything implementing that API could be trivially used with MediaWiki.
 * Refactor core so that revision storage is an implementation of that API.

There are two parts to revision storage: revision metadata and revision text. For revision text storage, we have External Storage, which was very innovative in its day, but is showing its age. Due to the design, reads and writes are largely focused on a single node (where the most recent revisions are), so adding nodes doesn't necessarily improve performance. The compression scripts may be fine, but we don't know for sure because we're a bit afraid of running them to find out. Thus, we're not really getting the benefit of the compression offered by the system.

A likely solution for storing revision text is Rashamon. Metadata could also be stored in Rashamon as well, but we may still need a copy of the metadata in our SQL database for fast and simple querying. Regardless of our plans for Rashamon, there is significant work involved in MediaWiki to abstract away the many built-in assumptions that our code has about retrieving revision metadata directly from a database.

This project would be done in service of a broader push toward a service-oriented architecture. We would use this project to set an example for how we foresee other aspects of MediaWiki being turned into modular services, and would likely use this as an opportunity to further establish use of the value-object pattern exemplified in TitleValue. As part of this work, we would also provide a simple SQL-based implementation for developer installations of MediaWiki. We would also like to work on the infrastructure for providing proper authentication tokens for API access. Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).

This project also offers us an opportunity to make revision storage a nicely abstracted interface in core, and could be quite complementary to Rashamon, giving Gabriel some clean API points to plug it in. It also offers us the ability to establish the template for how MediaWiki can be incrementally refactored into cleanly-separated services.

After some investigation, we decided to pass on this project. What made us consider this project was indications that our current revision table for English Wikipedia is becoming very unwieldy, and it's really time for a sharded implementation. However, Sean Pringle made some adjustments, and we're comfortable that we have enough headway that this project isn't as urgent as we first thought.

Job queue work
There have been repeated calls in the past for a "proper" job queue implementation for Wikimedia production use, where "proper" means "not relying on cron". Some recent discussion on the operations@ mailing list led the team to at least consider this as a possible near-term project.

The tracking bug for that comes closest to describing this project is Bug 46770 “Rewrite jobs-loop.sh in a proper programming language”. That is probably an oversimplification, though. In addition to removing shell scripting, we would at least like better monitoring of the system and more sophisticated system for assigning priorities to jobs. Aaron Schulz has already made the system much better with many incremental improvements over the past 18 months that, taken together, represent a pretty massive improvement to the features and reliability of the system. Hence, we've decided not to make further work on this a priority until an urgent need is identified.

Push Phabricator to production
The Engineering Community Team is currently working toward consensus around standardizing project management, and ended up with the possibility of a much bigger project: replacing many of our tools (Gerrit, Bugzilla, Trello, Mingle, maybe even RT) with Phabricator. This would need a substantial investment in engineering resources, but it's not entirely clear where this would come from. Platform Engineering may need help from other groups in order to accomplish this. It is not likely that members of MediaWiki Core would be substantially available for such a project in the coming quarter.

OpenID connect
One thing we've long wanted to do as complementary work with our OAuth work is to implement some form of standardized federated login. We briefly considered work on OpenID connect, which integrates well with OAuth as a project. Doing work in this area would make Phabricator integration with our Wikimedia project logins very simple.

Admin tools development
Many admin tools are related to the concept of a global account. As such, developing them is easier when one can develop these tools assuming that every account has the same name on every wiki. To reduce wasted effort, admin tools development has been postponed until the SUL finalisation is complete.

File upload pipeline
The Multimedia team will, at some point, need to embark on substantial work on our file upload pipeline. Aaron Schulz in particular has deep expertise in this area. At this point, the Multimedia team hasn't requested help in this area, so it's premature for us to get involved, but this may come up again in July.

Localisation cache write performance improvement
One thing that makes our current deployments unreliable is the fact that we have many workarounds to make it possible to avoid rewriting the l10n cache. Ideally, we would use "scap" no matter how small the change we're making to the site, but that's not practical due to l10n cache rebuild and distribution often taking over 10 minutes. Some of us have some nascent ideas about how to make this faster, but we haven't yet agreed on a solution we want to pursue.

Current Quarter Allocations
Our people allocations for April through June of 2014:
 * Tim Starling: HHVM, Architecture/RFC Review, other review
 * Bryan Davis: scap, LogStash, HHVM
 * Nik Everett: Search
 * Chad Horohoe: Search, HHVM
 * Brad Jorsch: SecurePoll cleanup, API Maintenance, Scribunto Maintenance
 * Ori Livneh: Performance Infrastructure, HHVM
 * Aaron Schulz: HHVM
 * Chris Steipp: Security reviews, SUL finalisation
 * Antoine Musso: CI for HHVM, scap pythonization
 * Sam Reed: Deployments
 * Dan Garry: Search, SecurePoll, SUL finalisation

Ongoing responsibilities

 * Deployments
 * Core deployments
 * External team deployments (e.g. Wikidata)
 * MediaWiki operations (performance, debugging, ops team support)
 * Code review
 * API maintenance and code review (Brad: 30%)
 * (fill me in)
 * Security issue response
 * Test infrastructure (Beta cluster and continuous integration)
 * Git/Gerrit improvement
 * Shell bugs