Wikimedia Platform Engineering/MediaWiki Core Team/Backlog

This page contains the backlog for the MediaWiki Core team.

Items designated as high priority are either in active development, or are being considered for development during either the current or next quarter. Items are designated as medium priority if they lack the necessary details to be worked on or if they are not planned for the current or next quarter. Items are designated as low priority if they are recognised as good ideas, but are not currently being considered in the next few quarters.

For more details of the process behind this backlog, see the process subpage.

In progress

 * HHVM
 * CirrusSearch
 * SUL finalization
 * Library infrastructure for MediaWiki
 * API/Architecture work, API/Architecture work/Planning

Central code repo
Gadgets, Lua, templates

A/B testing framework

 * Port PlanOut to PHP: https://facebook.github.io/planout/
 * See ACM paper: "Designing and Deploying Online Field Experiments": https://www.facebook.com/download/255785951270811/planout.pdf (PDF)
 * Port BetaFeatures to core, or provide some other mechanism for feature flags in core.

File handling and thumbnailing
The Multimedia team will, at some point, need to embark on substantial work on our file upload pipeline, how we name files, and how we handle thumbnailing. Aaron Schulz in particular has deep expertise in this area. At this point, the Multimedia team hasn't requested help in this area, so it's premature for us to get involved, but this may come up again in July.
 * Next step: scoping by Aaron and/or Gilles
 * Include version in thumbnail URL (17577)

Setup.php improvements

 * Setup.php speed improvements and service registry

Next step: Aaron to explain what this means :-) (Maybe: speed up MediaWiki initialization? We've talked about having a configuration backend that is a good fit for the complexity and performance requirements of WMF's deployment)
 * Design work on service registry (e.g. advancing Andrew Green's dependency injection RFC)

Authn/z service to interact with other services
As we move towards independent services (SOA), we need a system for identifying users and their rights across services, which involves building infrastructure for providing proper authentication tokens for API access. Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).


 * Requests for comment/SOA Authentication

Performance metrics
There is quite a lot of data emitted and logged all over our infrastructure but there is a lot of work that needs to happen in actually specifying what the numbers represent, in a way that is both mathematically credible and convincing / understandable to people. Aggregation sometimes happen at all three of: application emitting metrics, statsd, graphite (carbon), graphite (whisper). So it's very hard to say exactly what the numbers represent MediaWiki profiling data is a good example: Profiler.php aggregates, mwprof aggregates that, asher's Python script aggregates that, carbon aggregates that, and whisper aggregates that.

So we need:
 * Infrastructure that collects and summarizes data in transparent and credible ways
 * A set of small set of well-defined metrics that capture site health and that people trust and understand
 * Regular reporting

PyBal work
Three discrete subtasks to prioritize:


 * 1) Fix idleconnection monitor so that PyBal is monitoring HHVM, and not just Apache.
 * 2) Implement an API so hosts can be depooled and repooled programmatically as part of the deployment process.
 * 3) Polish PyBal for third-party use.

Elasticsearch category intersection

 * Elasticsearch category intersection with simple JS UI embedded on CategoryPage, presented to the user as "filtering" or "refining" a category.
 * Needs scoping with Chad and Nik
 * Possibly medium priority

Elasticsearch search quality

 * Tuning "fat fingering" and other stuff (talk to Howie for more

Dedicated cluster for private/closed wikis

 * Dedicated cluster for private/closed wikis (e.g. officewiki)
 * Need to work with Ops on priority
 * Need to establish which piece of this would be owned by MediaWiki-Core
 * Look at existing things like the fundraising cluster and zero config wiki

Localisation cache do-over

 * We have a distributed data store cobbled together with shell scripts, rsync, etc. It's incredibly opaque (to me, at least).
 * There are some security issues with it (I don't remember the details..)
 * It's the biggest bottleneck on the deployment process. But truly novel data (i.e., new messages/translations) represents a tiny sliver of the byte payload, which consists primarily of unmodified messages that already exist on each deployment target.

One thing that makes our current deployments unreliable is the fact that we have many workarounds to make it possible to avoid rewriting the l10n cache. Ideally, we would use "scap" no matter how small the change we're making to the site, but that's not practical due to l10n cache rebuild and distribution often taking over 10 minutes. Some of us have some nascent ideas about how to make this faster, but we haven't yet agreed on a solution we want to pursue.

Next step: RFC from Ori and/or Bryan Davis

Revision storage revamp
This would involve:
 * Specifying an API for storage of revisions; anything implementing that API could be trivially used with MediaWiki.
 * Refactor core so that revision storage is an implementation of that API.

There are two parts to revision storage: revision metadata and revision text. For revision text storage, we have External Storage, which was very innovative in its day, but is showing its age. Due to the design, reads and writes are largely focused on a single node (where the most recent revisions are), so adding nodes doesn't necessarily improve performance. The compression scripts may be fine, but we don't know for sure because we're a bit afraid of running them to find out. Thus, we're not really getting the benefit of the compression offered by the system.

A likely solution for storing revision text is Rashamon. Metadata could also be stored in Rashamon as well, but we may still need a copy of the metadata in our SQL database for fast and simple querying. Regardless of our plans for Rashamon, there is significant work involved in MediaWiki to abstract away the many built-in assumptions that our code has about retrieving revision metadata directly from a database.

This project would be done in service of a broader push toward a service-oriented architecture. We would use this project to set an example for how we foresee other aspects of MediaWiki being turned into modular services, and would likely use this as an opportunity to further establish use of the value-object pattern exemplified in TitleValue. As part of this work, we would also provide a simple SQL-based implementation for developer installations of MediaWiki. We would also like to work on the infrastructure for providing proper authentication tokens for API access (see below). Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).

This project also offers us an opportunity to make revision storage a nicely abstracted interface in core, and could be quite complementary to Rashamon, giving Gabriel some clean API points to plug it in. It also offers us the ability to establish the template for how MediaWiki can be incrementally refactored into cleanly-separated services.

After some investigation, we decided to pass on this project. What made us consider this project was indications that our current revision table for English Wikipedia is becoming very unwieldy, and it's really time for a sharded implementation. However, Sean Pringle made some adjustments, and we're comfortable that we have enough headway that this project isn't as urgent as we first thought.


 * Should we ping Sean to check if this is still his opinion? (Yes.)

OpenID connect
One thing we've long wanted to do as complementary work with our OAuth work is to implement some form of standardized federated login. We briefly considered work on OpenID connect, which integrates well with OAuth as a project. Doing work in this area would make Phabricator integration with our Wikimedia project logins very simple.

See also: OpenID provider (request for OpenID for use in Tool Labs)

Admin tools
See Admin tools development

Next step: Dan to offer specific choices for the group to consider.

git-deploy / deploy-tooling

 * will be fleshed out within the wider "Deployment status and improvements" work (on point: Greg)
 * Dependency: Localisation cache

Moving VCL logic to MediaWiki

 * Separate Cache-Control header for proxy and client (48835)
 * IMO Ops need to reconcile themselves with the fact that there will always be compelling reasons to keep *some* application logic on the edge. We should use a Lua VMOD and expose some of Varnish's APIs to Lua and use that to replace complicated VCL with inline C.
 * You're saying they disagree with this? They have always seemed keen to move logic from app to VCL, to me. [Tim]

LogStash

 * buuugs

User attribution

 * Infrastructure for "claim an edit" feature
 * Making user renaming simple and fast

Installation consolidation
Aligning MediaWiki developer and third party installation methods. Easy install of more complicated MediaWiki installs (e.g. Parsoid, PDF rendering, Math, etc). Possible use of Vagrant-Composer?
 * https://developers.google.com/compute/docs/containers


 * Composer package types:
 * https://github.com/composer/installers
 * there exists a very nominal 'mediawiki-extension' type: https://github.com/composer/installers/blob/master/src/Composer/Installers/MediaWikiInstaller.php
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=65188#c3


 * Assess viability of deploying MediaWiki using Containers to both the production cluster and various cloud platforms
 * ...as a way of improving our internal deployment process
 * ...as a way of making MediaWiki substantially easier to install by third-parties

Configuration management

 * Allowing Stewards to set certain things in the UI (e.g. per-wiki logos)
 * Cleaner command line methods to make configuration changes
 * Get rid of configuration globals
 * Requests for comment/Extension registration


 * https://gerrit.wikimedia.org/r/#/c/109850/

Better Captcha infrastructure
possibly as part of thumbnailing revamp
 * Periodically, someone needs to manually generate a new batch of unused captchas.
 * Need more from Aaron

OAuth UX refinement
The OAuth user interface is in need of refinements and improvements to improve the user experience. The user experience on mobile also needs to be looked in to, as the mobile workflow was basically ignored in the initial release of OAuth.

Structured license metadata
It would be good if Mediawiki had a way to include structured license/source data in the wiki, because the assumption that the page has a single source is slowly-but-increasingly inaccurate, and therefore legal is concerned about continuing to assume there is just one license and one set of authors.

Edit conflict and diffs
action=edit and diff (action=historysubmit) are where 80 % of the editors' time is spent, yet almost no work has been done in years to ensure they do their basic job i.e. comparing revisions and merging contributions on top of a page. Countless editors-hours could be saved, and converted in additional happiness or more editing done, with some work on diffing/wikidiff2/diff3 and related issues.


 * https://bugzilla.wikimedia.org/showdependencytree.cgi?id=70163&hide_resolved=1