Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, January 2014

This will be an outline for a review of the MediaWiki Core team which will take place at WMF on January 21, 2014.

As of right now, we're sorting through items on Wikimedia MediaWiki Core Team/Ideas list

This is an outline for a review of the MediaWiki Core team which will take place at WMF on January 21, 2014.
 * Notes for this review

Team
See Wikimedia MediaWiki Core Team page.

Changes since last review: (none)

Search (ElasticSearch/CirrusSearch) deployment
Nik Everett and Chad Horohoe have been working on this through the past quarter. We optimistically projected we'd be able to complete this deployment by the end of 2013, and disable our previous search technology (lsearchd) in January 2014. This rollout is going pretty well, but it's going to take a bit longer to get fully switched over. Several wikis are now using CirrusSearch as the primary search engine, constituting roughly 8% of our search traffic as of this writing. Also as of this writing, we are indexing English Wikipedia. If that indexing goes smoothly and the index size is manageable, we may attempt switching it next. We currently project that we'll be done rolling this out by the end of March with the ability to turn off lsearchd sometime shortly thereafter, subject to hardware availability and how smoothly our gradual rollout goes.

Some progress has been made on designing the UX for interwiki search, though development and deployment schedule for this work is TBD.

Architecture formalization
We held fortnightly IRC meetings to discuss many in-progress RFCs. Additionally, much of the planning for the Architecture summit happened in the quarter.
 * Number of RFCs reviewed, discussions held, + ideally some statement of what trajectory is emerging

DevOps sprint
The main areas of focus for this sprint were:
 * git-deploy - not done. Will be part of focus for next quarter (in some fashion).
 * monitoring / reporting - LogStash initial deployment in production is done (after extension testing in wmflabs).
 * deployment script improvements (scap) - we believe we have marked improvements, but they are hard to quantify due to stubborn problems with a legacy search host in Tampa that is slated to be decommissioned soon.
 * multi-site awareness - as it turns out, most of the issues were somehow addressed in the leadup to this work.
 * DSL for Apache URL rewriting configuration (to reduce mistakes in the apache rewrite rules)

See Retrospective for what we learned from how we managed the sprint.

Wikimania scholarships app
Bryan Davis rewrote large swaths of the Scholarship app to adapt it to new requirements. working with Chad Horohoe and Katie Filbert on code review and some coding. Jessie/Ellie worked with the Wikimania planning team on this.

Auth systems
Chris Steipp continued work in this area as a solo effort, focusing mainly on refining our password expiration protocol as well as improving our password hashing. OAuth was actually deployed \o/. Several performance issues with SULv2 were addressed.

Security auditing and response
Chris Steipp continued his work on security auditing. The review queue for this has stacked up, with reviews promised for Limn, TimedMediaHandler v2, Kraken, and GLAM upload, among others. During the quarter, reviews for Flow and the Wikimedia Scholarship application were prioritized above Limn and Kraken, and completed.

Performance
Since this is Ori's first quarter in his new role, his priority will be to understand and articulate the current state of site performance, with a focus on performance blind spots—i.e., performance bottlenecks that fall outside the purview of current Foundation engineering projects and existing expertise. He plans to bring this work to bear by building a set of tools and visualizations that make these bottlenecks visible and by building tools that help MediaWiki developers and gadget and template authors profile, understand, and optimize the performance impact of their work. An example of this is the mediawiki.inspect module, which lets savvy users scrutinize the static asset payload of a MediaWiki page in their browser debug console, and the instrumentation of real user monitoring in Ganglia and Graphite for page views and for VisualEditor.

Ori has also been working on augmenting client-side asset caching by using the Web Storage API to cache ResourceLoader modules (change If2ad2d80d). He would like to see this work through to completion and deployment in this quarter. -

Ori also plans to continue working with the TechOps team on improving the state of profiling and monitoring tools on the cluster. A realistic goal for this quarter is to finish the work of migrating Graphite from Tampa to Ashburn and improving the usability of the interface provided by MediaWiki core for logging performance metrics to a remote host for aggregation and analysis. -- done, profiler done

Finally, Ori is interested in working with Dan Garry and with folks in Analytics, Fundraising and Features to begin the work of correlating site performance with editor engagement metrics and other product goals, with a view toward being able to relate the value of performance and infrastructure work to the Foundation's mission. This is a long-term project, obviously. A specific way in which it could be advanced this quarter is to ensure that engagement data collected from new users (edits attempted / saved, etc.) is annotated with latency measurements.

Bugs caught using mw.inspect: https://bugzilla.wikimedia.org/show_bug.cgi?id=55683


 * Job queue improvements (get details from Aaron)

PDF rendering

 * Brad Jorsch helped with the initial sprint, and has mainly been working on an advisory role on this.

HipHop VM Deployment
Things that are already done:


 * Initial HHVM role in Vagrant
 * Collaborated with P. Trajan from Facebook on getting HHVM packages signed
 * Travis CI running mediawiki core test suite under HHVM on each commit: https://travis-ci.org/wikimedia/mediawiki-core/ ; Also running test suite of deployment branches w/extensions
 * Bug triage

Next quarter: As stated in the previous review, HHVM is maturing quite quickly, and it has the potential to vastly improve performance. However, although the set of features supported by HHVM is nominally sufficient for running MediaWiki in production, a lot of work must be done to verify that the set of software components that are essential to our MediaWiki deployment work properly when running under HHVM. This work has a very long tail and we hope to engage other teams in this work so that they help us see it through. To do this, some infrastructure for collaboration needs to be in place: The big-ticket compatibility issue that we already know we have is the need to patch or rewrite several PHP extensions (wikidiff2, wmerrors, LuaSandbox) so that they run under HHVM.
 * Get a production service running on HHVM. (one of: job queue, l10n update, image scalers, etc.)
 * Port LuaSandbox
 * Packages and Puppet manifests for provisioning HHVM on Ubuntu Precise.
 * Jenkins job that tests patches to core and extensions using HHVM.
 * Jenkins job that runs the full suite of unit tests against HHVM.

Tim Starling, Chad Horohoe, Ori Livheh, Aaron Schulz and Antione Musso plan to play a role in this. The goal for this coming quarter is to have at least one production service migrated over to HipHop VM.

Search
Chad and Nik plan have all sites using CirrusSearch (and thus, ElasticSearch) as the primary search engine by the end of quarter. We may have an initial run at the interwiki search UI if time allows.

Architecture/RFC Review
Tim will be working part time on this
 * Help teams and individual contributors to develop specifications for implementing product requirements in a manner consistent with the design, performance, architecture, stability, etc. requirements of MediaWiki and the Wikimedia production cluster.

Deployment-related Development
LogStash is a new logging framework which should make it much easier to view and query system logs for purposes of debugging. Several team members have been and will continue to be involved in work on this as it nears initial rollout: Bryan Davis, Ori Livneh, Aaron Schulz and Antoine Musso.

Bryan Davis will work on requirements for a migration from scap to another tool with input from the Dev and Deploy process review happening on the 22nd.

PDF rendering

 * Brad continuing to work in a supporting role.

Performance Infrastructure
for next quarter: get front-end performance data piping into the same profiling data aggregator as back-end performance data and provide some unified view for looking at latency across the stack
 * udpprofiler improvements
 * graphite, gdash, udpprofiler migrated from tampa
 * user-perceived latency measurement and aggregation http://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&tab=v&vn=Navigation+Timing&hide-hf=false
 * static asset monitoring in ganglia: http://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&tab=v&vn=Static+assets&hide-hf=false

Aaron's work:
 * Optimized copyFileBackend to use MD5 from listing if given (e.g. Swift) (https://gerrit.wikimedia.org/r/#/c/106647/)
 * Avoid extra parsing in prepareContentForEdit (https://gerrit.wikimedia.org/r/#/c/95519/)
 * Merged redis queue periodic tasks into recyclePruneAndUndelayJobs (https://gerrit.wikimedia.org/r/#/c/104475/)
 * Made Title cache use MapCacheLRU (https://gerrit.wikimedia.org/r/#/c/99043/)
 * Reduce use of FORCE INDEX in LogPager (https://gerrit.wikimedia.org/r/#/c/85097/)
 * Added some constants to speed up Setup.php (https://gerrit.wikimedia.org/r/#/c/103330/)
 * Added more Setup.php profiling (https://gerrit.wikimedia.org/r/#/c/101794/)
 * Optimized LocalRepo::findFiles (https://gerrit.wikimedia.org/r/#/c/97993/)
 * Added MapCacheLRU class, a simpler cousin to ProcessCacheLRU (https://gerrit.wikimedia.org/r/#/c/87650/)
 * Improved partitioning scheme for refreshLinks jobs (https://gerrit.wikimedia.org/r/#/c/96199/)
 * Avoid parsing more in refreshLinksJobs (https://gerrit.wikimedia.org/r/#/c/98071/)
 * Made SqlBagOStuff fully avoid transactions when possible (https://gerrit.wikimedia.org/r/#/c/96924/)
 * Reduced isQueueDeprioritized process cache time to 1 second (https://gerrit.wikimedia.org/r/#/c/96405/)

(Ask Aaron to summarize the list above into a handful of concise bullet points)

Security
Password storage update, Security reviews

Admin tools development
Dan Garry will be scoping this project

SecurePoll cleanup
Dan Garry will be scoping this project, with Brad Jorsch doing development work if it can be scoped on time.

SUL finalisation
Dan is scoping this, and working with James on what's involved.

Central CSS discussion
Dan working on scoping

Allocations
This is our planned allocation for January through March of 2014:
 * Tim Starling: HHVM, Architecture/RFC Review, other review
 * Bryan Davis: git-deploy, LogStash
 * Nik Everett: Search, Search in Wikidata
 * Chad Horohoe: HHVM, Search
 * Brad Jorsch: SecurePoll cleanup, PDF rendering, API Maintenance, Scribunto maintenance
 * Ori Livneh: Performance Infrastructure, HHVM, git-deploy, LogStash
 * Aaron Schulz: HHVM, git-deploy, LogStash, Password storage update, l10n cache
 * Chris Steipp: Password storage update, Security reviews
 * Antoine Musso: HHVM, LogStash, JobQueue, Zuul upgrade
 * Sam Reed: Deployments
 * Dan Garry: Admin tools development (scoping out), SecurePoll (scoping out), SUL finalisation