Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, July 2013/Summary

MediaWiki Core quarterly review, July 15 2013

 * Agenda
 * Slide deck
 * In attendance:
 * RobLa, Ken, Aaron S., Rachel, Philippe, Greg G., Brad, Tim, Chris M., Antoine, Mark B., Faidon, Toby, Tomasz, Gabriel, Erik M., Chad, Sumana, Chris S., Nik Everett
 * Leading the meeting: Tim

Intro
Team & Ongoing responsibilities: https://www.mediawiki.org/wiki/Wikimedia_Platform_Engineering#MediaWiki_Core

Done/stopped since the last review

 * Lua scripting
 * Job queue
 * Score extension
 * Admin tools development

Ongoing

 * Auth systems
 * have to redesign central logging because browsers changed how they do auth; took opportunity to improve user experience
 * OpenID: will be an OpenID *provider* to start in the near future, not consumer. Internal: consumers only (Labs etc.), maybe rolling out to more consumers as we get bugs ironed out
 * Search engine
 * Nik recommends ElasticSearch
 * continuing through whole of next quarter
 * Git/Gerrit
 * actually nothing major coming in the next quarter except for maintenance & firefighting
 * Test infrastructure
 * also in maintenance/bugfixing mode for the next quarter
 * Multimedia/media storage
 * bugfixes in Score & Ceph/Swift
 * multimedia team will get spun up and there will be more planning & review in a future quarterly review by them
 * We may be going Ceph instead of Swift; we're making that determination this week, and right now we are still using Swift in production for all purposes
 * Swift production issues: a major bug - any failover is very slow.
 * 2% of files exist in containers but not in the back end?
 * Ceph is better-designed & reliable, but still buggy. Over 35-40 bugs we've opened (?)
 * Ceph thing is a blocker to OSM tileserver? No, we decided to go with local storage.

Next quarter
not purge every variant title that could serve the content -- redirect to canonical encoding --- brackets encoded with %28 for instance
 * Architecture/RfC process
 * Deployment improvements
 * Replace scap with something like git-deploy
 * Datacenter switchover Maybe improving MW configuration - dealing with memcache, which slave DBs it contacts, etc.
 * PHP errors & exceptions monitoring - icinga should notify us when rate gets too high
 * Caching improvements
 * We've always served the page with 200 response code for certain noncanonical title encodings .... fix some caching errors because we do
 * Commons images - when they are updated, update pages that use them. Especially useful when aspect ratio changes
 * HipHop https://lists.wikimedia.org/mailman/listinfo/hiphop
 * The Lua extension work is quite extensive

Timelines
We'd hoped that by now we'd be done with auth systems work. Next few weeks: several deployments, then a review of OpenID extension.
 * Chris S.: We had a lot of UX work that made sense to do; this delayed things. This week: SUL into production, hopefully!
 * OAuth deploying to Labs this week - plans to go into prod week of the 29th
 * OpenID review on Chris's backlog; need to help Wikinaut get a few more features done before enabling in production. Very volunteer-driven.
 * Want to get done within next month or 2. Low involvement/work by WMF HipHop & admin tools: "time allowing"?
 * Deployment stuff - critical and Ops is waiting on it. HipHop would

slip to defer to that. Basically the same set of people working on both of those two projects.
 * Admin tools sprint -- if we have a Product Manager in Platform to figure that out, Chris Steipp + James Forrester would work on that, make sure stewards are on board. We have good volunteer contributions on those projects.
 * What would topic of sprint be? depends on a lot of things. Top request from stewards is currently a global CheckUser tool; depends on

larger CheckUser vision for the next few years. Maybe something on CAPTCHAs, or spam abuse stuff. List of roughly prioritized projects: https://www.mediawiki.org/wiki/Admin_tools_development/Roadmap
 * Steipp has been lots of OAuth/SUL stuff; detracts from day-to-day security work he needs to do. Right now, trying to give Steipp

breathing room to do security work. Then, later, hopefully push on core project work.
 * Admin tools sprint vs HipHop - a hard tradeoff. But HipHop probably wins in this quarter because:
 * lack of product management resources :( -- although Sumana has a product advisory volunteer
 * Facebook folks really keen on getting us up & running on HipHop; they are dedicating resources to this issue. Strike while iron is hot.  They have fixed blocker bugs for us.
 * Sumana: makes sense, especially because we have volunteer work on admin tools (including comaintainers of AbuseFilter etc.), but really

only we can do the messy work on HipHop; therefore deferring WMF work on admin tools
 * Goals for the deployment sprint:
 * Need more definition - Greg is working on it.
 * https://www.mediawiki.org/wiki/Site_performance_and_architecture#Deployment_sprint


 * We want to improve deployment speed & robustness. We got lots of speed advantages from just improving scap.
 * Git-deploy helps in 2 ways. 1) Scanning directories, transferring data is more compact with git than rsync. 2) network transfer done

first, then checkout. So code updates are more atomic.
 * Deployment practices: we always push the latest state of master even when we are just pushing a small change. Increases risk.
 * VE/hotfix issue: right now the methods for deploying minor updates to extensions are awkward and little-known. It's worth looking at how we do this -- whether with scap, git-deploy, submodules, etc. -- to find

something easier to use. http://git-annex.branchable.com/ nearly every day; how do we enable that without adding site risk? ACTION ITEM: document current ugly awkward method (possibly mentioning beta cluster), checking out an uncommitted change or whatever for a hotfix, and/or have tools on the deployment host that automate these sort of processes, and/or make a policy re hotfixes. Owner: Greg G.
 * Important to set some goals to ensure we get what we want out of this sprint
 * Our default position: we're moving to git-deploy. We had some issues last time to solve somehow, especially with atomicity of deployments of copies.  Working around git + ensuring we file a bug upstream with git itself & track it.
 * Git itself: Git has never managed large binary files well. Never will, probably. Other utilities have been built as workarounds. e.g.
 * git fetch -- when it fails due to network issue or whatever, object store gets corrupted, hard to get out. So dealing with or preventing corrupted state sucks.
 * Sumana suggests: should we have some training in this sprint?
 * Greg: there needs to be more training, + better tools to make things more straightforward.
 * Could we have a small team responsible for all deployments across all teams? - Mark Bergsma raises question.
 * Can we instead make it more of a pushbutton operation? more robust.... need to find warts & code them out of the system
 * Now we have a weekly deployment train; encourage more people to do that, more eyeballs. But VE-style teams will want to deploy fixes
 * More of this in release mgmt/QA review.
 * We've done ~5 weekly deployments. Adjust to this cycle and eventually speed it up again
 * The current method is documented, in https://wikitech.wikimedia.org/wiki/How_to_deploy_code2
 * But what about Solr or reportcard etc.? Put into requirements that we cover non-MediaWiki applications/deployments.
 * ACTION ITEM: Greg to consider adding some training components to this sprint
 * ACTION ITEM: Greg to help consider Mark B's question "Could we have a small team responsible for all deployments across all teams?"

Deferred or reorganized work
possible to use scap, but that requires a lot of tweaks in prod files since a lot of things are hardcoded. metadata. already in pipeline. But others are working on this as well so we can still see minor improvements coming.
 * Release mgmt & QA
 * scap replacement & monitoring/beta cluster/deployments - Chris M. asks: deployments are high-touch right now.... can the example of beta labs help with automating that?
 * Tim: beta labs has been running a single set of code on NFS - not really a deployment process in the way the main site does. Implementing separate code trees on sep. servers in beta labs would give testbed for changing deployment systems?
 * Antoine: in Jan/Feb, we had git-deploy installed to test it out before roll out in production.... had trouble with LDAP. Should be
 * Chris M wants more monitoring for beta. It falls over, stalls on deployments, times out.
 * Mobile has beta as important part of their deployment cycle
 * beta is important & validates our strategy. More resources needed to make it better.
 * RobLa: we can weave some beta work into deployment sprint. Whenever we have an ongoing project that impacts deployment, we might want to use beta (past examples: wikivoyage, git-deploy..).
 * Central code repo
 * Lua, gadgets, and everything.
 * No huge emergency reason to do it now.
 * Would benefit from a product manager. UI elements to this. Legal wrangling - licensing. Managing code metadata, different from content
 * Want to get to it by June 2014. May be Jan-March 2014.
 * ECT wants this to help nurture the gadget/front end community
 * Analytics might be another internal customer -- Limn stuff -- but Toby says it's not a huge issue for them at the moment
 * Configuration database (duke nukem forever)
 * embarrassing lack that forms newbies' first impressions
 * and more importantly -- want to be able to handle some things via MW permissions that right now we handle via shell bugs
 * Customer: LCA.
 * No known timeline; waiting till after search
 * Gerrit improvements (further BZ integration)
 * we will help by upgrading to new versions.
 * Chad's been working for 13+ months on Gerrit stuff. He's one of our more experienced MW developers. Getting Chad's focus back on MW is a win
 * Would want to push on BZ stuff, new diffing and inline editing interface, trying to help with benchmarking new Lucene stuff that's
 * Lucene: tagging -- arbitrary keywords
 * timeline depends on contractor budget, developer engagement work; waiting till after search

Questions
bug 48835 - client cache control header?
 * part of performance sprint - ostensibly to be this quarter
 * look in BZ whiteboard for each ticket :)