Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, July 2013/Summary

MediaWiki Core quarterly review, July 15 2013

 * Agenda
 * Slide deck
 * In attendance:
 * RobLa, Ken, Aaron S., Rachel, Philippe, Greg G., Brad, Tim, Chris M., Antoine, Mark B., Faidon, Toby, Tomasz, Gabriel, Erik M., Chad, Sumana, Chris S., Nik Everett
 * Leading the meeting: Tim

Intro
Team & Ongoing responsibilities

Done/stopped since the last review

 * Lua scripting
 * Job queue
 * Score extension
 * Admin tools development

Ongoing

 * Auth systems
 * have to redesign central logging because browsers changed how they do auth; took opportunity to improve user experience
 * OpenID: will be an OpenID *provider* to start in the near future, not consumer. Internal: consumers only (Labs etc.), maybe rolling out to more consumers as we get bugs ironed out
 * Search engine
 * Nik recommends ElasticSearch
 * continuing through whole of next quarter
 * Git/Gerrit
 * actually nothing major coming in the next quarter except for maintenance & firefighting
 * Test infrastructure
 * also in maintenance/bugfixing mode for the next quarter
 * Multimedia/media storage
 * bugfixes in Score & Ceph/Swift
 * multimedia team will get spun up and there will be more planning & review in a future quarterly review by them
 * We may be going Ceph instead of Swift; we're making that determination this week, and right now we are still using Swift in production for all purposes
 * Swift production issues: a major bug - any failover is very slow.
 * 2% of files exist in containers but not in the back end?
 * Ceph is better-designed & reliable, but still buggy. Over 35-40 bugs we've opened (?)
 * Ceph thing is a blocker to OSM tileserver? No, we decided to go with local storage.

Next quarter

 * Architecture/RfC process
 * Deployment improvements
 * Replace scap with something like git-deploy
 * Datacenter switchover Maybe improving MW configuration - dealing with memcache, which slave DBs it contacts, etc.
 * PHP errors & exceptions monitoring - icinga should notify us when rate gets too high
 * Caching improvements
 * We've always served the page with 200 response code for certain noncanonical title encodings .... fix some caching errors because we do not purge every variant title that could serve the content -- redirect to canonical encoding --- brackets encoded with %28 for instance
 * Commons images - when they are updated, update pages that use them. Especially useful when aspect ratio changes
 * HipHop https://lists.wikimedia.org/mailman/listinfo/hiphop
 * The Lua extension work is quite extensive

Timelines
We'd hoped that by now we'd be done with auth systems work. Next few weeks: several deployments, then a review of OpenID extension.
 * Chris S.: We had a lot of UX work that made sense to do; this delayed things. This week: SUL into production, hopefully!
 * OAuth deploying to Labs this week - plans to go into prod week of the 29th
 * OpenID review on Chris's backlog; need to help Wikinaut get a few more features done before enabling in production. Very volunteer-driven.
 * Want to get done within next month or 2. Low involvement/work by WMF HipHop & admin tools: "time allowing"?
 * Deployment stuff - critical and Ops is waiting on it. HipHop would slip to defer to that. Basically the same set of people working on both of those two projects.
 * Admin tools sprint -- if we have a Product Manager in Platform to figure that out, Chris Steipp + James Forrester would work on that, make sure stewards are on board. We have good volunteer contributions on those projects.
 * What would topic of sprint be? depends on a lot of things. Top request from stewards is currently a global CheckUser tool; depends on larger CheckUser vision for the next few years. Maybe something on CAPTCHAs, or spam abuse stuff.  List of roughly prioritized projects: https://www.mediawiki.org/wiki/Admin_tools_development/Roadmap
 * Steipp has been lots of OAuth/SUL stuff; detracts from day-to-day security work he needs to do. Right now, trying to give Steipp breathing room to do security work.  Then, later, hopefully push on core project work.
 * Admin tools sprint vs HipHop - a hard tradeoff. But HipHop probably wins in this quarter because:
 * lack of product management resources :( -- although Sumana has a product advisory volunteer
 * Facebook folks really keen on getting us up & running on HipHop; they are dedicating resources to this issue. Strike while iron is hot.  They have fixed blocker bugs for us.
 * Sumana: makes sense, especially because we have volunteer work on admin tools (including comaintainers of AbuseFilter etc.), but really only we can do the messy work on HipHop; therefore deferring WMF work on admin tools
 * Goals for the deployment sprint:
 * Need more definition - Greg is working on it.
 * https://www.mediawiki.org/wiki/Site_performance_and_architecture#Deployment_sprint


 * We want to improve deployment speed & robustness. We got lots of speed advantages from just improving scap.
 * Git-deploy helps in 2 ways. 1) Scanning directories, transferring data is more compact with git than rsync. 2) network transfer done first, then checkout. So code updates are more atomic.
 * Deployment practices: we always push the latest state of master even when we are just pushing a small change. Increases risk.
 * VE/hotfix issue: right now the methods for deploying minor updates to extensions are awkward and little-known. It's worth looking at how we do this -- whether with scap, git-deploy, submodules, etc. -- to find something easier to use.


 * Important to set some goals to ensure we get what we want out of this sprint
 * Our default position: we're moving to git-deploy. We had some issues last time to solve somehow, especially with atomicity of deployments of copies.  Working around git + ensuring we file a bug upstream with git itself & track it.
 * Git itself: Git has never managed large binary files well. Never will, probably. Other utilities have been built as workarounds. e.g. http://git-annex.branchable.com/
 * git fetch -- when it fails due to network issue or whatever, object store gets corrupted, hard to get out. So dealing with or preventing corrupted state sucks.
 * Sumana suggests: should we have some training in this sprint?
 * Greg: there needs to be more training, + better tools to make things more straightforward.
 * Could we have a small team responsible for all deployments across all teams? - Mark Bergsma raises question.
 * Can we instead make it more of a pushbutton operation? more robust.... need to find warts & code them out of the system
 * Now we have a weekly deployment train; encourage more people to do that, more eyeballs. But VE-style teams will want to deploy fixes nearly every day; how do we enable that without adding site risk?
 * More of this in release mgmt/QA review.
 * We've done ~5 weekly deployments. Adjust to this cycle and eventually speed it up again
 * ACTION ITEM: document current ugly awkward method (possibly mentioning beta cluster), checking out an uncommitted change or whatever for a hotfix, and/or have tools on the deployment host that automate these sort of processes, and/or make a policy re hotfixes. Owner: Greg G.
 * The current method is documented, in https://wikitech.wikimedia.org/wiki/How_to_deploy_code2
 * But what about Solr or reportcard etc.? Put into requirements that we cover non-MediaWiki applications/deployments.
 * ACTION ITEM: Greg to consider adding some training components to this sprint
 * ACTION ITEM: Greg to help consider Mark B's question "Could we have a small team responsible for all deployments across all teams?"

Deferred or reorganized work

 * Release mgmt & QA
 * scap replacement & monitoring/beta cluster/deployments - Chris M. asks: deployments are high-touch right now.... can the example of beta labs help with automating that?
 * Tim: beta labs has been running a single set of code on NFS - not really a deployment process in the way the main site does. Implementing separate code trees on sep. servers in beta labs would give testbed for changing deployment systems?
 * Antoine: in Jan/Feb, we had git-deploy installed to test it out before roll out in production.... had trouble with LDAP. Should be possible to use scap, but that requires a lot of tweaks in prod files since a lot of things are hardcoded.
 * Chris M wants more monitoring for beta. It falls over, stalls on deployments, times out.
 * Mobile has beta as important part of their deployment cycle
 * beta is important & validates our strategy. More resources needed to make it better.
 * RobLa: we can weave some beta work into deployment sprint. Whenever we have an ongoing project that impacts deployment, we might want to use beta (past examples: wikivoyage, git-deploy..).
 * Central code repo
 * Lua, gadgets, and everything.
 * No huge emergency reason to do it now.
 * Would benefit from a product manager. UI elements to this. Legal wrangling - licensing. Managing code metadata, different from content metadata.
 * Want to get to it by June 2014. May be Jan-March 2014.
 * ECT wants this to help nurture the gadget/front end community
 * Analytics might be another internal customer -- Limn stuff -- but Toby says it's not a huge issue for them at the moment
 * Configuration database (duke nukem forever)
 * embarrassing lack that forms newbies' first impressions
 * and more importantly -- want to be able to handle some things via MW permissions that right now we handle via shell bugs
 * Customer: LCA.
 * No known timeline; waiting till after search
 * Gerrit improvements (further BZ integration)
 * we will help by upgrading to new versions.
 * Chad's been working for 13+ months on Gerrit stuff. He's one of our more experienced MW developers. Getting Chad's focus back on MW is a win
 * Would want to push on BZ stuff, new diffing and inline editing interface, trying to help with benchmarking new Lucene stuff that's already in pipeline. But others are working on this as well so we can still see minor improvements coming.
 * Lucene: tagging -- arbitrary keywords
 * timeline depends on contractor budget, developer engagement work; waiting till after search

Questions
bug 48835 - client cache control header?
 * part of performance sprint - ostensibly to be this quarter
 * look in BZ whiteboard for each ticket :)