Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, January 2014/Notes

Team: https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#Team

Ongoing:
 * Deployments
 * MW Operations
 * Code Review
 * Security
 * Test infrastructure
 * Git/Gerrit
 * Less this quarter
 * Shell bugs

Previous Quarter
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#Previous_quarter

CirrusSearch

 * Chad/Nik/Dan/Andrew(s)
 * Much more real of a project/deployment.
 * Took about a month off to fix up the obvious failures.
 * Deployed to ~70% of pages, or 85% of all updates
 * mostly kept up in real-time
 * Serving 8% of all search traffic
 * pretty much all wikis will have CirrusSearch as an opt-in BetaFeature within the next 4 weeks

Deployment Tooling

 * multi-site awareness & git-deploy feel down in priority
 * scap improvements
 * specifically speed of deploy (re localization updates/generation)
 * hovering around 10 minutes per scap
 * as opposed to ~30 min today
 * Logstash in production with basic logging info (via udplog)
 * much more easy to monitor things without having to grep log files and to see trends more easily
 * access right now: the wmf LDAP group. Will probably have to stay that way for foreseeable future - there's PII in there
 * will probably be restricted still in the future due to IP addresses and other private information contained in the production logs

Scholarship App

 * Made the scholarhip application process application (eg: the form submission and review of applications) much much better :)
 * ~180 applications submitted as of the morning of the 21st, no indication of users unable to apply
 * Review process has not started in earnest yet

Auth Systems

 * OAuth was deployed (and is being used)!
 * refining password expiration protocol and password hashing
 * prompted by the potential data breach in October
 * SULv2 performance improvements
 * cut down the affect of anonymous users on cluster resources (eg: via hitting the backend apaches)

Security Auditing and Response

 * Code review of a bunch of projects
 * GLAM
 * Flow
 * Scholarship App
 * (delayed) Limn/Kraken
 * (delayed) TimedMediaHandler v2
 * Security Releases (1.21.3 and 1.22.1)
 * first one with the outside contractors (M&M)

Performance Monitoring

 * lots of stuff, see slides ;)

Architecture Formalization

 * https://www.mediawiki.org/wiki/Requests_for_comment
 * RFC IRC meetings
 * RFC process seems more clear/transparent
 * How does the room feel?
 * (silence == bang up job)
 * This feeds directly into the MW Core Team's planning

PDF support

 * Brad on loan

Search

 * "neat, not cool"
 * ENWIKI being indexed
 * goal of being done (rolling out) by end of March
 * Working on an interwiki search UI (with Design/Brandon)
 * Waiting on Rack D (in eqiad) buildout for more search machines
 * Rack D -- ops is roughly saying "end of february-ish"
 * Beta Features is a good feedback channel - may be increasing # of testers, and makes it easy for the people who want to test it anyway to provide feedback.
 * 183 beta users on Commons

HHVM
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#HipHop_VM_Deployment
 * Goal of production service running on HHVM by end of quarter
 * job queue?, l10n updates?, image scalers?
 * Need to port LuaSandbox
 * packages/puppet/automated testing
 * Great working relationship with the upstream team at Facebook
 * (discussion of unit tests (some not passing because of intl disabling)
 * fastgi
 * MediaWiki implemention ... works without ....
 * goal: get it running correctly without throwing errors, not optimizing specific services.
 * find a self-contained service to convert. e.g., jobqueue or l10n updates
 * factor of 5 times faster with HHVM vs standard PHP? Anecdotally it's faster but we need real benchmarks.
 * The last time we did benchmarks was factor of 5 with HPHP
 * Is HPHP faster or slower than HHVM? Very recently FB blogged that HHVM is faster?
 * http://www.hhvm.com/blog/2027/faster-and-cheaper-the-evolution-of-the-hhvm-jit
 * persistent connections to eg Redis is doable right now and do the same thing with HHVM
 * What level of involvement and where will Ops be involved with this?
 * pacakges and puppet
 * Blockers to HHVM
 * packages are crappy (redone by Faidon?)
 * monitoring is different for HHVM
 * Ops hasn't put in the time to really know what will all need to chage
 * People aren't really using the mailing list - it's in the active GitHub project & Freenode channel
 * https://lists.wikimedia.org/mailman/listinfo/hiphop

Deployment Tooling
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#Deployment-related_Development
 * Scaling back a bit
 * getting some preliminary work in place
 * Bryan Davis will be doing a fresh-eyes review of the current system
 * which will inform future work, eg: making extension deployment process less brittle
 * BTW: Completing the search deployment will remove lsearchd which is a blocker on scap renovation
 * Logstash: working with Ops to add more log sources (including Ops specific)
 * Ops Request: need a deployment system that is usable beyond MW itself
 * Interaction between packaging and deployment (using packages to deploy?)
 * Ops offering time to work on Graphite
 * To discuss a bunch of this tomorrow in Deployment process meeting

Performance

 * front end has been neglected, eg 2 separate requests for geoip which was not easily caught without this type of review (and similar code base)
 * making performance/latency visible so that teams/developers can see impact
 * Ori thinks we can probably get our pageload time down another 300ms or so
 * Performance Test Environment?
 * "it's on the roadmap"
 * blocked on unittests not actually making web requests
 * performance monitoring will follow the virtualized test environment
 * but all test infrastructure currently in place is virtualized and thus not reliable for data comparisons
 * Labs does not have reliable performance characteristics
 * Timely (eg: weekly or so) performance reports mailed to Ops and Engineering lists

Security

 * Password storage update to finally replace our password storage algorithm
 * most patches are mostly ready or merged, we're just waiting/reviewing to make sure we do it right the first time
 * continuous reviews and training
 * Training focus on team leads/project leads
 * form tbd
 * beginning with WMF individuals
 * can potentially leverage those materials, experiences to help ECT goal in April-June to train volunteers in security https://www.mediawiki.org/wiki/Wikimedia_Engineering/2013-14_Goals#Engineering_Community
 * Staffing:
 * There's an Ops Security opening
 * FrontEnd security engineer position, waiting for the internal candidate to become free

Other
+2 maintainership
 * PDF Rendering (Brad)
 * Product management scoping (Dan)
 * admin tools dev
 * Chris is engineering point person (to delegate)
 * securepoll cleanup (Brad maybe if he has time)
 * next election is in about 2 years, there are some hints of early ones (maybe hrwiki)
 * SUL finalization
 * main engineering point person is Chris
 * central CSS discussion
 * any problems?
 * punt to a later/larger conversation

TODOS

 * Sumana & Ken: follow up on possibility of "has signed an NDA" LDAP group
 * Bryan & Chris: look into 2FA or similar for Logstash Authentication for users
 * Chad & Nik: Get Brandon a link to a JSON API
 * More benchmarks for HHVM & MediaWiki - characterise & pinpoint & quantify benefits of HHVM so we have a real value proposition for rest of org
 * Mark B to look into this: To collect frontend performance data, would be great to have a varnish kafka topic running on bits varnishes actiing as aggregation point, asks Ori. Not urgent
 * Separate eventlogger load-balancing IP? suggests Faidon
 * Look into provisioning baremetal performance testing infra?
 * maybe just an additional job runner for testing HHVM
 * Faidon & Gabriel: Look into provisioning hardware for the large users of Labs, eg Parsoid
 * Describe what to do in the event of a users/admin settings leak
 * script it?
 * Chad: figure out why we still have AdminSettings lingering around. I killed that years ago.
 * Chris S & Sumana: talk about upcoming training, brainstorm approaches