Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, January 2014/Notes

From mediawiki.org

Team: https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#Team

Ongoing[edit]

  • Deployments
  • MW Operations
  • Code Review
  • Security
  • Test infrastructure
  • Git/Gerrit
    • Less this quarter
  • Shell bugs

Previous Quarter[edit]

https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#Previous_quarter

CirrusSearch[edit]

  • Chad/Nik/Dan/Andrew(s)
  • Much more real of a project/deployment.
  • Took about a month off to fix up the obvious failures.
  • Deployed to ~70% of pages, or 85% of all updates
    • mostly kept up in real-time
  • Serving 8% of all search traffic
  • pretty much all wikis will have CirrusSearch as an opt-in BetaFeature within the next 4 weeks

Deployment Tooling[edit]

  • multi-site awareness & git-deploy feel down in priority
  • scap improvements
    • specifically speed of deploy (re localization updates/generation)
    • hovering around 10 minutes per scap
      • as opposed to ~30 min today
  • Logstash in production with basic logging info (via udplog)
    • much more easy to monitor things without having to grep log files and to see trends more easily
    • access right now: the wmf LDAP group. Will probably have to stay that way for foreseeable future - there's PII in there
    • will probably be restricted still in the future due to IP addresses and other private information contained in the production logs

Scholarship App[edit]

  • Made the scholarhip application process application (eg: the form submission and review of applications) much much better :)
  • ~180 applications submitted as of the morning of the 21st, no indication of users unable to apply
  • Review process has not started in earnest yet

Auth Systems[edit]

  • OAuth was deployed (and is being used)!
  • refining password expiration protocol and password hashing
    • prompted by the potential data breach in October
  • SULv2 performance improvements
    • cut down the affect of anonymous users on cluster resources (eg: via hitting the backend apaches)

Security Auditing and Response[edit]

  • Code review of a bunch of projects
    • GLAM
    • Flow
    • Scholarship App
    • (delayed) Limn/Kraken
    • (delayed) TimedMediaHandler v2
  • Security Releases (1.21.3 and 1.22.1)
    • first one with the outside contractors (M&M)

Performance Monitoring[edit]

  • lots of stuff, see slides ;)

Architecture Formalization[edit]

PDF support[edit]

  • Brad on loan

Next Quarter[edit]

Search[edit]

  • "neat, not cool"
  • ENWIKI being indexed
  • goal of being done (rolling out) by end of March
  • Working on an interwiki search UI (with Design/Brandon)
  • Waiting on Rack D (in eqiad) buildout for more search machines
    • Rack D -- ops is roughly saying "end of february-ish"
  • Beta Features is a good feedback channel - may be increasing # of testers, and makes it easy for the people who want to test it anyway to provide feedback.
    • 183 beta users on Commons

HHVM[edit]

https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#HipHop_VM_Deployment

  • Goal of production service running on HHVM by end of quarter
    • job queue?, l10n updates?, image scalers?
  • Need to port LuaSandbox
  • packages/puppet/automated testing
  • Great working relationship with the upstream team at Facebook
  • (discussion of unit tests (some not passing because of intl disabling)
  • fastgi
  • MediaWiki implemention ... works without ....
  • goal: get it running correctly without throwing errors, not optimizing specific services.
    • find a self-contained service to convert. e.g., jobqueue or l10n updates
  • factor of 5 times faster with HHVM vs standard PHP? Anecdotally it's faster but we need real benchmarks.
  • persistent connections to eg Redis is doable right now and do the same thing with HHVM
  • What level of involvement and where will Ops be involved with this?
    • pacakges and puppet
  • Blockers to HHVM
    • packages are crappy (redone by Faidon?)
    • monitoring is different for HHVM
    • Ops hasn't put in the time to really know what will all need to chage
  • People aren't really using the mailing list - it's in the active GitHub project & Freenode channel

Deployment Tooling[edit]

https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Quarterly_review,_January_2014#Deployment-related_Development

  • Scaling back a bit
  • getting some preliminary work in place
  • Bryan Davis will be doing a fresh-eyes review of the current system
    • which will inform future work, eg: making extension deployment process less brittle
  • BTW: Completing the search deployment will remove lsearchd which is a blocker on scap renovation
  • Logstash: working with Ops to add more log sources (including Ops specific)
  • Ops Request: need a deployment system that is usable beyond MW itself
  • Interaction between packaging and deployment (using packages to deploy?)
  • Ops offering time to work on Graphite
  • To discuss a bunch of this tomorrow in Deployment process meeting

Performance[edit]

  • front end has been neglected, eg 2 separate requests for geoip which was not easily caught without this type of review (and similar code base)
  • making performance/latency visible so that teams/developers can see impact
  • Ori thinks we can probably get our pageload time down another 300ms or so
  • Performance Test Environment?
    • "it's on the roadmap"
    • blocked on unittests not actually making web requests
    • performance monitoring will follow the virtualized test environment
    • but all test infrastructure currently in place is virtualized and thus not reliable for data comparisons
    • Labs does not have reliable performance characteristics
  • Timely (eg: weekly or so) performance reports mailed to Ops and Engineering lists

Security[edit]

  • Password storage update to finally replace our password storage algorithm
    • most patches are mostly ready or merged, we're just waiting/reviewing to make sure we do it right the first time
  • continuous reviews and training
  • Training focus on team leads/project leads
  • Staffing:
    • There's an Ops Security opening
    • FrontEnd security engineer position, waiting for the internal candidate to become free

Other[edit]

  • PDF Rendering (Brad)
  • Product management scoping (Dan)
    • admin tools dev
      • Chris is engineering point person (to delegate)
    • securepoll cleanup (Brad maybe if he has time)
      • next election is in about 2 years, there are some hints of early ones (maybe hrwiki)
    • SUL finalization
      • main engineering point person is Chris
    • central CSS discussion

+2 maintainership

  • any problems?
  • punt to a later/larger conversation

TODOS[edit]

  • Sumana & Ken: follow up on possibility of "has signed an NDA" LDAP group
  • Bryan & Chris: look into 2FA or similar for Logstash Authentication for users
  • Chad & Nik: Get Brandon a link to a JSON API
  • More benchmarks for HHVM & MediaWiki - characterise & pinpoint & quantify benefits of HHVM so we have a real value proposition for rest of org
  • Mark B to look into this: To collect frontend performance data, would be great to have a varnish kafka topic running on bits varnishes actiing as aggregation point, asks Ori. Not urgent
    • Separate eventlogger load-balancing IP? suggests Faidon
  • Look into provisioning baremetal performance testing infra?
    • maybe just an additional job runner for testing HHVM
  • Faidon & Gabriel: Look into provisioning hardware for the large users of Labs, eg Parsoid
  • Describe what to do in the event of a users/admin settings leak
    • script it?
    • Chad: figure out why we still have AdminSettings lingering around. I killed that years ago.
  • Chris S & Sumana: talk about upcoming training, brainstorm approaches