Beta Cluster/2014-15-Q3/20141110-meeting

https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals/Q3#Beta_Cluster

Purpose of this meeting is to determine what cross-team support we should plan on for this work. Yuvi is the obvious crossover person from Ops and Antoine is the Deputy of Beta Cluster. Mark and I just fill out paperwork.

Attendees: Andrew B., YuviPanda, Mark B, Robla, Greg, Antoine, Damon

HHVM fcgi restart during scap runs cause 503s (and failed tests)
https://bugzilla.wikimedia.org/show_bug.cgi?id=72366
 * scap restarted hhvm causing 503. bd808 reverted it
 * Ori started improving unit tests code coverage of pybal, he is also looking at adding pybal to beta cluster eventually
 * Beta cluster has no LVS though

Dev code pipeline/Nightly
Instead of a 2nd beta cluster (overkill maintenance cost), use multiversion to run two versions in // What is going our definition of ✅ ?
 * Phabricator tasks has been filled

Prod/BC reconciliation

 * start with a diff between prod and BC
 * have antoine, andrew, yuvi flesh out what the list of things we need to fix would be

Truly responsive repair of Beta Cluster
Vision : have bugs autofilled whenever anything screw up on beta cluster Beta cluster is a shared resource. Need a sheriff to babysit it. Code sheriff team to monitor the shared resource and report traces / bugs, grabs people to fix it up. Sheriff like process (https://wiki.mozilla.org/Sheriffing ) Goal: data-driven infrastructure/development planning for Beta. For example, gather enough data to know if we would benefit from a move to non-virtualized hardware for app servers.
 * result: build trust accros the org
 * someone who's on call who's responsibility it is to sheparding a fix for any breakages (across all of Engineering, prod/beta)
 * advertise more about logstash which is really useful for devs to look at it and help babysit
 * lots of developers fix issues by themselves, filling bug against their software and not bothering anyone (but fixing the issue nonetheless).

Monitoring

 * Diamond collecting metrics on each instance (cpu/disk usage etc)
 * reported to a central Graphite
 * JS frontend replacing ganglia https://tools.wmflabs.org/nagf/?project=deployment-prep
 * Moving to Shinken, will give us more labs-specific checks
 * not sure how to make BC the same set of checks as production due to a limitation of labs (tech details: no puppet collection)
 * Labs is more complicated than production :)
 * Beta cluster could use its own Shinken instance (complicates things for other instances)
 * In production the monitoring checks are mostly active one (connecting to instance to execute some commands)
 * Slowly moving to passive checks via Graphite
 * puppet failures are now broadcasted

Puppet related

 * beta laggings out and sometime broken by operations/puppet changes. Most figured out in the next hours thanks to monitoring.
 * hiera definitely helping and will improve
 * puppet compiler: run it for every changeset on beta cluster? (it runs on request now for prod)
 * needs to be async / overridable by ops so it does not block them (sometime does not make any sense)
 * ops convention is to +2 / merge, test on one prod machine then generalize
 * maybe another step could be introduced to test it on beta

Next?

 * Test cluster on baremetal? Could do in the context of a performance cluster
 * Can't really do to replicate the whole cluster stack though
 * OpenStack could in theory provision baremetal
 * Swift?
 * Wikimedia labs infra is reliable