Beta Cluster/2014-15-Q3/20141110-meeting

https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals/Q3#Beta_Cluster

Purpose of this meeting is to determine what cross-team support we should plan on for this work. Yuvi is the obvious crossover person from Ops and Antoine is the Deputy of Beta Cluster. Mark and I just fill out paperwork.

Attendees: Andrew B., YuviPanda, Mark B, Robla, Greg, Antoine, Damon

HHVM fcgi restart during scap runs cause 503s (and failed tests)[edit]

https://bugzilla.wikimedia.org/show_bug.cgi?id=72366

scap restarted hhvm causing 503. bd808 reverted it
Ori started improving unit tests code coverage of pybal, he is also looking at adding pybal to beta cluster eventually
Beta cluster has no LVS though

Dev code pipeline/Nightly[edit]

Instead of a 2nd beta cluster (overkill maintenance cost), use multiversion to run two versions in //

Phabricator tasks has been filled

What is going our definition of Done ?

Prod/BC reconciliation[edit]

start with a diff between prod and BC
have antoine, andrew, yuvi flesh out what the list of things we need to fix would be

Truly responsive repair of Beta Cluster[edit]

Vision : have bugs autofilled whenever anything screw up on beta cluster Beta cluster is a shared resource. Need a sheriff to babysit it. Code sheriff team to monitor the shared resource and report traces / bugs, grabs people to fix it up.

result: build trust accros the org

Sheriff like process (https://wiki.mozilla.org/Sheriffing )

someone who's on call who's responsibility it is to sheparding a fix for any breakages (across all of Engineering, prod/beta)
advertise more about logstash which is really useful for devs to look at it and help babysit
lots of developers fix issues by themselves, filling bug against their software and not bothering anyone (but fixing the issue nonetheless).

Goal: data-driven infrastructure/development planning for Beta. For example, gather enough data to know if we would benefit from a move to non-virtualized hardware for app servers.

Monitoring[edit]

Diamond collecting metrics on each instance (cpu/disk usage etc)
reported to a central Graphite
JS frontend replacing ganglia https://tools.wmflabs.org/nagf/?project=deployment-prep
Moving to Shinken, will give us more labs-specific checks
- not sure how to make BC the same set of checks as production due to a limitation of labs (tech details: no puppet collection)
Labs is more complicated than production :)
Beta cluster could use its own Shinken instance (complicates things for other instances)
In production the monitoring checks are mostly active one (connecting to instance to execute some commands)
Slowly moving to passive checks via Graphite
puppet failures are now broadcasted

Puppet related[edit]

beta laggings out and sometime broken by operations/puppet changes. Most figured out in the next hours thanks to monitoring.
hiera() definitely helping and will improve
puppet compiler: run it for every changeset on beta cluster? (it runs on request now for prod)
- needs to be async / overridable by ops so it does not block them (sometime does not make any sense)
ops convention is to +2 / merge, test on one prod machine then generalize
- maybe another step could be introduced to test it on beta

Next?[edit]

Test cluster on baremetal? Could do in the context of a performance cluster
Can't really do to replicate the whole cluster stack though
OpenStack could in theory provision baremetal
Swift?
Wikimedia labs infra is reliable