Beta Cluster/2014-15-Q3/20141110-meeting

From mediawiki.org

https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals/Q3#Beta_Cluster

Purpose of this meeting is to determine what cross-team support we should plan on for this work. Yuvi is the obvious crossover person from Ops and Antoine is the Deputy of Beta Cluster. Mark and I just fill out paperwork.

Attendees: Andrew B., YuviPanda, Mark B, Robla, Greg, Antoine, Damon

HHVM fcgi restart during scap runs cause 503s (and failed tests)[edit]

https://bugzilla.wikimedia.org/show_bug.cgi?id=72366

  • scap restarted hhvm causing 503. bd808 reverted it
  • Ori started improving unit tests code coverage of pybal, he is also looking at adding pybal to beta cluster eventually
  • Beta cluster has no LVS though

Dev code pipeline/Nightly[edit]

Instead of a 2nd beta cluster (overkill maintenance cost), use multiversion to run two versions in //

  • Phabricator tasks has been filled

What is going our definition of Yes Done ?

Prod/BC reconciliation[edit]

  • start with a diff between prod and BC
  • have antoine, andrew, yuvi flesh out what the list of things we need to fix would be

Truly responsive repair of Beta Cluster[edit]

Vision : have bugs autofilled whenever anything screw up on beta cluster Beta cluster is a shared resource. Need a sheriff to babysit it. Code sheriff team to monitor the shared resource and report traces / bugs, grabs people to fix it up.

  • result: build trust accros the org

Sheriff like process (https://wiki.mozilla.org/Sheriffing )

  • someone who's on call who's responsibility it is to sheparding a fix for any breakages (across all of Engineering, prod/beta)
  • advertise more about logstash which is really useful for devs to look at it and help babysit
  • lots of developers fix issues by themselves, filling bug against their software and not bothering anyone (but fixing the issue nonetheless).

Goal: data-driven infrastructure/development planning for Beta. For example, gather enough data to know if we would benefit from a move to non-virtualized hardware for app servers.

Monitoring[edit]

  • Diamond collecting metrics on each instance (cpu/disk usage etc)
  • reported to a central Graphite
  • JS frontend replacing ganglia https://tools.wmflabs.org/nagf/?project=deployment-prep
  • Moving to Shinken, will give us more labs-specific checks
    • not sure how to make BC the same set of checks as production due to a limitation of labs (tech details: no puppet collection)
  • Labs is more complicated than production :)
  • Beta cluster could use its own Shinken instance (complicates things for other instances)
  • In production the monitoring checks are mostly active one (connecting to instance to execute some commands)
  • Slowly moving to passive checks via Graphite
  • puppet failures are now broadcasted

Puppet related[edit]

  • beta laggings out and sometime broken by operations/puppet changes. Most figured out in the next hours thanks to monitoring.
  • hiera() definitely helping and will improve
  • puppet compiler: run it for every changeset on beta cluster? (it runs on request now for prod)
    • needs to be async / overridable by ops so it does not block them (sometime does not make any sense)
  • ops convention is to +2 / merge, test on one prod machine then generalize
    • maybe another step could be introduced to test it on beta

Next?[edit]

  • Test cluster on baremetal? Could do in the context of a performance cluster
  • Can't really do to replicate the whole cluster stack though
  • OpenStack could in theory provision baremetal
  • Swift?
  • Wikimedia labs infra is reliable