Wikimedia Cloud Services team/Onboarding Chico/Sessions

From MediaWiki.org
Jump to navigation Jump to search

https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Onboarding_Chico/Sessions

Chico Questions[edit]

next session[edit]

Flapping alerts in shinken T161898[edit]

Diamond[edit]

  • Is there a reason for collecting less metrics about puppet?
    • We have an addapted minimalpuppetagent.py that collects a lot less than the original puppetagent.py (it alsos add a _check_sudo method)
    • Maybe .puppetagent.changes.total, puppetagent.events.failure puppetagent.events.success and servers.hostname.puppetagent.events.total could be useful as well?
    • https://diamond.readthedocs.io/en/latest/collectors/PuppetAgentCollector/

Puppet errors analysis[edit]

Bastion alerts[edit]

  • Create alerts specfic to bastion


Using puppet on VPS[edit]

  • Docs

https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster


PAWS[edit]

  • How to test changes?
    • Staging env
  • tools-beta


K8s logging retention[edit]

  • We can force pods to restart after 30 days, but it sounds like a terrible idea
    • Revisit after tools-beta

Other tasks[edit]

  • Is there something else I should be looking into?

2018 - 02 - 13[edit]

How is monitoring configured?[edit]

https://etherpad.wikimedia.org/p/chicoandchase https://graphite-labs.wikimedia.org/render/?width=674&height=377&_salt=1518534146.333&target=tools.tools-bastion-03.cpu.total.idle&from=-30d tc ifb tc an only manipulate send queues iotop iotop -ao https://graphite-labs.wikimedia.org/render/?width=674&height=377&_salt=1518535047.584&target=tools.tools-bastion-03.nfsiostat.labstore.ops&target=tools.tools-bastion-03.nfsiostat.labstore.ops_per_sec&target=tools.tools-bastion-03.nfsiostat.labstore1003.ops&target=tools.tools-bastion-03.nfsiostat.labstore1003.ops_per_sec

https://graphite-labs.wikimedia.org/render/?width=674&height=377&_salt=1518535115.171&target=tools.tools-bastion-03.nfsiostat.mounts.data_project.write.kilobytes&target=tools.tools-bastion-03.nfsiostat.mounts.data_project.read.kilobytes&from=-30d

https://graphite-labs.wikimedia.org/render/?width=674&height=377&_salt=1518535177.867&target=tools.tools-bastion-03.nfsiostat.mounts.mnt_nfs_labstore-secondary-tools-project.read.kilobytes&target=tools.tools-bastion-03.nfsiostat.mounts.mnt_nfs_labstore-secondary-tools-project.write.kilobytes&from=-30d

WMCS Phabricator etiquete[edit]

  • Do we have documentation about how to triage tasks and move them arround projects and workboards?
    • TBD

Cloud VPS / Horizon stuff[edit]

  • I am still unfamiliar with the interfaces and common questions, maybe I should create a temp project and go through docs.
    • make a task for a chicotestproject T187213
  • Where are things configured?
    • Wikitech
    • operations-puppet repo
    • Horizon

https://wikitech.wikimedia.org/wiki/Hiera:tools

~/git/wmf/puppet cpettet@cair>ls hieradata/labs/tools

toolsadmin.wikimedia.org

Wikitech docs[edit]

  • Portal namespace for user facing docs
  • /admin subpage for WMCS team

Other tasks[edit]

  • Is there something else I should be looking into?
    • Let's start slow and I'll try to integrate you into my sort of normal workflow
    • Flapping alerts in shinken
      • host* as way to get % of hosts in failure