Wikimedia Cloud Services team/Onboarding Hieu/Sessions

2019-12-12
modules/puppetmaster/files/production.hiera.yaml

$data = lookup(whatevr) $date = "other" <--- fails!!

modules/puppetmaster/files/labs.hiera.yaml

https://gerrit.wikimedia.org/r/admin/projects/cloud/instance-puppet

modules/profile/manifests/wmcs/monitoring.pp

2019-12-05
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554844

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554853 (use var form hiera)

hieradata/labs.yaml

git grep profile::mediawiki::scap_client git grep profile::mediawiki::common

https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/ Catalog ran successfully : https://puppet-compiler.wmflabs.org/compiler1003/19814/mwmaint1002.eqiad.wmnet/

mwmaint1002.eqiad.wmnet

hieradata/hosts/cloudservices1003.yaml: prometheus_nodes: - labmon1001.eqiad.wmnet - cloudmetrics1002.eqiad.wmnet - prometheus1003.eqiad.wmnet - prometheus1004.eqiad.wmnet
 * 1) prometheus-pdns-exporter is scrapped by labmons
 * 2) prometheus-node-exporter by prod servers

2019-11-28
https://office.wikimedia.org/wiki/Pwstore

cumin1001.eqiad.wmnet

aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "hieu reimaging server" --hours 1 labmon1002*

aborrero@cumin1001:~ $ sudo install_console labmon1002.mgmt.eqiad.wmnet
 * 1) or ssh root@labmon1002.mgmt.eqiad.wmnet

cd pw ../pwstore/pwd ed management

hpiLO-> vsp

Virtual Serial Port Active: COM2

Starting virtual serial port. Press 'ESC (' to return to the CLI Session.

Debian GNU/Linux 8 labmon1002 ttyS1

labmon1002 login:
 * merge patches (dns and puppet)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/553441 https://gerrit.wikimedia.org/r/c/operations/dns/+/553467

ssh ns1.wikimedia.org aborrero@authdns2001:~ $ sudo authdns-update

aborrero@cumin1001:~ $ sudo cumin A:installserver run-puppet-agent
 * run puppet on install servers


 * run the script:

phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out IPMI Password: 15:59:12 | labmon1002.eqiad.wmnet | Validated host 15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga 15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet 15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor 15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe 15:59:18 | labmon1002.eqiad.wmnet | Power cycling 15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out IPMI Password: 15:59:12 | labmon1002.eqiad.wmnet | Validated host 15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga 15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet 15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor 15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe 15:59:18 | labmon1002.eqiad.wmnet | Power cycling 15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out IPMI Password: 15:59:12 | labmon1002.eqiad.wmnet | Validated host 15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga 15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet 15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor 15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe 15:59:18 | labmon1002.eqiad.wmnet | Power cycling 15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle 16:03:27 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 5.0 minutes 16:03:29 | cloudmetrics1002.eqiad.wmnet | Uptime checked 16:03:29 | cloudmetrics1002.eqiad.wmnet | Host up (Debian installer) 16:08:08 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 5.0 minutes 16:13:23 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 10.0 minutes 16:13:25 | cloudmetrics1002.eqiad.wmnet | Uptime checked 16:13:25 | cloudmetrics1002.eqiad.wmnet | Host up 16:13:33 | cloudmetrics1002.eqiad.wmnet | Puppet CSR generated, fingerprint is: 06:02:32:2F:0E:80:B8:CA:8E:74:34:9B:63:EA:94:41:EF:B3:0E:B3:DF:D1:4B:84:F4:B3:73:66:B9:78:16:D5 16:13:33 | cloudmetrics1002.eqiad.wmnet | Polling until a Puppet sign request appears 16:13:37 | cloudmetrics1002.eqiad.wmnet | Signed Puppet cert 16:13:39 | cloudmetrics1002.eqiad.wmnet | Validated host 16:13:39 | cloudmetrics1002.eqiad.wmnet | Scheduled delayed downtime on Icinga 16:13:41 | cloudmetrics1002.eqiad.wmnet | Started first puppet run (sit back, relax, and enjoy the wait) START - Cookbook sre.hosts.downtime Forcing a Puppet run on the Icinga server Running Puppet with args --quiet --attempts 30 on 1 hosts: icinga1001.wikimedia.org Downtiming 1 hosts and all their services for 2:00:00: cloudmetrics1002.eqiad.wmnet Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: cloudmetrics1002.eqiad.wmnet END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) 16:20:24 | cloudmetrics1002.eqiad.wmnet | First Puppet run completed 16:20:25 | cloudmetrics1002.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got: Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: 0004000000 Boot Flags : - Boot Flag Invalid - Options apply to only next boot - BIOS PC Compatible (legacy) boot - Boot Device Selector : Force PXE - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

16:20:50 | cumin1001.eqiad.wmnet | Puppet run completed 16:20:51 | cloudmetrics1002.eqiad.wmnet | Rebooted host 16:24:00 | cloudmetrics1002.eqiad.wmnet | Uptime checked 16:24:00 | cloudmetrics1002.eqiad.wmnet | Host up 16:24:00 | cloudmetrics1002.eqiad.wmnet | Polling the completion of a Puppet run 16:26:04 | cloudmetrics1002.eqiad.wmnet | Puppet run checked 16:26:04 | cloudmetrics1002.eqiad.wmnet | Reimage completed 16:26:04 | cloudmetrics1002.eqiad.wmnet | REIMAGE END | retcode=0

hieradata/hosts/labmon1002.yaml
 * AFTER operations: cleanup DNS entries
 * AFTER operations: cleanup stale file in puppet:

2019-11-27
hieradata/hosts/cloudservices1003.yaml hieradata/hosts/cloudservices2002-dev.yaml hieradata/hosts/cloudservices1004.yaml
 * files within puppet repo with the keyword "labmon"

hieradata/hosts/cloudcontrol1003.yaml hieradata/hosts/cloudcontrol2001-dev.yaml hieradata/hosts/cloudcontrol2003-dev.yaml hieradata/hosts/cloudcontrol1004.yaml

hieradata/labs/cloudinfra/host/cloud-puppetmaster-01.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-02.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-03.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-04.yaml

modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 hieradata/role/common/wmcs/monitoring.yaml hieradata/common/profile/openstack/eqiad1.yaml

modules/install_server/files/autoinstall/preseed.cfg modules/install_server/files/autoinstall/netboot.cfg

modules/profile/templates/cumin/aliases.yaml.erb manifests/site.pp

labmon migration
https://phabricator.wikimedia.org/T224585 https://gerrit.wikimedia.org/r/c/operations/puppet/+/552107

labmon1001 (primary) labmon1002 (backup)

(conftool ) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cache/text.yaml
 * how to switch active to standby

things to be backed up and restored
 * /var/lib/grafana (dashboard data is not in puppet)


 * labmon name change -> cloudmetrics1001 & cloudmetrics1002


 * disable puppet agent on labmon
 * do 1002 (standby) first
 * shutdown 1002
 * change puppet to change hostname
 * turn back on 1002 to ensure hostname change is correct
 * (remember to update netbox)


 * reimage to buster
 * make 1002 the primary

https://gerrit.wikimedia.org/r/admin/projects/operations/dns (another puppet repository for DNS)

2019-11-14
https://wikitech.wikimedia.org/wiki/Incident_documentation

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:WMCS_eqiad1_network_topology.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron

network bonding/network teaming? multiple network switches

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

2019-11-05
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity


 * Services we offer / stuff we have / dependency assesment
 * IaaS (CloudVPS)
 * PaaS (Toolforge)
 * DaaS (Wiki-replicas, toolsdb)
 * Others (LDAP, etc)


 * For each service we offer, what is the current status from the availability and continuity point of view. Identify SPOF.
 * IaaS
 * hardware level (NICs, switches, RAID storage, racks, disk backups? etc)
 * software level (openstack services in HA, which are not, provisioning/bootstrap, puppet etc)
 * PaaS
 * hardware level (this uses our own IaaS as hardware)
 * software level (grid, k8s, docker registry, services, NFS, and other Toolforge key components, puppet, etc)
 * DaaS
 * hardware level (this uses both our own IaaS as hardware and physical hardware)
 * software level (simple cold-standby setups, dbproxies, puppet, etc)
 * Others


 * For each service we offer, things to improve in both short term and long term. Do we need them? Are they cost-effective?
 * IaaS
 * hardware level:
 * storage (ceph)
 * NIC redundancy
 * Racking scheme (not everything in row B eqiad)
 * etc
 * software level:
 * glance in HA
 * neutron DVR (distributed virtual routing)
 * automatic bootstrapping / provisioning
 * etc
 * PaaS
 * hadrware level:
 * automatic provisiong / bootstrapping
 * offline backups?
 * software level:
 * anything?
 * DaaS
 * hardware level:
 * etc
 * software level:
 * etc
 * Others

2019-10-27

 * tools-webservice
 * labmon migration
 * documentation (https://wikitech.wikimedia.org/wiki/Systems_and_Service_Continuity)
 * https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals reallocating page!
 * not ready for review yet
 * https://phabricator.wikimedia.org/T218461
 * cloud-cumin-01.cloudinfra.eqiad.wmflabs
 * https://tools.wmflabs.org/openstack-browser/project/cloudinfra

$ sudo cumin "project:tools" "apt-cache policy toollabs-webservice"

sudo cumin "O{project:tools name:tools-sgebastion-08}" "apt-cache policy toollabs-webservice"

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true" aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice -s"

Real installation:

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice"

2019-10-10

 * status of things:
 * working on reliability documentation
 * labmon project externally blocked https://phabricator.wikimedia.org/T224585

wmcs_puppet_tree_clean {       cd /var/lib/git/operations/puppet sudo git clean -fd sudo git checkout -f cd - sudo git-sync-upstream }

https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#wmf-export-puppet-patch.sh

2019-09-26

 * kubernetes ingress etc
 * Q2 goal labmon https://phabricator.wikimedia.org/T224585
 * some explanations of the servers
 * some puppet tree pointers

2019-08-08

 * multiple LDAP accounts: https://phabricator.wikimedia.org/T230126
 * https://wikitech.wikimedia.org/wiki/LDAP


 * not in the LDAP group?
 * cloud-wide root https://gerrit.wikimedia.org/r/admin/projects/labs/private
 * generate a patch to add a new SSH key (cloud VPS root)

https://wikitech.wikimedia.org/wiki/LDAP https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

+2 verified +2 code-review
 * puppet workflow:

then merge button will appeart -> git-gerrit (not yet in infra)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

puppetmaster1001.eqiad.wmnet

sudo puppet-merge (fetch change from gerrit to puppet master)

hpham@puppetmaster1001:~$ sudo puppet-merge Checking for pending merges in /labs/private Fetching new commits from https://gerrit.wikimedia.org/r/labs/private No changes to merge. Fetching new commits from https://gerrit.wikimedia.org/r/operations/puppet No changes to merge.

https://github.com/wikimedia/puppet/ (mirror)

lo https://github.com/wikimedia/puppet/tree/production/modules/role/manifests

https://wikitech.wikimedia.org/wiki/Puppet_coding

manifest (codes) - hiera pulls configuration data

https://github.com/wikimedia/puppet/blob/production/manifests/site.pp


 * Possible initial tasks:
 * Set up tools-buster repository in aptly to allow toolforge servers to be installed on buster https://phabricator.wikimedia.org/T229237
 * WMCS: migrate python2 scripts to python3 https://phabricator.wikimedia.org/T229920
 * Migrate labmon* to Stretch (or Buster, better yet!) https://phabricator.wikimedia.org/T224585


 * Commit first patch to puppet

sudo easy_install pip sudo pip install -U setuptools pip install --user git-review export PATH=$PATH:$HOME/Library/Python/2.7/bin

git clone "ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet" && scp -p -P 29418 phamhi@gerrit.wikimedia.org:hooks/commit-msg "puppet/.git/hooks/"
 * 1) clone with commit-msg hook
 * 2) https://gerrit.wikimedia.org/r/admin/projects/operations/puppet

git config --global --add gitreview.username "phamhi" git config --global --add gitreview.email "hpham@wikimedia.org"

git review -s
 * 1) Creating a git remote called 'gerrit' that maps to:
 * 2)        ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet.git


 * 1) make the change

git commit -a # add comment

git review