Wikimedia Cloud Services team/Onboarding Hieu/Sessions

To be persisted into: https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Onboarding_Hieu/Sessions

2019-11-28
https://office.wikimedia.org/wiki/Pwstore

cumin1001.eqiad.wmnet

aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "hieu reimaging server" --hours 1 labmon1002*

aborrero@cumin1001:~ $ sudo install_console labmon1002.mgmt.eqiad.wmnet
 * 1) or ssh root@labmon1002.mgmt.eqiad.wmnet

cd pw ../pwstore/pwd ed management

hpiLO-> vsp

Virtual Serial Port Active: COM2

Starting virtual serial port. Press 'ESC (' to return to the CLI Session.

Debian GNU/Linux 8 labmon1002 ttyS1

labmon1002 login:
 * merge patches (dns and puppet)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/553441 https://gerrit.wikimedia.org/r/c/operations/dns/+/553467

ssh ns1.wikimedia.org aborrero@authdns2001:~ $ sudo authdns-update

aborrero@cumin1001:~ $ sudo cumin A:installserver run-puppet-agent
 * run puppet on install servers

aborrero@cumin1001:~ $ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet
 * run the script:

phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet 13:55:52 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281355_phamhi_138148_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281355_phamhi_138148_labmon1002_eqiad_wmnet_cumin.out IPMI Password: Address lookup for cloudmetrics1002.mgmt.eqiad.wmnet failed Could not open socket! Error: Unable to establish IPMI v2 / RMCP+ session 13:56:00 | labmon1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Remote IPMI failed for mgmt 'cloudmetrics1002.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'cloudmetrics1002.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1 13:56:00 | labmon1002.eqiad.wmnet | REIMAGE END | retcode=2

phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 13:58:28 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281358_phamhi_138804_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281358_phamhi_138804_labmon1002_eqiad_wmnet_cumin.out IPMI Password: Address lookup for cloudmetrics1002.mgmt.eqiad.wmnet failed Could not open socket! Error: Unable to establish IPMI v2 / RMCP+ session 13:58:38 | labmon1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Remote IPMI failed for mgmt 'cloudmetrics1002.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'cloudmetrics1002.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1 13:58:38 | labmon1002.eqiad.wmnet | REIMAGE END | retcode=2

hieradata/hosts/labmon1002.yaml
 * AFTER operations: cleanup DNS entries
 * AFTER operations: cleanup stale file in puppet:

2019-11-27
hieradata/hosts/cloudservices1003.yaml hieradata/hosts/cloudservices2002-dev.yaml hieradata/hosts/cloudservices1004.yaml
 * files within puppet repo with the keyword "labmon"

hieradata/hosts/cloudcontrol1003.yaml hieradata/hosts/cloudcontrol2001-dev.yaml hieradata/hosts/cloudcontrol2003-dev.yaml hieradata/hosts/cloudcontrol1004.yaml

hieradata/labs/cloudinfra/host/cloud-puppetmaster-01.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-02.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-03.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-04.yaml

modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 hieradata/role/common/wmcs/monitoring.yaml hieradata/common/profile/openstack/eqiad1.yaml

modules/install_server/files/autoinstall/preseed.cfg modules/install_server/files/autoinstall/netboot.cfg

modules/profile/templates/cumin/aliases.yaml.erb manifests/site.pp

labmon migration
https://phabricator.wikimedia.org/T224585 https://gerrit.wikimedia.org/r/c/operations/puppet/+/552107

labmon1001 (primary) labmon1002 (backup)

(conftool ) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cache/text.yaml
 * how to switch active to standby

things to be backed up and restored
 * /var/lib/grafana (dashboard data is not in puppet)


 * labmon name change -> cloudmetrics1001 & cloudmetrics1002


 * disable puppet agent on labmon
 * do 1002 (standby) first
 * shutdown 1002
 * change puppet to change hostname
 * turn back on 1002 to ensure hostname change is correct
 * (remember to update netbox)


 * reimage to buster
 * make 1002 the primary

https://gerrit.wikimedia.org/r/admin/projects/operations/dns (another puppet repository for DNS)

2019-11-14
https://wikitech.wikimedia.org/wiki/Incident_documentation

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:WMCS_eqiad1_network_topology.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron

network bonding/network teaming? multiple network switches

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

2019-11-05
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity


 * Services we offer / stuff we have / dependency assesment
 * IaaS (CloudVPS)
 * PaaS (Toolforge)
 * DaaS (Wiki-replicas, toolsdb)
 * Others (LDAP, etc)


 * For each service we offer, what is the current status from the availability and continuity point of view. Identify SPOF.
 * IaaS
 * hardware level (NICs, switches, RAID storage, racks, disk backups? etc)
 * software level (openstack services in HA, which are not, provisioning/bootstrap, puppet etc)
 * PaaS
 * hardware level (this uses our own IaaS as hardware)
 * software level (grid, k8s, docker registry, services, NFS, and other Toolforge key components, puppet, etc)
 * DaaS
 * hardware level (this uses both our own IaaS as hardware and physical hardware)
 * software level (simple cold-standby setups, dbproxies, puppet, etc)
 * Others


 * For each service we offer, things to improve in both short term and long term. Do we need them? Are they cost-effective?
 * IaaS
 * hardware level:
 * storage (ceph)
 * NIC redundancy
 * Racking scheme (not everything in row B eqiad)
 * etc
 * software level:
 * glance in HA
 * neutron DVR (distributed virtual routing)
 * automatic bootstrapping / provisioning
 * etc
 * PaaS
 * hadrware level:
 * automatic provisiong / bootstrapping
 * offline backups?
 * software level:
 * anything?
 * DaaS
 * hardware level:
 * etc
 * software level:
 * etc
 * Others

2019-10-27

 * tools-webservice
 * labmon migration
 * documentation (https://wikitech.wikimedia.org/wiki/Systems_and_Service_Continuity)
 * https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals reallocating page!
 * not ready for review yet
 * https://phabricator.wikimedia.org/T218461
 * cloud-cumin-01.cloudinfra.eqiad.wmflabs
 * https://tools.wmflabs.org/openstack-browser/project/cloudinfra

$ sudo cumin "project:tools" "apt-cache policy toollabs-webservice"

sudo cumin "O{project:tools name:tools-sgebastion-08}" "apt-cache policy toollabs-webservice"

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true" aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice -s"

Real installation:

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice"

2019-10-10

 * status of things:
 * working on reliability documentation
 * labmon project externally blocked https://phabricator.wikimedia.org/T224585

wmcs_puppet_tree_clean {       cd /var/lib/git/operations/puppet sudo git clean -fd sudo git checkout -f cd - sudo git-sync-upstream }

https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#wmf-export-puppet-patch.sh

2019-09-26

 * kubernetes ingress etc
 * Q2 goal labmon https://phabricator.wikimedia.org/T224585
 * some explanations of the servers
 * some puppet tree pointers

2019-08-08

 * multiple LDAP accounts: https://phabricator.wikimedia.org/T230126
 * https://wikitech.wikimedia.org/wiki/LDAP


 * not in the LDAP group?
 * cloud-wide root https://gerrit.wikimedia.org/r/admin/projects/labs/private
 * generate a patch to add a new SSH key (cloud VPS root)

https://wikitech.wikimedia.org/wiki/LDAP https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

+2 verified +2 code-review
 * puppet workflow:

then merge button will appeart -> git-gerrit (not yet in infra)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

puppetmaster1001.eqiad.wmnet

sudo puppet-merge (fetch change from gerrit to puppet master)

hpham@puppetmaster1001:~$ sudo puppet-merge Checking for pending merges in /labs/private Fetching new commits from https://gerrit.wikimedia.org/r/labs/private No changes to merge. Fetching new commits from https://gerrit.wikimedia.org/r/operations/puppet No changes to merge.

https://github.com/wikimedia/puppet/ (mirror)

lo https://github.com/wikimedia/puppet/tree/production/modules/role/manifests

https://wikitech.wikimedia.org/wiki/Puppet_coding

manifest (codes) - hiera pulls configuration data

https://github.com/wikimedia/puppet/blob/production/manifests/site.pp


 * Possible initial tasks:
 * Set up tools-buster repository in aptly to allow toolforge servers to be installed on buster https://phabricator.wikimedia.org/T229237
 * WMCS: migrate python2 scripts to python3 https://phabricator.wikimedia.org/T229920
 * Migrate labmon* to Stretch (or Buster, better yet!) https://phabricator.wikimedia.org/T224585


 * Commit first patch to puppet

sudo easy_install pip sudo pip install -U setuptools pip install --user git-review export PATH=$PATH:$HOME/Library/Python/2.7/bin

git clone "ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet" && scp -p -P 29418 phamhi@gerrit.wikimedia.org:hooks/commit-msg "puppet/.git/hooks/"
 * 1) clone with commit-msg hook
 * 2) https://gerrit.wikimedia.org/r/admin/projects/operations/puppet

git config --global --add gitreview.username "phamhi" git config --global --add gitreview.email "hpham@wikimedia.org"

git review -s
 * 1) Creating a git remote called 'gerrit' that maps to:
 * 2)        ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet.git


 * 1) make the change

git commit -a # add comment

git review