Wikimedia Cloud Services team/Onboarding Hieu/Sessions

From mediawiki.org

2019-12-12[edit]

modules/puppetmaster/files/production.hiera.yaml


$data = lookup(whatevr) $date = "other" <--- fails!!

modules/puppetmaster/files/labs.hiera.yaml

https://gerrit.wikimedia.org/r/admin/projects/cloud/instance-puppet


modules/profile/manifests/wmcs/monitoring.pp


2019-12-05[edit]

gerrit:554844

gerrit:554853 (use var form hiera)

hieradata/labs.yaml



git grep profile::mediawiki::scap_client git grep profile::mediawiki::common

https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/ Catalog ran successfully : https://puppet-compiler.wmflabs.org/compiler1003/19814/mwmaint1002.eqiad.wmnet/

mwmaint1002.eqiad.wmnet

hieradata/hosts/cloudservices1003.yaml:

  1. prometheus-pdns-exporter is scrapped by labmons
  2. prometheus-node-exporter by prod servers

prometheus_nodes:

   - labmon1001.eqiad.wmnet                                                    
   - cloudmetrics1002.eqiad.wmnet                                              
   - prometheus1003.eqiad.wmnet                                                
   - prometheus1004.eqiad.wmnet 


2019-11-28[edit]

https://office.wikimedia.org/wiki/Pwstore

cumin1001.eqiad.wmnet

aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "hieu reimaging server" --hours 1 labmon1002*

aborrero@cumin1001:~ $ sudo install_console labmon1002.mgmt.eqiad.wmnet

  1. or ssh root@labmon1002.mgmt.eqiad.wmnet



cd pw ../pwstore/pwd ed management


</>hpiLO-> vsp

Virtual Serial Port Active: COM2

Starting virtual serial port. Press 'ESC (' to return to the CLI Session.


Debian GNU/Linux 8 labmon1002 ttyS1

labmon1002 login:

  • merge patches (dns and puppet)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/553441 https://gerrit.wikimedia.org/r/c/operations/dns/+/553467

ssh ns1.wikimedia.org aborrero@authdns2001:~ $ sudo authdns-update

  • run puppet on install servers

aborrero@cumin1001:~ $ sudo cumin A:installserver run-puppet-agent

  • run the script:
phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet  -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet
15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output:
sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log
sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out
IPMI Password: 
15:59:12 | labmon1002.eqiad.wmnet | Validated host
15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga
15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet
15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor
15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe
15:59:18 | labmon1002.eqiad.wmnet | Power cycling
15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle
phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet  -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet
15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output:
sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log
sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out
IPMI Password:
15:59:12 | labmon1002.eqiad.wmnet | Validated host
15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga
15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet
15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor
15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe
15:59:18 | labmon1002.eqiad.wmnet | Power cycling
15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle
phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet  -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet
15:59:02 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output:
sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log
sudo tail -F /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet_cumin.out
IPMI Password:
15:59:12 | labmon1002.eqiad.wmnet | Validated host
15:59:13 | labmon1002.eqiad.wmnet | Downtimed on Icinga
15:59:18 | labmon1002.eqiad.wmnet | Removed from Puppet
15:59:18 | labmon1002.eqiad.wmnet | Removed from Debmonitor
15:59:18 | labmon1002.eqiad.wmnet | Set Boot Device to pxe
15:59:18 | labmon1002.eqiad.wmnet | Power cycling
15:59:18 | labmon1002.eqiad.wmnet | Chassis Power Control: Cycle
16:03:27 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 5.0 minutes
16:03:29 | cloudmetrics1002.eqiad.wmnet | Uptime checked
16:03:29 | cloudmetrics1002.eqiad.wmnet | Host up (Debian installer)
16:08:08 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 5.0 minutes
16:13:23 | cloudmetrics1002.eqiad.wmnet | Still waiting for reboot after 10.0 minutes
16:13:25 | cloudmetrics1002.eqiad.wmnet | Uptime checked
16:13:25 | cloudmetrics1002.eqiad.wmnet | Host up
16:13:33 | cloudmetrics1002.eqiad.wmnet | Puppet CSR generated, fingerprint is: 06:02:32:2F:0E:80:B8:CA:8E:74:34:9B:63:EA:94:41:EF:B3:0E:B3:DF:D1:4B:84:F4:B3:73:66:B9:78:16:D5
16:13:33 | cloudmetrics1002.eqiad.wmnet | Polling until a Puppet sign request appears
16:13:37 | cloudmetrics1002.eqiad.wmnet | Signed Puppet cert
16:13:39 | cloudmetrics1002.eqiad.wmnet | Validated host
16:13:39 | cloudmetrics1002.eqiad.wmnet | Scheduled delayed downtime on Icinga
16:13:41 | cloudmetrics1002.eqiad.wmnet | Started first puppet run (sit back, relax, and enjoy the wait)
START - Cookbook sre.hosts.downtime
Forcing a Puppet run on the Icinga server
Running Puppet with args --quiet --attempts 30 on 1 hosts: icinga1001.wikimedia.org
Downtiming 1 hosts and all their services for 2:00:00: cloudmetrics1002.eqiad.wmnet
Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: cloudmetrics1002.eqiad.wmnet
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
16:20:24 | cloudmetrics1002.eqiad.wmnet | First Puppet run completed
16:20:25 | cloudmetrics1002.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot 
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

16:20:50 | cumin1001.eqiad.wmnet | Puppet run completed
16:20:51 | cloudmetrics1002.eqiad.wmnet | Rebooted host
16:24:00 | cloudmetrics1002.eqiad.wmnet | Uptime checked
16:24:00 | cloudmetrics1002.eqiad.wmnet | Host up
16:24:00 | cloudmetrics1002.eqiad.wmnet | Polling the completion of a Puppet run
16:26:04 | cloudmetrics1002.eqiad.wmnet | Puppet run checked
16:26:04 | cloudmetrics1002.eqiad.wmnet | Reimage completed
16:26:04 | cloudmetrics1002.eqiad.wmnet | REIMAGE END | retcode=0
  • AFTER operations: cleanup DNS entries
  • AFTER operations: cleanup stale file in puppet:
   hieradata/hosts/labmon1002.yaml
   

2019-11-27[edit]

  • files within puppet repo with the keyword "labmon"
  1. ------------------------------------------------------------------------------------------

hieradata/hosts/cloudservices1003.yaml hieradata/hosts/cloudservices2002-dev.yaml hieradata/hosts/cloudservices1004.yaml

hieradata/hosts/cloudcontrol1003.yaml hieradata/hosts/cloudcontrol2001-dev.yaml hieradata/hosts/cloudcontrol2003-dev.yaml hieradata/hosts/cloudcontrol1004.yaml

hieradata/labs/cloudinfra/host/cloud-puppetmaster-01.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-02.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-03.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-04.yaml

modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 hieradata/role/common/wmcs/monitoring.yaml hieradata/common/profile/openstack/eqiad1.yaml

modules/install_server/files/autoinstall/preseed.cfg modules/install_server/files/autoinstall/netboot.cfg

modules/profile/templates/cumin/aliases.yaml.erb manifests/site.pp

  1. ------------------------------------------------------------------------------------------


2019-11-21[edit]

labmon migration[edit]

https://phabricator.wikimedia.org/T224585 https://gerrit.wikimedia.org/r/c/operations/puppet/+/552107

labmon1001 (primary) labmon1002 (backup)

  • how to switch active to standby

(conftool ) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cache/text.yaml

things to be backed up and restored

  • /var/lib/grafana (dashboard data is not in puppet)
    • labmon name change -> cloudmetrics1001 & cloudmetrics1002
  • disable puppet agent on labmon
  • do 1002 (standby) first
    • shutdown 1002
    • change puppet to change hostname
    • turn back on 1002 to ensure hostname change is correct
    • (remember to update netbox)
    • reimage to buster
    • make 1002 the primary

https://gerrit.wikimedia.org/r/admin/projects/operations/dns (another puppet repository for DNS)


2019-11-14[edit]

https://wikitech.wikimedia.org/wiki/Incident_documentation

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:WMCS_eqiad1_network_topology.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron

network bonding/network teaming? multiple network switches

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

2019-11-05[edit]

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity

  • Services we offer / stuff we have / dependency assesment
    • IaaS (CloudVPS)
    • PaaS (Toolforge)
    • DaaS (Wiki-replicas, toolsdb)
    • Others (LDAP, etc)
  • For each service we offer, what is the current status from the availability and continuity point of view. Identify SPOF.
    • IaaS
      • hardware level (NICs, switches, RAID storage, racks, disk backups? etc)
      • software level (openstack services in HA, which are not, provisioning/bootstrap, puppet etc)
    • PaaS
      • hardware level (this uses our own IaaS as hardware)
      • software level (grid, k8s, docker registry, services, NFS, and other Toolforge key components, puppet, etc)
    • DaaS
      • hardware level (this uses both our own IaaS as hardware and physical hardware)
      • software level (simple cold-standby setups, dbproxies, puppet, etc)
    • Others
  • For each service we offer, things to improve in both short term and long term. Do we need them? Are they cost-effective?
    • IaaS
      • hardware level:
        • storage (ceph)
        • NIC redundancy
        • Racking scheme (not everything in row B eqiad)
        • etc
      • software level:
        • glance in HA
        • neutron DVR (distributed virtual routing)
        • automatic bootstrapping / provisioning
        • etc
    • PaaS
      • hadrware level:
        • automatic provisiong / bootstrapping
        • offline backups?
      • software level:
        • anything?
    • DaaS
      • hardware level:
        • etc
      • software level:
        • etc
    • Others

2019-10-27[edit]


$ sudo cumin "project:tools" "apt-cache policy toollabs-webservice"

sudo cumin "O{project:tools name:tools-sgebastion-08}" "apt-cache policy toollabs-webservice"


aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true" aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice -s"

Real installation:

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice"


2019-10-10[edit]

wmcs_puppet_tree_clean() {

       cd /var/lib/git/operations/puppet
       sudo git clean -fd
       sudo git checkout -f
       cd -
       sudo git-sync-upstream

}

https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#wmf-export-puppet-patch.sh

2019-09-26[edit]


2019-08-08[edit]


https://wikitech.wikimedia.org/wiki/LDAP https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398


  • puppet workflow:
   +2 verified

+2 code-review

then merge button will appeart -> git-gerrit (not yet in infra)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

puppetmaster1001.eqiad.wmnet

sudo puppet-merge (fetch change from gerrit to puppet master)

hpham@puppetmaster1001:~$ sudo puppet-merge Checking for pending merges in /labs/private Fetching new commits from https://gerrit.wikimedia.org/r/labs/private No changes to merge. Fetching new commits from https://gerrit.wikimedia.org/r/operations/puppet No changes to merge.

https://github.com/wikimedia/puppet/ (mirror)

lo https://github.com/wikimedia/puppet/tree/production/modules/role/manifests

https://wikitech.wikimedia.org/wiki/Puppet_coding


manifest (codes) - hiera pulls configuration data

https://github.com/wikimedia/puppet/blob/production/manifests/site.pp



  • Commit first patch to puppet

sudo easy_install pip sudo pip install -U setuptools

pip install --user git-review

export PATH=$PATH:$HOME/Library/Python/2.7/bin

  1. clone with commit-msg hook
  2. https://gerrit.wikimedia.org/r/admin/projects/operations/puppet

git clone "ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet" && scp -p -P 29418 phamhi@gerrit.wikimedia.org:hooks/commit-msg "puppet/.git/hooks/"

git config --global --add gitreview.username "phamhi" git config --global --add gitreview.email "hpham@wikimedia.org"

git review -s

  1. Creating a git remote called 'gerrit' that maps to:
  2. ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet.git
  1. make the change

git commit -a # add comment

git review