Wikimedia Cloud Services team/Onboarding Hieu/Sessions

From MediaWiki.org
Jump to navigation Jump to search

To be persisted into: https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Onboarding_Hieu/Sessions

2019-11-28[edit]

https://office.wikimedia.org/wiki/Pwstore

cumin1001.eqiad.wmnet

aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "hieu reimaging server" --hours 1 labmon1002*

aborrero@cumin1001:~ $ sudo install_console labmon1002.mgmt.eqiad.wmnet

  1. or ssh root@labmon1002.mgmt.eqiad.wmnet



cd pw ../pwstore/pwd ed management


</>hpiLO-> vsp

Virtual Serial Port Active: COM2

Starting virtual serial port. Press 'ESC (' to return to the CLI Session.


Debian GNU/Linux 8 labmon1002 ttyS1

labmon1002 login:

  • merge patches (dns and puppet)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/553441 https://gerrit.wikimedia.org/r/c/operations/dns/+/553467

ssh ns1.wikimedia.org aborrero@authdns2001:~ $ sudo authdns-update

  • run puppet on install servers

aborrero@cumin1001:~ $ sudo cumin A:installserver run-puppet-agent

  • run the script:

aborrero@cumin1001:~ $ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet


phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet 13:55:52 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281355_phamhi_138148_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281355_phamhi_138148_labmon1002_eqiad_wmnet_cumin.out IPMI Password: Address lookup for cloudmetrics1002.mgmt.eqiad.wmnet failed Could not open socket! Error: Unable to establish IPMI v2 / RMCP+ session 13:56:00 | labmon1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Remote IPMI failed for mgmt 'cloudmetrics1002.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'cloudmetrics1002.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1 13:56:00 | labmon1002.eqiad.wmnet | REIMAGE END | retcode=2


phamhi@cumin1001:~$ sudo -i wmf-auto-reimage-host --rename cloudmetrics1002.eqiad.wmnet --rename-mgmt cloudmetrics1002.mgmt.eqiad.wmnet -p T224585 labmon1002.eqiad.wmnet labmon1002.mgmt.eqiad.wmnet 13:58:28 | labmon1002.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output: sudo tail -F /var/log/wmf-auto-reimage/201911281358_phamhi_138804_labmon1002_eqiad_wmnet.log sudo tail -F /var/log/wmf-auto-reimage/201911281358_phamhi_138804_labmon1002_eqiad_wmnet_cumin.out IPMI Password: Address lookup for cloudmetrics1002.mgmt.eqiad.wmnet failed Could not open socket! Error: Unable to establish IPMI v2 / RMCP+ session 13:58:38 | labmon1002.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Remote IPMI failed for mgmt 'cloudmetrics1002.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'cloudmetrics1002.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1 13:58:38 | labmon1002.eqiad.wmnet | REIMAGE END | retcode=2



  • AFTER operations: cleanup DNS entries
  • AFTER operations: cleanup stale file in puppet:
   hieradata/hosts/labmon1002.yaml
   

2019-11-27[edit]

  • files within puppet repo with the keyword "labmon"
  1. ------------------------------------------------------------------------------------------

hieradata/hosts/cloudservices1003.yaml hieradata/hosts/cloudservices2002-dev.yaml hieradata/hosts/cloudservices1004.yaml

hieradata/hosts/cloudcontrol1003.yaml hieradata/hosts/cloudcontrol2001-dev.yaml hieradata/hosts/cloudcontrol2003-dev.yaml hieradata/hosts/cloudcontrol1004.yaml

hieradata/labs/cloudinfra/host/cloud-puppetmaster-01.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-02.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-03.yaml hieradata/labs/cloudinfra/host/cloud-puppetmaster-04.yaml

modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 hieradata/role/common/wmcs/monitoring.yaml hieradata/common/profile/openstack/eqiad1.yaml

modules/install_server/files/autoinstall/preseed.cfg modules/install_server/files/autoinstall/netboot.cfg

modules/profile/templates/cumin/aliases.yaml.erb manifests/site.pp

  1. ------------------------------------------------------------------------------------------


2019-11-21[edit]

labmon migration[edit]

https://phabricator.wikimedia.org/T224585 https://gerrit.wikimedia.org/r/c/operations/puppet/+/552107

labmon1001 (primary) labmon1002 (backup)

  • how to switch active to standby

(conftool ) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cache/text.yaml

things to be backed up and restored

  • /var/lib/grafana (dashboard data is not in puppet)
    • labmon name change -> cloudmetrics1001 & cloudmetrics1002
  • disable puppet agent on labmon
  • do 1002 (standby) first
    • shutdown 1002
    • change puppet to change hostname
    • turn back on 1002 to ensure hostname change is correct
    • (remember to update netbox)
    • reimage to buster
    • make 1002 the primary

https://gerrit.wikimedia.org/r/admin/projects/operations/dns (another puppet repository for DNS)


2019-11-14[edit]

https://wikitech.wikimedia.org/wiki/Incident_documentation

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:WMCS_eqiad1_network_topology.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron

network bonding/network teaming? multiple network switches

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

2019-11-05[edit]

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity

  • Services we offer / stuff we have / dependency assesment
    • IaaS (CloudVPS)
    • PaaS (Toolforge)
    • DaaS (Wiki-replicas, toolsdb)
    • Others (LDAP, etc)
  • For each service we offer, what is the current status from the availability and continuity point of view. Identify SPOF.
    • IaaS
      • hardware level (NICs, switches, RAID storage, racks, disk backups? etc)
      • software level (openstack services in HA, which are not, provisioning/bootstrap, puppet etc)
    • PaaS
      • hardware level (this uses our own IaaS as hardware)
      • software level (grid, k8s, docker registry, services, NFS, and other Toolforge key components, puppet, etc)
    • DaaS
      • hardware level (this uses both our own IaaS as hardware and physical hardware)
      • software level (simple cold-standby setups, dbproxies, puppet, etc)
    • Others
  • For each service we offer, things to improve in both short term and long term. Do we need them? Are they cost-effective?
    • IaaS
      • hardware level:
        • storage (ceph)
        • NIC redundancy
        • Racking scheme (not everything in row B eqiad)
        • etc
      • software level:
        • glance in HA
        • neutron DVR (distributed virtual routing)
        • automatic bootstrapping / provisioning
        • etc
    • PaaS
      • hadrware level:
        • automatic provisiong / bootstrapping
        • offline backups?
      • software level:
        • anything?
    • DaaS
      • hardware level:
        • etc
      • software level:
        • etc
    • Others

2019-10-27[edit]


$ sudo cumin "project:tools" "apt-cache policy toollabs-webservice"

sudo cumin "O{project:tools name:tools-sgebastion-08}" "apt-cache policy toollabs-webservice"


aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true" aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice -s"

Real installation:

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice"


2019-10-10[edit]

wmcs_puppet_tree_clean() {

       cd /var/lib/git/operations/puppet
       sudo git clean -fd
       sudo git checkout -f
       cd -
       sudo git-sync-upstream

}

https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#wmf-export-puppet-patch.sh

2019-09-26[edit]


2019-08-08[edit]


https://wikitech.wikimedia.org/wiki/LDAP https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398


  • puppet workflow:
   +2 verified

+2 code-review

then merge button will appeart -> git-gerrit (not yet in infra)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

puppetmaster1001.eqiad.wmnet

sudo puppet-merge (fetch change from gerrit to puppet master)

hpham@puppetmaster1001:~$ sudo puppet-merge Checking for pending merges in /labs/private Fetching new commits from https://gerrit.wikimedia.org/r/labs/private No changes to merge. Fetching new commits from https://gerrit.wikimedia.org/r/operations/puppet No changes to merge.

https://github.com/wikimedia/puppet/ (mirror)

lo https://github.com/wikimedia/puppet/tree/production/modules/role/manifests

https://wikitech.wikimedia.org/wiki/Puppet_coding


manifest (codes) - hiera pulls configuration data

https://github.com/wikimedia/puppet/blob/production/manifests/site.pp



  • Commit first patch to puppet

sudo easy_install pip sudo pip install -U setuptools

pip install --user git-review

export PATH=$PATH:$HOME/Library/Python/2.7/bin

  1. clone with commit-msg hook
  2. https://gerrit.wikimedia.org/r/admin/projects/operations/puppet

git clone "ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet" && scp -p -P 29418 phamhi@gerrit.wikimedia.org:hooks/commit-msg "puppet/.git/hooks/"

git config --global --add gitreview.username "phamhi" git config --global --add gitreview.email "hpham@wikimedia.org"

git review -s

  1. Creating a git remote called 'gerrit' that maps to:
  2. ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet.git
  1. make the change

git commit -a # add comment

git review