Wikimedia Cloud Services team/Onboarding Hieu/Sessions

To be persisted into: https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Onboarding_Hieu/Sessions

labmon migration
https://phabricator.wikimedia.org/T224585 https://gerrit.wikimedia.org/r/c/operations/puppet/+/552107

labmon1001 (primary) labmon1002 (backup)


 * how to switch active to standby

(conftool ) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cache/text.yaml

things to be backed up and restored
 * /var/lib/grafana (dashboard data is not in puppet)


 * disable puppet agent on labmon
 * do backup (1002) first
 * shutdown 1002
 * change puppet to change hostname first
 * labmon name change -> cloudmon or cloudWHATEVER (decide this with the team, probably the most easy thing is to use just 'cloudmon').
 * maybe cloudmon1003 and cloudmon1004
 * turn back on 1002 to ensure hostname change is correct
 * (remember to update netbox)


 * reimage to buster
 * make 1002 the primary

https://gerrit.wikimedia.org/r/admin/projects/operations/dns (another puppet repository for DNS)

2019-11-14
https://wikitech.wikimedia.org/wiki/Incident_documentation

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:WMCS_eqiad1_network_topology.png

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron

network bonding/network teaming? multiple network switches

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Example_of_NICs_in_Neutron.png

2019-11-05
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Systems_and_Service_Continuity


 * Services we offer / stuff we have / dependency assesment
 * IaaS (CloudVPS)
 * PaaS (Toolforge)
 * DaaS (Wiki-replicas, toolsdb)
 * Others (LDAP, etc)


 * For each service we offer, what is the current status from the availability and continuity point of view. Identify SPOF.
 * IaaS
 * hardware level (NICs, switches, RAID storage, racks, disk backups? etc)
 * software level (openstack services in HA, which are not, provisioning/bootstrap, puppet etc)
 * PaaS
 * hardware level (this uses our own IaaS as hardware)
 * software level (grid, k8s, docker registry, services, NFS, and other Toolforge key components, puppet, etc)
 * DaaS
 * hardware level (this uses both our own IaaS as hardware and physical hardware)
 * software level (simple cold-standby setups, dbproxies, puppet, etc)
 * Others


 * For each service we offer, things to improve in both short term and long term. Do we need them? Are they cost-effective?
 * IaaS
 * hardware level:
 * storage (ceph)
 * NIC redundancy
 * Racking scheme (not everything in row B eqiad)
 * etc
 * software level:
 * glance in HA
 * neutron DVR (distributed virtual routing)
 * automatic bootstrapping / provisioning
 * etc
 * PaaS
 * hadrware level:
 * automatic provisiong / bootstrapping
 * offline backups?
 * software level:
 * anything?
 * DaaS
 * hardware level:
 * etc
 * software level:
 * etc
 * Others

2019-10-27

 * tools-webservice
 * labmon migration
 * documentation (https://wikitech.wikimedia.org/wiki/Systems_and_Service_Continuity)
 * https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals reallocating page!
 * not ready for review yet
 * https://phabricator.wikimedia.org/T218461
 * cloud-cumin-01.cloudinfra.eqiad.wmflabs
 * https://tools.wmflabs.org/openstack-browser/project/cloudinfra

$ sudo cumin "project:tools" "apt-cache policy toollabs-webservice"

sudo cumin "O{project:tools name:tools-sgebastion-08}" "apt-cache policy toollabs-webservice"

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true" aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice -s"

Real installation:

aborrero@cloud-cumin-01:~$ sudo cumin "project:tools" "dpkg -s toollabs-webservice 2>/dev/null | grep install || true && apt-get install toollabs-webservice"

2019-10-10

 * status of things:
 * working on reliability documentation
 * labmon project externally blocked https://phabricator.wikimedia.org/T224585

from modules/graphite/manifests/web.pp

# graphite >= 1.0 is in backports (>= stretch) package { 'graphite-web': ensure         => 'present', install_options => ['-t', "${::lsbdistcodename}-backports"], }

# django 1.9 compat, remove once the jessie -> stretch migration is completed $syncdb_command = $::lsbdistcodename ? {                                           stretch  => '/usr/bin/graphite-manage migrate --run-syncdb --noinput', default => '/usr/bin/graphite-manage syncdb --noinput', }

wmcs_puppet_tree_clean {       cd /var/lib/git/operations/puppet sudo git clean -fd sudo git checkout -f cd - sudo git-sync-upstream }

https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#wmf-export-puppet-patch.sh

2019-09-26

 * kubernetes ingress etc
 * Q2 goal labmon https://phabricator.wikimedia.org/T224585
 * some explanations of the servers
 * some puppet tree pointers

2019-08-08

 * multiple LDAP accounts: https://phabricator.wikimedia.org/T230126
 * https://wikitech.wikimedia.org/wiki/LDAP


 * not in the LDAP group?
 * cloud-wide root https://gerrit.wikimedia.org/r/admin/projects/labs/private
 * generate a patch to add a new SSH key (cloud VPS root)

https://wikitech.wikimedia.org/wiki/LDAP https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

+2 verified +2 code-review
 * puppet workflow:

then merge button will appeart -> git-gerrit (not yet in infra)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/519398

puppetmaster1001.eqiad.wmnet

sudo puppet-merge (fetch change from gerrit to puppet master)

hpham@puppetmaster1001:~$ sudo puppet-merge Checking for pending merges in /labs/private Fetching new commits from https://gerrit.wikimedia.org/r/labs/private No changes to merge. Fetching new commits from https://gerrit.wikimedia.org/r/operations/puppet No changes to merge.

https://github.com/wikimedia/puppet/ (mirror)

lo https://github.com/wikimedia/puppet/tree/production/modules/role/manifests

https://wikitech.wikimedia.org/wiki/Puppet_coding

manifest (codes) - hiera pulls configuration data

https://github.com/wikimedia/puppet/blob/production/manifests/site.pp


 * Possible initial tasks:
 * Set up tools-buster repository in aptly to allow toolforge servers to be installed on buster https://phabricator.wikimedia.org/T229237
 * WMCS: migrate python2 scripts to python3 https://phabricator.wikimedia.org/T229920
 * Migrate labmon* to Stretch (or Buster, better yet!) https://phabricator.wikimedia.org/T224585


 * Commit first patch to puppet

sudo easy_install pip sudo pip install -U setuptools pip install --user git-review export PATH=$PATH:$HOME/Library/Python/2.7/bin

git clone "ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet" && scp -p -P 29418 phamhi@gerrit.wikimedia.org:hooks/commit-msg "puppet/.git/hooks/"
 * 1) clone with commit-msg hook
 * 2) https://gerrit.wikimedia.org/r/admin/projects/operations/puppet

git config --global --add gitreview.username "phamhi" git config --global --add gitreview.email "hpham@wikimedia.org"

git review -s
 * 1) Creating a git remote called 'gerrit' that maps to:
 * 2)        ssh://phamhi@gerrit.wikimedia.org:29418/operations/puppet.git


 * 1) make the change

git commit -a # add comment

git review