User:ABorrero (WMF)/Notes/Onboarding notes

= Timeline =

A timeline of how the onboarding process was.

week 1
Basically following what was planned at Wikimedia_Cloud_Services_team/Onboarding_Arturo. Lots of paperwork. Lots of meetings using Google Hangout. Lots of new sutff, technologies, names and people. This was overwhelming. Try to be patient.

Registering to at least 4 wikis and creating profile in each of them:


 * https://meta.wikimedia.org <-- Community & movement Wiki
 * https://wikitech.wikimedia.org <-- Cloud team Wiki, CloudVPS frontend
 * https://mediawiki.org <-- General Wiki about technology at WMF
 * https://office.wikimedia.org <-- WMF intranet

Important meetings:


 * WMCS weekly team meeting
 * TechOPs weekly meeting
 * Quarter goals meetings
 * Chase meetings to sync and learn
 * Bryan (as my manager) 1:1 meetings
 * Meetings with other people for other several stuff (like GPG key signing)

Setting accounts and access for other services:


 * Webmail, calendar, etc <-- Google services actually
 * https://phabricator.wikimedia.org <-- tasks, tickets and projects management
 * https://gerrit.wikimedia.org <-- code review
 * pwstore <-- internal tool for password management
 * SSH keys <-- to identify to SSH servers
 * IRC channels <-- probably better use https://irccloud.com
 * Mailing lists <-- several WMF mailing lists

Important learnings this week:


 * Infra
 * WMF projects, organization and structure

Got my first task assigned: https://phabricator.wikimedia.org/T179024

week 2
Follow-up with meetings and learnings.

Continue with task: https://phabricator.wikimedia.org/T179024 <-- closed

Create these wiki notes.

Created a CloudVPS project and a virtual machine inside: ssh aborrero-test-vm1.aborrero-test.eqiad.wmflabs

week 3

 * Play with puppet-compiler and puppet-standalone (testing the unattended upgrades patches)
 * Cultural orientation meetings
 * Unattended upgrades https://phabricator.wikimedia.org/T177920 https://phabricator.wikimedia.org/T180254
 * Wiki replicas https://phabricator.wikimedia.org/T173647

week 4
TODO:
 * task https://phabricator.wikimedia.org/T154150
 * task https://phabricator.wikimedia.org/T180513

Done:
 * document puppet-compiler and puppet-standalone learnings https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler
 * wiki replicas https://phabricator.wikimedia.org/T173647

Docs for wiki-replicas automation:
 * https://forge.puppet.com/puppetlabs/mysql
 * http://bitfieldconsulting.com/puppet-and-mysql-create-databases-and-users

= Infra =

Cloud have 2 main projects:
 * CloudVPS (Openstack)
 * Toolsforge

Also, there are other several important things:
 * Puppet deployment
 * Networking: management networks, physical network, bastions
 * Datacenters and physical deployments
 * NFS servers for shared storage and data

CloudVPS
This is the main infra for hosting in the wikimedia movement both for internal use and for volunteers and anyone who adds value to our movement. Is basically an old OpenStack deployment. Work is ongoing to move to OpenStack Liberty.

The wikitech frontend is a mediawiki plugin to perform tasks that nowadays can be done via Horizon.

There should be docs both for external users and for us (admins), for example:
 * https://wikitech.wikimedia.org/wiki/Help:Access

workflow 1: server lists
For knowing more instances of a project:

OS_TENANT_ID=tools openstack server list
 * enter labcontrol1001.wikimedia.org
 * get root. source /root/novaenv.sh
 * run, for example:

workflow 2: quotas
About knowing and managing quotas: root@labcontrol1001:~# source /root/novaenv.sh root@labcontrol1001:~# openstack quota show aborrero-test +--+---+ +--+---+ +--+---+
 * Field               | Value         |
 * cores               | 8             |
 * fixed-ips           | 200           |
 * floating_ips        | 0             |
 * injected-file-size  | 10240         |
 * injected-files      | 5             |
 * injected-path-size  | 255           |
 * instances           | 8             |
 * key-pairs           | 100           |
 * project             | aborrero-test |
 * properties          | 128           |
 * ram                 | 16384         |
 * secgroup-rules      | 20            |
 * secgroups           | 10            |
 * server_group_members | 10           |
 * server_groups       | 10            |

Upstream docs: https://docs.openstack.org/nova/pike/admin/quotas.html

workflow 3: wiki db replicas
If a new wiki is deployed in production, we should create a replica for Cloud VPS users to work with that database instead of the production one. We replicate the database but offer just a SQL view of the data, without private data.

Steps:
 * 1) DBAs setup the database and sanitize private data
 * 2) we run maintain-views and maintain-meta_p on labsdb servers
 * 3) we run wikireplica_dns
 * 4) check with sql command if that works

More docs and examples:
 * https://phabricator.wikimedia.org/T173647
 * https://wikitech.wikimedia.org/wiki/Add_a_wiki#Cloud_Services
 * https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replica_DNS

current deployment
All servers are in the same subnet.


 * labvirtXXXX <-- servers for openstack virtualization, compute
 * labnetXXXX <-- servers implementing nova-network
 * labdbXXXX <-- servers hosting wiki database replicas (without private data)
 * labservicesXXXX <-- DNS servers

Toolforge
System deployed inside CloudVPS (Openstack) as the tools tenant.

It runs 2 backends: gridengine, kubernetes

Two tools related projects maintained in part by the Cloud Services team are quarry and paws. (Quarry is actually not hosted in Toolforge currently. It has its own project.)

Composition and naming scheme
The tools cluster is composed of:


 * tools-worker* <-- kubernetes node
 * tools-exec* <-- gridengine
 * 2 etcd clusters (1 kubernetes datastore for state, 1 flannel network overlay)

The kubernetes cluster has a flat network topology allowing each node (i.e. worker) to connect directly to each other. This is by using flannel.

Managing nodes
In case some operations require it (like testing a patch or doing maintenance), tools-exec* nodes can be depool'ed/repool'ed.


 * Jump to login.tools.wmflabs.org.
 * Leave a message to Server Admin Log: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL (on IRC: !log tools depool node X for whatever)
 * Run exec-manage depool tools-exec*.tools.eqiad.wmflabs
 * Wait for jobs to end: exec-manage status tools-exec*.tools.eqiad.wmflabs.
 * Jump to the node and use it. Beware of puppet running every 30 minutes, this may overwrite your files.
 * Once finished, back to login.tools.wmflabs.org and run exec-manage repool tools-exec*.tools.eqiad.wmflabs and leave another SAL message.

Access

 * SSH bastions: login.tools.wmflabs.org
 * Web interface:

Puppet
The puppet deployment is used for almost everything related to bare infrastructure.

There are several puppet repositories, the main one being operations/puppet.git.

Main documentation: https://wikitech.wikimedia.org/wiki/Puppet_coding

workflows
Description of several workflows.

generic patching workflow
git clone ssh://aborrero@gerrit.wikimedia.org:29418/operations/puppet.git
 * Set up SSH keys, gerrit and phabricator, LDAP groups
 * Clone repository, for example:
 * Set up git-review https://www.mediawiki.org/wiki/Gerrit/git-review
 * Develop patch, test it somewhere
 * Push patch and await review. Update patch and push again if required.
 * In gerrit, use Verified+2 and Submit buttons.
 * Jump to puppetmaster1001.eqiad.wmnet and run sudo puppet-merge.

If the patch affects the tools project, then additionally:


 * If requried, jump to tools-clushmaster-01.eqiad.wmflabs and run clush -w @all 'sudo puppet agent --test'

Advanced patching
There are 2 main approaches:
 * 1) Setting a puppet standalone master/agent to test patches and how they affect the final machine. https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster
 * 2) Running puppet-compile by hand to see final generated changes before deploying. https://phabricator.wikimedia.org/T97081#3681225

testing a patch
In order to test a patch, it would be necessary to have a real machine at hand.

In the tools project, get an tools-exec* node and depool/repool it (see specific docs in the tools section).

Other tests may require to compile the puppet catalog by hand before deploying it to agents.

physical servers
Physical servers are being installed using Puppet as well.

We use a combination of DHCP+PXE+Debian installer preseed to get it installed automatically.

In case a server needs to be reached via ILO, there are specific docs for this: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation

deployment
Some bits about the puppet deployment. Every project has his own puppet master.

For example:


 * integration project: integration-puppetmaster01.integration.eqiad.wmflabs
 * tools project: tools-puppetmaster-01.tools.eqiad.wmflabs

Each puppet master knows the facts for the servers/instances in his project.

DNS
There is a git repository for DNS: operations/dns.git. The workflow is similar to the one followed for operations/puppet.git (i.e. gerrit review and so on)

Namespaces and schemes:
 * *..wmnet <-- physical private network, not directly accessible from the public internet.
 * *.wmflabs.org <-- public vlans, accessible from the public internet, proxyed by nginx or whatever. Things inside openstack, instances, project and so on. This will be eventually renamed to wmcloud at some point in the future.
 * *..wmflabs <-- virtual network inside openstack. Private network.
 * *.wikimedia.org <-- general production

Example naming:
 * silver.eqiad.wmnet <-- private name
 * silver.wikimedia.org <-- public accessible name
 * login.tools.wmflabs.org <-- access proxy (bastion) for the toolsforge Cloud VPS project.
 * vm1.aborrero-test.eqiad.wmflabs <-- private address for vm1 inside the aborrero-test Cloud VPS project in eqiad. Private address which requires SSH proxy/bastion.

NFS
NFS servers are being use to store shared data.

There are 2 main severs right now:
 * labstore-secondary (actually, the primary)
 * labstore1003

Cloud VPS and Tools both use the NFS backends.

Building blocks
The are 2 nodes cluster using DRBD+LVM and a floating IP (using proxy ARP). They use manual failover to avoid split brain-like situations.

Each node have a quota to avoid users overloading the servers. These quotas are tc controllers (like a QoS). In the past, overloading a server resulted in the whole NFS infra being rather slow, which resulted in all clients not accessing data.

Data in NFS
There are several data which are usually stored in the NFS backends:


 * home directories
 * scratch spaces
 * wiki dumps (read only)
 * project specific data

Networking
Some bits about the WMF networks.

SSH bastions
We use bastion hosts as gateways to jump to backend servers. This is done by proxying commands and requires a specific config in ~.ssh/config.

Info: https://wikitech.wikimedia.org/wiki/Production_shell_access

Datacenters
Usually machines and services are spread all across several datacenters.

Naming scheme is usually:
 * machine1001.wikimedia.org <-- datacenter 1
 * machine2001.wikimedia.org <-- datacenter 2
 * machine3001.wikimedia.org <-- datacenter 3
 * machine4001.wikimedia.org <-- datacenter 4

L2L3 design bits
No perimetral firewalls, host bases firewalls in each server. Subnets per rack rows.

Monitoring
The metrics stack is Graphana/Graphine/Diamond and for alerts Nagios.

Links:
 * https://graphite-labs.wikimedia.org/
 * https://grafana.wikimedia.org/

Diamond collectors runs in every machine and sends metrics to the graphite server.

Wiki replicas
General architecture diagram:



Several layers of load distribution by using DNS namespaces (web, analytics) and by using haproxy in the db clusters. The old cluster (labsdb1001, labsdb1002, labsdb1003) is being replaced by the new one (labsdb1009, labsdb1010, labsdb1011) which among other things mimics what is in production and implements proper load balancing.

Know more about general databases and how the share data:
 * https://tendril.wikimedia.org/tree