User:ABorrero (WMF)/Notes/Onboarding notes
A timeline of how the onboarding process was.
Basically following what was planned at Wikimedia_Cloud_Services_team/Onboarding_Arturo. Lots of paperwork. Lots of meetings using Google Hangout. Lots of new sutff, technologies, names and people. This was overwhelming. Try to be patient.
Registering to at least 4 wikis and creating profile in each of them:
- https://meta.wikimedia.org <-- Community & movement Wiki
- https://wikitech.wikimedia.org <-- Cloud team Wiki, CloudVPS frontend
- https://mediawiki.org <-- General Wiki about technology at WMF
- https://office.wikimedia.org <-- WMF intranet
- WMCS weekly team meeting
- TechOPs weekly meeting
- Quarter goals meetings
- Chase meetings to sync and learn
- Bryan (as my manager) 1:1 meetings
- Meetings with other people for other several stuff (like GPG key signing)
Setting accounts and access for other services:
- Webmail, calendar, etc <-- Google services actually
- https://phabricator.wikimedia.org <-- tasks, tickets and projects management
- https://gerrit.wikimedia.org <-- code review
- pwstore <-- internal tool for password management
- SSH keys <-- to identify to SSH servers
- IRC channels <-- probably better use https://irccloud.com
- Mailing lists <-- several WMF mailing lists
Important learnings this week:
- WMF projects, organization and structure
Got my first task assigned: https://phabricator.wikimedia.org/T179024
Follow-up with meetings and learnings.
Continue with task: https://phabricator.wikimedia.org/T179024 <-- closed
Create these wiki notes.
Created a CloudVPS project and a virtual machine inside:
- Play with puppet-compiler and puppet-standalone (testing the unattended upgrades patches)
- Cultural orientation meetings
- Unattended upgrades https://phabricator.wikimedia.org/T177920 https://phabricator.wikimedia.org/T180254
- Wiki replicas https://phabricator.wikimedia.org/T173647
- document puppet-compiler and puppet-standalone learnings https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler
- wiki replicas https://phabricator.wikimedia.org/T173647
Docs for wiki-replicas automation:
Cloud have 2 main projects:
- CloudVPS (Openstack)
Also, there are other several important things:
- Puppet deployment
- Networking: management networks, physical network, bastions
- Datacenters and physical deployments
- NFS servers for shared storage and data
This is the main infra for hosting in the wikimedia movement both for internal use and for volunteers and anyone who adds value to our movement. Is basically an old OpenStack deployment. Work is ongoing to move to OpenStack Liberty.
The wikitech frontend is a mediawiki plugin to perform tasks that nowadays can be done via Horizon.
There should be docs both for external users and for us (admins), for example:
workflow 1: server lists
For knowing more instances of a project:
- enter labcontrol1001.wikimedia.org
- get root. source /root/novaenv.sh
- run, for example:
OS_TENANT_ID=tools openstack server list
workflow 2: quotas
About knowing and managing quotas:
root@labcontrol1001:~# source /root/novaenv.sh root@labcontrol1001:~# openstack quota show aborrero-test +----------------------+---------------+ | Field | Value | +----------------------+---------------+ | cores | 8 | | fixed-ips | 200 | | floating_ips | 0 | | injected-file-size | 10240 | | injected-files | 5 | | injected-path-size | 255 | | instances | 8 | | key-pairs | 100 | | project | aborrero-test | | properties | 128 | | ram | 16384 | | secgroup-rules | 20 | | secgroups | 10 | | server_group_members | 10 | | server_groups | 10 | +----------------------+---------------+
Upstream docs: https://docs.openstack.org/nova/pike/admin/quotas.html
workflow 3: wiki db replicas
If a new wiki is deployed in production, we should create a replica for Cloud VPS users to work with that database instead of the production one. We replicate the database but offer just a SQL view of the data, without private data.
- DBAs setup the database and sanitize private data
- we run maintain-views and maintain-meta_p on labsdb servers
- we run wikireplica_dns
- check with sql command if that works
More docs and examples:
All servers are in the same subnet.
- labvirtXXXX <-- servers for openstack virtualization, compute
- labnetXXXX <-- servers implementing nova-network
- labdbXXXX <-- servers hosting wiki database replicas (without private data)
- labservicesXXXX <-- DNS servers
System deployed inside CloudVPS (Openstack) as the tools tenant.
It runs 2 backends: gridengine, kubernetes
Composition and naming scheme
The tools cluster is composed of:
- tools-worker* <-- kubernetes node
- tools-exec* <-- gridengine
- 2 etcd clusters (1 kubernetes datastore for state, 1 flannel network overlay)
The kubernetes cluster has a flat network topology allowing each node (i.e. worker) to connect directly to each other. This is by using flannel.
Managing exec nodes
In case some operations require it (like testing a patch or doing maintenance), tools-exec* nodes can be depool'ed/repool'ed.
- Jump to login.tools.wmflabs.org.
- Leave a message to Server Admin Log: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL (on IRC: !log tools depool node X for whatever)
- Run exec-manage depool tools-exec*.tools.eqiad.wmflabs
- Wait for jobs to end: exec-manage status tools-exec*.tools.eqiad.wmflabs.
- Jump to the node and use it. Beware of puppet running every 30 minutes, this may overwrite your files.
- Once finished, back to login.tools.wmflabs.org and run exec-manage repool tools-exec*.tools.eqiad.wmflabs and leave another SAL message.
Managing worker nodes
In case some operations require it (like testing a patch or doing maintenance), tools-worker* nodes can be cordoned/uncordoned.
- Jump to tools-k8s-master-01.tools.eqiad.wmflabs.
- Leave a message to Server Admin Log: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL (on IRC: !log tools cordon node X for whatever)
- Run kubectl cordon tools-worker*.tools.eqiad.wmflabs
- Review status: kubectl get nodes. Drain if neccesary: kubectl drain tools-worker*.tools.eqiad.wmflabs
- Jump to the node and use it. Beware of puppet running every 30 minutes, this may overwrite your files.
- Once finished, un kubectl uncordon tools-worker*.tools.eqiad.wmflabs and leave another SAL message. Review status again.
To know which pods are scheduled in which nodes, run:
aborrero@tools-k8s-master-01:~$ sudo kubectl get pods --all-namespaces -o wide | grep tools-worker-1001 grantmetrics grantmetrics-1330309696-9rzri 1/1 Running 0 16d 192.168.168.2 tools-worker-1001.tools.eqiad.wmflabs lziad p4wikibot-657229038-22rxw 1/1 Running 0 13d 192.168.168.5 tools-worker-1001.tools.eqiad.wmflabs openstack-browser openstack-browser-148894442-vhs63 1/1 Running 0 6d 192.168.168.6 tools-worker-1001.tools.eqiad.wmflabs versions versions-1535803801-j8v7s 1/1 Running 0 22d 192.168.168.4 tools-worker-1001.tools.eqiad.wmflabs
- SSH bastions: login.tools.wmflabs.org
- Web interface:
The puppet deployment is used for almost everything related to bare infrastructure.
There are several puppet repositories, the main one being operations/puppet.git.
Main documentation: https://wikitech.wikimedia.org/wiki/Puppet_coding
Description of several workflows.
generic patching workflow
- Set up SSH keys, gerrit and phabricator, LDAP groups
- Clone repository, for example:
git clone ssh://email@example.com:29418/operations/puppet.git
- Set up git-review https://www.mediawiki.org/wiki/Gerrit/git-review
- Develop patch, test it somewhere
- Push patch and await review. Update patch and push again if required.
- In gerrit, use Verified+2 and Submit buttons.
- Jump to puppetmaster1001.eqiad.wmnet and run sudo puppet-merge.
If the patch affects the tools project, then additionally:
- If requried, jump to tools-clushmaster-01.eqiad.wmflabs and run clush -w @all 'sudo puppet agent --test'
There are 2 main approaches:
- Setting a puppet standalone master/agent to test patches and how they affect the final machine. https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster
- Running puppet-compile by hand to see final generated changes before deploying. https://phabricator.wikimedia.org/T97081#3681225
testing a patch
In order to test a patch, it would be necessary to have a real machine at hand.
In the tools project, get an tools-exec* node and depool/repool it (see specific docs in the tools section).
Other tests may require to compile the puppet catalog by hand before deploying it to agents.
Physical servers are being installed using Puppet as well.
We use a combination of DHCP+PXE+Debian installer preseed to get it installed automatically.
In case a server needs to be reached via ILO, there are specific docs for this: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation
Some bits about the puppet deployment. Every project has his own puppet master.
- integration project: integration-puppetmaster01.integration.eqiad.wmflabs
- tools project: tools-puppetmaster-01.tools.eqiad.wmflabs
Each puppet master knows the facts for the servers/instances in his project.
There is a git repository for DNS: operations/dns.git. The workflow is similar to the one followed for operations/puppet.git (i.e. gerrit review and so on)
Namespaces and schemes:
- *.<dc>.wmnet <-- physical private network, not directly accessible from the public internet.
- *.wmflabs.org <-- public vlans, accessible from the public internet, proxyed by nginx or whatever. Things inside openstack, instances, project and so on. This will be eventually renamed to wmcloud at some point in the future.
- *.<dc>.wmflabs <-- virtual network inside openstack. Private network.
- *.wikimedia.org <-- general production
- silver.eqiad.wmnet <-- private name
- silver.wikimedia.org <-- public accessible name
- login.tools.wmflabs.org <-- access proxy (bastion) for the toolsforge Cloud VPS project.
- vm1.aborrero-test.eqiad.wmflabs <-- private address for vm1 inside the aborrero-test Cloud VPS project in eqiad. Private address which requires SSH proxy/bastion.
NFS servers are being use to store shared data.
There are 2 main severs right now:
- labstore-secondary (actually, the primary)
Cloud VPS and Tools both use the NFS backends.
The are 2 nodes cluster using DRBD+LVM and a floating IP (using proxy ARP). They use manual failover to avoid split brain-like situations.
Each node have a quota to avoid users overloading the servers. These quotas are tc controllers (like a QoS). In the past, overloading a server resulted in the whole NFS infra being rather slow, which resulted in all clients not accessing data.
Data in NFS
There are several data which are usually stored in the NFS backends:
- home directories
- scratch spaces
- wiki dumps (read only)
- project specific data
Some bits about the WMF networks.
We use bastion hosts as gateways to jump to backend servers. This is done by proxying commands and requires a specific config in ~.ssh/config.
Usually machines and services are spread all across several datacenters.
Naming scheme is usually:
- machine1001.wikimedia.org <-- datacenter 1
- machine2001.wikimedia.org <-- datacenter 2
- machine3001.wikimedia.org <-- datacenter 3
- machine4001.wikimedia.org <-- datacenter 4
L2L3 design bits
No perimetral firewalls, host bases firewalls in each server. Subnets per rack rows.
The metrics stack is Graphana/Graphine/Diamond and for alerts Nagios.
Diamond collectors runs in every machine and sends metrics to the graphite server.
Know more about general databases and how they share data: