SQL/XML Dumps/Puppet for dumps maintainers

Introduction to puppet
This is a companion document to the slides from the Puppet for dumps maintainers presentation on December 9, 2020. The slides and speaker notes from the presentation cover much more ground, including linting, testing, and facter, but they don't look at code examples in any detail.

Basic puppet syntax, classes
Let's have a thorough look at the file modules/dumps/manifests/generation/server/dirs.pp.

Files of puppet code are called "manifests", so we'll call them that too.

{{CodeCommentary|type=code|lang=puppet|start=8|icon=hand|pos=2|link=https://github.com/wikimedia/puppet/blob/production/modules/dumps/manifests/server_dirs.pp|content= ) {   class {'dumps::server_dirs':        datadir         => $datadir,        xmldumpsdir     => $xmldumpsdir,        miscdatasetsdir => $miscdatasetsdir,        user            => $user,        group           => $group,    }

# Directories where dumps of any type are generated # This list is not for one-off directories, nor for # directories with incoming rsyncs of datasets $cirrussearchdir             = "${miscdatasetsdir}/cirrussearch" $xlationdir                  = "${miscdatasetsdir}/contenttranslation" $categoriesrdfdir            = "${miscdatasetsdir}/categoriesrdf" $categoriesrdfdailydir       = "${miscdatasetsdir}/categoriesrdf/daily" $globalblocksdir             = "${miscdatasetsdir}/globalblocks" $medialistsdir               = "${miscdatasetsdir}/imageinfo" $incrsdir                    = "${miscdatasetsdir}/incr" }}

{{CodeCommentary|type=code|lang=puppet|start=41|icon=eyes|pos=2|content= ​   # subdirs for various generated dumps file { [ $cirrussearchdir, $xlationdir, $categoriesrdfdir, $categoriesrdfdailydir, $globalblocksdir, $medialistsdir, $incrsdir, $mediatitlesdir, $pagetitlesdir, $shorturlsdir, $machinevisiondir ]:

ensure => 'directory', mode  => '0755', owner => $user, group => $group, }

# needed for wikidata weekly crons file { [ $otherwikibasedir, $otherwikibasewikidatadir, $otherwikidatadir ]: ensure => 'directory', mode  => '0755', owner => $user, group => $group, } } }}

Not too bad, was it? Next let's look at resources a bit more.

Resources in puppet
We'll have a look at the cron jobs for one of the "other" dumps, and in particular the job that generates lists of titles of articles in the main space for each wiki once a day .

{{CodeCommentary|type=code|lang=puppet|start=1|icon=eyes|pos=2|content= class snapshot::cron::pagetitles(   $user      = undef,    $filesonly = false, ) { $cronsdir = $snapshot::dumps::dirs::cronsdir $repodir = $snapshot::dumps::dirs::repodir $confsdir = $snapshot::dumps::dirs::confsdir

}}

{{CodeCommentary|type=code|lang=puppet|start=8|icon=eyes|pos=4|content= ​   if !$filesonly { cron { 'pagetitles-ns0': ensure     => 'present', environment => 'MAILTO=ops-dumps@wikimedia.org', user       => $user, command    => "cd ${repodir}; python3 onallwikis.py --configfile ${confsdir}/wikidump.conf.dumps:monitor  --filenameformat '{w}-{d}-all-titles-in-ns-0.gz' --outdir '${cronsdir}/pagetitles/{d}' --query \"'select page_title from page where page_namespace=0;'\"", minute     => '10', hour       => '8', } }}

{{CodeCommentary|type=code|lang=puppet|start=17|icon=eyes|pos=2|content= ​       cron { 'pagetitles-ns6': ensure     => 'present', environment => 'MAILTO=ops-dumps@wikimedia.org', user       => $user, command    => "cd ${repodir}; python3 onallwikis.py --configfile ${confsdir}/wikidump.conf.dumps:monitor  --filenameformat '{w}-{d}-all-media-titles.gz' --outdir '${cronsdir}/mediatitles/{d}' --query \"'select page_title from page where page_namespace=6;'\"", minute     => '50', hour       => '8', }   } } }}

Roles, profiles and classes
Each server gets one role. Each role is built from profiles (just another collection of puppet manifests). And each profile is built from classes from various modules, and sometimes from other profiles.

Roles
Let's have a look. We'll start with the dumper role , which is applied to snapshot hosts that just run xml/sql dumps and nothing else.

{{CodeCommentary|type=code|lang=puppet|start=1|content= class role::dumps::generation::worker::dumper { include ::profile::standard include ::profile::base::firewall }}

{{CodeCommentary|type=code|lang=puppet|start=7|content= ​   system::role { 'dumps::generation::worker::dumper': description => 'dumper of XML/SQL wiki content', } } }}

Profiles
Let's have a closer look at the common profile , since it's used in all the roles.

What does every dump worker need? Well, it needs MediaWiki to be set up, of course. And MediaWiki installations also have a few things required; we don't want the full-fledged installation that goes on every app server, since we're not actually serving any web requests from these hosts, so we choose a few "lower-level" MediaWiki profiles and pull them in.

{{CodeCommentary|type=code|lang=puppet|start=1|icon=eyes|pos=21|content= class profile::dumps::generation::worker::common(   $dumps_nfs_server = lookup('dumps_nfs_server'),    $cron_nfs_server = lookup('dumps_cron_nfs_server'),    $managed_subdirs = lookup('dumps_managed_subdirs'),    $datadir_mount_type = lookup('dumps_datadir_mount_type'),    $extra_mountopts = lookup('profile::dumps::generation::worker::common::nfs_extra_mountopts'),    $php = lookup('profile::dumps::generation::worker::common::php'),    $dumps_misc_cronrunner = lookup('profile::dumps::generation::worker::common::dumps_misc_cronrunner'), ) { # mw packages and dependencies require profile::mediawiki::scap_proxy require profile::mediawiki::common require profile::mediawiki::nutcracker class { 'profile::mediawiki::mcrouter_wancache': prometheus_exporter => false }   require profile::services_proxy::envoy

$xmldumpsmount = '/mnt/dumpsdata'

class { '::dumpsuser': } }}

{{CodeCommentary|type=code|lang=puppet|start=48|content= ​   class { '::snapshot::dumps': php => $php}

# scap3 deployment of dump scripts scap::target { 'dumps/dumps': deploy_user => 'dumpsgen', manage_user => false, key_name   => 'dumpsdeploy', }   ssh::userkey { 'dumpsgen': content => secret('keyholder/dumpsdeploy.pub'), } } }}

Puppet repo layout
We've talked about roles and profiles and classes, but where does all of this live? Why, in our puppet repo, of course. It's available for public checkout and has "production" as its main branch, so you'll want to make sure that's the one checked out.

Our top level manifest which declares all the hosts and assigns roles to them, is in manifests/site.pp. But everything else is a module living somewhere in the modules directory.

If you look at any directory under modules, you'll see the same layout with the following three subdirectories: files, manifests, and templates. Look for yourself: .


 * Files are content for file resources that don't get any variable interpolation. They just get plopped right onto the server as is, in the location and with the permissions you specify.


 * Manifests are puppet code, and we've seen some examples of that already.


 * Templates are content for file resources with little bits of ruby code in them, which will be evaluated to shove values from variables and so on in them before they are written out. We'll look at an example later.

Now you may be wondering where profiles live. Remember that profiles are a sort of thing we made up; they are a nice convention but the name "profile" doesn't have any special meaning to puppet. There is just a module that we maintain called "profile" , and it has manifests, files and templates like any other module.

The same is true of roles. There is a "role" module with files, manifests and templates in it. It's how we use these modules that makes them special.

Puppet Templates: basic syntax
The easiest way to understand how they work is to look at an example. So, here we go. This is the template used to generate configuration files for the "misc" (not xml/sql) dumps .

Puppet Templates: usage
Now that we have a template, how do we use it? This was alluded to briefly above, but let's look at an actual invocation.

{{CodeCommentary|type=code|lang=puppet|start=1|icon=eyes|pos=1|content= define snapshot::cron::configfile(   $configvals = undef,    ) { $confsdir = $snapshot::dumps::dirs::confsdir

}}

{{CodeCommentary|type=code|lang=puppet|start=6|icon=eyes|pos=7|content= file { "${confsdir}/${title}": ensure => 'present', path   => "${confsdir}/${title}", mode   => '0755', owner  => 'root', group  => 'root', content => template('snapshot/wikidump.conf.other.erb'), } } }}