SQL/XML Dumps/Puppet for dumps maintainers

Introduction to puppet
This is a companion document to the slides from the Puppet for dumps maintainers presentation on December 9, 2020.

Basic puppet syntax, classes
Let's have a thorough look at the file modules/dumps/manifests/generation/server/dirs.pp.

Files of puppet code are called "manifests", so we'll call them that too.

{{CodeCommentary|type=code|lang=python|start=8|icon=hand|pos=2|link=https://github.com/wikimedia/puppet/blob/production/modules/dumps/manifests/server_dirs.pp|content= ) {   class {'dumps::server_dirs':        datadir         => $datadir,        xmldumpsdir     => $xmldumpsdir,        miscdatasetsdir => $miscdatasetsdir,        user            => $user,        group           => $group,    }

# Directories where dumps of any type are generated # This list is not for one-off directories, nor for # directories with incoming rsyncs of datasets $cirrussearchdir             = "${miscdatasetsdir}/cirrussearch" $xlationdir                  = "${miscdatasetsdir}/contenttranslation" $categoriesrdfdir            = "${miscdatasetsdir}/categoriesrdf" $categoriesrdfdailydir       = "${miscdatasetsdir}/categoriesrdf/daily" $globalblocksdir             = "${miscdatasetsdir}/globalblocks" $medialistsdir               = "${miscdatasetsdir}/imageinfo" $incrsdir                    = "${miscdatasetsdir}/incr" }}

{{CodeCommentary|type=code|lang=python|start=41|icon=eyes|pos=2|content= ​   # subdirs for various generated dumps file { [ $cirrussearchdir, $xlationdir, $categoriesrdfdir, $categoriesrdfdailydir, $globalblocksdir, $medialistsdir, $incrsdir, $mediatitlesdir, $pagetitlesdir, $shorturlsdir, $machinevisiondir ]:

ensure => 'directory', mode  => '0755', owner => $user, group => $group, }

# needed for wikidata weekly crons file { [ $otherwikibasedir, $otherwikibasewikidatadir, $otherwikidatadir ]: ensure => 'directory', mode  => '0755', owner => $user, group => $group, } } }}

Not too bad, was it? Next let's look at resources a bit more.

Resources in puppet
We'll have a look at the cron jobs for one of the "other" dumps, and in particular the job that generates lists of titles of articles in the main space for each wiki once a day .

{{CodeCommentary|type=code|lang=python|start=1|icon=eyes|pos=2|content= class snapshot::cron::pagetitles(   $user      = undef,    $filesonly = false, ) { $cronsdir = $snapshot::dumps::dirs::cronsdir $repodir = $snapshot::dumps::dirs::repodir $confsdir = $snapshot::dumps::dirs::confsdir

}}

{{CodeCommentary|type=code|lang=python|start=8|icon=eyes|pos=4|content= ​   if !$filesonly { cron { 'pagetitles-ns0': ensure     => 'present', environment => 'MAILTO=ops-dumps@wikimedia.org', user       => $user, command    => "cd ${repodir}; python3 onallwikis.py --configfile ${confsdir}/wikidump.conf.dumps:monitor  --filenameformat '{w}-{d}-all-titles-in-ns-0.gz' --outdir '${cronsdir}/pagetitles/{d}' --query \"'select page_title from page where page_namespace=0;'\"", minute     => '10', hour       => '8', } }}

{{CodeCommentary|type=code|lang=python|start=17|icon=eyes|pos=2|content= ​       cron { 'pagetitles-ns6': ensure     => 'present', environment => 'MAILTO=ops-dumps@wikimedia.org', user       => $user, command    => "cd ${repodir}; python3 onallwikis.py --configfile ${confsdir}/wikidump.conf.dumps:monitor  --filenameformat '{w}-{d}-all-media-titles.gz' --outdir '${cronsdir}/mediatitles/{d}' --query \"'select page_title from page where page_namespace=6;'\"", minute     => '50', hour       => '8', }   } } }}

Roles, profiles and classes
Each server gets one role. Each role is built from profiles (just another collection of puppet manifests). And each profile is built from classes from various modules, and sometimes from other profiles.

Roles
Let's have a look. We'll start with the dumper role , which is applied to snapshot hosts that just run xml/sql dumps and nothing else.

{{CodeCommentary|type=code|lang=python|start=1|content= class role::dumps::generation::worker::dumper { include ::profile::standard include ::profile::base::firewall }}

{{CodeCommentary|type=code|lang=python|start=7|content= ​   system::role { 'dumps::generation::worker::dumper': description => 'dumper of XML/SQL wiki content', } } }}

Profiles
Let's have a closer look at the common profile , since it's used in all the roles.

{{CodeCommentary|type=code|lang=python|start=1|icon=eyes|pos=21|content= class profile::dumps::generation::worker::common(   $dumps_nfs_server = lookup('dumps_nfs_server'),    $cron_nfs_server = lookup('dumps_cron_nfs_server'),    $managed_subdirs = lookup('dumps_managed_subdirs'),    $datadir_mount_type = lookup('dumps_datadir_mount_type'),    $extra_mountopts = lookup('profile::dumps::generation::worker::common::nfs_extra_mountopts'),    $php = lookup('profile::dumps::generation::worker::common::php'),    $dumps_misc_cronrunner = lookup('profile::dumps::generation::worker::common::dumps_misc_cronrunner'), ) { # mw packages and dependencies require profile::mediawiki::scap_proxy require profile::mediawiki::common require profile::mediawiki::nutcracker class { 'profile::mediawiki::mcrouter_wancache': prometheus_exporter => false }   require profile::services_proxy::envoy

$xmldumpsmount = '/mnt/dumpsdata'

class { '::dumpsuser': } }}

{{CodeCommentary|type=code|lang=python|start=22|content= ​   if ($dumps_misc_cronrunner) { $nfs_server = $cron_nfs_server }   else { $nfs_server = $dumps_nfs_server }   snapshot::dumps::datamount { 'dumpsdatamount': mountpoint     => $xmldumpsmount, mount_type     => $datadir_mount_type, extra_mountopts => $extra_mountopts, server         => $nfs_server, managed_subdirs => $managed_subdirs, user           => 'dumpsgen', group          => 'dumpsgen', }

# dataset server config files, # stages files, dblists, html templates class { '::snapshot::dumps::dirs': user              => 'dumpsgen', xmldumpsmount     => $xmldumpsmount, xmldumpspublicdir =>  "${xmldumpsmount}/xmldatadumps/public", xmldumpsprivatedir => "${xmldumpsmount}/xmldatadumps/private", dumpstempdir      =>  "${xmldumpsmount}/xmldatadumps/temp", cronsdir          =>  "${xmldumpsmount}/otherdumps", apachedir         => '/srv/mediawiki', }   class { '::snapshot::dumps': php => $php}

# scap3 deployment of dump scripts scap::target { 'dumps/dumps': deploy_user => 'dumpsgen', manage_user => false, key_name   => 'dumpsdeploy', }   ssh::userkey { 'dumpsgen': content => secret('keyholder/dumpsdeploy.pub'), } } }} }}