Wikimedia Platform Engineering/MediaWiki Core Team/Check-ins/20131112

who: Brad, CSteipp, Nik, Antoine, Ori, rob, Tim, Greg

RFC process

DevOps sprint
 * Graphite Puppet module merged; provisioned on tungsten (professor replacement). Distributes load over four carbon-cache instances so should resolve the issue observed on professor (single instance bound to one core by Python GIL). Still need to copy whisper files and set up udpprofiler or replacement.
 * rsync running with --compress (=faster, we hope; CDB compresses well) and --delayed-update (=more atomic)
 * Updating the branch & config repositories on tin should !log (to SAL) the post-update commit SHA1; Sam noticed a bug though.
 * Ryan working on trebuchet patches, in-progress
 * including the per-apache-generated i10n cache

Performance work
 * Mostly Graphite (see DevOps item #1)
 * Designed experiment w/Aaron Halfaker for evaluating impact of module storage on page load time. (Assign 0.1% of visitors to experiment, divided equally into control / experiment groups. Will log page load timing from both groups, but module storage will only be enabled for experiment group.)
 * Schema: https://meta.wikimedia.org/wiki/Schema:ModuleStorage Change: 94840
 * More ULS perf troubleshooting (56856)

Search
 * Having trouble with runJobs.php run from the web process
 * Must switch how we calculate page weight (SQL too slow)
 * Deploying (now) to nlwiki
 * Noise in the logs lately. Specific fixes merged.  General fixes scheduled.
 * 56968 We think we’re doing two parses on updates in the web process
 * Product working on an overall direction for search. Design coming up with mockups for how it could work.

Zuul upgrade
 * Wednesday 20th Nov around 8-9am UTC
 * Would use Gearman as a backend to trigger jobs in Jenkins
 * + bug / performances fixes
 * Antoine to schedule it properly in deployment calendar.

PDF generation replacement

Bug escalation
 * bugzilla 56840 Special:Allpages is too slow!
 * 56882 memcached-serious log flooded with TIMED RETRY errors

Errors & deployments (Tabling this as discussion item if there’s time.)
 * Proposal: block deployments when errors or fatals appear in prod until those errors go away
 * Be deliberately naive, taking severity at face value