Analytics/Server Admin Log

2019-12-16

 * 11:14 joal: Clean spark-shell drivers on cluster before kerberos
 * 10:46 elukey: stop airflow-* on an-airflow1001
 * 10:41 elukey: stop jupyterhub on notebook100[3,4] as prep step for kerberos
 * 10:38 elukey: kill Nuria's spark shell application masters in Yarn
 * 10:17 elukey: stop hadoop-related timers on stat1007
 * 10:04 joal: Killing user-app eating all cluster (application_1573208467349_190044)
 * 09:05 joal: Rerun webrequest-load-wf-text-2019-12-14-18 with updated error-checking parameters (all false positive)
 * 08:49 elukey: re-run webrequest-load 2019-12-14-13 and 2019-12-15-12 with higher mapreduce limits (modified version of refinery on hdfs /user/elukey with https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/557794/)
 * 07:22 elukey: stop camus timers as prep step for maintenance (if we'll do it)

2019-12-13

 * 07:42 elukey: execute reset-failed for monitor_refine_mediawiki_job_events on an-coord1001

2019-12-12

 * 18:46 elukey: rsync timers deployed on labstore100[6,7]
 * 15:23 elukey: execute systemctl reset-failed monitor_refine_mediawiki_job_events after Andrew's comment on alerts@
 * 12:59 elukey: roll restart hadoop workers to pick up the new settings (removed prefer ipv4 false after T240255)
 * 12:40 elukey: enable timers on an-coord1001 after maintenance
 * 12:39 elukey: restart hive and oozie on an-coord1001 to pick up ipv6 settings
 * 11:14 elukey: stop timers on an-coord1001 as prep step for hive/oozie restart

2019-12-11

 * 07:07 elukey: kill/re-run pageview 2019-12-10-17, stuck in whitelist check for hours (https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_171800 for more info)

2019-12-10

 * 14:34 elukey: shutdown of stat1004 to check if it can hold a GPU
 * 14:08 jbond42: rolling restart of varnishkafaka-webrequest and varnishkafaka-eventloggin

2019-12-05

 * 09:35 elukey: enable timers on an-coord1001 after maintenance
 * 09:34 elukey: stop oozie/hive-*; restart mariadb; restart oozie/hive-* on an-coord1001 to pick up explicit_defaults_for_timestamp - T236180
 * 09:06 elukey: temporarily stop timers on an-coord1001 to ease the restart of mariadb on an-coord1001

2019-12-04

 * 20:57 milimetric: finished refinery-deploy-to-hdfs from stat1004 but something's broken on stat1007 in the /srv/deployment/analytics/refinery repo
 * 20:08 milimetric: deployed refinery source
 * 11:36 elukey: restart mariadb on analytics1030 (hadoop test coordinator) to test explicit_defaults_for_timestamp - T236180

2019-12-03

 * 09:48 joal: Kill restart mediawiki-history-load-coord after sqoop re-import of missing tables

2019-12-02

 * 20:27 joal: restart cassandra bundle
 * 20:17 joal: Deploying refinery to hdfs - Last for today!
 * 20:00 joal: Deploy refinery using scap to fix today deploy (last)
 * 19:20 joal: Manually kill cassandra-coord-mediarequest-per-referer-hourly from bundle as it shouldn't exist
 * 19:07 joal: restart cassandra bundle after redeployed patch
 * 18:40 joal: Deploy refinery onto hdfs
 * 18:39 joal: Deploy refinery using scap for fixes
 * 16:43 joal: Restarting cassandra bundle after deploy
 * 11:40 joal: Restart mediawiki-geoeditors-monthly-coord
 * 11:39 joal: Drop wmf.geoeditors_daily table and create wmf/editors_daily, moving underlying data and recreating partitions
 * 11:35 joal: Kill mediawiki-geoeditors-monthly-coord before updating the jobn
 * 10:30 joal: Manually sqoop tables not yet done because of late deploy (content_models, content, slots, slot_roles, wbt_entity_usage)
 * 10:21 joal: Create new tables for newly sqooped data in hive wmf_raw database
 * 09:43 joal: Deploying refinery onto HDFS
 * 09:22 joal: Deploy refinery using scap

2019-11-27

 * 09:16 elukey: apply systemd user limits to stat1005
 * 07:10 elukey: apply systemd user limits to stat1006,stat1007 and notebook100*

2019-11-26

 * 17:19 elukey: add systemd user limits to stat1004

2019-11-25

 * 13:27 elukey: set global read_only=1 on db1108's log database

2019-11-21

 * 20:07 mforns: deploying refinery to add pageview whitelist changes and stop alerts
 * 15:50 mforns: deployed refinery (with v0.0.107)
 * 15:10 mforns: deployed refinery-source v0.0.107
 * 06:59 elukey: restart hdfs-cleaner on an-coord1001

2019-11-19

 * 19:00 elukey: regenerate TLS cert for yarn.wikimedia.org (containing SANs for all analytics UIs) to add datasets.w.o SAN (site was failing due to ATS not being able to contact thorium)
 * 13:54 joal: Deleting 600 more log-folders from analytics user (cassandra backfilling logs) -- T238648
 * 13:46 joal: Deleting old parquet wikitext data (new data is stored in Avro) -- T238648
 * 13:46 joal: Deleting 100 heavier log-folders from analytics user (cassandra backfilling logs) -- T238648
 * 07:51 elukey: restart hdfs-cleaner on an-coord1001

2019-11-18

 * 20:03 joal: Rerun failed mediawiki_wikitext_history oozie job (2019-10)

2019-11-16

 * 09:44 elukey: systemctl restart hadoop-* on analytics1077 after oom killer

2019-11-15

 * 17:05 elukey: restart hdfs-cleaner (failed due to tmp hive files not present when deleting)

2019-11-14

 * 16:09 elukey: roll restart presto-server on an-presto* to pick up new openjdk
 * 08:47 joal: starting hdfs-cleaner manually after after failure earlier this night
 * 08:37 fdans: initiating bacfilling of daily top mediarequests from the mediacounts database - May 2018 to May 2019

2019-11-12

 * 16:52 elukey: forced a purge in Varnish for the stats.wikimedia.org front page to pick up the new deprecation banner
 * 15:42 fdans: manually overwriting index.html in Wikistats 1 to apply patch https://gerrit.wikimedia.org/r/#/c/analytics/wikistats/+/550338/

2019-11-08

 * 12:34 elukey: roll restart cassandra on aqs to pick up new openjdk upgrades
 * 12:22 elukey: restart oozie and hive daemons on an-coord1001
 * 09:05 elukey: roll restart druid daemons on druid public to pick up the new jvm
 * 08:34 elukey: roll restart druid daemons on druid analytics to pick up the new jvm
 * 08:34 elukey: restart kafka on kafka-jumbo1001 to test opendjk

2019-11-07

 * 17:33 elukey: restart zookeeper on druid nodes for jvm upgrades
 * 17:33 elukey: restart all jvms on hadoop test workers
 * 15:41 elukey: roll restart all jvms on Hadoop Analytics Workers to pick up the new jvm
 * 12:18 joal: Deleting stat1007:/srv/reportupdater/output/metrics/reference-previews/baseline.tsv as asked by awight

2019-11-06

 * 22:36 milimetric: successfully restarted webrequest bundle, webrequest druid daily, and webrequest druid hourly
 * 21:21 milimetric: restarting webrequest load bundle and druid loading jobs
 * 21:21 milimetric: refinery deployed, hdfs cleaner and tls ready to be restarted
 * 20:16 joal: Kill-rerun pageview-hourly-wf-2019-11-6-13 for being stuck in whitelist-check
 * 18:13 joal: restart refinery-import-page-current-dumps.service to test after yestardays failure

2019-11-05

 * 21:06 joal: restarting oozie jobs after spark 2.4.4 upgrade
 * 21:04 ottomata: re-enabling refine jobs after spark 2.4.4 upgrade
 * 20:57 joal: Starting denoramlize-check one month in advance to enforce a running job with new spark
 * 20:37 ottomata: roll restarting hadoop-yarn-nodemanagers to pick up spark 2.4.4 shuffle lib
 * 20:21 ottomata: install spark 2.4.4-bin-hadoop2.6-1 cluster wide using debdeploy - T222253
 * 20:18 joal: Deploying refinery onto HDFS
 * 20:12 ottomata: stopped refine jobs for Spark 2.4 upgrade - T222253
 * 20:09 joal: Deploying refinery using scap with missing patch
 * 20:00 joal: Deploying refinery using scap
 * 18:49 joal: Make Jenkins release refinery-source v0.0.105 to archiva
 * 17:12 ottomata: 2019-11-05T17:11:50.239 INFO HDFSCleaner Deleted 872360 files and directories in tmp
 * 17:01 ottomata: first run of HDFSCleaner on /tmp, should delete files older than 31 days
 * 11:00 fdans: testing load of top metric from mediarequests with corrected quotemarks escaping

2019-11-04

 * 23:28 milimetric: deployed refinery
 * 14:58 joal: restarting AQS using scap after snapshot bump (2019-10)

2019-10-31

 * 19:45 fdans: (actually no, no need)
 * 19:43 fdans: (changing jar version first)
 * 19:43 fdans: restarting mediawiki-history-wikitext
 * 19:42 fdans: refinery deployment complete
 * 19:17 fdans: updating jar symlinks to 0.0.104
 * 17:59 fdans: deploying refinery
 * 17:49 fdans: deplying refinery-source 0.0.104
 * 16:36 elukey: restart oozie and hive-server2 on an-coord1001 to pick up new new TLS mapreduce settings
 * 15:31 joal: Rerun webrequest jobs for hour 2019-10-31T14:00 after failure
 * 14:53 elukey: enabled encrypted shuffle option in all Hadoop Analytics Yarn Node Managers
 * 10:17 elukey: deploy TLS certificates for MapReduce Shufflers on Hadoop worker nodes (no-op change, no yarn-site config)

2019-10-30

 * 15:00 ottomata: disabling eventlogging-consumer mysql on eventlog1002
 * 08:31 joal: Rerun failed cassandra-daily-coord-local_group_default_T_mediarequest_per_file days: 2019-10-26, 2019-10-23 and 2019-10-22
 * 06:30 elukey: re-run cassandra-coord-pageview-per-article-daily 29/10/2019

2019-10-29

 * 08:51 fdans: starting backfilling for per file mediarequests for 7 days from Sep 15 2015
 * 07:09 elukey: roll restart java daemons on analytics1042, druid1003 and aqs1004 to pick up new openjdk upgrades

2019-10-28

 * 10:10 fdans: mediarequest per file backfilling suspended
 * 09:14 elukey: manual re-run of cassandra-coord-pageview-per-article-daily - 26/10/2019 - as attempt to see if the error is reproducible or not (timeout while inserting into cassandra)

2019-10-24

 * 13:54 fdans: running top mediarequest backfill from 2015-01-02 to 2019-05-01

2019-10-23

 * 18:59 milimetric: refinery deployment re-done to fix my mistake
 * 18:37 mforns: refinery deployment done!
 * 18:31 mforns: deploying refinery with refinery-deploy-to-hdfs up to 1110d59c3983bcff4986bce1baf885f05ee06ba5
 * 18:21 mforns: deploying refinery with scap up to 1110d59c3983bcff4986bce1baf885f05ee06ba5

2019-10-22

 * 15:47 fdans: start backfilling of mediarequests per file from 2015-01-02 to 2019-05-17 after ok vetting of 2015-01-01

2019-10-18

 * 14:45 fdans: backfilling 2015-1-1 for mediarequests per file, proceeding with all days until 2019-05-17 if successful

2019-10-17

 * 18:01 elukey: update librdkafka on eventlog1002 and restart eventlogging
 * 10:26 elukey: rollback eventlogging back to Python 2, some errors (unseen in tests) logged by the processors
 * 10:18 elukey: move eventlogging to python 3

2019-10-16

 * 20:27 ottomata: upgrading to spark 2.4.4 in analytics test cluster
 * 20:20 joal: Kill-restart mediawiki-history-dumps-coord to pick up changes
 * 20:16 joal: Deployed refinery onto HDFS
 * 20:08 joal: Deployed refinery using scap
 * 19:45 joal: Refinery-source v0.0.103 released to refinery
 * 19:29 joal: Ask jenkins to release refinery-source v0.0.103 to archiva
 * 19:19 joal: AQS deployed with mediarequest-top endpoint
 * 18:45 joal: Manually create mediarequest-top cassandra keyspace and tables, and add fake test data into it

2019-10-15

 * 13:15 elukey: re-enable timers on an-coord1001
 * 12:57 fdans: resumed backfilling of mediarequests per referer daily
 * 12:46 elukey: moved hadoop cluster to new zookeeper cluster
 * 11:25 elukey: stop all systemd timers on an-coord1001 as prep step for hadoop maintenance
 * 10:42 fdans: backfilling January 1st 2015 for mediarequests per referer daily, proceeding with all days until May 2019 if successful

2019-10-14

 * 18:13 joal: Manually add ban.wikipedia.org to pageview whitelist (T234768)
 * 14:28 elukey: matomo upgraded to 3.11 on matomo1001

2019-10-11

 * 12:51 elukey: deployed eventlogging python3 version in deployment-prep
 * 07:09 elukey: drop test_wmf_netflow fro druid analytics and restart turnilo
 * 06:24 elukey: remove /tmp/hive-staging_hive_(2017|2018)* data from HDFS instead of /tmp/* to avoid causing hive failures (it needs to write temporary data for the current running jobs)
 * 06:04 elukey: delete content of /tmp/* on HDFS

2019-10-10

 * 09:13 joal: rerun failed pageview hour after manual job killing (pageview-hourly-wf-2019-10-9-19)
 * 09:13 joal: Kill stuck oozie launcher in yarn (application_1569878150519_43184)

2019-10-09

 * 20:52 milimetric: deploy of refinery and refinery-source 0.0.102 finally seems to have finished
 * 19:55 milimetric: refinery ... probably? deployed with errors like "No such file or directory (2)\nrsync error"
 * 17:11 elukey: restart druid-broker on druid100[5-6] - not serving data correctly

2019-10-08

 * 09:22 elukey: delete druid old test datasource from the analytics cluster - test_kafka_event_centralnoticeimpression

2019-10-07

 * 17:46 ottomata: powercycling stat1007
 * 06:08 elukey: upgrade python-kafka on eventlog1002 to 1.4.7-1 (manually via dpkg -i)

2019-10-05

 * 18:18 elukey: kill/restart mediawiki-history-reduced oozie coord to pick up the new druid_loader.py version on HDFS
 * 06:49 elukey: force umount/remount of /mnt/hdfs on an-coord1001 - processes stuck in D state, fuser proc consuming a ton of memory

2019-10-04

 * 16:27 ottomata: manually rsyncing mediawiki_history 2019-08 snapshot to labstore1006

2019-10-03

 * 14:17 elukey: stop the Hadoop test cluster to migrate it to the new kerberos cluster
 * 13:26 elukey: re-run refinery-download-project-namespace-map (modified with recent fixes for encoding and python3)
 * 09:48 elukey: ran apt-get autoremove -y on all Hadoop workers to remove old Python 2 deps
 * 08:43 elukey: apply 5% threshold to the HDFS balancer - T231828
 * 07:48 elukey: restart druid-broker on druid1003 (used by superset)
 * 07:47 elukey: restart superset to test if a stale status might cause data not to be shown

2019-10-02

 * 21:21 nuria: restarting superset
 * 16:18 elukey: kill duplicate of oozie pageview-druid-hourly coord and start the wrongly killed oozie pageview-hourly-coord (causing jobs to wait for data)
 * 13:12 elukey: remove python-request from all the hadoop workers (shouldn't be needed anymore)
 * 13:08 elukey: kill/start oozie webrequest druid daily/hourly coords to pick up new druid_loader.py version
 * 13:04 elukey: kill/start oozie virtualpageview druid daily/monthly coords to pick up new druid_loader.py version
 * 12:54 elukey: kill/start oozie unique devices per family druid daily/daily_agg_mon/monthly coords to pick up new druid_loader.py version
 * 10:24 elukey: restart unique dev per domain druid daily_agg_monthly/daily/montly coords to pick up new hdfs version of druid_loader.py
 * 10:15 elukey: re-run unique devices druid daily 28/09/2019 - failed but possibly no alert was fired to analytics-alerts@
 * 09:48 elukey: restart pageview druid hourly/daily/montly coords to pick up new hdfs version of druid_loader.py
 * 09:45 elukey: restart mw geoeditors druid coord to pick up new hdfs version of druid_loader.py
 * 09:41 elukey: restart edit druid hourly coord to pick up new hdfs version of druid_loader.py
 * 09:38 elukey: restart banner activity druid daily/montly coords to pick up new hdfs version of druid_loader.py
 * 08:31 elukey: kill/restart mw check denormalize with hive2_jdbc parameter

2019-09-30

 * 21:05 ottomata: rolling restart of hdfs namenode and hdfs resourcemanager to take presto proxy user settings
 * 05:26 elukey: re-run manually pageview-druid-hourly 29/09T22:00

2019-09-27

 * 06:44 elukey: clean up files older than 30d in /var/log/{oozie,hive} on an-coord1001

2019-09-26

 * 18:42 mforns: finished deploying refinery using scap (together with refinery-source 0.0.101)
 * 18:27 mforns: deploying refinery using scap (together with refinery-source 0.0.101)
 * 17:33 elukey: run apt-get autoremove on stat* and notebook* to clean up old python2 deps
 * 15:01 mforns: deploying analytics/aqs using scap
 * 13:04 elukey: removing python2 packages from the analytics hosts (not from eventlog1002)
 * 11:13 mforns: deployed analytics-refinery-source v0.0.101 using Jenkins
 * 05:47 elukey: upload the new version of the pageview whitelist - https://gerrit.wikimedia.org/r/539225

2019-09-25

 * 13:37 elukey: move the Hadoop test cluster to the Analytics Zookeeper cluster
 * 08:37 elukey: add netflow realtime ingestion alert for Druid
 * 06:02 elukey: set python3 for all report updater jobs on stat1006/7

2019-09-24

 * 14:46 ottomata: temporarily disabled camus-mediawiki_analytics_events systemd timer on an-coord1001 - T233718
 * 13:18 joal: Manually repairing wmf.mediawiki_wikitext_history
 * 06:07 elukey: update Druid Kafka supervisor for netflow to index new dimensions

2019-09-23

 * 20:56 ottomata: created new camus job for high volume mediawiki analytics events: mediawiki_analytics_events
 * 16:46 elukey: deploy refinery again (no hdfs, no source) to deploy the latest python fixes
 * 09:25 elukey: temporarily disable *drop* timers on an-coord1001 to verify refinery python change with the team
 * 08:24 elukey: deploy refinery to apply all the python2 -> python3 fixes
 * 07:44 elukey: restart manually refine_mediawiki_events on an-coord1001 with --since 48 to force the refinement after camus backfilled the missing data
 * 07:41 elukey: manually applied https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/538235/ on an-coord1001
 * 06:21 elukey_: restart camus mediawiki_events on an-coord1001 with increased mapreduce heap size

2019-09-21

 * 09:00 fdans: resumed per file mediarequests backfiling coordinator

2019-09-20

 * 17:04 elukey: restart yarn/hdfs daemons on analytics1045
 * 17:01 elukey: remove /var/lib/hadoop/j from analytics1045 due to a broken dis

2019-09-19

 * 13:31 joal: Kill-restart webrequest-load bundle to fix queue issue
 * 10:37 elukey: manually rollback /srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions to "#!/usr/bin/env python" on stat1007
 * 09:16 fdans: starting load to cassandra of mediarequests per file daily

2019-09-18

 * 19:23 joal: Deploy AQS using scap - Try 3
 * 18:59 joal: Deploy AQS using scap - Try 2
 * 18:53 joal: Deploy AQS using scap
 * 18:16 joal: Start mediawiki-history-dumps oozie job starting with August 2019
 * 18:10 joal: Kill-restart webrequest-load oozie job to pick-up new ua-parser
 * 18:09 joal: Restart eventlogging with new ua-parser (ottomata did)
 * 16:46 elukey: manually restarted the refinery-drop-older-than jobs
 * 16:45 elukey: manually set "#!/usr/bin/env python" for refinery-drop-older-than on an-coord1001 to restore functionality (minor bug encountered)
 * 13:41 joal: Deploy refinery to hdfs
 * 13:35 joal: Deploying refinery using scap
 * 12:54 elukey: re-run webrequest-load upload/text for hour 11 due to transient hive server socket failures
 * 12:39 joal: Release refinery-source v0.0.100 to archiva

2019-09-17

 * 08:19 elukey: manually decommed analytics1032 for hdfs/yarn on the Hadoop testing cluster - T233080
 * 07:50 joal: Manually released com.github.ua-parser/uap-java 1.4.4-core0.6.9~1-wmf to archiva

2019-09-16

 * 12:41 elukey: rebooting the hadoop test cluster with the new spicerack cookbook as test
 * 10:04 elukey: disable puppet on an-coord1001 and manually forcing python3 for camus - T204735
 * 07:25 joal: Delete matomo error with URL http://Wikipedia/screen/Explore

2019-09-13

 * 16:57 joal: Reset ua-parser/uap-java wmf branch to up-to-date master using push force

2019-09-12

 * 09:35 elukey: drop old database 'superset' from analytics-meta (an-coord1001) after a precautionary backup

2019-09-11

 * 18:42 nuria: deployment of v0.0.99 to cluster succeeded, letting it bake for a bit
 * 18:14 nuria: deployment of v0.0.99 of refinery that includes quite a bit of cleanup
 * 08:33 elukey: stat1005 upgraded with ROCm 2.7.1

2019-09-10

 * 21:34 ottomata: restarting archiva service on archiva1001
 * 18:57 joal: Manually fixed dewiki wikitext for snapshot=2019-07 (snapshot is now full and complete despite oozie error)

2019-09-04

 * 15:55 joal: Deploy refinery using scap (fix for SLAs)
 * 08:46 joal: Fix mediacounts-archive SLA and kill-restart job

2019-09-03

 * 19:21 fdans: finished restart of all hosts, 2019-08 snapshot deployed
 * 19:18 fdans: restarting service on aqs1004
 * 15:20 fdans: creating test Cassandra keyspace "local_group_default_T_request_per_file_TEST"
 * 13:43 joal: Deploying refinery using scap for fixes
 * 13:30 joal: Kill-restart mediawiki-history jobs (denormalize, check_denormalize, reduced, metrics, wikitext)
 * 13:26 joal: Kill-restart geoeditors jobs (monthly, yearly and druid)
 * 13:21 joal: Kill-restart edit jobs (hourly and druid)
 * 13:06 joal: Kill-restart unique_devices per_project_familly jobs (daily, monthly, druid daily, druid daily aggregated monthly, and druid monthly)
 * 13:00 joal: Kill-restart unique_devices per_domain jobs (daily, monthly, druid daily, druid daily aggregated monthly, and druid monthly)
 * 12:43 joal: Kill-restart mobile-apps jobs (app-session, uniques daily and monthly)
 * 12:34 joal: Kill-restart virtualpageview druid jobs (daily and monthly)
 * 12:29 joal: Kill-restart wikidata jobs (article_placeholder, coeditors, specialentitydata)
 * 12:28 joal: Fix interlanguage for naming convention
 * 12:24 joal: Kill-restart interlanguage job
 * 12:20 joal: Kill restart browser-general and clickstream
 * 12:17 joal: Manually create success-files for banner_activity monthly to start
 * 12:11 joal: Kill-restart banner-activity jobs (daily and monthly)
 * 12:08 joal: Restart mediarequest job after hotfix (renaming) and needed ops (table change and data move)
 * 11:43 joal: Kill-restart virtualpageview_hourly
 * 11:42 joal: Kill mediarequest-hourly (more ops to do before restarting)
 * 11:39 joal: Kill-restart mediacount jobs (load and archive)
 * 11:33 joal: Kill-restart pageview-druid jobs (hourly, daily, monthly)
 * 11:29 joal: Kill restart projectview_geo job
 * 11:28 joal: Kill restart projectview_hourly job
 * 11:25 joal: Kill-restart webrequest-druid jobs (hourly and daily)
 * 11:16 joal: Kill-restart pageview_hourly, aqs_hourly and apis jobs
 * 11:10 joal: Fixing data-quality bundle and coord for restart
 * 11:05 joal: Kill-restart data-quality bundle
 * 11:01 joal: Kill-restart cassandra bundle (beginning of month)
 * 10:56 joal: Hotfixing webrequest-load job to prevent redeploying
 * 10:50 joal: Kill/restart webrequest bundle
 * 08:21 joal: Kill-restart mediawiki-load and geoeditors-load jobs after corrective deploy
 * 08:10 joal: Deploy refinery onto HDFS

2019-08-27

 * 20:46 ottomata: rolling back to jupyterlab version 0.32.1, 1.0.x is not compatible with Stretch's version of nodejs - T230724
 * 16:02 mforns: restarted turnilo to apply changes to config

2019-08-26

 * 19:06 ottomata: update spark2 package to -4 version with support for python3.7 across cluster. T229347
 * 11:30 joal: Remove tracking failure for http://Wikipedia/screen/Explore in matomo

2019-08-23

 * 12:24 joal: Rerunning refine for eventlogging-analytics for 2019-08-23T03:00
 * 09:38 elukey: restart hive-server2 on an-coord1001 to pick up new settings - T209536
 * 08:06 joal: Launch mediawiki-history-dump test from marcel forlder

2019-08-22

 * 15:13 elukey: remove reading_depth druid load job from an-coord1001
 * 14:20 joal: Start mediarequests oozie coordinator from 2019-08-14T12:00
 * 13:41 joal: Deploying refinery onto hdfs
 * 13:18 joal: Deploying refinery with scap
 * 12:48 joal: Releasing refinery v0.0.98 on archiva from jenkins after correction
 * 12:10 joal: Release refinery-source v0.0.98 to archiva (correction)
 * 12:10 joal: Release refinery-source v0.0.98 to jenkins
 * 09:34 elukey: clean up on the oozie db loop_* workflows (oozie stuck for some reason, most of the coords not processing anything since hours)
 * 08:35 joal: Restart webrequest bundle
 * 08:32 joal: Manually kill all leftover workflows from mediawiki-history-dumps
 * 08:29 joal: Kill webrequest-load bundle
 * 08:18 moritzm: restarted oozie on an-coord1001
 * 07:32 joal: Rerun webrequest-load for text and upload, hours 21 and 22
 * 07:28 joal: Suspend/resume stalled coordinators in hue

2019-08-21

 * 14:27 elukey: swap turnilo backend in varnish from analytics-tool1002 to an-tool1007

2019-08-20

 * 06:57 elukey: drop wmf_netflow from Analytics druid and restart the job with more dimensions

2019-08-14

 * 17:58 fdans: backfilling mediarequests from 2019-5-16 to 2019-8-14

2019-08-13

 * 08:47 elukey: kill/restart mediawiki geoeditors|history load, history wikitext to pick up chances to the repair workflow (hive2 actions)
 * 08:40 elukey: kill/restart mediawiki geoeditors druid/monthly to pick up hive2 actions
 * 08:40 elukey: kill/restart mediawiki history metrics/reduced to pick up hive2 actions
 * 06:23 elukey: kill/restart oozie coord unique_devices per_project_family daily due to missing jdbc_url in coordinator.properties (for hive2 actions)

2019-08-12

 * 22:00 mforns: restarted projectview geo coordinator in oozie
 * 21:55 mforns: restarted all Unique Devices coordinators (except cassandra ones) in oozie
 * 21:28 mforns: restarted all Virtualpageview coordinators in oozie
 * 21:14 mforns: restarted Webrequest druid coordinators in oozie
 * 21:06 mforns: restarted Webrequest bundle in oozie
 * 19:32 mforns: Finished deployment of analytics-refinery up to 5418d3be5f65f7325324d0c15c51b3ca722dde1c
 * 18:33 mforns: Starting deployment of analytics-refinery up to 5418d3be5f65f7325324d0c15c51b3ca722dde1c

2019-08-09

 * 06:36 elukey: restart oozie coords to pick up new hive2 actions (edit hourly, pageview druid daily/hourly/monthly, mobile apps uniques daily/monthly)

2019-08-08

 * 17:47 fdans: refinery deploy successful
 * 17:33 fdans: scap deploy of refinery done
 * 16:40 fdans: deploying refinery
 * 16:38 fdans: updating jars
 * 16:25 fdans: releasing refinery-source 0.0.97 to Maven
 * 16:15 fdans: restarting oozie coordinator pageview-druid-monthly-coord
 * 16:14 fdans: restarting oozie coordinator pageview-druid-daily-coord
 * 16:13 fdans: restarting oozie coordinator pageview-druid-hourly-coord
 * 16:09 fdans: restarting oozie coordinator mobile_apps-uniques-daily-coord
 * 16:08 fdans: restarting oozie coordinator mobile_apps-uniques-monthly-coord
 * 16:02 fdans: restarting edit_hourly

2019-08-02

 * 14:28 mforns: restarting oozie bundle for cassandra and oozie coordinator for edit_hourly
 * 14:27 mforns: finished deploying refinery
 * 13:57 mforns: deploying refinery up to b50a93955952ed863d5ef7703a91ab59f5d979cf (rollback of cassandra and edit_hourly hive2 actions to unbreak production)
 * 13:26 elukey: kill/start edit hourly oozie coordinator as attempt to fix a recurrent failure
 * 08:52 elukey: manually created /tmp/hive/operation_logs on an-coord1001

2019-07-31

 * 18:06 mforns: deployed Wikistats2 version 2.6.5
 * 16:04 mforns: finished deployment of analytics-refinery up to eb2d9b005b26f6dddab2b59f1ba591f1758ec99f
 * 15:37 mforns: starting deployment of analytics-refinery up to eb2d9b005b26f6dddab2b59f1ba591f1758ec99f
 * 12:58 elukey: roll restart zookeeper on druid clusters with spicerack cookbook
 * 08:06 elukey: increase heap size on HDFS Namenodes (an-master100[12]) to 16G

2019-07-30

 * 08:14 mforns: restarted hive-server2
 * 08:14 mforns: restarted hive-metastore
 * 07:59 mforns: restarted oozie in an-coord1001.eqiad.wmnwt

2019-07-29

 * 15:34 elukey: roll restart kafka brokers on Jumbo with spicerack
 * 13:01 elukey: roll restart druid jvms on druid100[4-6] via spicerack cookbook
 * 08:56 elukey: roll restart druid jvms on druid100[1-3] via spicerack cookbook
 * 06:31 elukey: restart the hadoop workers' jvms via spicerack cookbook

2019-07-26

 * 07:42 elukey: restart mediacounts-load hourly coordinator after refinery deployment to hdfs
 * 07:33 elukey: restart browser-general daily coordinator to pick up hive2 settings
 * 07:31 elukey: restart banner_impressions daily coordinator to pick up hive2 settings
 * 07:27 elukey: restart aqs coordinator to pick up hive2 settings
 * 07:18 elukey: deploy last version of refinery to HDFS
 * 06:28 elukey: restart aqs coordinator with hive2 actions

2019-07-25

 * 20:47 nuria: restarting banner_activity/druid/daily
 * 20:26 nuria: restarting browser-general oozie job
 * 16:58 elukey: restart the hdfs datanode on an-worker1080 to pick up new Ipv6 settings

2019-07-24

 * 22:53 nuria: uploading of refinery-0.0.95 to archiva failed, reseting archiva pw
 * 21:17 nuria: deployment of refinery 0.0.95 aborted
 * 16:40 ottomata: removed all non reportupdater-queries job repositories from /srv/reportupdater/jobs/ - T222739
 * 07:55 elukey: restart pageview-hourly oozie coordinator to pick up new hive2 action settings

2019-07-23

 * 09:23 elukey: restart projectview-hourly-coordinator with correct config - T228731

2019-07-22

 * 17:33 nuria: finished deploying refinery (no refinery source deploy, just bumping up jars)

2019-07-18

 * 18:34 ottomata: backfilling MobileWikiAppDailyStats data since June 7 to populate misisng fields (e.g. appinstallid) in refined data. - T226219
 * 14:28 nuria: deployed refinery v0.0.40

2019-07-17

 * 18:10 nuria: stating build of new refinery-source 0.0.94

2019-07-15

 * 16:46 elukey: add ipv6 aaaa/ptr records for an-worker* hosts (still didn't have them)
 * 14:32 elukey: resize /srv on an-coord1001 to 103G (-10G) to allow lvm backups
 * 14:31 elukey: restart hive/oozie/mariadb on an-coord1001 after mainteance
 * 14:23 elukey: temporary stop oozie/hive/mariadb for maintenance
 * 13:46 elukey: stop all timers on an-coord1001 as prep step for maintenance
 * 06:50 elukey: run msck repair table mediawiki_wikitext_history in beeline

2019-07-11

 * 21:28 ottomata: rerunning eventlogging_to_druid_readingdepth_hourly
 * 21:26 ottomata: rerunning eventlogging_to_druid_navigationtiming_hourly
 * 21:18 ottomata: rerunning /usr/local/bin/eventlogging_to_druid_prefupdate_hourly
 * 20:31 ottomata: resized /srv on an-coord1001 from 60G to 115G - T227132
 * 16:07 elukey: sudo chown -R analytics:analytics /srv/geoip/archive/ on stat1007
 * 15:47 elukey: chown -R analytics:analytics /wmf/data/archive/geoip on HDFS

2019-07-09

 * 18:58 nuria: re-refining ExternalGuidance events for July 2019
 * 14:47 ottomata: moved all mediawiki_page_* event tables to schema aware refine job
 * 13:26 elukey: enable base::firewall on stat1007

2019-07-08

 * 07:03 elukey: add base::firewall to stat1004

2019-07-05

 * 08:16 elukey: forced manual run of refinery-druid-drop-public-snapshots.service on an-coord1001

2019-07-04

 * 08:00 joal: Kill mediawiki-history-redeuced coordinator and restart it with manually patched version

2019-07-03

 * 21:57 nuria: deployed wikistats2 https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/520632/

2019-07-02

 * 10:10 elukey: reset-failed refinery-sqoop-mediawiki-private.service
 * 01:40 milimetric: deployed refinery, restarted refinery-mediawiki-sqoop-private

2019-07-01

 * 20:29 ottomata: removed old refinery deploy caches from an-coord1001 to free up disk space
 * 20:19 milimetric: syncing to hdfs on minor refinery deploy to remove hiwikisource from sqoop lists
 * 10:26 elukey: removed Hive tables and Database from Superset - T223919
 * 06:37 joal: Move newly computed snapshot for 2019-05 in place of original one for new checker run to normally succeed

2019-06-28

 * 20:57 joal: Restart mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-reduced-coord
 * 20:04 joal: drop-recreate mediawiki_history, mediawiki_page_history and mediawiki_user_history tables in hive
 * 18:59 joal: Restart Webrequest bundle
 * 18:53 joal: Kill data-quality-hourly bundle
 * 18:52 joal: Kill webrequest bundle
 * 18:33 joal: Deploy refinery with scap
 * 18:15 joal: Deploy refinery to HDSF
 * 17:43 elukey: deleted /srv/home/nathante/.local/share/Trash/* to free space on notebook1004
 * 17:12 joal: Deploying refinery with scap
 * 17:12 joal: Refinery-source v0.0.93 released to archiva

2019-06-26

 * 07:23 joal: manually rerun webrequest-druid-hourly-wf-2019-6-26-5 (second failure, druid reboot this time)
 * 07:04 joal: Manually rerun pageview-hourly-wf-2019-6-26-5, aqs-hourly-wf-2019-6-26-5 and webrequest-druid-hourly-wf-2019-6-26-5
 * 06:22 elukey: stop camus and other timers on an-coord1001 (prep step for reboot)

2019-06-25

 * 14:00 ottomata: killing mediawiki-load-bundle - T226436

2019-06-20

 * 17:18 milimetric: deployed wikistats 2

2019-06-19

 * 19:57 joal: Killing mediawiki-history-wikitext job becasue of failures due to userId parsing (same as previous month)
 * 13:41 ottomata: renaming event.mediawiki_page_restrictions_change to event.mediawiki_page_restrictions_change_T226051 - T226051

2019-06-17

 * 18:34 elukey: run hdfs fsck / on an-master1001
 * 15:52 elukey: re-run webrequest-load-wf-upload-2019-6-17-14 and webrequest-load-wf-upload-2019-6-17-13, failed due to reboots
 * 14:34 elukey: manual run of mediawiki-history-drop-snapshot.service to test new debug log
 * 14:16 elukey: re-run webrequest-load-wf-text-2019-6-17-12 manually, failed due to reboots
 * 13:45 elukey: re-run webrequest-load-wf-upload-2019-6-17-12, failed due to reboots

2019-06-16

 * 09:53 elukey: hdfs dfs -chmod o-rw /wmf/data/raw/netflow
 * 09:52 elukey: hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/raw/netflow
 * 08:09 elukey: manually restart refinery-druid-drop-public-snapshots.service with new unit settings (-t druid1004.eqiad.wmnet vs -t druid1004.eqiad.wmnet:8081)

2019-06-14

 * 13:22 joal: Restarting AQS using `scap deploy --service-restart`

2019-06-13

 * 18:18 fdans: deployment complete
 * 17:42 fdans: deploying refinery
 * 17:40 fdans: updating refinery jar symlinks
 * 17:20 fdans: Releasing new version of refinery source (v0.0.92)

2019-06-11

 * 07:38 fdans: reset fail alert for efinery-import-page-history-dumps

2019-06-10

 * 18:12 joal: Restart pageview, pageview-druid-hourly/daily/monthly ooie jobs for them to run in production queue
 * 18:05 joal: Kill/Restart webrequest bundle and move it to production queue
 * 17:54 ottomata: rolling restart of AQS service using scap deploy for new mediawiki_history_snaphost

2019-06-08

 * 08:17 joal: Manually re-run patched refine_eventlogging_analytics on an-coord1001 with flags "--ignore_failure_flag=true --since 48"
 * 08:12 elukey: remove org.wikimedia.analytics.refinery.job.refine.filter_out_non_wiki_hostname from refine's transform functions temporarily to unblock T225342
 * 07:37 elukey: manual run of monitor_refine_eventlogging_analytics
 * 07:28 joal: Manually run refine_eventlogging_analytics on an-coord1001 with flag --ignore_failure_flag=true

2019-06-07

 * 17:42 joal: Drop currently unused /wmf/data/wmf/webrequest_subset folder
 * 17:29 elukey: chown -R analytics:analytics-privatedata-users + chmod o-rw /wmf/data/wmf/netflow on HDFS
 * 17:18 mforns: restarted turnilo to clear deleted datasource
 * 17:17 elukey: restart turnilo to remove the old netflow datasource's settings
 * 17:01 mforns: restarted turnilo to clear deleted datasource
 * 16:18 joal: rerun webrequest-load-wf-text-2019-6-7-14 after failure
 * 09:59 joal: Kill wikitext-history job to prevent more resource-consuption becasue of failures

2019-06-06

 * 09:52 elukey: chown report updater output dirs on stat1007 to analytics:wikidev (was hdfs:wikidev) to unblock creation of new data
 * 09:45 elukey: re-run refine_sanitize_eventlogging_analytics_immediate with since = 900 in the .properties file
 * 06:38 elukey: re-run refine_sanitize_eventlogging_analytics_immediate with since = 48 in the .properties file (manually added)
 * 05:36 elukey: chown analytics:analytics /wmf/data/event_sanitized/{CentralNoticeTiming,LayoutJank,EventTiming,ElementTiming} (new directories created with yarn:analytics)

2019-06-05

 * 20:59 mforns: finished deployment of analytics/refinery up to 0660e70153dec892ae20bee7119a72cc17e8ec87
 * 20:20 mforns: starting deployment of analytics/refinery up to 0660e70153dec892ae20bee7119a72cc17e8ec87
 * 18:20 mforns: finished deployment of analytics/refinery/source v0.0.91
 * 18:00 mforns: starting deployment of analytics/refinery/source v0.0.91
 * 10:07 elukey: attempt to re-run webrequest-load-wf-text-2019-6-4-20 via Hue (temporary errors in the logs)

2019-06-04

 * 08:03 elukey: restart hive-server2 on an-coord1001 to pick up new GC/Heap settings
 * 06:57 elukey: restart hive metastore on an-coord1001 to apply new GC/heap settings

2019-06-03

 * 06:51 elukey: add the server field to the webrequest event format in varnishkafka + roll restart of all the varnishkafkas (via puppet) - T224236

2019-06-02

 * 07:04 elukey: manually restart refinery-import-page-history-dumps.service with some debug info to check what file breaks
 * 04:50 joal: Restart mediawiki-history-wikitext (dumps conversion) oozie job
 * 04:12 joal: Restart load-cassandra oozie bundle to use analytics user

2019-06-01

 * 08:03 elukey: manually restart refinery-sqoop-whole-mediawiki.service after failure

2019-05-27

 * 19:42 elukey: chown analytics:analytics /wmf/data/event/mediawiki_job_userOptionsUpdate on HDFS

2019-05-22

 * 21:29 joal: Manually refine webrequest_upload_2019_05_22_12 removing 19 rows having user-agents causing UAParser issue
 * 20:44 joal: Manually refine webrequest_text_2019_05_22_12 removing 19 rows having user-agents causing UAParser issue
 * 17:27 joal: Manually Rerun webrequest-load-wf-upload-2019-5-22-12 with higher error-threshold as dataloss-error is confirmed flase positive

2019-05-21

 * 06:28 elukey: chown analytics:analytics /user/hdfs/salts/eventlogging_sanitization on HDFS

2019-05-20

 * 17:17 elukey: chown -R analytics:analytics /tmp/DataFrameToDruid on HDFS
 * 16:39 joal: Manually run webrequest-load-wf-upload-2019-5-20-11 with higher error threshold as error were false positive
 * 15:28 joal: Rerunning timeout webrequest-load-coord-text and webrequest-load-coord-upload (2019-05-20T09:00)
 * 14:41 elukey: chown analytics:analytics /wmf/data/event_sanitized on HDFS
 * 12:02 elukey: chown analytics:analytics /wmf/data/event on HDFS
 * 12:00 elukey: chown analytics:analytics /wmf/data/wmf/event on HDFS
 * 10:21 elukey: chown -R analytics:analytics /wmf/data/raw/ dirs (except the webrequest one that has different perms)
 * 10:07 elukey: chown analytics:analytics /wmf/camus dirs (except the webrequest dir)
 * 08:49 elukey: move report updater HDFS jobs to the analytics user

2019-05-18

 * 11:25 elukey: delete analytics-store config from Superset

2019-05-17

 * 07:46 elukey: restart mediawiki history and denormalize coordinators with the new analytics user (left mediawiki-history-wikitext-coord aside for further investigation)
 * 07:22 elukey: chown -R analytics:analytics /wmf/data/wmf/mediawiki

2019-05-16

 * 20:08 joal: Manually fixing banner job
 * 19:53 joal: Restarting banner_activity-druid-monthly-coord after chuu chuu
 * 16:43 elukey: chown analytics:analytics /wmf/camus/webrequest-00 on HDFS
 * 16:36 elukey: restart the webrequest-load-bundle after the previous chown of the webrequest raw data
 * 16:23 elukey: chown -R analytics /wmf/data/raw/webrequest - step missed in earlier on migration
 * 14:09 elukey: restart the webrequest-druid-hourly-coord coordinator with the analytics user
 * 14:08 elukey: restart the webrequest-druid-daily-coord coordinator with the analytics user
 * 13:57 elukey: start webrequest-load-bundle from hour 12:00
 * 13:27 elukey: chown -R analytics:analytics /user/hive/warehouse/wmf_raw.db on HDFS
 * 13:23 elukey: chown -R analytics:analytics /wmf/data/raw/webrequests_faulty_hosts on HDFS
 * 13:08 elukey: chown -R analytics:analytics /wmf/data/raw/webrequests_data_loss on HDFS
 * 12:57 elukey: chown -R analytics:analytics-privatedata-users /wmf/data/wmf/webrequest on HDFS
 * 12:53 elukey: kill the webrequest-load-bundle in hue - prep step to migrate the webrequest bundle to the analytics user
 * 12:49 elukey: kill webrequest-load-coord-upload from hue - prep step to migrate the webrequest bundle to the analytics user

2019-05-15

 * 21:00 fdans: refinery deployed successfully
 * 20:43 fdans: deploying refinery
 * 20:31 fdans: updating symlinks for jars
 * 20:11 fdans: deploying refinery source
 * 18:02 fdans: rerunning refine for VirtualPageviewHourly @ 9am
 * 10:34 elukey: superset upgraded to 0.32

2019-05-14

 * 15:33 mforns: restart turnilo to clear deleted datasource

2019-05-12

 * 15:33 elukey: rollback python-kafka one eventlog1002 to 1.4.1-1~stretch1
 * 12:14 elukey: restart eventlogging on eventlog1002

2019-05-10

 * 15:53 elukey: kill mediacounts-archive coordinator, chown analytics:analytics /wmf/data/archive/mediacounts + restart the coord with the analytics user
 * 14:53 ottomata: restarted eventlogging with python-kafka-1.4.3
 * 14:17 ottomata: downgrading python-kafka from 1.4.6-1~stretch1 to 1.4.3-2~wmf0 on eventlog1002 - T221848
 * 06:30 elukey: refine with higher loss threshold webrequest upload 2019-5-8-18

2019-05-09

 * 16:38 elukey: restart hive-server2 on an-coord1001 due to OOMs
 * 16:13 elukey: killed application_1555511316215_77583 from Yarn CLI
 * 11:04 elukey: kill oozie mediawiki geoeditors coords (3 in total) + chown -R analytics /wmf/data/wmf/mediawiki_private (raw data already chowned with analytics:analytics) + restart of the coords with the analytics user
 * 09:17 elukey: restart oozie wikidata coordinators (3 in total) with the analytics user

2019-05-08

 * 15:39 mforns_: deploying refinery up to 698f2137aa965b07548ae7565aafaa784628b13c and together with refinery-source 0.0.89
 * 14:55 elukey: kill last_access_uniques-daily-asiacell-coord from hue (coord not used anymore)
 * 14:05 mforns_: deployed refinery-source up to ad74c41b05d5f838df6febb379e883855abb203d
 * 13:09 mforns_: started deployment train
 * 11:15 elukey: kill projectview coords (2 in total) + chown analytics:analytics /wmf/data/wmf/projectview and /wmf/data/archive/projectview + restart coords with the analytics user
 * 09:50 elukey: kill virtualpageviews coords (3 in total) + chown analytics:analytics /wmf/data/wmf/virtualpageview + restart of the coords with user analytics
 * 09:00 elukey: kill unique_devices coords (10 in total) + chown analytics:analytics /wmf/data/wmf/unique_devices + restart of 10 coords with user analytics
 * 06:41 elukey: chown /tmp/mobile_apps to analytics:analytics

2019-05-07

 * 14:37 elukey: kill pageview oozie coord (4 in total) + chown analytics:analytics /wmf/data/wmf/pageview /wmf/data/archive/pageview + restart of the coordinators with the analytics user
 * 11:46 elukey: kill mobile apps coordinators + chown analytics:analytics /wmf/data/archive/mobile_apps, /wmf/data/wmf/mobile_apps + restart of the coordinators with user analytics
 * 11:23 joal: Updating /wmf/data/raw/mediawiki_private/tables to be owned by analytics:analytics
 * 11:19 joal: Updating /wmf/data/raw/mediawiki/xmldumps to be owned by analytics:analytics
 * 11:19 joal: Updating /wmf/data/raw/mediawiki/project_namespace_map to be owned by analytics:analytics
 * 11:19 joal: Updating /wmf/data/raw/mediawiki/tables to be owned by analytics:analytics
 * 09:38 elukey: kill clickstream-coord, chown /wmf/data/archive/clickstream to analytics:analytics, restart the job with the analytics user override
 * 09:24 elukey: kill ores-revision-scores-public-coord via hue (not used anymore)
 * 07:46 elukey: temporary override of oozie/util/druid/load/workflow.xml in HDFS's refinery to allow the analytics user to push data to druid from oozie

2019-05-06

 * 14:22 elukey: kill apis-coord and relaunch it with user analytics
 * 13:51 elukey: kill mediacounts-load-coor, chown analytics:analytics /wmf/data/wmf/mediacounts, restart coordinator with user 'analytics'
 * 12:52 elukey: kill interlanguage-coord, chown analytics:analytics /wmf/data/wmf/interlanguage, restart coordinator with user 'analytics'
 * 12:15 elukey: kill browser-general-coor, chown analytics:analytics /wmf/data/wmf/browser, restart coordinator with user 'analytics'
 * 09:32 joal: manually touching success files to start banner_activity-druid-monthly-coord between 2018-06-01/2018-12-31
 * 09:28 joal: Launch new banner_activity-druid-monthly-coord between 2018-06-01/2018-12-31 to cover for timedout past actions
 * 09:13 elukey: kill banner impression coordinators, chown /wmf/data/wmf/banner_impressions to analyits:analytics and start coordinators again
 * 07:42 elukey: chown -R /wmf/data/wmf/aqs/* to analytics:analytics (was: analytics:hdfs)

2019-05-05

 * 07:31 joal: Manually laumch druid indexation of mediawiki_history_reduced_2019_04
 * 07:29 joal: Manually add 2019-04 hive partition to mediawiki_history_reduced after automated job failed (expected failure after refactor)

2019-05-03

 * 07:14 joal: Restarting mediawiki-history-check_denormalize-coord with missing parameter (patch provided to prevent the coord to start without it)

2019-05-02

 * 13:30 joal: Restarting AQS oozie job with -Duser=analytics parameter
 * 13:10 joal: Kill oozie aqs-hourly-coord
 * 08:57 elukey: manual start of refinery-sqoop-mediawiki-production.service

2019-05-01

 * 20:02 ottomata: sudo systemctl stop refinery-sqoop-mediawiki-production
 * 19:58 ottomata: sudo systemctl disable refinery-sqoop-mediawiki-production
 * 18:25 joal: restarted oozie jobs mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord and mediawiki-history-reduced-coord
 * 18:22 joal: Confirming that sqoop-private (cu_changes) will run automatically tonight (2nd of the month at 00:00) - nothing needed
 * 18:08 joal: Manually killed sqoop-production (comment and actor) to have it done after the current manual labs run
 * 17:59 joal: Starting a manual run of sqoop
 * 17:55 joal: deploying refinery onto HDFS
 * 17:40 joal: deploy refinery using scap after failed attemp
 * 17:37 joal: Recreate wmf.mediawiki_history (+page and user) and wmf.mediawiki_history_archive (with old data)
 * 17:12 joal: Moving exisiting mediawiki-history to /wmf/data/wmf/mediawiki/archive folder
 * 17:08 joal: Killing oozie jobs for new deploy: mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-reduced-coord
 * 17:02 joal: Deploying refinery using scap
 * 16:48 joal: Kill oozie mediawiki-history-druid-coord for true (replaced by edit_hourly job)
 * 16:08 joal: refinery-source v0.0.88 released on archiva

2019-04-30

 * 13:25 ottomata: restarting eventlogging processes to upgrade to python-kafka 1.4.6 - T221848

2019-04-29

 * 08:22 joal: Deploying refinery using scap (analytics-deploy user test)

2019-04-25

 * 14:19 mforns: Restarted Turnilo to clear deleted datasource

2019-04-24

 * 15:00 elukey: set innodb_file_format=Barracuda and innodb_large_prefix=1 on mariadb on an-coord1001 to allow bigger indexes for Superset db upgrades
 * 07:43 fdans: refinery uploaded to hdfs and webrequest bundle restarted
 * 07:06 fdans: restarted webrequest bundle
 * 06:24 elukey: kill of application_1555511316215_18282 on Hadoop due to excessive resource usage

2019-04-23

 * 13:49 elukey: delete tbayer_popups from druid analytics - T220575
 * 09:13 fdans: refinery deployed successfully
 * 08:28 fdans: deploying refinery
 * 08:26 fdans: refinery source v0.0.87 released and symlinks updated
 * 07:04 fdans: releasing refinery source v0.0.86 for what I hope is the last time

2019-04-18

 * 18:55 fdans: updated jars
 * 18:53 fdans: Release of v0.0.86 in maven succeeded
 * 15:22 fdans: restarting release of version 0.0.86 of refinery source to maven
 * 14:29 fdans: releasing version 0.0.86 of refinery source to maven

2019-04-17

 * 09:06 elukey: restart eventlogging on eventlog1002 due to errors in processors and consumer lag accumulated after the last Kafka Jumbo roll restart

2019-04-13

 * 09:21 elukey: re-run failed webrequest-text 2018-04-13-07 job - temporary failure between Hive and HDFS

2019-04-12

 * 10:12 elukey: matomo upgraded to 3.9.1 to fix some security vulns

2019-04-10

 * 14:48 elukey: restart turnilo to pick up the new nodejs runtime
 * 13:58 joal: Deploying AQS

2019-04-09

 * 18:40 ottomata: chowning files in analytics.wm.org/datasets/archive/public-datasets/ as stats:wikidev
 * 15:00 fdans: backfilling data between previous backfill end and start of puppetized job for PrefUpdate
 * 13:53 mforns: restarted turnilo to clear deleted datasource

2019-04-08

 * 14:50 fdans: backfilling prefupdate schema into druid from Jan 1 2019 until Apr 1 2019

2019-04-04

 * 21:20 mforns: Restarted turnilo to clear deleted datasource

2019-04-03

 * 19:16 elukey: failover from namenode on 1002 (currently active after the outage) to 1001 (standby)
 * 18:07 joal: mediawiki-history-checker manual rerun successful
 * 15:22 elukey: execute kafka preferred-replica-election on kafka-jumbo

2019-04-02

 * 17:54 mforns: restarted turnilo to clear deleted datasource
 * 17:29 milimetric: revision/pagelinks failed wikis rerun successfully, now forcing comment/actor rerun
 * 15:02 mforns: Rerunning webrequest-load-coord for 2019-04-01T22
 * 14:59 elukey: re-run of webrequest upload 2019-04-01-14 with higher data loss threshold
 * 10:14 elukey: restart eventlogging's mysql consumers on eventlog1002 - T219842
 * 06:18 joal: Deleted (in hdfs bin) actor and comment table data because it has been sqooped too early - manual rerun will be started once labs sqoop is done

2019-04-01

 * 06:02 elukey: kill + re-run of pageviews hourly 30-03 hour 7 - seems stuck in heart beat after reduce completed

2019-03-29

 * 12:29 mforns: Restarted Turnilo to refresh deleted test datasource
 * 12:11 mforns: Restarted Turnilo to refresh deleted test datasource
 * 11:52 mforns: Restarted Turnilo to refresh deleted test datasource
 * 11:10 mforns: Restarted Turnilo to refresh deleted test datasource

2019-03-28

 * 19:04 joal: Manually rerun webrequest-load-wf-upload-2019-3-28-8 with higher error threshold (alot of false positive!)

2019-03-27

 * 21:13 milimetric: done deploying refinery, will now restart monthly geoeditors coordinator

2019-03-18

 * 11:08 elukey: restart hue on analytics-tool1001 to pick up some new changes (should be a no-op)

2019-03-14

 * 17:43 mforns: Deploying AQS using scap (node10 upgrade)

2019-03-13

 * 22:58 nuria: mediawiki-check denormalized restart ed 0147256-181112144035577-oozie-oozi-C
 * 22:48 nuria: killed oozie job 0131427-181112144035577-oozie-oozi-C to correct e-mail address

2019-03-12

 * 16:06 joal: Rerun webrequest-load-wf-text-2019-3-12-11 after error

2019-03-08

 * 20:48 joal: Rerun webrequest-load-wf-upload-2019-3-8-19 after hive outage
 * 14:52 joal: deployed wikistats2 2.5.5

2019-03-07

 * 14:50 joal: Restart mediawiki-history after having corrected data
 * 13:52 joal: manually killing mediawiki-history-denormalize-wf-2019-02 instead of letting it fail another 3 attemps
 * 10:40 joal: Manually fixed sqoop issues

2019-03-06

 * 18:13 joal: Refinery deployed onto hadoop
 * 18:08 joal: Refinery deployed using scap

2019-03-04

 * 16:17 elukey: disable all report updater jobs via puppet (ensure => absent) due to dbstore1002 decom

2019-02-28

 * 17:16 milimetric: restarted mediawiki/history/load job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0131840-181112144035577-oozie-oozi-C/
 * 14:40 milimetric: refinery deployed with new sqoop logic and updated history/load job
 * 09:57 fdans: restarting mediawiki-history-wikitext coordinator
 * 09:56 fdans: restarting mediawiki-history-check_denormalize
 * 09:48 fdans: restarting mediawiki-history-denormalize coordinator

2019-02-27

 * 17:42 elukey: re-run webrequest-load-wf-upload-2019-2-27-16 (failed due to a shutdown of analytics1071 for hw maintenance)

2019-02-24

 * 10:24 elukey: restart check webrequest service on an-coord1001 (failed due to /mnt/hdfs being unavail)

2019-02-20

 * 18:17 fdans: deploying refinery
 * 16:03 ottomata: removing spark 1 from Analytics cluster - T212134

2019-02-19

 * 09:47 mforns: deployed refinery (without refinery-source) until commit 0d7ec1989852d4dd5b1497463fd9509e4f5bdb87

2019-02-15

 * 18:18 nuria: restarted turnilo in analytics-tool1002

2019-02-14

 * 09:07 joal: rerun mediawiki-history-wikitext-wf-2019-01
 * 09:06 joal: Re-run webrequest-load-wf-text-2019-2-14-6

2019-02-13

 * 19:46 mforns: Deploying refinery with scap
 * 19:12 mforns: Deployed refinery-source v0.0.85 using jenkins

2019-02-12

 * 09:39 elukey: systemctl disable/stop mediawiki-geoeditors-drop-month.timer on an-coord1001

2019-02-11

 * 10:01 elukey: restart superset to pick up new config.py changes
 * 08:38 elukey: restart superset to pick up new settings in config.py

2019-02-10

 * 10:52 elukey: re-run webrequest upload webrequest-load-wf-upload-2019-2-10-0
 * 10:52 elukey: killed oozie job related to webrequest-load-wf-upload-2019-2-10-0, seemed stuck in generate_sequence_statistics (not really clear why)

2019-02-08

 * 13:45 joal: wikistats2 snapshot updated to 2019-01

2019-02-06

 * 19:28 milimetric: deployed refinery
 * 18:41 joal: Killling-restarting mediawiki-history related oozie jobs

2019-02-04

 * 20:30 joal: Confirm that last week dataloss warnings were false alarms (upload -> 2019-1-28-15, 2019-1-28-16, 2019-2-1-1, 2019-2-1-4, 2019-2-1-13 -- text -> 2019-2-1-13, 2019-2-1-15)
 * 14:47 joal: Rerun webrequest-load-coord-text for 2019-02-04T04:00:00

2019-01-24

 * 11:49 mforns: Restarted Turnilo to remove deleted datasource

2019-01-23

 * 15:24 elukey: added lea-wmde and goransm to Superset

2019-01-22

 * 20:30 milimetric: updated hive tables in wmf_raw for actor/comment refactor
 * 19:00 milimetric: deployed refinery with refinery-source
 * 15:15 mforns: Restarted turnilo to clear deleted datasource
 * 08:59 elukey: clean up reportupdater_discovery-stats-interactive from stat1006 - old job not cleaned up

2019-01-21

 * 09:34 elukey: removed ./jobs/limn-language-data/interlanguage/.reportupdater.pid in /srv/reportupdater on stat1007 to force the first run of the timer

2019-01-17

 * 13:57 elukey: re-run pageview-hourly-wf-2019-1-12-14's coordinator

2019-01-15

 * 14:56 fdans: "rolling back to stable superset"
 * 14:40 fdans: deploying superset 0.26.3-wikimedia1
 * 14:36 elukey: stop superset to allow a clean mysqldump

2019-01-14

 * 17:48 nuria: restarting tUrnilo to pick up new config.. sigh
 * 17:47 nuria: restarting tornilo to pick up new config
 * 16:55 elukey: restart turnilo to pick up new changes
 * 16:40 ottomata: running refine eventlogging analytics for dec 17 2018 12:00 - 16:00   - T213602
 * 15:26 elukey: reimage stat1005 - T205846

2019-01-09

 * 14:13 elukey: shutdown all the hdfs datanode daemons on the decom nodes (analytics1028->41)

2019-01-08

 * 08:09 elukey: manual stop of hdfs balancer to ease the under replicated blocks healing (worker nodes already decently balanced)
 * 07:24 elukey: decommission analytics10[39-41] from Analytics Hadoop

2019-01-07

 * 22:02 mforns: Finished to restart oozie jobs after refinery deployment
 * 21:24 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs, proceeding to restart oozie jobs
 * 21:05 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
 * 19:48 ottomata: merging change to make rsync server modules pull only - T205157, T205152
 * 19:46 mforns: Deployed refinery-source using jenkins
 * 17:21 joal: Manually repair hive table and add _PARTITIONED flag to project_namespace_map
 * 17:03 elukey: re-enabled eventlogging mysql consumers
 * 16:02 elukey: stop eventlogging mysql consumers on eventlog1002 - db1107 down
 * 15:36 mforns: Restarted turnilo to clear deleted test datasource
 * 11:26 elukey: move hue/oozie/hive password handling from auto-load to role lookup in the puppet private repo
 * 09:19 joal: Deploying refinery onto HDFS so that refinery-job-0.0.82.jar is present on HDFS (needed to run mediawiki-history successfully)
 * 08:49 joal: Rerun failed mediawiki-denormalize job with update spark conf
 * 07:29 elukey: decom analytics103[7,8] from Analytics Hadoop

2019-01-06

 * 07:58 elukey: manually stopped the hdfs-balancer to ease the decom process (the hdfs nodes are already nicely balanced)
 * 07:55 elukey: decom analytics103[5,6] from Analytics Hadoop

2019-01-05

 * 07:37 elukey: decommission analytics1033/34 from the Hadoop cluster

2019-01-04

 * 09:42 joal: Kill banner test kafka-druid ingestion job
 * 08:16 elukey: restart eventlogging daemons on eventlog1002 to pick up openssl updates
 * 07:39 elukey: decommission analytics1031/32 from the Hadoop analytics cluster
 * 07:37 elukey: manually stopped hdfs-balancer (cluster already balanced, only one host left with some blocks to get) to ease the decom of two more nodes

2019-01-03

 * 11:26 elukey: manually started the hdfs-balancer (failed earlier on due to the presence of a lock file)

2019-01-02

 * 18:03 elukey: decom analytics10(29|30) from HDFS/Yarn
 * 10:31 elukey: killed all hdfs-balancer processes (one running since ages ago in 2018)
 * 09:16 elukey: decom analytics1028 from hdfs/yarn

2018-12-22

 * 18:47 elukey: manually clean up of old log files on an-coord1001 (disk space issues)

2018-12-21

 * 22:55 mforns: Restarted Turnilo to clear a deleted test datasource
 * 18:44 mforns: Restarted Turnilo to clear a deleted test datasource

2018-12-19

 * 00:09 mforns: restarted Turnilo to clear deleted datasource

2018-12-18

 * 23:50 mforns: restarted Turnilo to clear deleted datasource
 * 20:34 ottomata: bounced eventlogging-processor to pick up change to send invalid rawEvents as json string
 * 19:36 ottomata: re-running refine_eventlogging_backfill again for days in december - T211833
 * 17:37 mforns: restarted Turnilo to clear deleted datasource
 * 15:34 mforns: restarted Turnilo to clear deleted datasource
 * 10:19 mforns: restarted Turnilo to clear deleted datasource

2018-12-17

 * 23:52 mforns: restarted Turnilo to clear deleted datasource
 * 22:50 mforns: restarted Turnilo to clear deleted datasource
 * 22:31 mforns: restarted Turnilo to clear deleted datasource
 * 15:23 ottomata: re-running refine_eventlogging_analytics with --ignore_done_flag (backfilling didn't complete properly on friday) - T211833

2018-12-14

 * 20:45 ottomata: re-refining all hive EventLogging tables since 2018-11-29T17:00:00. - T211833
 * 20:35 ottomata: removing EventLogging Hive _REFINED flag files since 2018-11-29T17:00:00 to allow for re-refinement of data - 2018-11-29T17:00:00
 * 19:50 ottomata: staring refinery release deploy process for refinery 0.0.82 to fix T211833

2018-12-13

 * 21:11 ottomata: superset is back up at version 0.26.3
 * 20:57 ottomata: stopped superset on analytics-tool1003 for revert to previous version (luca will revert the db backup)
 * 15:25 mforns: restarted turnilo to clean a deleted test datasource

2018-12-12

 * 22:49 mforns: restarted turnilo to clear deleted test datasource
 * 20:07 mforns: restarted turnilo to clear deleted test datasource
 * 17:10 mforns: restarted turnilo to clear deleted test datasource

2018-12-11

 * 16:03 mforns: restarted Turnilo to clear deleted datasource
 * 15:27 mforns: restarted Turnilo to clear deleted datasource
 * 14:58 joal: Restart clickstream job after having repaired hive mediawiki-tables partitions

2018-12-10

 * 19:37 joal: Manually deleting old druid-public snapshots that were not following datasource naming convention (- instead of _)
 * 14:51 milimetric: trying the labsdb/analytics-store combination sqoop, live logs in /home/milimetric/sqoop-[private-]log.log on stat1004

2018-12-07

 * 08:10 joal: manually create /wmf/data/raw/mediawiki/tables/change_tag/snapshot=2018-11/_SUCCESS on hdfs to unlock mw-history-load and therefore mw-history-reduced

2018-12-06

 * 15:08 elukey: turnilo migrated to nodejs 10

2018-12-05

 * 14:53 elukey: restart hdfs namenodes and yarn rm to update rack awareness config (prep for new nodes)
 * 11:58 fdans: backfilling in progress, killing uniques coordinators within bundle, will restart bundle on Jan 1st
 * 11:34 fdans: backfill test successful. Starting job to backfill family uniques since mar 2017
 * 10:03 fdans: backfilling test for unique project families - start_time=2016-01-01T00:00Z stop_time=2016-02-01T00:00Z
 * 09:13 elukey: matomo read only + upgrade to matomo 3.7.0 on matomo1001
 * 07:43 elukey: restart middlemanager/broker/historical on druid-public to pick up new log4j settings

2018-12-04

 * 18:26 ottomata: reenabled refinement of mediawiki_revision_score
 * 17:50 joal: Deploying aqs using scap for offset and underestimate values in unique-devices endpoints
 * 17:12 elukey: cleanup logs on /var/log/druid on druid100[1-3] after change in log4j settings
 * 15:25 elukey: rolling restart of broker/historical/middlemanager on druid100[1-3] to pick up new logging settings
 * 15:01 joal: Update test values for uniques in cassandra before deploy
 * 14:56 elukey: restart druid broker and historical on druid1001
 * 12:16 joal: Drop cassandra test keyspace "local_group_default_T_unique_devices_TEST"
 * 10:55 fdans: deploying AQS to expose offset and underestimate numbers on unique devices

2018-12-03

 * 20:05 ottomata: dropping and recreating hive event.mediawiki_revision_score table and data - T210465
 * 18:11 mforns: rerun webrequest upload load job for 2018-12-01T14:00

2018-12-01

 * 08:50 fdans: bundle restarted successfully
 * 08:39 fdans: killing current cassandra bundle

2018-11-30

 * 12:45 joal: Update hive wmf_raw mediawiki schemas (namespace bigint -> int)

2018-11-29

 * 18:33 mforns: Finished refinery deployment using scap and refinery-deploy-to-hdfs
 * 17:41 mforns: Starting refinery deployment using scap and refinery-deploy-to-hdfs
 * 17:37 mforns: Deployed refinery-source using jenkins

2018-11-26

 * 15:47 ottomata: moved old raw revision-score data to hdfs in /user/otto/revision_score_old_schema_raw - T210013
 * 15:41 ottomata: stopped producing revision-score events with old schema; merged and deployed new schema; petr to deploy change to produce events with new schema soon.  https://phabricator.wikimedia.org/T210013
 * 15:27 fdans: monthly and daily jobs for uniques killed, replaced with backfilling jobs until Dec 1st

2018-11-22

 * 13:42 elukey: allow the research user to create/alter/etc.. tables on staging@db1108

2018-11-21

 * 19:49 milimetric: deploying AQS
 * 13:06 fdans: launching backfilling jobs for daily and monthly uniques from beginning of time until Nov 20
 * 13:05 fdans: test backfill on 13 Nov daily uniques successful
 * 12:54 fdans: testing backfill of daily uniques in production for 2018-11-13

2018-11-20

 * 14:02 elukey: restart hive-server2 to pick up new settings - T209536
 * 11:44 elukey: re-run pageview-hourly-wf-2018-11-20-9

2018-11-19

 * 13:59 joal: failing deployment on aqs to include a new patch
 * 13:41 joal: Deploying aqs using scap
 * 13:27 fdans: deploying aqs to add new fields to uniques dataset (T167539)

2018-11-18

 * 08:44 elukey: re-run webrequest-load-wf-text-2018-11-17-23 via Hue
 * 08:37 elukey: restart yarn on analytics1039 - not clear why the process failed (nothing in the logs, no other disks failed)

2018-11-15

 * 14:51 fdans: testing load of new uniques fields in test keyspace in cassandra
 * 14:07 elukey: re-run mediacounts-load-wf-2018-11-15-8 - died due to issues on an1039 (happened this morning, broken disk)

2018-11-12

 * 19:30 ottomata: running oozie-setup sharelib create and then spark2_oozie_sharelib_install
 * 15:40 fdans: Restarting per project family unique generation jobs (daily and monthly)
 * 13:18 joal: Suspend discovery 0060527-180705103628398-oozie-oozi-C coordinator for it not to block upgrade

2018-11-05

 * 10:20 joal: Create hive tables wmf.webrequest_subset and wmf.webrequest_subset_tags
 * 10:02 joal: Start mediawiki-history-wikitext job
 * 09:58 joal: create wmf.mediawiki_wikitext_history table
 * 09:46 joal: Alter wmf.pageview_whitelist renaming insertion_ts field to insertion_dt for convention
 * 09:43 joal: restart mediawiki-load oozie bundle to pick new deploy
 * 09:39 joal: Restart mediawiki-history-load oozie job to pick new deploy
 * 09:37 joal: Create table wmf_raw.mediawiki_change_tag
 * 09:24 joal: deploying refinery onto HDFSb
 * 09:04 joal: Deploy refinery from scap
 * 08:55 joal: Refinery-source released on archiva

2018-10-30

 * 16:55 mforns: Finished AQS deployment using scap
 * 16:45 mforns: Starting AQS deployment using scap
 * 15:34 ottomata: kafka topics --alter --topic eventlogging_VirtualPageView --partitions 12

2018-10-29

 * 22:55 ottomata: groceryheist killed a long running hive query that is now allowing backlogged production yarn jobs to finally execute
 * 16:37 ottomata: reassigning eventlogging_ReadingDepth partition 0 from 1002,1004,1006 to 1003,1001,1005 to move preferred leadership from 1002 to 1003
 * 14:27 ottomata: ran kafka-preferred-replica-election on kafka jumbo-eqiad cluster (this successfully rebalanced webrequest_text partition leadership) T207768
 * 10:23 joal: Kill yarn application application_1540747790951_1429 to prevent more cluster errors (eating too many resources)
 * 08:56 elukey: bounce yarn resource managers to pick up new zookeeper session timeout settings

2018-10-28

 * 17:30 elukey: restart yarn resource manager on an-master1002 to force failover to an-master1001

2018-10-26

 * 11:49 joal: Rerun failed oozie jobs (pageview and projectview)
 * 06:18 elukey: add AAAA DNS records for aqs and matomo1001
 * 05:55 elukey: reportupdater hadoop migrated to stat1007

2018-10-25

 * 21:06 ottomata: bouncing eventlogging-processor client side* to pick up mysql whitelist change for ContentTranslationAbuseFilter (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469419/)
 * 18:14 joal: Manually resume the bunch of suspended jobs (mostly from ebernhardson and chelsyx - our apologizes for not noticing earlier)
 * 18:13 joal: Manually copy /etc/hive/conf/hive-site.xml to hdfs:///user/hive and set permissions to 644 to allow all users to run oozie jobs
 * 15:36 elukey: shutdown aqs1006 to replace one broken disk
 * 14:28 elukey: upgrade druid on druid100[4-6] to Druid 0.12.3
 * 14:24 elukey: added AAAA DNS records to all the druid nodes
 * 10:36 joal: Resuming oozie webrequest and pageview druid hourly indexation jobs
 * 10:35 elukey: upgraded Druid on druid100[1-3] to 0.12.3-1
 * 09:16 elukey: upgrade turnilo to 1.8.1
 * 08:56 elukey: restart hive-server on an-coord1001 to pick up new prometheus settings
 * 08:10 joal: Suspend webrequest-druid-hourly and pageview-druid-hourly oozie jobs
 * 07:52 joal: Manually add za.wikimedia to pageview-witelist (patch merged: https://gerrit.wikimedia.org/r/469557)

2018-10-23

 * 16:25 ottomata: altering topic eventlogging_ReadingDepth to increase partitions from 1 to 12
 * 06:42 elukey: restart yarn and hdfs daemon on analytics1068 to pick up correct config (the host was down since before we swapped the Hadoop masters due to hw failures)

2018-10-22

 * 17:24 elukey: upgraded camus jar version in an-coordq1001's crontab (via puppet)
 * 17:21 elukey: deploy refinery to hdfs (via stat1005)
 * 17:12 elukey: deploy refinery (new version of camus)
 * 15:09 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs
 * 14:51 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
 * 14:50 mforns: Finished deployment of refinery-source using jenkins
 * 14:24 mforns: Starting deployment of refinery-source using jenkins

2018-10-16

 * 12:32 joal: rerun pageview-hourly-wf-2018-10-15-17

2018-10-15

 * 19:45 mforns: Finished refinery deployment with scap and refinery-deploy-to-hdfs
 * 19:10 mforns: Started refinery deployment with scap and refinery-deploy-to-hdfs
 * 19:09 mforns: Finished refinery-source deployment
 * 18:42 mforns: Started refinery-source deployment
 * 15:20 mforns: Finished refinery deployment with scap and refinery-deploy-to-hdfs
 * 14:52 mforns: Started refinery deployment with scap
 * 14:47 mforns: Finished refinery-source deployment
 * 14:19 mforns: Started refinery-source deployment
 * 14:05 elukey: swapped cobalt's ip with gerrit.wikimedia.org's one in analytics-in(4|6) firewall filters on the eqiad routers for https://phabricator.wikimedia.org/T206331#4666622. This should not cause git pulls to fail but let me know in case it does.

2018-10-14

 * 09:15 elukey: restart yarn resource manager on an-coord1002 (failover happened due to jvm issues)
 * 09:15 elukey: restart apps-session-metrics with spark 2.3.1 oozie libs (modified the coordinator.properties file manually on disk)

2018-10-12

 * 07:32 elukey: cleaned up all september files from eventlog1002's srv el archive to free some space (disk alerts)

2018-10-11

 * 14:20 elukey: reboot eventlog1002 for kernel upgrades

2018-10-10

 * 19:27 joal: Restart webrequest-load oozie bundle
 * 18:23 joal: kill Webrequest-load bundle
 * 18:04 joal: Kill webrequest-load-coord-upload
 * 07:23 elukey: add ipv6 mapped addresses (and DNS PTRs) to analytics-tools*
 * 07:23 joal: Full restart of browser-general oozie job
 * 07:19 joal: patch mediacount-archive job in prod
 * 07:16 joal: Full restart of mediacount-archive oozie job
 * 05:54 elukey: re-run failed mediacounts and browser-general coordinators with hive-site -> hdfs://analytics-hadoop/user/hive/hive-site.xml

2018-10-09

 * 18:24 ottomata: adding Accept header to all varnishkafka generated webrequest logs
 * 15:10 joal: restart Mediawiki-history-reduced
 * 15:08 joal: restart wikidata-coeditors oozie job
 * 15:08 joal: restart wikidata-specialentites oozie job
 * 15:00 joal: restart wikidata-article-placeholder oozie job
 * 14:57 joal: restart mediawiki-history denormalize oozie job
 * 14:56 joal: Restart check_denormalize oozie job
 * 14:53 joal: Restart clickstream oozie job to pick new spark-lib
 * 13:56 ottomata: bouncing oozie server on an-coord1001
 * 13:46 joal: Restarting oozie-api job
 * 13:36 joal: fully restart projectview_geo oozier job
 * 13:26 joal: Full restart of aqs oozie job
 * 13:25 joal: full restart of projectview_hourly
 * 13:14 joal: rerun failed aqs-hourl jobs
 * 12:48 elukey: re-run all the failed projectview-hourly-coord and aqs-hourly-coord workflows (restarting them via hue)
 * 12:47 elukey: re-run apis-wf-2018-10-9-8
 * 10:01 joal: Restart failed oozie jobs (webrequest, virtual-pageviews, mwh-reduced)
 * 07:14 elukey: stopped all crons on analytics1003 as prep step for migration to an-coord1001

2018-10-08

 * 16:28 elukey: restart eventlogging on eventlog1002 for python security upgrades
 * 10:26 elukey: swapped db settings from analytics1003 to an-coord1001 on both Druid clusters (restarted coordinators and overlords)
 * 07:35 joal: Manually run download-project-namespace-map with proxy

2018-10-06

 * 18:10 elukey: restart Yarn Resource Manager on an-master1002 to force an-master1001 to take the active role back (failed over due to a zk conn issue)

2018-10-05

 * 10:32 elukey: piwik/matomo out of maintenance
 * 10:17 elukey: set piwik/matomo in maintenance mode on matomo1001

2018-10-04

 * 20:33 mforns: Finished deployment of refinery
 * 19:52 mforns: Started deployment of refinery
 * 19:50 mforns: Finished deployment of refinery-source
 * 19:22 mforns: Started deployment of refinery-source
 * 17:20 elukey: bounce druid-brokers on druid100[4-6] after network maintenance

2018-10-01

 * 12:56 fdans: reverting to last version of wikistats

2018-09-27

 * 06:44 elukey: rolling restart of Druid coordinators and historicals on the Druid public cluster to pick up new Hadoop masters (one at the time, very gently)

2018-09-26

 * 20:39 elukey: rolling restart of all the druid historicals on Druid private/analytics
 * 20:00 ottomata: rolling restart of druid coordinators to hopefully pick up hadoop master config change
 * 17:49 joal: Deploy AQS from scap
 * 08:22 elukey: start mysql consumers on eventlog1002 after maintenance
 * 07:51 elukey: stop mysql consumers on eventlog1002 as prep step for db maintenance

2018-09-25

 * 20:21 joal: Webrequest warning for upload-2018-09-25-13 were all false positives
 * 17:36 ottomata: stopping refine jobs and deploying refinery source 0.0.75 - T203804
 * 12:37 joal: Rerun webrequest-load-wf-text-2018-9-25-6 and webrequest-load-wf-text-2018-9-25-7 after SLA failure due to hadoop master swaps
 * 11:55 joal: Rerun webrequest-load-wf-upload-2018-9-25-6 after failed SLA during hadoop master swap
 * 11:53 joal: rerun as you prefer dcausse :)
 * 08:02 joal: Killing discovery transfer job to drain cluster before master replacement (application_1536592725821_38136)
 * 06:24 elukey: stop camus crons on an1003 and report updater on stat1005 as prep step for cluster shutdown

2018-09-20

 * 16:04 joal: webrequest-load-check_sequence_statistics-wf-text-2018-9-19-20 have been checked as false-positive

2018-09-15

 * 12:22 joal: Restart webrequest-druid-[hourly|daily] coordinators
 * 12:20 joal: Kill wikidata-wdqs coordinator
 * 12:11 joal: Killing and restarting webrequest-load-bundle
 * 12:00 joal: Deploying refinery onto hadoop :)
 * 11:39 joal: Deploying refinery with scap

2018-09-12

 * 17:34 ottomata: deploying new version of refinery-source, and then refinery for properties based RefineMonitor job - https://phabricator.wikimedia.org/T203804
 * 13:11 ottomata: otto@deploy1001 Started deploy [eventlogging/analytics@5c6fab6]: Support loading plugins in eventlogging-processor - T203596
 * 06:21 elukey: re-run webrequest-load-wf-text-2018-9-12-4, failed due to sql exceptions/timeouts to the database

2018-09-10

 * 16:26 ottomata: restarting eventlogging-processors to pick up blacklist of WebClientError schema for MySQL - T203814
 * 12:49 elukey: disable camus as prep step for analytics100[1-3] reboots
 * 07:54 joal: Manually restarting mediawiki-reduced oozie with manual addition of missing parameter

2018-09-07

 * 18:18 joal: Manually downoad namespaces for 2018-08
 * 17:32 joal: Manually rerun download-project-namespace-map on analytics1003 after cron's failure

2018-09-06

 * 13:03 fdans: restarted virtualpageview_hourly coordinator

2018-09-05

 * 18:18 ottomata: restarted eventlogging processors blacklisting CentralNoticeImpression - T203592
 * 16:56 ottomata: restarting eventlogging processors to blacklist CitationUsage - T191086
 * 14:42 elukey: deploying refinery (pageview whitelist and cron script change)
 * 13:40 ottomata: reimaging thorium to debian stretch (this will cause an announced {stats,analytics}.wm.org downtime!) - T192641
 * 13:21 fdans: restarting webrequest load bundle, start time 11:00Z
 * 09:12 elukey: re-run webrequest-druid-hourly-wf-2018-9-5-7 - failed due to rebooting druid1001
 * 07:02 elukey: restart oozie on analytics1003 to pick up new smtp settings
 * 06:37 elukey: re-run webrequest-load-wf-misc-2018-9-5-2 and webrequest-load-wf-upload-2018-9-4-19 via Hue
 * 06:35 elukey: upload new pageview whitelist to hdfs

2018-09-04

 * 19:05 joal: Restart cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2018-9-4-14
 * 16:37 fdans: restarting webrequest-load bundle
 * 16:07 fdans: beginning refinery deployment
 * 14:28 fdans: deployed refinery source using jenkins

2018-09-03

 * 08:07 joal: Delete /wmf/data/raw/mediawiki_private/tables/cu_changes/month=2018-05 folder and relaunch mediawiki-geoeditors-load-wf-2018-08
 * 06:05 elukey: re-run virtualpageview-hourly-wf-2018-9-2-1 via Hue (failed oozie job to inspect: 0082016-180705103628398-oozie-oozi-W)

2018-08-31

 * 14:31 elukey: re-run webrequest-load-wf-upload-2018-8-31-11, failed due to hadoop workers reboots
 * 10:05 elukey: re-run webrequest-load-wf-upload-2018-8-31-7, failed due to hadoop workers reboots
 * 09:15 elukey: re-run webrequest-load-wf-upload-2018-8-31-[7,8], failed due to hadoop workers reboots
 * 07:26 elukey: re-run webrequest-load-wf-text-2018-8-31-4, failed due to hadoop workers reboots
 * 06:20 elukey: re-run webrequest-load-wf-text-2018-8-31-4, failed for hadoop workers reboots

2018-08-30

 * 15:59 elukey: rerun of pageview-druid-hourly-wf-2018-8-30-13, hadoop worker reboots in progress
 * 15:23 elukey: re-run webrequest-load-wf-upload-2018-8-30-13, failed due to hadoop worker reboots
 * 14:49 elukey: re-run webrequest-load-wf-text-2018-8-30-12, failed due to worker nodes reboots

2018-08-29

 * 15:27 joal: Deploy AQS with scap
 * 13:59 ottomata: upgrading spark2 package with pyarrow dependency and default pyspark to python3
 * 11:39 joal: Restart mediwiki-history-reduced oozie job after deploy
 * 10:41 elukey: nuked /srv/deployment/analytics/refinery on stat1005 after errors with archiva/git-fat (stat1005 is the canary)
 * 10:34 joal: Deploying refinery onto HDFS
 * 08:53 joal: Deploying refinery using scap
 * 08:30 joal: Deploying refinery using jenkins

2018-08-28

 * 14:19 joal: Restart Workflow pageview-druid-hourly-wf-2018-8-28-11
 * 13:00 joal: Restart mediawiki-history and mediawiki-history-druid jobs
 * 10:59 joal: deploying refinery onto HDFS
 * 10:44 joal: Deploying refinery from scap
 * 10:22 joal: Refinery-source v0.0.71 deployed onto archiva
 * 08:14 joal: Restart virtualpageview-hourly-wf-2018-8-27-21

2018-08-20

 * 21:28 fdans: restarting virtualpageview-hourly-coord
 * 21:21 fdans: refinery deployment succeeded
 * 20:55 fdans: deploying analytics refinery
 * 18:18 ottomata: restaring eventlogging client side processes using librdkafka1 0.11.x - https://phabricator.wikimedia.org/T200769

2018-08-14

 * 23:07 fdans: starting deployment of refinery via scap
 * 19:35 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs
 * 18:57 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
 * 18:44 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs
 * 18:03 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
 * 17:29 mforns: Finished deployment of refinery-source using jenkins
 * 15:51 ottomata: removed /srv/geowiki from stat1006
 * 14:39 mforns: Starting deployment of refinery-source using jenkins

2018-08-13

 * 14:59 ottomata: deploying refinery-0.0.69 and refinery changes for T198908

2018-08-10

 * 14:52 ottomata: restarting eventlogging-consumer@mysql-eventbus consuming from kafka jumbo-eqiad - T201420

2018-08-09

 * 17:49 mforns: finished refinery deploy using scap and refinery-deploy-to-hdfs
 * 17:24 mforns: starting refinery deploy using scap
 * 17:23 mforns: finished refiery-source deploy using jenkins
 * 16:46 mforns: starting refiery-source deploy using jenkins

2018-08-08

 * 21:29 joal: Webrequest data-loss warnings for upload and text for hours 2018-08-08-18 were contained only false positive (possibly related to network glitch ?)

2018-08-07

 * 13:24 joal: Update AQS druid datasource to 2018-07 snapshot

2018-08-06

 * 19:11 ottomata: upgrading from spark 2.3.0 -> spark 2.3.1 everywhere
 * 12:21 joal: Warning in webrequest-upload-2018-8-1-13 contained only false positives

2018-08-02

 * 17:33 milimetric: deployed refinery, relaunching geoeditors sqoop

2018-08-01

 * 16:23 mforns: deploying refinery using scap
 * 10:06 elukey: restart all the yarn nodemanagers after minor max memory allocation change
 * 09:19 elukey: restart webrequest-load-wf-text-2018-8-1-7 (died due to yarn restarts)
 * 06:59 elukey: restart eventlogging on eventlog1002 to pick up new logging settings

2018-07-28

 * 17:29 elukey: restart eventlogging on eventlog1002 after tons of disconnects (still not clear what happened)

2018-07-27

 * 15:18 joal: Deploying AQS with scap
 * 11:42 joal: Restart mediawiki-history-denormalize oozie job after deploy

2018-07-26

 * 17:09 joal: Restart mediawiki-history-reduced job after deploy
 * 14:03 joal: Restart webrequest-bundle load job to pick new pageview definition
 * 13:59 joal: Start wikidata-coeditors job
 * 07:57 joal: Deploying refinery with scap - 2nd try

2018-07-25

 * 15:42 joal: Deploying refinery onto HDFS
 * 14:38 joal: Release refinery v0.0.67 to archiva

2018-07-24

 * 11:46 joal: Cheked that oozie webrequest upload warning for hour 2018-07-24-07 contains only false positive

2018-07-18

 * 08:48 elukey: re-run hour 7 of webrequest upload/text via Hue (failed due to a hadoop node restart)

2018-07-10

 * 10:43 elukey: restart map reduce history server on an1001 as attempt to see if related with yarn.w.o unresponsiveness
 * 10:03 elukey: bounce yarn RM on an100[12], some socket errors after the ip6 interface rollout
 * 08:20 joal: Update AQS druid backend datasource to 2018-06

2018-07-05

 * 10:36 elukey: restart oozie on analytics1003 - connection timeouts from thorium after mariadb maintenance
 * 10:34 elukey: restart hive metastore on an1003, errors after mariadb maintenance this morning
 * 07:44 elukey: all jobs re-enabled
 * 06:26 elukey: stop camus to allow mariadb restart on analytics1003

2018-07-02

 * 14:56 elukey: resume cassandra bundle via hue
 * 13:27 elukey: suspend cassandra bundle via Hue to ease the reimage of aqs1004
 * 09:12 joal: Rerun mediawiki-geoeditors-load-wf-2018-06 after having fixed the wmf_raw.mediawiki_private_cu_changes table issueb
 * 07:12 joal: Restart cassandra bundle

2018-06-28

 * 14:46 elukey: upgrade piwik 3.2.1 to matomo (new name/package) 3.5.1
 * 11:27 joal: Change mediawiki-reduced table format to be parquet and restart mediawiki-reduced oozie job
 * 11:19 joal: Restart druid uniques daily-monthly-aggregated indexation jobs
 * 11:19 joal: Start backfilling job cassandra pageviews-top-countries ceiled-values
 * 10:20 joal: Deploying refinery to HDFS
 * 10:09 joal: Deploying refinery using scap
 * 09:03 joal: deploying AQS pageviews-bycountry ceiled value glue code
 * 07:41 fdans: testing load of 2 months of per country pageviews with the new ceiled value
 * 06:10 elukey: move /srv/kafka to a dedicated 60G partition on deployment-jumbo hosts in deployment-prep

2018-06-27

 * 21:51 elukey: piwik maintenance completed
 * 13:08 elukey: piwik upgraded to 3.2.1 on bohrium + started the db migration procedure (will last 2/3h probably)
 * 12:57 elukey: set Piwik in maintenance mode as prep step for backup + upgrade

2018-06-20

 * 19:54 ottomata: removed Kafka MirrorMaker from kafka10(12|13|14)

2018-06-18

 * 11:57 joal: Restart oozie webrequest refine jobs
 * 11:19 joal: Launch oozie webrequest refine jobs for the failing hour 2018-06-14-11
 * 10:18 joal: Deployed refiney on hdfs
 * 10:18 joal: Deployed refinery with scap

2018-06-15

 * 09:00 joal: Deleting corrupted file hdfs://analytics-hadoop/user/joal/wmf/data/raw/webrequest/webrequest_upload/hourly/2018/06/14/11/webrequest_upload.1004.10.1214791.15490650727.1528974000000._COPYING_ to prevent webrequest refine jobs from failing. No data will be lost as the correct file exist.

2018-06-14

 * 19:29 joal: try rerunning webrequest-load-wf-upload-2018-6-14-11
 * 13:14 elukey: re-run failed webrequest-upload/text jobs (namenodes restarted)

2018-06-11

 * 13:56 ottomata: bouncing eventlogging processes to apply kafka event time producing

2018-06-08

 * 11:45 joal: Launching manual sqooping of revision and archive table to recover from failure

2018-06-01

 * 08:37 joal: Restart every druid loading oozie job (except mediawiki reduced) to pick new configuration
 * 08:33 joal: Restart mediawiki-history-denormalize oozie job after deploy
 * 08:24 joal: Deploy refinery on HDFS
 * 08:08 joal: Deploying refinery using scap
 * 07:53 joal: Releasing refinery-source v0.0.65 to archiva
 * 07:05 joal: Rerun virtualpageview-druid-monthly-wf-2018-5

2018-05-31

 * 17:01 ottomata: dropping and deleting MobileWikiAppiOS* tables and data per request from chelsyx
 * 10:51 elukey: stopped Pivot on thorium
 * 07:27 joal: Restart webrequest-load-bundle with default oozie_launcher_memory value (should be 2048 set by workflows)
 * 05:33 elukey: re-run faied webrequest-load upload|misc jobs via Hue
 * 01:02 ottomata: bouncing main-eqiad -> jumbo-eqiad mirror maker

2018-05-30

 * 17:49 joal: Rerun webrequest-load-wf-misc-2018-5-30-16
 * 13:15 elukey: re-run webrequest-load-wf-upload-2018-5-30-11 - died after worker node reboots
 * 06:14 elukey: re-run failed webrequest-load jobs
 * 06:11 elukey: temporary point Turnilo to druid1002 to allow druid1001's reimage
 * 05:50 elukey: restart mirror maker on kafka10[12-23] - failures to consume after rebalance

2018-05-29

 * 17:02 elukey: re-run webrequest-load-text 29th May 2018 12:00:00
 * 15:03 joal: rerun webrequest-load-wf-upload-2018-5-29-13
 * 10:30 elukey: roll restart of druid-middlemanagers on druid* to pick up the new runtime settings (no more references to hadoop-client-cdh)
 * 10:04 elukey: re-run pageview-druid-hourly-wf-2018-5-29-7
 * 07:05 elukey: re-run webrequest-load-wf-text-2018-5-29-1

2018-05-28

 * 18:51 elukey: rerun webrequest-load-wf-upload-2018-5-28-14
 * 18:16 elukey: restart kafka mirror maker on kafka1012->14 - failed after the last round of kafka restarts
 * 12:55 elukey: re-run webrequest-load-wf-misc-2018-5-28-10
 * 05:50 elukey: re-run webrequest-load-wf-misc-2018-5-27-22, webrequest-load-wf-text-2018-5-28-2, webrequest-load-wf-upload-2018-5-28-3

2018-05-27

 * 07:25 joal: Rerun webrequest-load-wf-upload-2018-5-25-23
 * 07:25 joal: rerun webrequest-load-wf-misc-2018-5-26-16 and webrequest-load-wf-misc-2018-5-27-0

2018-05-25

 * 06:53 elukey: re-run webrequest-load-wf-upload-2018-5-24-23 and webrequest-load-wf-text-2018-5-25-4

2018-05-24

 * 17:20 ottomata: dropped and deleted raw and refined eventlogging tables and data for MobileWikiAppiOSUserHistory, MobileWikiAppiOSLoginAction, MobileWikiAppiOSSettingAction, MobileWikiAppiOSReadingLists, MobileWikiAppiOSSessions
 * 16:45 joal: Rerun webrequest-druid-daily-wf-2018-5-23 to correct corrupted data
 * 08:17 elukey: increase webrequest replication to 2 in druid analytics (via coordinator's UI)
 * 08:16 joal: rerun webrequest-load-wf-misc-2018-5-24-6

2018-05-23

 * 14:25 ottomata: redirecting pivot -> turnilo.wikimedia.org - https://phabricator.wikimedia.org/T194427
 * 07:35 elukey: upgrading the Druid labs cluster to Debian Stretch
 * 06:14 elukey: re-run webrequest-load-wf-misc-2018-5-23-2 via Hue

2018-05-22

 * 15:38 elukey: re-run webrequest-druid-hourly-wf-2018-5-22-12 - failed due to Druid cluster upgrade in progress
 * 14:08 elukey: upgrade druid on druid100[1-3] to 0.11
 * 13:37 elukey: killed banner impression data job (application_1523429574968_110796) and removed its related respawn cron on an1003
 * 09:43 elukey: upload to HDFS a new pageview whitelist to include fdc.wikimedia - https://gerrit.wikimedia.org/r/434370
 * 06:54 elukey: upload Fran's pageview whitelist change to HDFS - related code change: https://gerrit.wikimedia.org/r/#/c/434370/ (also includes mai.wikimedia)
 * 06:45 elukey: add nyc.wikimedia to the pageview whitelist on HDFS - related code change: https://gerrit.wikimedia.org/r/434440

2018-05-21

 * 21:24 ottomata: granted User:CN=kafka_fundraising_client read permissions for group fundraising* on kafka-jumbo (for kafkatee webrequest consumption: kafka acls --add --allow-principal User:CN=kafka_fundraising_client --consumer --topic '*' --group 'fundraising*'
 * 19:16 ottomata: restarted eventlogging file log consumers with new consumer groups beginning at end of topic
 * 18:46 ottomata: restarting eventlogging with python-ua-parser 0.8.0
 * 16:46 fdans: deploying refinery
 * 14:40 fdans: Deploying refinery-source v0.0.64 using Jenkins
 * 01:20 ottomata: bouncing main -> jumbo MirrorMaker with increased max.request.size - T189464

2018-05-16

 * 17:45 milimetric: refinery deploy is done
 * 16:21 milimetric: deploying refinery

2018-05-15

 * 16:35 milimetric: finished deploying refinery, cron for dropping old mediawiki snapshots should now be good
 * 16:20 milimetric: deploying refinery to fix that partition drop cron
 * 16:01 joal: Deploy AQS using scap
 * 15:55 ottomata: bouncing main -> analytics MirrorMaker
 * 10:38 joal: Kill-Restart mediawiki-history-reduced ooie coordinator to pick up deployed changes
 * 09:37 joal: Deploy refinery onto HDFS
 * 09:36 joal: Deployed refinery using scap

2018-05-14

 * 23:54 ottomata: bouncing main -> jumbo MirrorMaker with larger max.request.size
 * 22:39 ottomata: bouncing main-eqiad -> jumbo mirror maker after committing new offset for eqiad.mediawiki.job.RecordLintJob
 * 17:27 ottomata: enabling main-eqiad job topics -> jumbo mirroring
 * 14:49 milimetric: deployment of refinery done
 * 14:07 milimetric: deploying refinery to enable dropping cu_changes data

2018-05-11

 * 14:25 elukey: restarted hadoop namenodes/resourcemanagers to apply openjdk security upgrades

2018-05-10

 * 14:11 elukey: re-enabled camus after analytics1003's maintenance
 * 13:08 elukey: disabled all camus jobs to drain the cluster and allow hive/oozie restarts for jvm upgrades

2018-05-09

 * 16:56 ottomata: disabled 0.9 MirrorMaker on kafka102[023], enabled 1.x MirrorMaker on kafka-jumbo*
 * 14:41 milimetric: finished deploying refinery with proper geoeditors druid indexing template
 * 13:59 ottomata: beginning upgrade of Kafka main-eqiad cluster from 0.9.0.1 to 1.1.0 - T167039
 * 13:49 milimetric: deploying refinery again, forgot to index a new metric in the new datasource, sorry
 * 13:23 mforns: re-run webrequest-load-wf-misc-2018-5-9-12 via hue
 * 13:13 milimetric: deployed refinery
 * 12:58 milimetric: deploying very simple change just to rename druid datasource
 * 12:48 elukey: re-run webrequest-load-wf-text-2018-5-8-17 via hue

2018-05-08

 * 20:35 milimetric: refinery deploy complete
 * 20:18 milimetric: deploying geoeditors for real now
 * 20:12 milimetric: aborting deployment, will deploy data truncation script too
 * 20:08 milimetric: deploying refinery to relaunch geoeditors job
 * 17:57 joal: Mvoe recomputed 2018-03 history snapshot in place of old one (T194075)
 * 15:38 joal: Try again (last time) to rerun mediawiki-history-druid-wf-2018-04
 * 15:06 ottomata: beginnng Kafka upgrade of main-codfw: T167039
 * 08:01 elukey: removed cassandra-metrics-collector (graphite) from aqs nodes
 * 07:42 joal: Rerun mediawiki-history-druid-wf-2018-04 in a non-sync way with mediawiki-reduced
 * 06:41 elukey: rolling restart of druid-historicals on druid100[456] due to half of the segments not avaiable

2018-05-07

 * 12:05 joal: Rerun mediawiki-history-reduced-wf-2018-04
 * 09:18 elukey: re-run webrequest-load-wf-text-2018-5-7-7 - failed due to reimages

2018-05-04

 * 10:11 elukey: d-[123] Druid cluster upgraded to 0.11 in labs (project analytics)

2018-05-03

 * 20:29 milimetric: fixed wikimetrics issues, working fine again
 * 19:19 milimetric: wikimetrics is partly broken until I can figure out what’s going on

2018-05-02

 * 17:33 joal: Rerun webrequest-load-wf-text-2018-5-2-15
 * 16:41 joal: Manually silence pageview-whitelist alarm overwriting /wmf/refinery/current/static_data/pageview/whitelist/whitelist.tsv
 * 16:27 joal: 2018-05-02T14 webrequest dataloss warnings have been checked and are false positives
 * 16:17 joal: Restart oozie mediawiki-history-denormalize job after deploy
 * 16:14 ottomata: bounced eventlogging-consumer@mysql-m4-master-00 after kafka jumbo 1.1.0 upgrade
 * 16:05 joal: Restart oozie webrequest bundle after deploy
 * 15:20 joal: Deploying refinery to hadoop
 * 14:45 joal: Deploying refinery using Scap
 * 14:16 joal: Refinery-source version 0.0.63 finally released to Archiva!
 * 13:49 ottomata: beginning upgrade of kafka-jumbo brokers from 1.0.0 -> 1.1.0 : T193495
 * 13:20 elukey: restart druid broker on druid100[1-3] to enable druid.sql.enable: true

2018-05-01

 * 15:33 elukey: restart historical on druid1003 - exceptions in the logs
 * 15:22 elukey: restart druid-historical on druid1002 - Caused by: java.lang.IllegalArgumentException: Could not resolve type id 'hdfs' into a subtype of
 * 11:44 joal: False positive only in webrequest-load-check_sequence_statistics-wf-upload-2018-5-1-6
 * 07:14 joal: Rerun webrequest-druid-daily-wf-2018-4-30
 * 06:24 elukey: roll restart of all middlemanagers on druid100[123] - realtime tasks piled up from hours

2018-04-30

 * 23:04 ottomata: blacklisting change-prop and job queue topics from main-eqiad -> analytics (eqiad)
 * 22:55 ottomata: bouncing kafka main-eqiad -> eqiad (analytics) mirror maker
 * 19:34 joal: Retry releasing refinery-source to archiva
 * 18:43 joal: Releasing refinery-source
 * 15:53 joal: Resume webrequest-druid-hourly-coord and pageview-druid-hourly-coord
 * 14:23 joal: Suspend webrequest-druid-hourly-coord and pageview-druid-hourly-coord before druid upgrade
 * 14:23 elukey: disabled cron/check on analytics1003 to respawn banner impressions if needed
 * 14:21 joal: Kill BannerImpressionStream job before upgrading druid

2018-04-25

 * 14:39 elukey: re-enable camus after maintenance
 * 14:37 elukey: restart hive-server2 on analytics1003 to pick up settings in https://gerrit.wikimedia.org/r/428919
 * 13:40 elukey: stop camus on an1003 as prep step to gracefully restart hive server
 * 12:24 joal: Only false positive for Data Loss Warning - Workflow webrequest-load-check_sequence_statistics-wf-upload-2018-4-25-10

2018-04-24

 * 16:30 elukey: restart hadoop hdfs journalnode on analytics1035/52 to pick up prometheus jmx settings
 * 14:41 elukey: restart hadoop hdfs journalnode on analytics1028 to pick up jmx settings
 * 12:08 elukey: restart webrequest-load-wf-text-2018-4-24-9 via Hue (failed due to reimages)
 * 06:57 joal: correct reindextion job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0033859-180330093100664-oozie-oozi-C/
 * 06:55 joal: Reindextion job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0033855-180330093100664-oozie-oozi-C/
 * 06:54 joal: Manually reindexing all of mediawiki-history for snapshot 2018-03 after having messed it with job testing

2018-04-23

 * 20:41 milimetric: deployed a version of wikistats with all but reading metrics disabled to stop showing bad data
 * 19:34 elukey: deploy https://gerrit.wikimedia.org/r/428331 for Pivot
 * 14:10 ottomata: switching main -> analytics MirrorMaker to --new.consumer (temporarily stopping puppet on kafka101[234]) https://phabricator.wikimedia.org/T192387
 * 13:54 elukey: reimage analytics1067 to debian stretch

2018-04-20

 * 18:23 joal: Drop/recreate wmf.mediawiki_user_history andwmf.mediawiki_page_history for T188669
 * 14:17 elukey: d-[1,2,3] hosts in the analytics labs project upgraded to druid 0.10
 * 11:37 fdans: manually uploaded refinery whitelist to hdfs
 * 11:33 elukey: reimage analytics1068 do Debian stretch

2018-04-19

 * 20:39 milimetric: launched virtual pageviews job, it has id 0026169-180330093100664-oozie-oozi-C
 * 20:36 milimetric: Synced latest refinery version to HDFS
 * 17:35 fdans: refinery deployment - sync to hdfs finished
 * 16:27 elukey: analytics1069 reimaged to Debian stretch
 * 15:40 fdans: deploying refinery
 * 14:30 elukey: disabled druid1001's middlemanager, restarted 1002's
 * 14:19 elukey: add 60G /srv partition to hadoop-coordinator-1 in analytics labs
 * 14:04 elukey: disabled druid1002's worker as prep step for restart - jvms with a old version running realtime indexation

2018-04-16

 * 10:04 joal: Restart metrics job after table update
 * 09:54 joal: Update wmf.mediawiki_metrics table for T190058
 * 08:41 joal: Restart Mediawiki-history job after new patches
 * 08:35 joal: Restarting wikidata-articleplaceholder oozie job after last week's failures
 * 08:29 joal: Deploying refnery onto HDFS
 * 08:22 joal: Deploying refinery from tin
 * 08:03 joal: Correction - Deploying refinery-source v0.0.62 using Jenkins !
 * 08:03 joal: Deploying refinery source v0.0.62 from tin

2018-04-12

 * 20:34 ottomata: replacing references to dataset1001.wikimedia.org:: with /srv/dumps in stat1005:~ezachte/wikistats/dammit.lt/bash: for f in $(sudo grep -l dataset1001.wikimedia.org *); do sudo sed -i 's@dataset1001.wikimedia.org::@/srv/dumps/@g' $f; done T189283

2018-04-11

 * 16:48 elukey: restart hadoop namenodes to pick up HDFS trash settings

2018-04-10

 * 22:43 joal: Deploying refinery with scap
 * 22:42 joal: Refinery-source 0.0.61 deployed on archiva
 * 20:43 ottomata: bouncing main -> jumbo mirrormakers to blacklist job topics until we have time to investigate more
 * 20:38 ottomata: restarted event* camus and refine cron jobs, puppet is reenabled on analytics1003
 * 20:14 ottomata: restart mirrormakers main -> jumbo (AGAIN)
 * 19:26 ottomata: restarted camus-webrequest and camus-mediawiki (avro) camus jobs
 * 18:18 ottomata: restarting all hadoop nodemanagers, 3 at a time to pick up spark2-yarn-shuffle.jar T159962
 * 18:06 joal: EDeploy refinery to HDFS
 * 17:46 joal: Refinery source 0.0.60 deployed to archiva
 * 15:42 ottomata: disable puppet on analytics1003 and stop camus crons in preperation for spark 2 upgrade
 * 14:25 ottomata: bouncing all main -> jumob mirror makers, they look stuck!
 * 09:00 elukey: restart eventlogging mysql consumers on eventlog1002 to pick up new DNS changes for m4-master - T188991

2018-04-09

 * 07:15 elukey: upgrade kafka burrow on kafkamon*

2018-04-06

 * 17:14 joal: Launch manual mediawiki-history-reduced job to test memory setting (and index new data) -- mediawiki-history-reduced-wf-2018-03
 * 13:39 joal: Rerun mediawiki-history-druid-wf-2018-03

2018-04-05

 * 19:24 ottomata: upgrading spark2 to spark 2.3
 * 13:43 mforns: created success files in /wmf/data/raw/mediawiki/tables/ /snapshot=2018-03 for in revision, logging, pagelinks
 * 13:38 mforns: copied sqooped data for mediawiki history from /user/mforns over to /wmf/data/raw/mediawiki/tables/ for enwiki, table: revision

2018-04-04

 * 21:07 mforns: copied sqooped data for mediawiki history from /user/mforns over to /wmf/data/raw/mediawiki/tables/ for wikidatawiki and commonswiki, tables: revision, logging and pagelinks
 * 16:06 elukey: killed banner-impression related jvms on an1003 to finish openjdk-8 upgrades (they should be brought back via cron)

2018-04-03

 * 20:11 ottomata: bouncing main -> jumbo mirrormaker to apply batch.size = 65536
 * 19:32 ottomata: bouncing main -> jumbo MirrorMaker unsetting http://session.timeout.ms/, this has a restiction on the broker in 0.9 :(
 * 19:22 ottomata: bouncing main -> jumbo MirrorMaker setting session.timeout.ms = 125000
 * 18:46 ottomata: restart main -> jumbo MirrorMaker with request.timeout.ms = 2 minutes
 * 15:26 elukey: manually run hdfs balancer on an1003 (tmux session)
 * 15:25 elukey: killed a jvm belonging to hdfs-balancer stuck from march 9th
 * 13:48 ottomata: re-enable job queue topic mirroring from main -> eqiad

2018-04-02

 * 22:28 ottomata: bounce mirror maker to pick up client_id config changes
 * 20:55 ottomata: deployed multi-instance mirrormaker for main -> jumbo. 4 per host == 12 total processes
 * 11:25 joal: Repair cu_changes hive table afer succesfull sqoop import and add _PARTITIONED file for oozie jobs to launch
 * 08:33 joal: rerun wikidata-specialentitydata_metrics-wf-2018-4-1

2018-03-30

 * 13:48 elukey: restart overlord+middlemanager on druid100[23] to avoid consistency issues
 * 13:41 elukey: restart overlord+middlemanager on druid1001 after failures in real time indexing (overlord leader)
 * 09:44 elukey: re-enable camus
 * 08:26 elukey: stopped camus to drain the cluster - prep for easy restart of analytics1003's jvm daemons

2018-03-29

 * 20:55 milimetric: accidentally killed mediawiki-geowiki-monthly-coord, and then restarted it
 * 20:12 ottomata: blacklisted mediawiki.job topics from main -> jumbo MirrorMaker again, don't want to page over the weekend while this still is not stable. T189464
 * 07:30 joal: Manually reparing hive mediawiki_private_cu_changes table after manual sqooping of 2018-01 data, and add _PARTITIONNED file to the folder

2018-03-28

 * 19:39 ottomata: bouncing main -> jumbo mirrormaker to apply increase in consumer num.streams
 * 19:21 milimetric: synced refinery to hdfs (only python changes but just so we have latest)
 * 19:20 joal: Start Geowiki jobs (monthly and druid) starting 2018-01
 * 18:36 joal: Making hdfs://analytics-hadoop/wmf/data/wmf/mediawiki_private accessible only by analytics-privatedata-users group (and hdfs obviously)
 * 18:02 joal: Kill-Restart mobile_apps-session_metrics (bundle killed, coord started)
 * 18:00 joal: Kill-Restart mediawiki-history-reduced-coord after deploy
 * 17:44 joal: Deploying refinery onto hadoop
 * 17:29 joal: Deploy refinery using scap
 * 16:32 ottomata: bouncing main -> jumbo mirror makers to increase heap size to 2G
 * 14:16 ottomata: re-enabling replication of mediawiki job topics from main -> jumbo

2018-03-27

 * 14:03 elukey: consolidate all the zookeeper definition in one 'main-eqiad' one in Horizon -> Project-Analytics
 * 11:16 elukey: kill banner impression job to force a respawn (still using an old jvm)

2018-03-26

 * 15:12 elukey: restart eventlogging mysql consumers after maintenance
 * 14:26 ottomata: restarting jumbo -> eqiad mirror makers with prometheus instead of jmx
 * 13:28 ottomata: restarting kafka mirror maker main -> jumbo using new consumer
 * 13:09 fdans: stopped 2 mysql consumers as precaution for T174386

2018-03-24

 * 08:13 joal: kill failing query swamping the cluster (application_1520532368078_47226)

2018-03-23

 * 16:44 elukey: invalidated 2018-03-12/13 for iOS data in piwik to force a re-run of the archiver

2018-03-20

 * 10:10 elukey: removed old mysql/ssh/ganglia analytics vlan firewall rules (https://phabricator.wikimedia.org/T189408#4055749)

2018-03-19

 * 09:38 elukey: restart hadoop daemons on analytics1070 for openjdk upgrades (canary)

2018-03-16

 * 20:23 ottomata: bouncing main -> jumbo mirror makers to apply change-prop topic blacklist
 * 14:44 ottomata: restarting eventlogging mysql eventbus consumer to consume from analytics instead of jumbo
 * 14:38 elukey: temporary point pivot to druid1002 as prep step for druid1001's reboot
 * 14:37 elukey: disable druid1001's middlemanager as prep step for reboot
 * 14:24 elukey: changed superset druid private config from druid1002 to druid1003
 * 13:43 elukey: disable druid1002's middle manager via API as prep step for reboot
 * 09:57 elukey: restart eventlogging-consumer@mysql-m4/eventbus on eventlog1002 to force the DNS resolution of m4-master (changed from dbproxy1009 -> dbproxy1004)

2018-03-15

 * 22:13 ottomata: bounced jumbo mirror makers
 * 19:10 ottomata: bouncing main -> jumbo mirror maker
 * 14:50 joal: Restart clickstream-coord to pick new config including fawiki
 * 14:29 elukey: disabled druid1003's middlemanager as prep step for reboot
 * 14:07 ottomata: bouncing kafka jumob -> eqiad mirrormaker

2018-03-14

 * 15:27 ottomata: bouncing main -> jumbo mirror maker instances
 * 14:45 ottomata: beginning migration of eventlogging analtyics from Kafka analytics  to Kafka jumbo: T183297

2018-03-13

 * 20:47 ottomata: restarting eventlogging processors to pick up VirtualPageView blacklist from eventlogging-valid-mixed topic
 * 15:13 ottomata: bounce main -> analytics mirror maker instances
 * 15:07 ottomata: bouncing MirrorMaker on kafka1020 (main -> jumbo) to re-apply acks=all
 * 14:55 ottomata: bouncing MirrorMaker on kafka1022 to re-apply acks=all (main -> jumbo)
 * 14:32 ottomata: bouncing MirrorMaker on kafka1023 (main -> jumbo) to re-apply acks=all
 * 14:22 ottomata: bouncing mirrormaker for main -> analytics on kafka101[234] to apply roundrobin

2018-03-12

 * 19:39 ottomata: deployed new Refine jobs (eventlogging, eventbus, etc.) with deduplication and geocoding and casting
 * 18:17 ottomata: bouncing kafka mm eqiad -> jumbo witih acks=1
 * 18:10 ottomata: bouncing kafka mirrormaker for main-eqiad -> jumbo with buffer.memory=128M
 * 17:34 joal: Restart mediawiki-history-reduced oozie job to add a dependency
 * 16:55 joal: Restart mobile_apps_session_metrics
 * 16:52 joal: Deploying refinery on HDFS for mobile_apps patch
 * 16:26 joal: Deploying refinery again to provide patch for mobile_apps_session_metric job
 * 15:09 joal: Deploy refinery onto hdfs
 * 15:07 joal: Deploy refinery from scap
 * 14:32 elukey: restart druid-broker on druid1004 - no /var/log/druid/broker.log after 2018-03-10T22:38:52 (java.io.IOException: Too many open files_
 * 08:50 elukey: fixed evenglog1002's ipv6 (https://gerrit.wikimedia.org/r/#/c/418714/)

2018-03-10

 * 09:07 joal: Rerun clickstream-wf-2018-2
 * 00:32 milimetric: finished sqooping pagelinks for missing dbs, hdfs -put a SUCCESS flag in the 2018-02 snapshot, jobs should run unless Hue is still lying to itself

2018-03-09

 * 17:29 joal: Rerun mediawiki-history-reduced job after having manually repaired wmf_raw.mediawiki_project_namespace_map

2018-03-08

 * 18:05 ottomata: bouncing ResourceManagers
 * 08:54 elukey: re-enable camus after reboots
 * 07:15 elukey: disable Camus on an1003 to allow the cluster to drain - prep step for an100[123] reboot

2018-03-07

 * 07:15 elukey: manually re-run wikidata-articleplaceholder_metrics-wf-2018-3-6

2018-03-06

 * 20:44 ottomata: reverted change to point mediawiki monolog kafka producers at kafka jumbo-eqiad until deployment train is done T188136
 * 20:35 ottomata: pointing mediawiki monolog kafka producers at kafka jumbo-eqiad cluster: T188136
 * 19:06 elukey: cleaned up id=0 rows on db1108 (log database) for T188991
 * 10:19 elukey: restart webrequest-load-wf-upload-2018-3-6-7 (failed due to reboots)
 * 10:08 elukey: re-starting mysql consumers on eventlog1001
 * 09:41 elukey: stop eventlogging's mysql consumers for db1107 (el master) kernel updates

2018-03-05

 * 18:22 elukey: restart webrequest-load-wf-upload-2018-3-5-16 via Hue (failed due to reboots)
 * 18:21 elukey: restart webrequest-load-wf-text-2018-3-5-16 via Hue (failed due to reboots)
 * 15:00 mforns: rerun mediacounts-load-wf-2018-3-5-9
 * 10:57 joal: Relaunch Mediawiki-history job manually from spark2 to see if new versions helps
 * 10:57 joal: Killing failing Mediawiki-History job for 2018-03

2018-03-02

 * 15:33 mforns: rerun webrequest-load-wf-text-2018-3-2-12

2018-03-01

 * 14:59 elukey: shutdown deployment-eventlog02 in favor of deployment-eventlog05 in deployment-prep (Ubuntu -> Debian EL migration)
 * 09:45 elukey: rerun webrequest-load-wf-text-2018-3-1-6 manually, failed due to analytics1030's reboot

2018-02-28

 * 22:09 milimetric: re-deployed refinery for a small docs fix in the sqoop script
 * 17:55 milimetric: Refinery synced to HDFS, deploy completed
 * 17:40 milimetric: deploying Refinery
 * 08:38 joal: rerun cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2018-2-27-15

2018-02-27

 * 19:12 ottomata: updating spark2-* CLIs to spark 2.2.1: T185581

2018-02-21

 * 20:48 ottomata: now running 2 camus webrequest jobs, one consuming from jumbo (no data yet), the other from analytics. these should be fine to run in parallel.
 * 07:21 elukey: reboot db1108 (analytics-slave.eqiad.wmnet) for mariadb+kernel updates

2018-02-19

 * 17:14 elukey: deployed eventlogging - https://gerrit.wikimedia.org/r/#/c/405687/
 * 07:35 elukey: re-run wikidata-specialentitydata_metrics-wf-2018-2-17 via Hue

2018-02-16

 * 15:41 elukey: add analytics1057 back in the Hadoop worker pool after disk swap
 * 10:55 elukey: increased topic partitions for netflow to 3

2018-02-15

 * 13:54 milimetric: deployment of refinery and refinery-source done
 * 12:52 joal: Killing webrequest-load bundle (next restart should be at hour 12:00)
 * 08:18 elukey: removed jmxtrans and java 7 from analytics1003 and re-launched refinery-drop-mediawiki-snapshots
 * 07:51 elukey: removed default-java packages from analytics1003 and re-launched refinery-drop-mediawiki-snapshots

2018-02-14

 * 13:44 elukey: rollback java 8 upgrade for archiva - issues with Analytics builds
 * 13:35 elukey: installed openjdk-8 on meitnerium, manually upgraded java-update-alternatives to java8, restarted archiva
 * 13:14 elukey: removed java 7 packages from analytics100[12]
 * 12:43 elukey: jmxtrans removed from all the Hadoop workers
 * 12:43 elukey: openjdk-7-* packages removed from all the Hadoop workers

2018-02-13

 * 11:42 elukey: force kill of yarn nodemanager + other containers on analytics1057 (node failed, unit masked, processes still around)

2018-02-12

 * 23:16 elukey: re-run webrequest-load-wf-upload-2018-2-12-21 via Hue (node managers failure)
 * 23:13 elukey: manual restart of Yarn Node Managers on analytics1058/31
 * 23:09 elukey: cleaned up tmp files on all analytics hadoop worker nodes, job filling up tmp
 * 17:18 elukey: home dirs on stat1004 moved to /srv/home (/home symlinks to it)
 * 17:15 ottomata: restarting eventlogging-processors to blacklist Print schema in eventlogging-valid-mixed (MySQL)
 * 14:46 ottomata: deploying eventlogging for T186833 with EventCapsule in code and IP NO_DB_PROPERTIES

2018-02-09

 * 12:19 joal: Rerun wikidata-articleplaceholder_metrics-wf-2018-2-8

2018-02-08

 * 16:23 elukey: stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one

2018-02-07

 * 13:55 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
 * 13:49 elukey: restart overlord/middlemanager on druid1005

2018-02-06

 * 19:40 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
 * 15:36 elukey: drain + shutdown of analytics1038 to replace faulty BBU
 * 09:58 elukey: applied https://gerrit.wikimedia.org/r/c/405687/ manually on deployment-eventlog02 for testing

2018-02-05

 * 15:51 elukey: live hacked deployment-eventlog02's /srv/deployment/eventlogging/analytics/eventlogging/handlers.py to add poll(0) to the confluent kafka producer - T185291
 * 11:03 elukey: restart eventlogging/forwarder legacy-zmq on eventlog1001 due to slow memory leak over time (cached memory down to zero)

2018-02-02

 * 17:09 joal: Webrequest upload 2018-02-02 hours 9 and 11 dataloss warning have been checked - They are false positive
 * 09:56 joal: unique_devices-per_project_family-monthly-wf-2018-1 after failure

2018-02-01

 * 17:00 ottomata: killing stuck JsonRefine eventlogging analytics job application_1515441536446_52892, not sure why this is stuck.
 * 14:06 joal: Dataloss alerts for upload 2018-02-01 hours 1, 2, 3 and 5 were false positives
 * 12:17 joal: Restart cassandra monthly bundle after January deploy

2018-01-23

 * 20:10 ottomata: hdfs dfs -chmod 775 /wmf/data/archive/mediacounts/daily/2018 for T185419
 * 09:26 joal: Dataloss warning for upload and text 2018-01-23:06 is confirmed to be false positive

2018-01-22

 * 17:36 joal: Kill-Restart clickstream oozie job after deploy
 * 17:12 joal: deploying refinery onto HDFS
 * 17:12 joal: Refinery deployed from scap

2018-01-18

 * 19:11 joal: Kill-Restart coord_pageviews_top_bycountry_monthly ooie job from 2015-05
 * 19:10 joal: Add fake data to cassandra to silent alarms (Thanks again ema)
 * 18:56 joal: Truncating table "local_group_default_T_top_bycountry"."data" in cassandra before reload
 * 15:21 mforns: refinery deployment using scap and then deploying onto hdfs finished
 * 15:07 mforns: starting refinery deployment
 * 12:43 elukey: piwik on bohrium re-enabled
 * 12:40 elukey: set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot)
 * 09:38 elukey: reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites
 * 09:37 elukey: resumed druid hourly index jobs via hue and restored pivot's configuration
 * 09:21 elukey: reboot druid1001 for kernel upgrades
 * 09:00 elukey: suspended hourly druid batch index jobs via Hue
 * 08:58 elukey: temporarily set druid1002 in superset's druid cluster config (via UI)
 * 08:53 elukey: temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted)
 * 08:52 elukey: disable druid1001's middlemanager as prep step for reboot
 * 07:11 elukey: re-run webrequest-load-wf-misc-2018-1-18-3 via Hue

2018-01-17

 * 17:33 elukey: killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present)
 * 17:29 elukey: restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task)
 * 16:24 elukey: re-run all the pageview-druid-hourly failed jobs via Hue
 * 14:42 elukey: restart druid middlemanager on druid1003 as attempt to unblock realtime streaming
 * 14:21 elukey: forced kill of banner impression data streaming job to get it restarted
 * 11:44 elukey: re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot)
 * 11:44 elukey: restart druid middlemanager on druid1002
 * 10:38 elukey: stopped all crons on hadoop-coordinator-1
 * 10:37 elukey: re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot)
 * 10:22 elukey: reboot druid1002 for kernel upgrades
 * 09:53 elukey: disable druid middlemanager on druid1002 as prep step for reboot
 * 09:46 elukey: rebooted analytics1003
 * 09:46 elukey: removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?)
 * 08:53 elukey: disabled camus as prep step for analytics1003 reboot

2018-01-15

 * 13:39 elukey: stop eventlogging and reboot eventlog1001 for kernel updates
 * 09:58 elukey: rolling reboots of aqs hosts (1005->1009) for kernel updates
 * 09:11 elukey: reboot aqs1004 for kernel updates

2018-01-12

 * 13:03 joal: Rerun webrequest-load-wf-text-2018-1-12-9
 * 13:02 joal: Rerun webrequest-load-wf-upload-2018-1-12-9
 * 10:33 elukey: reboot analytics1066->69 for kernel updates
 * 09:07 elukey: reboot analytics1063->65 for kernel updates

2018-01-11

 * 22:35 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/403774
 * 22:04 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403762/
 * 20:57 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403753/
 * 17:37 joal: Kill manual banner-streaming job to see it restarted by cron
 * 17:11 ottomata: restart kafka on kafka-jumbo1003
 * 17:08 ottomata: restart kafka on kafka-jumbo1001...something is not right with my certpath change yesterday
 * 14:46 joal: Deploy refinery onto HDFS
 * 14:33 joal: Deploy refinery with Scap
 * 14:07 joal: Manually restarting banner streaming job to prevent alerting
 * 13:23 joal: Killing banner-streaming job to have it auto-restarted from cron
 * 11:45 elukey: re-run webrequest-load-wf-text-2018-1-11-8 (failed due to reboots)
 * 11:39 joal: rerun mediacounts-load-wf-2018-1-11-8
 * 10:48 joal: Restarting banner-streaming job after hadoop nodes reboot
 * 10:01 elukey: reboot analytics1059-61 for kernel updates
 * 09:34 elukey: reboot analytics1055->1058 for kernel updates
 * 09:04 elukey: reboot analytics1051->1054 for kernel updates

2018-01-10

 * 16:56 elukey: reboot analytics1048->50 for kernel updates
 * 16:23 ottomata: restarting kafka jumbo brokers to apply java.security certpath restrictions
 * 11:51 elukey: re-run webrequest-load-wf-upload-2018-1-10-10 (failed due to reboots)
 * 11:27 elukey: re-run webrequest-load-wf-text-2018-1-10-10 (failed due to reboots)
 * 11:26 elukey: reboot analytics1044->47 for kernel updates
 * 11:03 elukey: reboot analytics1040->43 for kernel updates

2018-01-09

 * 16:53 joal: Rerun pageview-druid-hourly-wf-2018-1-9-13
 * 15:33 elukey: stop mysql on dbstore1002 as prep step for shutdown (stop all slaves, mysql stop)
 * 15:10 elukey: reboot analytics1028 (hadoop worker and hdfs journal node) for kernel updates
 * 15:00 elukey: reboot kafka-jumbo1006 for kernel updates
 * 14:41 elukey: reboot kafka-jumbo1005 for kernel updates
 * 14:33 elukey: reboot kafka1023 for kernel updates
 * 14:04 elukey: reboot kafka1022 for kernel updates
 * 13:51 elukey: reboot kafka-jumbo1003 for kernel updates
 * 10:08 elukey: reboot kafka-jumbo1002 for kernel updates
 * 09:35 elukey: reboot kafka1014 for kernel updates

2018-01-08

 * 19:07 milimetric: Deployed refinery and synced to hdfs
 * 15:23 elukey: reboot kafka1013 for kernel updates
 * 13:40 elukey: reboot analytics10[36-39] for kernel updates
 * 12:59 elukey: reboot kafka1012 for kernel updates
 * 12:43 joal: Deploy AQS from tin
 * 12:36 fdans: Deploying AQS
 * 12:33 joal: Update fake-data in cassandra adiing top-by-country needed row
 * 11:07 elukey: re-run webrequest-load-wf-text-2018-1-8-8 (failed after some reboots due to kernel updates)
 * 10:07 elukey: drain + reboot analytics1029,1031->1034 for kernel updates

2018-01-07

 * 09:01 elukey: re-enabled puppet on db110[78] - eventlogging_sync restarted on db1108 (analytics-slave)

2018-01-06

 * 08:09 elukey: re-enable eventlogging mysql consumers after database maintenance

2018-01-05

 * 13:18 fdans: deploying AQS

2018-01-04

 * 19:54 joal: Deploying refinery onto hadoop
 * 19:45 joal: Deploy refinery using scap
 * 19:38 joal: Deploy refinery-source using jenkins
 * 16:01 ottomata: killing json_refine_eventlogging_analytics job that started yesterday and has not completed (has no executors running?) application_1512469367986_81514. I think the cluster is just too busy? mw-history job running...
 * 10:34 elukey: re-run mediacounts-archive-wf-2018-01-03

2018-01-03

 * 15:00 ottomata: restarting kafka-jumbo brokers to enable tls version and cipher suite restrictions

2018-01-02

 * 11:13 joal: Kill and restart cassandra loading oozie bundle to pick new pageview_top_bycountry job
 * 08:22 elukey: restart druid coordinators to pick up new jvm settings (freeing up 6GB of used memory)