Analytics/Server Admin Log

2020-07-03

 * 19:20 joal: restart failed webrequest-load job webrequest-load-wf-text-2020-7-3-17 with higher thresholds - error due to burst of requests in ulsfo
 * 19:13 joal: restart mediawiki-history-denormalize oozie job using 0.0.115 refinery-job jar
 * 19:05 joal: kill manual execution of mediawiki-history to save an-coord1001 (too big of a spark-driver)
 * 18:53 joal: restart webrequest-load-wf-text-2020-7-3-17 after hive server failure
 * 18:52 joal: restart data_quality_stats-wf-event.navigationtiming-useragent_entropy-hourly-2020-7-3-15 after have server failure
 * 18:51 joal: restart virtualpageview-hourly-wf-2020-7-3-15 after hive-server failure
 * 16:41 joal: Rerun mediawiki-history-check_denormalize-wf-2020-06 after having cleaned up wrong files and restarted a job without deterministic skewed join

2020-07-02

 * 18:16 joal: Launch a manual instance of mediawiki-history-denormalize to release data despite oozie failing
 * 16:17 joal: rerun mediawiki-history-denormalize-wf-2020-06 after oozie sharelib bump through manual restart
 * 12:41 joal: retry mediawiki-history-denormalize-wf-2020-06
 * 07:26 elukey: start a tmux on an-launcher1002 with 'sudo -u analytics /usr/local/bin/kerberos-run-command analytics /usr/local/bin/refinery-sqoop-mediawiki-production'
 * 07:20 elukey: execute systemctl reset-failed refinery-sqoop-whole-mediawiki.service to clear our alarms on launcher1002

2020-07-01

 * 19:04 joal: Kill/restart webrequest-load-bundle for mobile-pageview update
 * 18:59 joal: kill/restart pageview-druid jobs (hourly, daily, monthly) for in_content_namespace field update
 * 18:57 joal: kill/restart mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord for bz2 codec update
 * 18:55 joal: kill/restart mediawiki-history-denormalize-coord after skewed-join strategy update
 * 18:52 joal: Kill/Restart unique_devices-per_project_family-monthly-coord after fix
 * 18:41 joal: deploy refinery to HDFS
 * 18:28 joal: Deploy refinery using scap after hotfix
 * 18:20 joal: Deploy refinery using scap
 * 16:58 joal: trying to release refinery-source 0.0.129 to archiva, version 3
 * 16:51 elukey: remove /etc/maven/settings.xml from all analytics nodes that have it

2020-06-30

 * 18:28 joal: trying to release refinery-source to archiva from jenkins (second time)
 * 16:30 joal: Release refinery-source v0.0.129 using jenkins
 * 16:30 joal: Deploy refien
 * 16:05 elukey: re-enable timers on an-launcher1002 after archiva maintenance
 * 15:23 elukey: stop timers on an-launcher1002 to ease debugging for refinery deploy
 * 13:12 elukey: restart nodemanager on analytics1068 after GC overhead and OOMs
 * 09:32 joal: Kill/Restart mediawiki-wikitext-history job now that the current month one is done (bz2 fix)

2020-06-29

 * 13:09 elukey: archiva.wikimedia.org migrated to archiva1002

2020-06-25

 * 17:20 elukey: move RU jobs/timers from an-launcher1001 to an-launcher1002
 * 16:07 elukey: move all timers but RU from an-launcher1001 to 1002 (puppet disabled on 1001, all timers completed)
 * 12:13 elukey: reimage notebook1003/4 to debian buster as fresh start
 * 09:28 joal: Kill-restart pageview-hourly to read from pageview_actor
 * 09:25 joal: Kill-restart pageview_actor jobs (current+backfill) after dpeloy
 * 09:14 joal: Deploy refinery to HDFS
 * 08:56 joal: deploying refinery using scap to fix pageview_actor_hourly
 * 08:02 joal: Start backfilling pageview_actor_hourly job with new patch (expected to solve heisenbug)
 * 07:40 joal: Dropping refinery-camus jars from archiva up to 0.0.115
 * 07:04 joal: rerun failed pageview_actor_hourly

2020-06-24

 * 19:36 joal: Cleaning refinery-spark from archiva (up to 0.0.115)
 * 19:28 joal: Cleaning refinery-tools from archiva (up to 0.0.115)
 * 19:16 joal: Restarting unique-devices jobs to use pageview_actor_hourly instead of webrequest (4 jobs)
 * 19:08 joal: Start pageview_actor_hourly oozie job
 * 19:06 joal: Create pageview_actor_hourly after deploy to start new jobs
 * 18:57 joal: Clean archiva refinery-camus except 0.0.90
 * 18:54 joal: Deploying refinery onto HDFS
 * 18:47 joal: clean archiva from refinery-hive (up to 0.0.115)
 * 18:47 joal: Deploying refinery using scap
 * 18:15 joal: launching a new jenkins release after cleanup
 * 17:43 joal: Reseting refinery-source to v0.0.128 for clean release after jenkins-archiva password fix
 * 16:20 joal: Releasing refinery-source 0.0.128 to archiva
 * 06:50 elukey: truncate /srv/reportupdater/log/reportupdater-ee-beta-features from 43G to 1G on an-launcher1001 (disk space issues)

2020-06-22

 * 18:50 joal: Manually update pageview whitelist adding shnwiktionary

2020-06-20

 * 07:41 elukey: powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console
 * 07:37 elukey: powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console

2020-06-17

 * 19:59 milimetric: deployed quick fix for data stats job
 * 18:04 elukey: decommission matomo1001
 * 16:57 ottomata: produce searchsatisfaction events on group0 wikis via eventgate - T249261
 * 07:17 joal: Deleting mediawiki-history-text (avro) for 2020-01 and 2020-02 (we still have 2020-03 and 2020-04) - Expected free space: 160Tb
 * 06:40 elukey: reboot krb1001 for kernel upgrades
 * 06:24 elukey: reboot an-master100[1,2] for kernel upgrades
 * 06:03 elukey: reboot an-conf100[1-3] for kernel upgrades
 * 05:45 elukey: reboot stat1007/8 for kernel upgrades

2020-06-16

 * 19:58 ottomata: evolving event.SearchSatisfaction Hive table using /analytics/legacy/searchsatisfaction/latest schema
 * 19:41 ottomata: bumping Refine refinery jar version to 0.0.127 - T238230
 * 19:17 ottomata: deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261
 * 16:02 elukey: reboot kafka-jumbo1008 for kernel upgrades
 * 15:33 milimetric: refinery deployed and synced to hdfs, with refinery-source at 0.0.126
 * 15:20 elukey: reboot kafka-jumbo1007 for kernel upgrades
 * 15:13 elukey: re-enabling timers on launcher after maintenance
 * 15:06 elukey: reboot an-coord1001 for kernel upgrades
 * 14:27 elukey: stop timers on an-launcher1001, prep before rebooting an-coord1001
 * 14:23 elukey: reboot druid100[7,8] for kernel upgrades
 * 11:51 elukey: re-run webrequest-druid-hourly-coord 16/06T10
 * 11:36 elukey: reboot an-druid100[1,2] for kernel upgrades

2020-06-15

 * 09:37 elukey: restart refinery-druid-drop-public-snapshots.service after change in vlan firewall rules (added druid100[7,8] to term druid)

2020-06-11

 * 15:01 mforns: started refinery deploy for v0.0.126
 * 14:58 mforns: deployed refinery-source v0.0.126
 * 13:57 ottomata: removed accidentally added page_restrictions column(s) on Hive table event.mediawiki_user_blocks_change after a incorrect schema change was merged (no data was ever set in this column)

2020-06-09

 * 07:32 elukey: upgrade ROCm to 3.3 on stat1005

2020-06-08

 * 15:42 elukey: remove access to notebook100[3,4] - T249752
 * 14:07 elukey: move matomo cron archiver to systemd timer archiver (with nagios alarming)
 * 14:02 elukey: re-enable timers on an-coord1001
 * 14:01 elukey: restart hive/oozie on an-coord1001 for openjdk upgrades
 * 13:42 elukey: roll restart kafka jumbo brokers for openjdk upgrades
 * 13:26 elukey: stop timers on an-launcher to drain jobs and restart hive/oozie for openjdk upgrades

2020-06-05

 * 17:56 elukey: roll restart presto server on an-presto* to pick up new openjdk upgrades
 * 16:45 elukey: upgrade turnilo to 1.24.0
 * 13:26 elukey: reimage druid1006 to debian buster
 * 09:26 elukey: roll restart cassandra on AQS to pick up openjdk upgrades

2020-06-04

 * 19:12 elukey: roll restart of aqs to pick up new druid settings
 * 18:39 mforns: deployed wikistats2 2.7.5
 * 13:33 elukey: re-enable netflow hive2druid jobs after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/602356/
 * 10:56 elukey: depooled and reimage druid1004 to Debian Buster (Druid public cluster)
 * 07:31 elukey: stop netflow hive2druid timers to do some experiments
 * 06:13 elukey: kill application_1589903254658_75731 (druid indexation for netflow still running since 12h ago)
 * 05:36 elukey: restart druid middlemanager on druid1002 - strange protobuf warnings, netflow hive2druid indexation job stuck for hours
 * 05:13 elukey: reimage druid1003 to Buster

2020-06-03

 * 17:10 elukey: restart RU jobs after adding memory to an-launcher1001
 * 16:57 elukey: reboot an-launcher1001 to get new memory
 * 16:01 elukey: stop timers on an-launcher, prep for reboot
 * 09:35 elukey: re-run webrequest-druid-hourly-coord 03/06T7 (failed due to druid1002 moving to buster)
 * 08:50 elukey: reimage druid1002 to Buster

2020-06-01

 * 14:54 elukey: stop all timers on an-launcher1001, prep step for reboot
 * 12:54 elukey: /user/dedcode/.Trash/* -skipTrash
 * 06:53 elukey: re-run virtualpageview-hourly-wf-2020-5-31-19
 * 06:28 elukey: temporary stop of all RU jobs on an-launcher1001 to priviledge camus and others
 * 06:03 elukey: kill all airflow-related processes on an-launcher1001 - host killing tasks due to OOM

2020-05-30

 * 08:15 elukey: manual reset-failed of monitor_refine_mediawiki_job_events_failure_flags

2020-05-29

 * 13:19 elukey: re-run druid webrequest hourly 29/05T11 (failed due to a host reimage in progress)
 * 12:19 elukey: reimage druid1001 to Debian Buster
 * 10:05 elukey: move el2druid config from druid1001 to an-druid1001

2020-05-28

 * 18:31 milimetric: after deployment, restarted four oozie jobs with new SLAs and fixed datasets definitions
 * 06:40 elukey: slowly restarting all RU units on an-launcher1001
 * 06:32 elukey: delete old RU pid files with timestamp May 27 19:00 (scap deployment failed to an-launcher due to disk issues) except ./jobs/reportupdater-queries/pingback/.reportupdater.pid that was working fine

2020-05-27

 * 19:53 joal: Start pageview-complete dump oozie job after deploy
 * 19:24 joal: Deploy refinery onto hdfs
 * 19:22 joal: restart failed services on an-launcher1001
 * 19:06 joal: Deploy refinery using scap to an-launcher1001 only
 * 18:41 joal: Deploying refinery with scap
 * 13:42 ottomata: increased Kafka topic retention in jumbo-eqiad to 31 days for (eqiad|codfw).mediawiki.revision-create - T253753
 * 07:09 joal: Rerun webrequest-druid-hourly-wf-2020-5-26-17
 * 07:04 elukey: matomo upgraded to 3.13.5 on matomo1001
 * 06:17 elukey: superset upgraded to 0.36
 * 05:52 elukey: attempt to upgrade Superset to 0.36 - downtime expected

2020-05-24

 * 10:04 elukey: re-run virtualpageview-hourly 23/05T15 - failed due to a sporadic kerberos/hive issue

2020-05-22

 * 09:11 elukey: superset upgrade attempt to 0.36 failed due to a db upgrade error (not seen in staging), rollback to 0.35.2
 * 08:15 elukey: superset down for maintenance
 * 07:09 elukey: add druid100[7,8] to the LVS druid-public-brokers service (serving AQS's traffic)

2020-05-21

 * 17:24 elukey: add druid100[7,8] to the druid public cluster (not serving load balancer traffic for the moment, only joining the cluster) - T252771
 * 16:44 elukey: roll restart druid historical nodes on druid100[4-6] (public cluster) to pick up new settings - T252771
 * 14:02 elukey: restart druid kafka supervisor for wmf_netflow after maintenance
 * 13:53 elukey: restart druid-historical on an-druid100[1,2] to pick up new settings
 * 13:17 elukey: kill wmf_netflow druid supervisor for maintenance
 * 13:13 elukey: stop druid-daemons on druid100[1-3] (one at the time) to move the druid partition from /srv/druid to /srv (didn't think about it before) - T252771
 * 09:16 elukey: move Druid Analytics SQL in Superset to druid://an-druid1001.eqiad.wmnet:8082/druid/v2/sql/
 * 09:05 elukey: move turnilo to an-druid1001 (beefier host)
 * 08:15 elukey: roll restart of all druid historicals in the analytics cluster to pick up new settings

2020-05-20

 * 13:55 milimetric: deployed refinery with refinery-source v0.0.125

2020-05-19

 * 15:28 elukey: restart hadoop master daemons on an-master100[1,2] for openjdk upgrades
 * 06:29 elukey: roll restart zookeeper on druid100[4-6] for openjdk upgrades
 * 06:18 elukey: roll restart zookeeper on druid100[1-3] for openjdk upgrades

2020-05-18

 * 14:02 elukey: roll restart of hadoop daemons on the prod cluster for openjdk upgrades
 * 13:30 elukey: roll restart hadoop daemons on the test cluster for openjdk upgrades
 * 10:33 elukey: add an-druid100[1,2] to the Druid Analytics cluster

2020-05-15

 * 13:23 elukey: roll restart of the Druid analytics cluster to pick up new openjdk + /srv completed
 * 13:15 elukey: turnilo back to druid1001
 * 13:03 elukey: move turnilo config to druid1002 to ease druid maintenance
 * 12:31 elukey: move superset config to druid1002 (was druid1003) to ease maintenance
 * 09:08 elukey: restart druid brokers on Analytics Public

2020-05-14

 * 18:41 ottomata: fixed TLS authentication for Kafka mirror maker on jumbo - T250250
 * 12:49 joal: Release 2020-04 mediawiki_history_reduced to public druid for AQS (elukey did it :-P)
 * 09:53 elukey: upgrade matomo to 3.13.3
 * 09:50 elukey: set matomo in maintenance mode as prep step for upgrade

2020-05-13

 * 21:36 elukey: powercycle analytics1055
 * 13:46 elukey: upgrade spark2 on all stat100x hosts - T250161
 * 06:47 elukey: upgrade spark2 on stat1004 - canary host - T250161

2020-05-11

 * 10:17 elukey: re-run webrequest-load-wf-text-2020-5-11-9
 * 06:06 elukey: restart wikimedia-discovery-golden on stat1007 - apparenlty killed by no memory left to allocate on the system
 * 05:14 elukey: force re-run of monitor_refine_event_failure_flags after fixing a refine failed hour

2020-05-10

 * 07:44 joal: Rerun webrequest-load-wf-upload-2020-5-10-1

2020-05-08

 * 21:06 ottomata: running prefered replica election for kafka-jumbo to get preferred leaders back after reboot of broker earlier today - T252203
 * 15:36 ottomata: starting kafka broker on kafka-jumbo1006, same issue on other brokers when they are leaders of offending partitions - T252203
 * 15:27 ottomata: stopping kafka broker on kafka-jumbo1006 to investigate camus import failures - T252203
 * 15:16 ottomata: restarted turnilo after applying nuria and mforns changes

2020-05-07

 * 17:39 ottomata: deploying fix to refinery bin/camus CamusPartitionChecker when using dynamic stream configs
 * 16:49 joal: Restart and babysit mediawiki-history-denormalize-wf-2020-04
 * 16:37 elukey: roll restart of all the nodemanagers on the hadoop cluster to pick up new jvm settings
 * 13:53 elukey: move stat1007 to role::statistics::explorer (adding jupyterhub)
 * 11:00 joal: Moving application_1583418280867_334532 to the nice queue
 * 10:58 joal: Rerun wikidata-articleplaceholder_metrics-wf-2020-5-6
 * 07:45 elukey: re-run mediawiki-history-denormalize
 * 07:43 elukey: kill application_1583418280867_333560 after a chat with David, the job is consuming ~2TB of RAM
 * 07:32 elukey: re-run mediawiki history load
 * 07:18 elukey: execute yarn application -movetoqueue application_1583418280867_332862 -queue root.nice
 * 07:06 elukey: restart mediawiki-history-load via hue
 * 06:41 elukey: restart oozie on an-coord1001
 * 05:46 elukey: re-run mediarequest-hourly-wf-2020-5-6-19
 * 05:35 elukey: re-run two failed hours for webrequest load text (07/05T05) and upload (06/05T23)
 * 05:33 elukey: restart hadoop yarn nodemanager on analytics1071

2020-05-06

 * 12:49 elukey: restart oozie on an-coord1001 to pick up the new shlib retention changes
 * 12:28 mforns: re-run pageview-druid-hourly-coord for 2020-05-06T06:00:00 after oozie shared lib update
 * 11:30 elukey: use /run/user as kerberos credential cache for stat1005
 * 09:25 elukey: re-run projectview coordinator for 2020-5-6-5 after oozie shared lib update
 * 09:24 elukey: re-run virtualpageview coordinator for 2020-5-6-5 after oozie shared lib update
 * 09:13 elukey: re-run apis coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:11 elukey: re-run learning features actor coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:10 elukey: re-run aqs-hourly coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:09 elukey: re-run mediacounts coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:08 elukey: re-run mediarequest coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:08 elukey: re-run data quality coordinators for 2020-5-6-5/6 after oozie shared lib update
 * 09:05 elukey: re-run pageview-hourly coordinator 2020-5-6-6 after oozie shared lib update
 * 09:04 elukey: execute oozie admin -sharelibupdate on an-coord1001
 * 06:05 elukey: execute hdfs dfs -chown -R analytics-search:analytics-search-users /wmf/data/discovery/search_satisfaction/daily/year=2019

2020-05-05

 * 19:49 mforns: Finished re-deploying refinery using scap, then re-deploying onto hdfs
 * 18:47 mforns: Finished deploying refinery using scap, then deploying onto hdfs
 * 18:13 mforns: Deploying refinery using scap, then deploying onto hdfs
 * 18:02 mforns: Deployed refinery-source using the awesome new jenkins jobs :]
 * 13:15 joal: Dropping unavailable mediawiki-history-reduced datasources from superset

2020-05-04

 * 17:08 joal: Restart refinery-sqoop-mediawiki-private.service on an-launcher1001
 * 17:03 elukey: restart refinery-drop-webrequest-refined-partitions after manual chown
 * 17:03 joal: Restart refinery-sqoop-whole-mediawiki.service on an-launcher1001
 * 17:02 elukey: chown analytics (was: hdfs) /wmf/data/wmf/webrequest/webrequest_source=text/year=2019/month=12/day=14/hour= { 13,18 }
 * 16:44 joal: Deploy refinery again using scap (trying to fox sqoop)
 * 15:39 joal: restart refinery-sqoop-whole-mediawiki.service
 * 15:37 joal: restart refinery-sqoop-mediawiki-private.service
 * 14:50 joal: Deploy refinery using scap to fix sqoop
 * 13:43 elukey: restart refinery-sqoop-whole-mediawiki to test failure exit codes
 * 06:50 elukey: upgrade druid-exporter on all druid nodes

2020-05-03

 * 19:36 joal: Rerun mobile_apps-session_metrics-wf-7-2020-4-26

2020-05-02

 * 10:54 joal: Rerun predictions-actor-hourly-wf-2020-5-2-0

2020-05-01

 * 16:59 elukey: test prometheus-druid-exporter 0.8 on druid1001 (deb packages not yet uploaded, just build and manually installed)

2020-04-30

 * 10:36 elukey: run superset init to add missing perms on an-tool1005 and analytics-tool1004 - T249681
 * 07:14 elukey: correct X-Forwarded-Proto for superset (http -> https) and restart it

2020-04-29

 * 18:55 joal: Kill-restart cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat
 * 18:46 joal: Kill-restart pageview-hourly job
 * 18:45 joal: No restart needed for pageview-druid jobs
 * 18:36 joal: kill restart pageview-druid jobs (hourly, daily, monthly) to add new dimension
 * 18:29 joal: Kill-restart data-quality-stats-hourly bundle
 * 17:57 joal: Deploy refinery on HDFS
 * 17:45 elukey: roll restart Presto workers to pick up the new jvm settings (110G heap size)
 * 16:06 joal: Deploying refinery using scap
 * 15:57 joal: Deploying AQS using scap
 * 14:26 elukey: enable TLS consumer/producers for kafka main -> jumbo mirror maker - T250250
 * 13:48 joal: Releasing refinery 0.0.123 onto archiva with Jenkins
 * 08:47 elukey: roll restart zookeeper on an-conf* to pick up new openjdk11 updates (affects hadoop)

2020-04-27

 * 13:02 elukey: superset 0.36.0 deployed to an-tool1005

2020-04-26

 * 18:14 elukey: restart nodemanager on analytics1054 - failed due to heap pressure
 * 18:14 elukey: re-run webrequest-load-coord-text 26/04/2020T16 via Hue

2020-04-23

 * 13:57 elukey: launch again data quality stats bundle with https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/592008/ applied locally

2020-04-22

 * 06:46 elukey: kill dataquality hourly bundle again, traffic_by_country keeps failing
 * 06:11 elukey: start data quality bundle hourly with --user=analytics
 * 05:45 elukey: add a separate refinery scap target for the Hadoop test cluster and redeploy to check new settings

2020-04-21

 * 23:17 milimetric: restarted webrequest bundle, babysitting that first before going on
 * 23:00 milimetric: forgot a small jar version update, finished deploying now
 * 21:38 milimetric: deployed twice because analytics1030 failed with "OSError { } " but seems ok after the second deploy
 * 14:27 elukey: add motd to notebook100[3,4] to alert about host deprecation (in favor of stat100x)
 * 11:51 elukey: manually add SUCCESS flags under /wmf/data/wmf/banner_activity/daily/year=2020/month=1 and /wmf/data/wmf/banner_activity/daily/year=2019/month=12 to unblock druid banner monthly indexations

2020-04-20

 * 14:38 ottomata: restarting eventlogging-processor with updated python3-ua-parser for parsing KaiOS user ageints
 * 10:28 elukey: drop /srv/log/mw-log/archive/api from stat1007 (freeing 1.3TB of space!)

2020-04-18

 * 21:40 elukey: force hdfs-balancer as attempt to redistribute hdfs blocks more evenly to worker nodes (hoping to free the busiest ones)
 * 21:32 elukey: drop /user/analytics-privatedata/.Trash/* from hdfs to free some space (~100G used)
 * 21:25 elukey: drop /var/log/hadoop-yarn/apps/analytics-search/* from hdfs to free space (~8T replicated used)
 * 21:21 elukey: drop /user/ { analytics|hdfs } /.Trash/* from hdfs to free space (~100T used)
 * 21:12 elukey: drop /var/log/hadoop-yarn/apps/analytics from hdfs to free space (15.1T replicated)

2020-04-17

 * 13:45 elukey: lock down /srv/log/mw-log/archive/ on stat1007 to analytics-privatedata-users access only
 * 10:26 elukey: re-created default venv for notebooks on notebook100[3,4] (missed to git pull before re-creaing it the last time)

2020-04-16

 * 05:34 elukey: restart hadoop-yarn-nodemanager on an-worker108[4,5] - failed after GC OOM events (heavy spark jobs)

2020-04-15

 * 14:03 elukey: update Superset Alpha role perms with what stated in T249923#6058862
 * 09:35 elukey: restart jupyterhub too as follow up
 * 09:35 elukey: execute "create_virtualenv.sh ../venv" on stat1006, notebook1003, notebook1004 to apply new settings to Spark kernels (re-creating them)
 * 09:09 elukey: restart druid brokers on druid100[4-6] - stuck after datasource deletion

2020-04-11

 * 09:19 elukey: set hive-security: read-only for the Presto hive connector and roll restart the cluster

2020-04-10

 * 16:31 elukey: enable TLS from kafkatee to Kafka on analytics1030 (test instance)
 * 15:45 elukey: migrate data_purge timers from an-coord1001 to an-launcher1001
 * 09:11 elukey: move druid_load jobs from an-coord1001 to an-launcher1001
 * 08:08 elukey: move project_namespace_map from an-coord1001 to an-launcher1001
 * 07:38 elukey: move hdfs-cleaner from an-coord1001 to an-launcher1001

2020-04-09

 * 20:54 elukey: re-run webrequest upload/text hour 15:00 from Hue (stuck due to missing _IMPORTED flag, caused by an-launcher1001 migration. Andrew fixed it re-running manually the Camus checker)
 * 16:00 elukey: move camus timers from an-coord1001 to an-launcher1001
 * 15:20 elukey: absent spark refine timers on an-coord1001 and move them to an-launcher1001

2020-04-07

 * 09:17 elukey: enable refine for TwoColConflictExit (EL schema)

2020-04-06

 * 13:23 elukey: upgraded stat1008 to AMD ROCm 3.3 (enables tensorflow 2.x)
 * 12:33 joal: Bump AQS druid backend to 2020-03
 * 11:50 elukey: deploy new druid datasource in Druid public
 * 06:29 elukey: allow all analytics-privatedata-users to use the GPUs on stat1005/8

2020-04-04

 * 06:52 elukey: restart refinery-import-page-history-dumps

2020-04-03

 * 09:57 elukey: remove TwoColConflictExit from eventlogging's refine blacklist

2020-04-02

 * 19:31 joal: restart paegviewhourly job after manual patch
 * 19:29 joal: Manually patching last deploy to fic virtualpageview job - code merged
 * 17:48 joal: Kill/restart virtualpageview-hourly-coord after deploy
 * 16:55 joal: Deploy refinery onto HDFS
 * 16:30 joal: Deploy refinery using scap
 * 16:12 elukey: re-enable timers on an-coord1001 after maintenance
 * 15:52 elukey: restart hive server2/metastore with G1 settings
 * 14:05 elukey: temporary stop timers on an-coord1001 to facilitate hive daemons restarts
 * 13:47 hashar: test 1 2 3
 * 13:30 joal: Releasing refinery-source v0.0.121 using new jenkins-docker :)
 * 08:23 elukey: kill/restart netflow realtime druid indexation with a new dimension (peer_ip_src) - T246186

2020-04-01

 * 21:19 joal: restart pageview-hourly-wf-2020-4-1-15
 * 18:24 joal: Kill learning-features-actor-hourly as new version to come
 * 18:23 joal: Restart unique_devices-per_project_family-monthly-wf-2020-3 and aqs-hourly-wf-2020-4-1-15 after hive fialure
 * 18:21 joal: restart webrequest-load-wf-upload-2020-4-1-16 and webrequest-load-wf-text-2020-4-1-16 after hive failure
 * 18:14 joal: Kill groceryheist job taking half the cluster
 * 18:06 ottomata: restarted hive-server2
 * 10:07 jbond42: updating icu packages

2020-03-31

 * 12:57 jbond42: updating icu on presto-analytics-canary and hadoop-worker-canary

2020-03-30

 * 07:27 elukey: run /usr/local/bin/refine_sanitize_eventlogging_analytics_immediate --ignore_failure_flag=true --since=72 --verbose --table_whitelist_regex="ResourceTiming" refine_sanitize_eventlogging_analytics_immediate to fix _REFINE_FAILED events
 * 07:16 elukey: run eventlogging refine manually for schemas "EditorActivation|EditorJourney|HomepageVisit|VisualEditorFeatureUse|WikibaseTermboxInteraction|UploadWizardErrorFlowEvent|MobileWikiAppiOSReadingLists|ContentTranslationCTA|QuickSurveysResponses|MobileWikiAppiOSSessions to fix _REFINE_FAILED events

2020-03-29

 * 08:44 elukey: blacklist TwoColConflictExit from Eventlogging Refine to avoid alarm spam

2020-03-28

 * 16:54 elukey: restart yarn nodemanger on analytics1071 - network errors in the logs

2020-03-27

 * 08:09 elukey: deployed new kernerls for https://gerrit.wikimedia.org/r/580083 on stat1004

2020-03-26

 * 09:09 elukey: re-running manually webrequest-load upload 26/03/2020T08 - kerberos failures

2020-03-25

 * 08:14 elukey: restart presto-server on an-coord1001 to remove jmx catalog config

2020-03-24

 * 15:46 elukey: restart all cron.service processes on stat/notebook (killing long lingering processes) to move the unit under user.slice

2020-03-21

 * 14:17 joal: Restart wikidata_item_page_link job with manual fix - review to be confirmed
 * 14:06 joal: Kill buggy wikidata_item_page_link job

2020-03-18

 * 19:39 fdans: refinery deployed
 * 18:52 fdans: deploying refinery
 * 18:51 fdans: refinery source 0.0.119 jars generated and symlinked
 * 18:17 fdans: beginning deploy of refinery-source 0.0.119

2020-03-17

 * 17:25 elukey: deploy superset to enable Presto and Kerberos (Pyhive 0.6.2.)

2020-03-16

 * 19:43 joal: Kill-restart wikidata-articleplaceholder_metrics-coord to fix yarn queue
 * 18:30 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 17:05 elukey: roll restart of hadoop namenodes to get the new GC setting (MaxGCPauseMillis 400 -> 1000)

2020-03-13

 * 12:18 joal: Restart cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2020-3-12

2020-03-12

 * 22:53 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 22:22 mforns: deployed refinery-source using jenkins
 * 11:09 elukey: roll restart kerberos kdcs to pick up new ticket lifetime settings (10h -> 48h)
 * 08:27 elukey: re-running refine eventlogging with --since 12 (very conservative but just in case)

2020-03-11

 * 14:49 elukey: add xmldumps mountpoints on stat1004 and stat1005

2020-03-10

 * 15:20 elukey: remove the analytics user keytab from stat100[4,5]
 * 15:06 elukey: move stat1006 to role::statistics::explorer
 * 09:24 elukey: removed /etc/mysql/conf.d/stats-research-client.cnf from all stat boxes (all file used for RU, now on an-launcher1001)

2020-03-09

 * 07:27 elukey: deploy jupyterhub on notebook100[3,4] (manual venv re-creation) to allow the use of the user.slice - T247055
 * 07:26 elukey: upgrade nodejs from 6->10 on stat1* and notebook1*

2020-03-08

 * 17:58 elukey: restart hadoop-yarn-nodemanger on an-worker1087

2020-03-06

 * 14:58 joal: AQS new druid snapshot released (2020-02)
 * 10:06 elukey: roll restart Presto daemons for openjdk upgrades
 * 09:45 elukey: roll restart of cassandra on AQS to pick up new openjdk upgrades

2020-03-05

 * 19:45 elukey: deleted dangling 'reports' symlink on stat100[6,7] in /srv/published
 * 19:39 elukey: mv /srv/reportupdater to /srv/reportupdater-backup05032020 on stat100[6,7]
 * 16:34 mforns: restart turnilo to refresh deleted datasources
 * 14:16 elukey: restart hdfs/yarn master daemons to pick up new core-site changes for Superset
 * 06:48 elukey: restart yarn on analytics1074 (GC overhead, traces of network errors with datanodes)

2020-03-04

 * 08:41 joal: Kill-restart mediawiki-history-reduced-coord
 * 08:38 joal: Kill-restart mediawiki-history-dumps-coord

2020-03-03

 * 21:19 joal: Kill-restart actor jobs
 * 21:17 joal: kill-restart mediawiki-history-check_denormalize-coord
 * 21:16 joal: Kill-restart mediawiki-history job
 * 21:10 joal: Kill Wikidataplaceholder failling coord
 * 21:08 joal: Kill restart wikidata-specialentitydata_metrics-coord
 * 21:07 joal: Start Wikidataplaceholder job
 * 21:06 joal: Kill/restart edit_hourly job
 * 21:04 joal: Start wikidata_item_page_link coordinator
 * 20:46 joal: Deploy refinery onto HDFS
 * 20:34 joal: Deploy refinery using scap
 * 20:28 joal: Add new jars to refinery using Jenkins
 * 20:01 joal: Release refinery-source v0.0.117 with Jenkins
 * 16:37 mforns: restarted turnilo to refresh deleted test datasource
 * 11:56 joal: Kill actor-hourly oozie test jobs (precision of previous message)
 * 11:55 joal: Kill actor-hourly tests
 * 10:50 elukey: restarted kafka jumbo (kafka + mirror maker) for openjdk upgrades
 * 09:22 joal: Rerunning failed mediawiki-history jobs for 2020-02 after mediawiki-history-denormalize issue
 * 09:16 joal: Manually restarting mediawiki-history-denormalize with new patch to try
 * 08:36 elukey: roll restart kafka-jumbo for openjdk upgrades
 * 08:34 elukey: re-enable timers on an-coord1001 after maintenance
 * 08:30 joal: Correct previsou message: Kill mediawiki-history (not mediawiki-history-reduced) as it is failing
 * 08:30 joal: Kill mediawiki-history-reduced as it is failing
 * 08:22 elukey: hive metastore/server2 now running without zookeeper settings and without DBTokenStore (in memory one used instead, the default)
 * 08:19 elukey: restart oozie/hive daemons on an-coord1001 for openjdk upgrades
 * 06:41 elukey: roll restart druid daemons for openjdk upgrades
 * 06:39 elukey: sto timers on an-coord1001 to facilitate daemon restarts (hive/oozie)

2020-03-02

 * 19:58 joal: Remove faulty _REFINED file at /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2020/month=3/day=2/hour=10/_REFINED
 * 15:38 elukey: apply new settings to all stat/notebooks
 * 15:31 elukey: setting new user.slice global memory/cpu settings on notebook1003
 * 15:25 elukey: setting new user-slice global memory/cpu settings on stat1007

2020-02-28

 * 19:10 milimetric: deployed 0.0.116 and restarted webrequest load bundle at 2020-02-28T14
 * 14:49 joal: Drop test keyspaces in cassandra cluster

2020-02-27

 * 21:16 milimetric: tried to deploy AQS but it failed with the same integration test on mediarequests, sending email

2020-02-26

 * 15:06 ottomata: dropped and re-added backfilled partitions on event.CentralNoticeImpression table to propogate schema alter on main table - T244771
 * 09:50 joal: Force delete old api/cirrus events from HDFS trash to free some space

2020-02-24

 * 18:20 elukey: move report updater jobs from stat1007 to an-launcher1001

2020-02-22

 * 14:21 elukey: restart hadoop-yarn-nodemanager on analytics1044 - broken disk, apply hiera overrides to exclude it
 * 14:11 elukey: restart hadoop-yarn-nodemanager on analytics1073 - process died, logs saved in /home/elukey

2020-02-21

 * 16:04 ottomata: altered event.CentralNoticeImpression table column event.campaignStatuses to type string, will backfill data - T244771
 * 11:49 elukey: restart varnishkafka on various cp30xx nodes
 * 11:41 elukey: restart varnishkafka on cp3057 (stuck in timeouts to kafka, analytics alarms raised)
 * 08:19 fdans: deploying refinery
 * 00:11 joal: Rerun failed wikidata-json_entity-weekly-coord instances after having created the missing hive table

2020-02-20

 * 16:57 fdans: refinery source jars updated
 * 16:39 fdans: deploying refinery source 0.0.114
 * 15:16 fdans: deploying AQS

2020-02-19

 * 16:58 ottomata: Deployed refinery using scap, then deployed onto hdfs

2020-02-17

 * 18:29 elukey: reboot turnilo and superset's hosts for kernel upgrades
 * 18:25 elukey: restart kafka on kafka-jumbo1001 to pick up new openjdk updates
 * 18:22 elukey: restart cassandra on aqs1004 to pick up new openjdk updates
 * 17:59 elukey: restart druid daemons on druid1003 to pick up new openjdk updates
 * 17:58 elukey: restart cassandra on aqs1004 to pick up new openjdk updates
 * 17:56 elukey: restart hadoop daemons on analytics1042 to pick up new openjdk updates

2020-02-15

 * 12:07 elukey: re-run failed pageview druid hour
 * 12:05 elukey: re-run failed virtualpageview hours

2020-02-12

 * 14:33 elukey: restart hue on analytics-tool1001
 * 13:36 joal: Kill-restart webrequest bundle to see if it mitigates the error

2020-02-10

 * 15:26 elukey: kill application_1576512674871_246621 (consuming too much memory)
 * 14:31 elukey: kill application_1576512674871_246419 (eating a ton of ram on the cluster)

2020-02-08

 * 09:35 elukey: created /wmf/data/raw/wikidata/dumps/all_ttl on hdfs
 * 09:35 elukey: created /wmf/data/raw/wikidata/dumps/all_json on hdfs

2020-02-05

 * 21:14 joal: Kill data_quality_stats-hourly-bundle and data_quality_stats-daily-bundle
 * 21:11 joal: Kill-restart mediawiki-history-dumps-coord, drop existing data, and restart at 2019-11
 * 21:06 joal: Kill-restart mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord
 * 20:51 joal: Deploy refinery using scap
 * 20:29 joal: Refinery-source released in archiva by jenkins
 * 20:20 joal: Deploy hdfs-tools 0.0.5 using scap

2020-02-03

 * 11:20 elukey: restart oozie on an-coord1001
 * 10:11 elukey: enable all timers on an-coord1001 after spark encryption/auth settings
 * 09:32 elukey: roll restart yarn node managers again to pick up spark encryption/authentication settings
 * 08:34 elukey: stop timers on an-coord1001 to drain the cluster and ease the deploy of spark encryption settings
 * 07:58 elukey: roll restart hadoop yarn node managers to pick up new libcrypto.so link (shouldn't be necessary but just in case)
 * 07:24 elukey: create /usr/lib/x86_64-linux-gnu/libcrypto.so on all the analytics nodes via puppet

2020-01-27

 * 05:38 elukey: re-run webrequest text 2020-01-26T20/21 with higher dataloss thresholds (false positives)
 * 02:49 elukey: re-run refine eventlogging manually to clear out refine failed events

2020-01-26

 * 17:58 elukey: re-run failed refine job for MobileWebUIActionsTracking 2020-01-26T12
 * 17:32 elukey: restart varnishkafka on cp3056/cp3064 due to network issues on the hosts

2020-01-23

 * 17:48 milimetric: launching a sqoop for imagelinks (will be slow because tuning sess)

2020-01-20

 * 12:19 elukey: restart zookeeper on an-conf100X to pick up openjdk-11 updates

2020-01-18

 * 10:06 elukey: re-run all entropy job failed via Hue (StopWatch issue)

2020-01-16

 * 20:52 mforns: deployed refinery accompanying source v0.0.112
 * 17:00 mforns: deployed refinery-source v0.0.112
 * 15:17 elukey: upgrade superset to 0.35.2
 * 15:14 elukey: stop superset as prep step for upgrade

2020-01-15

 * 10:44 elukey: remove flume-ng and spark-python/core packages from an-coord1001,analytics1030,analytics-tool1001,analytics1039 - T242754
 * 10:39 elukey: remove flume-ng from all stat/notebooks - T242754
 * 10:37 elukey: remove spark-core flume-ng from all the hadoop workers - T242754
 * 08:44 elukey: move aqs to the new rsyslog-logstash pipeline

2020-01-14

 * 20:12 milimetric: deployed aqs with new service-runner version 2.7.3

2020-01-13

 * 21:45 milimetric: webrequest restarted
 * 21:32 milimetric: killing webrequest bundle for restart
 * 15:00 joal: Deploy hdfs-tools 0.0.3 using scap
 * 14:24 joal: Releasing hdfs-tools 0.0.3 to archiva
 * 12:54 elukey: restart hue to re-apply user hive limits (again)

2020-01-10

 * 14:30 elukey: restart oozie with new settings to instruct it to pick up spark-defaults.conf settings from /etc/spark2/conf
 * 07:38 elukey: re-run virtualpageviews-druid-daily 09/01/2020 via Hue
 * 07:37 elukey: systemctl restart drop-el-unsanitized-events on an-coord1001

2020-01-09

 * 11:17 moritzm: installing cyrus-sasl security updates
 * 11:10 elukey: remove old accounts (user: absent) from Superset
 * 10:30 elukey: revert hue's hive query limit and restart hue - T242306
 * 07:45 elukey: re-run failed data-quality-stats-event.navigationtiming-useragent_entropy-hourly-coord 2020/01/09T00
 * 07:33 elukey: kill test_elukey_webrequest_sampled_128 from druid
 * 07:30 elukey: restart turnilo after updating the webrequest_sampled_128's config

2020-01-08

 * 20:44 joal: Restart webrequest-load-bundle to update queue to production
 * 20:17 joal: rerun edit-hourly-wf-2019-12 after having updated the underlying table
 * 20:06 joal: Prepare and start learning-features-actor-hourly-coord
 * 19:56 joal: kill wikidata-articleplaceholder_metrics-coord as it is buggy
 * 19:56 joal: Kill-restart edit-hourly-coord and edit-hourly-druid-coord
 * 19:48 joal: Kill-restart wikidata-articleplaceholder_metrics-coord
 * 19:44 joal: Kill-restart mediawiki-history-load-coord, mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-metrics-coord, mediawiki-history-reduced-coord, mediawiki-history-dumps-coord
 * 19:42 joal: Kill-restart mediawiki-history-load-coord,
 * 19:29 joal: Kill-restart webrequest-druid-daily-coord and webrequest-druid-hourly-coord after deploy
 * 19:16 joal: Deploy refinery on HDFS
 * 19:04 joal: Deploy refinery using scap
 * 18:30 joal: Releasing refinery-0.0.110 to archiva using Jenkins
 * 18:11 joal: AQS deployed with new druid datasource (2019-12)
 * 17:52 joal: Rerun webrequest-load-wf-text-2020-1-8-15 with updated thresholds after frontend issue

2020-01-07

 * 17:54 elukey: apt-get remove python3.5 on stat1005
 * 15:16 elukey: re-enable timers on an-coord1001 after hive restart
 * 15:03 elukey: restart hive (server+metastore) on an-coord1001 to apply delegation token settings
 * 14:36 elukey: stop timers on an-coord1001 as prep step to restart hive
 * 14:05 elukey: apply max cpu cores usage (via systemd cgroups) on stat/notebook
 * 07:59 elukey: restart hue (again) with correct principal settings)
 * 07:42 elukey: restart Hue after applying a new kerberos setting (hue_principal, was not specified before)

2020-01-06

 * 16:45 joal: Manually sqoop missing tables (content,content_models,slot_roles,slots,wbc_entity_usage0

2020-01-02

 * 18:32 elukey: restart hue with new hive query limits