Analytics/Server Admin Log

2022-03-17

 * 17:10 ottomata: restart webrequest and pageview_actor data purge - https://gerrit.wikimedia.org/r/c/operations/puppet/+/771389
 * 14:07 btullis: shutdown analytics1063 and analytics1067 with 120 minutes of downtime T303151
 * 06:46 elukey: kill remaining hanging processes for ppche*lko and accra*ze on an-test-client1001 to allow users offboard (puppet broken)

2022-03-16

 * 19:14 ottomata: deploying refinery to hadoop-test cluster with new gobblin-wmf-core jar
 * 18:00 razzi: sudo cookbook sre.hosts.downtime -D 3 -r 'Setting up karapace for the first time' karapace1001.eqiad.wmnet
 * 17:57 btullis: restarted mediawiki-history-drop-snapshot service on an-launcher1002
 * 16:03 aqu: analytics/refinery - scap deply "Migrate session_length/daily from Oozie to Airflow"
 * 10:26 btullis: rerunning failed mediawiki_structured_task_article_link_suggestion_interaction refnie job

2022-03-15

 * 22:16 razzi: upload karapace_2.1.3-py3.7-1_amd64.deb to apt.wikimedia.org
 * 19:58 razzi: upload karapace_2.1.3-py3.7-0_amd64.deb to apt.wikimedia.org
 * 17:24 ottomata: also change stats uid and gid to 918 on an-web1001 - T291384
 * 14:35 ottomata: change stats uid and gid on all stat boxes to 918 - T291384
 * 13:59 ottomata: roll restarting kafka jumbo brokers to set max.incremental.fetch.session.cache.slots=2000 - T303324

2022-03-14

 * 21:05 razzi: `sudo kill -9 15674` to stop unresponsive hive query

2022-03-09

 * 21:05 ottomata: fix group ownership of cchen.db/new_editors/cohort=2021-12 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/cchen.db/new_editors/cohort=2021-12
 * 18:33 ottomata: fix group ownership of wmf_product.db//new_editors/cohort=2021-12 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/new_editors/cohort=2021-12
 * 18:32 ottomata: fix group ownership of wmf_product.db/global_markets_pageviews/year=2022/month=2 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/global_markets_pageviews/year=2022/month=2
 * 18:19 btullis: btullis@ganeti1024:~$ sudo gnt-instance start karapace1001.eqiad.wmnet (T301562)
 * 16:16 ottomata: fix group ownership of wmf_product.db/poageviews_corrected/year=222/month=2 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/pageviews_corrected/year=2022/month=2

2022-03-08

 * 13:31 ottomata: restarted webrequest-load oozie bundle as 0073173-220113112502223-oozie-oozi-B starting at 2022-03-08T12:00Z
 * 13:09 ottomata: killing and rerunning webrequest-load-text-wf for webrequest_source=text/year=2022/month=3/day=7/hour=17, it was stuck in add_partition task as SUSPENDED, not sure why.
 * 12:47 btullis: roll-restarting druid-analytics T300626
 * 12:08 btullis: roll-restarting druid-public. T300626
 * 11:21 btullis: roll-restarting druid-test T300626
 * 11:00 btullis: roll-restarting aqs T300626
 * 10:57 btullis: restarted archiva T300626

2022-03-07

 * 19:14 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/*/hourly/year=2022/month=3/day=7 to make sure perms are fixed after revert of T291664
 * 19:13 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/virtualpageview/hourly/year=2022/month=3/day=7 - revert of T291664
 * 18:45 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/mediacounts/year=2022/month=3/day=7
 * 18:37 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7 - after reverting - T291664
 * 18:34 ottomata: restarting hive-server2 on an-coord1001 to revert hive.warehouse.subdir.inherit.perms change - T291664
 * 14:44 btullis: failing back hive services to an-coord1001
 * 13:09 aqu_: About to deploy analytics/refinery - Migrate wikidata/item_page_link/weekly from Oozie to Airflow
 * 12:45 aqu_: About to deploy airflow-dags/analytics - Migrates wikidata/item_page_link
 * 12:10 btullis: restarted hive-server2 process on an-coord1001
 * 11:52 btullis: obtaining heap dump: `hive@an-coord1001:/srv/hive-tmp$ jmap -dump:format=b,file=hive_server2_heap_T303168.bin 16971`
 * 11:51 btullis: obtaining summary of heap objects and sizes: `hive@an-coord1001:/srv/hive-tmp$ jmap -histo:live 16971 > hive-object-storage-and-sizes.T303168.txt`
 * 11:38 btullis: failing over hive to an-coord1001 T303168

2022-03-05

 * 10:03 elukey: restart hadoop-yarn-nodemanager on an-worker1132 (unhealthy node, reason Linux Container Executor reached unrecoverable exception)

2022-03-04

 * 17:46 mforns: deployed Airflow to analytics instance to fix skein logs problem
 * 15:50 mforns: deployed airflow in an-test-client1001 to test skein log fix
 * 05:19 milimetric: rerunning monthly edit hourly druid oozie coordinator

2022-03-03

 * 17:48 ottomata: roll restart aqs to pick up new MW history snapshot

2022-03-01

 * 18:38 SandraEbele: sandra testing
 * 18:34 razzi: demo irc logging to data eng team members
 * 10:19 btullis: btullis@an-coord1002:/srv$ sudo rm -rf an-coord1001-backup/ (#T302777)
 * 09:48 elukey: elukey@stat1004:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the host)

2022-02-28

 * 16:00 milimetric: refinery done deploying and syncing, new sqoop list is up
 * 15:01 milimetric: deploying new wikis to sqoop list ahead of sqoop job starting in a few hours

2022-02-25

 * 17:00 milimetric: rerunning webrequest-load-wf-text-2022-2-25-15 after confirming all false positive loss

2022-02-23

 * 23:00 razzi: sudo maintain-views --table flaggedrevs --databases fiwiki on clouddb1014.eqiad.wmnet and clouddb1018.eqiad.wmnet for T302233

2022-02-22

 * 10:37 btullis: re-enabled puppet on an-launcher1002, having absented the network_internal druid load job
 * 09:30 aqu: Deploying analytics/refinery on hadoop-test only.
 * 07:38 elukey: systemctl reset-failed mediawiki-history-drop-snapshot on an-launcher1002 (opened since a week ago)
 * 07:30 elukey: kill remaining processes of rhuang-ctr on stat1004 and an-test-client1001 (user offboarded, but still holding jupyter notebooks etc..). Puppet was broken trying to remove the user.

2022-02-21

 * 17:55 elukey: kill remaining processes of rhuang-ctr on various stat nodes (user offboarded, but still holding jupyter notebooks etc..). Puppet was broken trying to remove the user.
 * 16:58 mforns: Deployed refinery using scap, then deployed onto hdfs (aqs hourly airflow queries)

2022-02-19

 * 12:21 elukey: stop puppet on an-launcher1002, stop timers for eventlogging_to_druid_network_flows_internal_ { hourly,daily } since no data is coming to the Kafka topic (expected due to some work for the Marseille DC) and it keeps alarming

2022-02-17

 * 16:18 mforns: deployed wikistats2

2022-02-16

 * 14:13 mforns: deployed airflow-dags to analytics instance

2022-02-15

 * 17:20 ottomata: split anaconda-wmf into 2 packages: anaconda-wmf-base and anaconda-wmf. anaconda-wmf-base is installed on workers, anaconda-wmf on clients.  The size of the package on workers is now much smaller.  Installing throught the cluster.  relevant: T292699

2022-02-14

 * 17:38 razzi: razzi@an-test-client1001:~$ sudo systemctl reset-failed airflow-scheduler@analytics-test.service
 * 16:08 razzi: sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 50 eqiad_B datahubsearch1002 for T301383

2022-02-12

 * 08:50 elukey: truncate /var/log/auth.log to 1g on krb1001 to free space on root partition (original log saved under /srv)

2022-02-11

 * 15:06 ottomata: set hive.warehouse.subdir.inherit.perms = false - T291664

2022-02-10

 * 18:54 ottomata: setting up research airflow-dags scap deployment, recreating airflow database and starting from scractch (fab okayed this) - T295380
 * 16:48 ottomata: deploying airflow analytics with lots of recent changes to airflow-dags repository

2022-02-09

 * 17:41 joal: Deploy refinery onto HDFS
 * 17:05 joal: Deploying refinery with scap
 * 16:39 joal: Release refinery-source v0.1.25 to archiva

2022-02-08

 * 07:27 elukey: restart hadoop-yarn-nodemanager on an-worker1115 (container executor reached unrecoverable exception, doesn't talk with the Yarn RM anymore)

2022-02-07

 * 18:43 ottomata: manually installing airflow_2.1.4-py3.7-2_amd64.deb on an-test-client1001
 * 14:38 ottomata: merged Set spark maxPartitionBytes to hadoop dfs block size - T300299
 * 12:17 btullis: depooled aqs1009
 * 11:59 btullis: depooled aqs1008
 * 11:41 btullis: depooled aqs1007
 * 11:03 btullis: depooled aqs1006
 * 10:22 btullis: depooling aqs1005

2022-02-04

 * 16:05 elukey: unmask prometheus-mysqld-exporter.service and clean up the old @analytics + wmf_auto_restart units (service+timer) not used anymore on an-coord100[12]
 * 12:55 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-2-3
 * 07:12 elukey: `GRANT PROCESS, REPLICATION CLIENT ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5` on an-test-coord1001 to allow the prometheus exporter to gather metrics
 * 07:09 elukey: cleanup wmf_auto_restart_prometheus-mysqld-exporter@analytics-meta on an-test-coord1001 and unmasked wmf_auto_restart_prometheus-mysqld-exporter (now used)
 * 07:03 elukey: clean up wmf_auto_restart_prometheus-mysqld-exporter@matomo on matomo1002 (not used anymore, listed as failed)

2022-02-03

 * 19:35 joal: Rerun virtualpageview-druid-monthly-wf-2022-1
 * 19:32 btullis: re-running the failed refine_event job as per email.
 * 19:27 joal: Rerun virtualpageview-druid-daily-wf-2022-1-16
 * 19:12 joal: Kill druid indexation stuck task on Druid (from 2022-01-17T02:31)
 * 19:09 joal: Kill druid-loading stuck yarn applications (3 HiveToDruid, 2 oozie launchers)
 * 10:04 btullis: pooling the remaining aqs_next nodes.
 * 07:01 elukey: kill leftover processes of decommed user on an-test-client1001

2022-02-01

 * 20:05 btullis: btullis@an-launcher1002:~$ sudo systemctl restart refinery-sqoop-whole-mediawiki.service
 * 19:01 joal: Deploying refinery with scap
 * 18:36 joal: Rerun virtualpageview-druid-daily-wf-2022-1-16
 * 18:34 joal: rerun webrequest-druid-hourly-wf-2022-2-1-12
 * 17:43 btullis: btullis@an-launcher1002:~$ sudo systemctl start refinery-sqoop-whole-mediawiki.service
 * 17:29 btullis: about to deploy analytics/refinery
 * 12:28 elukey: kill processes related to offboarded user on stat1006 to unblock puppet
 * 11:09 btullis: btullis@an-test-coord1001:~$ sudo apt-get -f install

2022-01-31

 * 14:51 btullis: btullis@an-launcher1002:~$ sudo systemctl start mediawiki-history-drop-snapshot.service
 * 14:03 btullis: btullis@an-launcher1002:~$ sudo systemctl start mediawiki-history-drop-snapshot.service

2022-01-27

 * 08:15 joal: Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-1-26

2022-01-26

 * 15:54 joal: Add new CH-UA fields to wmf_raw.webrequest and wmf.webrequest
 * 15:44 joal: Kill-restart webrequest oozie job after deploy
 * 15:40 joal: Kill-restart edit-hourly oozie job after deploy
 * 15:27 joal: Deploy refinery to HDFS
 * 15:10 elukey: elukey@cp4036:~$ sudo systemctl restart varnishkafka-eventlogging
 * 15:10 elukey: elukey@cp4036:~$ sudo systemctl restart varnishkafka-statsv
 * 15:06 elukey: elukey@cp4035:~$ sudo systemctl restart varnishkafka-eventlogging.service - metrics showing messages stuck for a poll
 * 14:56 elukey: elukey@cp4035:~$ sudo systemctl restart varnishkafka-webrequest.service - metrics showing messages stuck for a poll
 * 14:52 joal: Deploy refinery with scap
 * 10:07 btullis: btullis@cumin1001:~$ sudo cumin 'O:cache::upload or O:cache::text' 'disable-puppet btullis-T296064-T299401'

2022-01-25

 * 19:46 ottomata: removing hdfs druid deep storage from test cluster
 * 19:37 ottomata: reseting test cluster druid via druid reset-cluster https://druid.apache.org/docs/latest/operations/reset-cluster.html - T299930
 * 14:30 ottomata: stopping services on an-test-coord1001 - T299930
 * 14:29 ottomata: stopping druid* on an-test-druid1001 - T299930
 * 11:30 btullis: pooled aqs1011 T298516
 * 11:29 btullis: btullis@puppetmaster1001:~$ sudo -i confctl select name=aqs1011.eqiad.wmnet set/pooled=yes

2022-01-24

 * 21:18 btullis: btullis@deploy1002:/srv/deployment/analytics/refinery$ scap deploy -e hadoop-test -l an-test-coord1001.eqiad.wmnet
 * 20:35 btullis: rebooting an-test-coord1001 after recreating the /srv/file system.
 * 20:28 btullis: root@an-test-coord1001:~# mke2fs -t ext4 -j -m 0.5 /dev/vg0/srv
 * 19:53 btullis: power cycled an-test-coord1001 from racadm
 * 19:50 btullis: rebooting an-test-coord1001
 * 19:19 ottomata: kill mysqld on an-test-coord1001 - 19:19:04 [@an-test-coord1001:/etc] $ sudo kill 42433
 * 19:02 razzi: razzi@an-test-coord1001:~$ sudo systemctl stop presto-server
 * 18:23 razzi: downtime an-test-coord1001 while attempting to fix /srv partition
 * 11:48 elukey: roll restart of kafka test brokers to pick up the new keystore/tls-certs (1y of validity)

2022-01-22

 * 08:36 elukey: `apt-get clean` on an-test-coord1001 to free some space

2022-01-21

 * 01:03 milimetric: rerunning the eventlogging_to_druid_network_flows_internal-sanitization_daily timer that failed to get logs

2022-01-20

 * 11:58 btullis: re-enabled puppet on all hive nodes, deploying the updated log4j configuration for parquet
 * 11:36 btullis: temporarily disabling puppet on servers with hive installed T297734
 * 07:49 joal: Rerun failed webrequest jobs (text and upload, 2022-01-19T19:00

2022-01-19

 * 15:44 ottomata: installing anaconda-wmf_2020.02~wmf6_amd64.deb on all analytics cluster nodes. -  T292699
 * 14:00 ottomata: installing anaconda-wmf_2020.02~wmf6_amd64.deb on stat1004 - T292699

2022-01-17

 * 07:19 elukey: launch webrequest bundle from 2022-01-16T01:00 (first hour missing for text) - 0003712-220113112502223-oozie-oozi-B
 * 07:17 elukey: kill webrequest bundle, text coordinator failed (logs/info/etc.. https://hue.wikimedia.org/hue/jobbrowser/#!id=0024621-210701181527401-oozie-oozi-B)
 * 07:13 elukey: umount/mount /mnt/hdfs on an-coord1001 to pick up java upgrades

2022-01-16

 * 16:43 elukey: `elukey@an-launcher1002:~$ sudo systemctl reset-failed eventlogging_to_druid_network_internal_flows-sanitization_daily.service eventlogging_to_druid_network_internal_flows_daily.service eventlogging_to_druid_network_internal_flows_hourly.service

2022-01-13

 * 12:41 joal: rerun failed instances of webrequest-load-coord
 * 11:59 btullis: stopped eventlogging service on eventlog1003 with 1 hour's downtime.
 * 11:52 btullis: Upgrading hive packages on stat1005
 * 11:26 btullis: restarted hive-metastore and hive-server2 on an-coord1001 after running puppet.
 * 11:23 btullis: btullis@an-coord1001:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie oozie-client
 * 11:18 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-metastore hive-server2
 * 09:53 btullis: DNS change deployed, failing over hive to an-coord1002.
 * 09:42 btullis: btullis@an-coord1002:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie-client
 * 08:45 joal: Kill-restart wikidata-json_entity-weekly-coord after deploy

2022-01-12

 * 21:13 joal: Deploying refinery to HDFS
 * 20:46 joal: Deploying refinery with scap
 * 20:35 joal: refinery-source v0.1.24 released on archiva
 * 11:25 elukey: move kafka-jumbo nodes to fixed kafka uid/gid
 * 07:46 elukey: `systemctl reset-failed product-analytics-movement-metrics.service` on stat1007

2022-01-10

 * 13:56 btullis: Upgrading oozie packages on an-test-coord1001 to test new log4j versions

2022-01-08

 * 10:51 elukey: start hive-server2 on an-coord1002 - failed to connect to the metastore due to restart
 * 10:41 elukey: restart hive daemons on an-coord1002 (after my last upgrade/rollback of packages the prometheus agent settings were not picked up, so no metrics)

2022-01-07

 * 20:16 ottomata: altering hive table MobileWikiAppiOSUserHistory field  event.device_level_enabled to string - T298721
 * 17:29 btullis: deployed updated hive packages to an-test-worker100[1-3] and an-test-ui1001
 * 14:52 btullis: root@aqs1014:~# jmap -dump:live,format=b,file=/srv/cassandra-b/tmp/aqs1014-b-dump202201071450.hprof 4468

2022-01-06

 * 18:02 btullis: btullis@aqs1010:~$ sudo systemctl restart cassandra-a.service
 * 12:22 btullis: restarting cassandra-a service on aqs1004.eqiad.wmnet in order to troubleshoot logging.
 * 11:24 btullis: restarting cassandra-a service on aqs1010.eqiad.wmnet in order to troubleshoot logging.
 * 08:12 joal: Rerun failed webrequest-load-wf-text-2022-1-6-7
 * 07:58 joal: Rerun refine_event_sanitized_analytics_immediate missing hours after errors from the past days
 * 07:39 joal: Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-05T2[123]:00:00 and 2022-01-06T00:00:00, dropping malformed rows as discussed with schema owner

2022-01-05

 * 19:16 joal: Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-04T1[5789]:00:00, dropping malformed rows as discussed with schema owner
 * 11:37 btullis: Upgrading hive on an-test-client1001 in order to test log4j upgrade
 * 11:35 btullis: Upgrading hive packages on an-test-coord1001 to test log4j changes.

2022-01-04

 * 10:39 elukey: restart cassandra-a on aqs1010 (heap size used in full, high GC)
 * 10:20 elukey: restart cassandra-a on aqs1015 (heap size used in full, high GC)

2022-01-03

 * 18:26 joal: rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2022-1-1
 * 16:08 joal: Kill cassandra3-local_group_default_T_mediarequest_per_file-daily-2022-1-1
 * 11:26 elukey: restart cassandra-b on aqs1015 (instance not responding, probably trashing)
 * 11:16 elukey: restart cassandra-b on aqs1010 (stuck trashing)
 * 10:34 elukey: depool aqs1010 (`sudo -i depool` on the node) to allow investigation of the cassandra -b instance
 * 10:22 elukey: powercycle an-worker1114 (CPU soft lockup errors in mgmt console)
 * 10:20 elukey: powercycle an-worker1120 (CPU soft lockup errors in mgmt console)

2021-12-22

 * 19:13 milimetric: Additional context on the last delete message: on an-launcher1002 which is filled up
 * 19:12 milimetric: Marcel and I are deleting files from /tmp older than 60 days
 * 15:55 mforns: finished refinery deployment for anomaly detection queries
 * 14:54 mforns: starting refinery deployment for anomaly detection queries

2021-12-20

 * 18:59 mforns: finished deployment of refinery, adding anomaly detection hql for airflow job
 * 18:39 mforns: started to deploy refinery, adding anomaly detection hql for airflow job

2021-12-17

 * 12:32 btullis: Upgraded druid packages, with pool/depool on druid1004
 * 11:20 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
 * 11:18 btullis: updating reprepo with new druid packages for buster-wikimedia to pick up new log4j jar files

2021-12-16

 * 11:01 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
 * 11:01 btullis: upgrading druid on the test cluster with new packages to test log4j changes.

2021-12-15

 * 08:51 joal: Rerun failed cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2021-12-13 after cluster restart
 * 07:20 elukey: elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics

2021-12-14

 * 19:02 milimetric: finished deploying the weekly train as per etherpad
 * 18:04 joal: Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-12-13 after cluster reboot
 * 17:51 btullis: rebooting aqs1015
 * 17:25 btullis: rebooting aqs1013
 * 17:19 btullis: rebooting aqs1012
 * 16:00 btullis: rebooting aqs1011
 * 15:53 btullis: rebooting aqs1010
 * 15:00 btullis: btullis@aqs1010:~$ sudo nodetool-a repair --full system_auth
 * 14:59 btullis: cassandra@cqlsh> ALTER KEYSPACE "system_auth" WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': '12' } ; on aqs1010-a
 * 14:25 btullis: btullis@aqs1011:$ sudo systemctl start cassandra-b.service
 * 12:44 joal: Rerun failed cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2021-12-14-10
 * 12:42 joal: Kill late spark cassandra loading job

2021-12-11

 * 10:06 elukey: kill process 2560 on stat1005 to allow puppet to clean up the related user (offboarded)
 * 10:04 elukey: kill process 2831 on stat1008 to allow puppet to clean up the related user (offboarded)

2021-12-09

 * 11:08 btullis: roll restarting druid historical daemons on analytics cluster T297148
 * 10:46 btullis: roll restarting druid brokers on analytics cluster

2021-12-07

 * 20:09 ottomata: deploy wikistats2 with doc updates

2021-12-03

 * 17:36 razzi: restart aqs-next to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cumin A:aqs-next 'systemctl restart aqs'`
 * 17:36 razzi: restart aqs to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cookbook sre.aqs.roll-restart aqs`
 * 07:33 elukey: move kafka-test to fixed uid/gid

2021-12-02

 * 20:05 ottomata: restarting pageview-druid-daily-coord (killing 0062888-210701181527401-oozie-oozi-C) - I can't seem to rerun a particular hour, so just starting again from that hour.
 * 17:57 elukey: drop "EventLogging MySQL" datasource from Superset (not valid anymore)
 * 17:26 joal: Kill paragon job to prevent more nodemangers to OOM

2021-12-01

 * 20:40 razzi: deploy refinery for T296089 patch https://gerrit.wikimedia.org/r/c/analytics/refinery/+/742672

2021-11-27

 * 09:56 elukey: powercycle analytics1071, soft lockup stacktraces in the tty

2021-11-24

 * 17:30 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 12:31 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
 * 07:10 elukey: drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 on stat1006 to free space on the root partition

2021-11-23

 * 11:56 btullis: roll-restarting the cassandra services on the aqs cluster. (Not the aqs_next cluster)
 * 11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart presto-server.service
 * 11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart oozie.service

2021-11-22

 * 12:18 btullis: failed back the hive services to an-coord1001 via CNAME change
 * 11:36 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
 * 10:44 btullis: deploying DNS change to switch hive to the standby server.
 * 10:18 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore

2021-11-18

 * 17:26 elukey: varnishkafka-webrequest on cp3050 is running with /etc/ssl/localcerts/wmf_trusted_root_CAs.pem
 * 10:03 elukey: restart prometheus-druid-exporter on Druid Analytics to clear unnecessary metrics
 * 07:32 elukey: restart prometheus-druid-exporter on Druid Public to see metrics difference

2021-11-17

 * 16:01 btullis: roll-restarting kafka-test brokers
 * 12:12 btullis: roll-restarting the presto analytics workers
 * 11:44 btullis: btullis@archiva1002:~$ sudo systemctl restart archiva.service
 * 07:29 elukey: `apt-get clean` on an-tool1005 to free space in the root partition
 * 07:28 elukey: `sudo pkill -U jmixter` on stat100[5,8] to allow puppet to run and remove the offboarded user

2021-11-16

 * 19:40 joal: Deploying refinery to HDFS
 * 19:15 joal: Deploying refinery with scap
 * 18:23 joal: Releasing refinery-source v0.1.21
 * 11:32 btullis: btullis@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers public
 * 10:20 btullis: roll-restarting hadoop masters

2021-11-15

 * 16:37 joal: Rerun failed mediawiki-wikitext-history-wf-2021-10

2021-11-11

 * 06:56 elukey: `systemctl start prometheus-mysqld-exporter@analytics_meta` on db1108

2021-11-10

 * 18:20 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
 * 10:19 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed

2021-11-09

 * 16:52 razzi: restart presto server on an-coord1001 to apply change for T292087
 * 16:30 razzi: set superset presto version to 0.246 in ui
 * 16:30 razzi: set superset presto timeout to 170s: { "connect_args": { "session_props": { "query_max_run_time":"170s"} for T294771}}
 * 12:23 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed
 * 07:23 elukey: `apt-get clean` on stat1006 to free some space (root partition full)

2021-11-08

 * 19:51 ottomata: an-coord1002: drop user 'admin'@'localhost'; start slave; to fix broken replication - T284150
 * 19:44 razzi: create admin user on an-coord1001 for T284150
 * 18:07 razzi: run `create user 'admin'@'localhost' identified by ; grant all privileges on *.* to admin;` to allow milimetric to access mysql on an-coord1002 for T284150

2021-11-04

 * 16:39 razzi: add "can sql json on superset" permission to Alpha role on superset.wikimedia.org
 * 16:14 razzi: drop and restore superset_staging database to test permissions as they are in production

2021-11-03

 * 17:07 razzi: razzi@an-tool1010:~$ sudo systemctl stop superset
 * 16:57 razzi: dump mysql in preparation for superset upgrade
 * 02:23 milimetric: deployed refinery with regular train

2021-10-29

 * 23:04 btullis: deleted all remaining old cassandra snapshots on aqs100x servers.
 * 22:58 btullis: deleted old snapshots from aqs1006 and aqs1009
 * 17:45 razzi: set presto_analytics_hive extra parameter engine_params.connect_args.session_props.query_max_run_time to 55s on superset.wikimedia.org
 * 10:39 elukey: roll restart of kafka-test to pick up new truststore (root PKI added)

2021-10-28

 * 19:13 ottomata: re-enable hdfs-cleaner for /wmf/gobblin

2021-10-26

 * 09:01 btullis: reverted hive services back to an-coord1001.

2021-10-25

 * 16:03 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
 * 13:02 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore
 * 12:51 btullis: btullis@aqs1007:~$ sudo nodetool-a clearsnapshot

2021-10-21

 * 14:05 ottomata: rerun refine_eventlogging_analytics refine_eventlogging_legacy and refine_event with -ignore-done-flag=true --since=2021-10-21T01:00:00 --until=2021-10-21T04:00:00 for backfill of missing data after gobblin problems
 * 13:39 btullis: btullis@an-launcher1002:~$ sudo systemctl restart gobblin-event_default
 * 10:35 joal: Re-refine netflow data after gobblin pulled data fix
 * 08:41 joal: Rerun webrequest-load jobs for hour 2021-10-21T02:00

2021-10-20

 * 18:11 razzi: Deployed refinery using scap, then deployed onto hdfs
 * 16:36 razzi: deploy refinery change for https://phabricator.wikimedia.org/T287084
 * 07:15 joal: rerun webrequest-load-wf-upload-2021-10-20-1 after node issue
 * 06:27 elukey: reboot analytics1066 - OS showing CPU soft lockups, tons of defunct processes (including node manager) and high CPU usage

2021-10-19

 * 07:14 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_top_files-2021-10-17

2021-10-18

 * 19:29 joal: Rerun cassandra-daily-wf-local_group_default_T_top_pageviews-2021-10-17
 * 18:36 joal: Rerun cassandra-daily-wf-local_group_default_T_unique_devices-2021-10-17
 * 16:22 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-10-17
 * 16:16 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_referer-2021-10-17
 * 15:17 joal: Rerun failed instances from cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2
 * 14:49 elukey: restart hadoop-yarn-nodemanager on an-worker1119 and an-worker1103 (Java OOM in the logs)
 * 12:09 btullis: root@aqs1013:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
 * 12:09 btullis: root@aqs1012:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
 * 09:25 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1013-b/
 * 09:17 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1012-b/
 * 09:16 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/cassandra_migration/aqs1012-b/

2021-10-15

 * 08:33 btullis: btullis@aqs1007:~$ sudo nodetool-b clearsnapshot

2021-10-13

 * 19:49 mforns: re-ran cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat for 2021-10-12 successfully
 * 17:58 ottomata: deleting files on stat1008 in /tmp older than 10 days and larger than 20M sudo find /tmp -mtime +10 -size +20M | xargs sudo rm -rfv
 * 17:54 ottomata: removed /tmp/spark-* files belonging to aikochou on stat1008

2021-10-12

 * 15:43 btullis: btullis@aqs1008:~$ sudo nodetool-b clearsnapshot
 * 13:17 btullis: btullis@analytics1069:~$ sudo shutdown -h now
 * 13:15 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-hdfs-*
 * 13:14 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-yarn-nodemanager.service
 * 07:26 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-10-11

2021-10-11

 * 07:37 joal: rerun refine_event for `event`.`mediawiki_content_translation_event` year=2021/month=10/day=10/hour=16

2021-10-10

 * 18:07 joal: Rerun webrequest-load-wf-text-2021-10-10-10 - failed due to network issue

2021-10-06

 * 14:30 elukey: upgrade stat1005 to ROCm 4.2.0
 * 13:20 btullis: btullis@aqs1004:~$ sudo nodetool-a clearsnapshot
 * 10:20 elukey: upgrade ROCm to 4.2 on stat1008

2021-10-05

 * 11:28 elukey: failover analytics-hive back to an-coord1001 after maintenance

2021-10-04

 * 16:56 elukey: restart java daemons on an-coord1001 (standby)
 * 13:43 elukey: failover analytics-hive to an-coord1002 (to restart java daemons on 1001)
 * 07:43 joal: Kill-restart mediawiki-history-reduced job after deploy (more ressources)
 * 07:32 joal: Deploy refinery to hdfs
 * 07:10 joal: Deploy refinery for mediawiki-history-reduced hotfix
 * 06:56 joal: Kill-restart pageview-monthly_dump-coord to apply fix for SLA

2021-10-01

 * 15:11 btullis: sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='editoractivation' --since='2021-09-29T22:00:00.000Z' --until='2021-09-30T23:00:00.000Z'

2021-09-30

 * 19:55 ottomata: not changing to stats uid to 499; it already exists as a another system user
 * 19:54 ottomata: changing stats uid and gid on an-launcher1002 and stat1005 to 499
 * 09:32 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_netflow --ignore_failure_flag=true --since=2021-09-28T11:00:00 --until 2021-09-28T12:00:00

2021-09-29

 * 09:16 elukey: restart hive-* units on an-coord1002 for openjdk upgrades (standby node)

2021-09-28

 * 13:14 btullis: Deployed refinery using scap, then deployed onto hdfs
 * 12:34 btullis: deploying refinery
 * 09:55 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100*.eqiad.wmnet' 'nodetool-a snapshot -t T291472 local_group_default_T_pageviews_per_article_flat' 'nodetool-b snapshot -t T291472 local_group_default_T_pageviews_per_article_flat'
 * 09:36 elukey: restart java daemons on an-test-coord1001 to pick up new openjdk

2021-09-27

 * 11:18 btullis: btullis@stat1005:~$ sudo apt purge usrmerge
 * 11:11 btullis: btullis@stat1005:~$ sudo apt install usrmerge

2021-09-24

 * 22:33 razzi: restart an-test-coord presto coordinator service to experiment withweb-ui.authentication.type=fixed
 * 15:06 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].eqiad.wmnet' 'nodetool-a snapshot -t T291469' 'nodetool-b snapshot -t T291469'
 * 14:47 btullis: btullis@aqs1007:~$ sudo nodetool-a repair --full local_group_default_T_mediarequest_per_file data
 * 11:02 btullis: btullis@an-master1001:~$ sudo systemctl restart hadoop-mapreduce-historyserver
 * 10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-namenode
 * 10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-zkfc
 * 10:35 btullis: btullis@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
 * 10:07 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='centralnoticeimpression' --since='2021-09-23T04:00:00.000Z' --until='2021-09-24T05:00:00.000Z'

2021-09-22

 * 17:23 razzi: razzi@an-test-coord1001:/etc/presto$ sudo systemctl restart presto-server
 * 17:05 joal: Kill-restart oozie jobs after deploy (mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-dumps-coord, mediawiki-history-reduced-coord)
 * 11:54 joal: release refiner-source v0.1.18 to archiva with Jenkins

2021-09-20

 * 08:12 elukey: remove old /reportcard (password protected, old files from 2012) httpd settings for stats.wikimedia.org

2021-09-18

 * 06:48 joal: Rerun webrequest-load-wf-text-2021-9-18-0 for errors after yesterday night production issue

2021-09-17

 * 16:03 btullis: Cleared all snapshots on aqs100[47] to reclaim space with nodetool-[ab] clearsnapshot (T249755)
 * 15:15 btullis: btullis@aqs1004:~$ sudo nodetool-a repair --full && sudo nodetool-b repair --full (T249755)
 * 10:18 btullis: btullis@an-web1001:~$ sudo find /srv/published-rsynced -user systemd-coredump -exec chown stats { } \;
 * 09:47 milimetric: deployed refinery to sync sanitize allowlist, deleting event_sanitized data per decision in the task
 * 08:21 elukey: disable mod_cgi/mod_cgid on an-web1001 (and remove cgi-perl related httpd configs/settings)

2021-09-16

 * 19:25 ottomata: pointing analytics-web cname at new an-web1001, this moves stats and analytics .wm.org from thorium to an-web1001 - T285355
 * 18:30 joal: Create HDFS home folder for user 'analytics-research'
 * 07:03 elukey: stop jupyter-kaywong-singleuser.service on stat1005 to allow puppet to clean up

2021-09-15

 * 16:26 joal: Deploying refinery

2021-09-13

 * 18:25 razzi: (I stopped replication earlier but forgot to !log)
 * 18:24 razzi: razzi@dbstore1007:~$ for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done - reenable replication for T290841
 * 18:19 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s4.service for T290841
 * 18:13 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841
 * 18:05 razzi: sudo systemctl restart mariadb@s2.service

2021-09-07

 * 11:41 joal: Restarting cassandra hourly loading job after C2 snapshot taken and C3 tables truncated
 * 11:37 joal: Re-Add test rows in cassandra3 cluster after tables got truncated
 * 10:25 hnowlan: truncating data tables on aqs_next cluster
 * 10:12 joal: Kill cassandra-hourl loading job for cluster-migration first step

2021-09-03

 * 11:43 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs (second)
 * 09:57 joal: Deploy AQS on new AQS servers
 * 09:45 joal: Kill-restart mediarequest-top cassandra loading jobs after deploy
 * 09:12 joal: Rerun mediawiki-history-denormalize-wf-2021-08 after failure
 * 09:07 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs

2021-09-01

 * 16:44 mforns: finished one-off deployment of refinery to fix cassandra3 loading
 * 15:57 joal: Kill cassandra loading jobs and restart them after deploy
 * 15:55 mforns: starting one-off deployment of refinery to fix cassandra3 loading
 * 13:15 joal: Restart cassandra jobs to load cassandra3 with spark
 * 08:21 joal: Rerun webrequest-load-wf-upload-2021-9-1-0

2021-08-31

 * 23:25 mforns: finished deployment of refinery (regular weekly train v0.1.17) successfully, only an-test-coord1001.eqiad.wmnet failed
 * 22:41 mforns: starting deployment of refinery (regular weekly train v0.1.17)
 * 22:27 mforns: Deployed refinery-source using jenkins
 * 10:30 hnowlan: sudo cookbook sre.aqs.roll-restart aqs-next

2021-08-30

 * 06:53 elukey: drop an-airflow1001's old airflow logs to fix root partition almost filled up

2021-08-26

 * 06:22 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type d -empty -delete
 * 06:21 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type f -delete -mtime +60

2021-08-25

 * 13:40 joal: Kill restart pageview-monthly_dump job and 2 backfilling jobs
 * 13:34 joal: Deploy refinery onto HDFS
 * 13:09 joal: Deploying refinery using scap

2021-08-24

 * 10:30 btullis: btullis@an-launcher1002:~$ sudo systemctl start hdfs-balancer.service

2021-08-20

 * 08:46 btullis: btullis@druid1001:~$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord

2021-08-19

 * 19:05 razzi: razzi@deploy1002:/srv/deployment/analytics/aqs/deploy$ scap deploy "Deploy aqs 9c062f2"
 * 19:02 razzi: note that the aqs-deploy repo's commit message DOES NOT include the changes of aqs in its changes list (though it has the correct SHA in the first line)
 * 18:26 razzi: Beginning aqs deploy process
 * 17:55 razzi: razzi@labstore1007:~$ sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service
 * 17:53 razzi: sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service on labstore1006

2021-08-18

 * 17:37 btullis: on an-coord1001: MariaDB [superset_production]> update clusters set broker_host='an-druid1001.eqiad.wmnet' where cluster_name='analytics-eqiad';
 * 15:08 joal: Restart oozie jobs loading druid to use new druid-host
 * 08:55 joal: Deploying refinery with scap

2021-08-13

 * 16:46 elukey: cleanup /srv/discovery on stat1007 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/712422
 * 15:16 milimetric: reran the other three failed jobs successfully
 * 14:52 milimetric: rerunning webrequest-druid-hourly-wf-2021-8-13-13 because of failure to connect to Hive metastore

2021-08-12

 * 14:46 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl disable druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
 * 14:45 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord

2021-08-11

 * 19:43 btullis: btullis@druid1003:~$ sudo systemctl stop druid-overlord && sudo systemctl disable druid-overlord
 * 19:41 btullis: btullis@druid1003:~$ sudo systemctl stop druid-historical && sudo systemctl disable druid-historical
 * 19:40 btullis: btullis@druid1003:~$ sudo systemctl stop druid-coordinator && sudo systemctl disable druid-coordinator
 * 19:37 btullis: btullis@druid1003:~$ sudo systemctl stop druid-broker && sudo systemctl disable druid-broker
 * 19:30 btullis: btullis@druid1003:~$ curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable
 * 12:13 btullis: migration of zookeeper from druid1002 to an-druid1002 complete, with quorum and two zynced followers. Re-enabling puppet on all druid nodes.
 * 09:48 btullis: suspended the following oozie jobs in hue: webrequest-druid-hourly-coord, pageview-druid-hourly-coord, edit-hourly-druid-coord
 * 09:45 btullis: btullis@an-launcher1002:~$ sudo systemctl disable eventlogging_to_druid_editattemptstep_hourly.timer eventlogging_to_druid_navigationtiming_hourly.timer eventlogging_to_druid_netflow_hourly.timer eventlogging_to_druid_prefupdate_hourly.timer
 * 09:21 elukey: run "sudo find /var/log/airflow -type f -mtime +15 -delete" on an-airflow1001 to free space (root partition almost full)

2021-08-10

 * 17:27 razzi: resume the following schedules in hue: edit-hourly-druid-coord, pageview-druid-hourly-coord, webrequest-druid-hourly-coord
 * 17:10 razzi: sudo cookbook sre.druid.roll-restart-workers analytics (errored out)
 * 09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_prefupdate_hourly.service
 * 09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_netflow_daily.service

2021-08-09

 * 10:45 btullis_: btullis@an-druid1003:/var/log/druid$ sudo chown -R druid:druid /srv/druid /var/log/druid
 * 10:25 btullis_: btullis@an-druid1003:~$ sudo puppet agent -tv

2021-08-04

 * 09:12 btullis: btullis@an-coord1001:~$ sudo systemctl start hive-metastore.service hive-server2.service
 * 09:12 btullis: btullis@an-coord1001:~$ sudo systemctl stop hive-server2.service hive-metastore.service
 * 09:00 btullis: sudo systemctl start hive-metastore && sudo systemctl start hive-server2
 * 09:00 btullis: btullis@an-coord1002:~$ sudo systemctl stop hive-server2 && sudo systemctl stop hive-metastore

2021-08-03

 * 19:23 ottomata: bump Refine to refinery version 0.1.16 to pick up normalized_host transform - now all event tables will have a new normalized_host field - T251320
 * 19:02 ottomata: Deployed refinery using scap, then deployed onto hdfs
 * 14:57 ottomata: rerunning webrequest refine for upload 08-03T01:00 - 0042643-210701181527401-oozie-oozi-W

2021-08-02

 * 18:49 razzi: sudo cookbook sre.druid.roll-restart-workers analytics
 * 17:57 razzi: sudo cookbook sre.druid.roll-restart-workers public

2021-07-30

 * 22:22 razzi: razzi@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers test

2021-07-29

 * 18:12 razzi: sudo cookbook sre.aqs.roll-restart aqs

2021-07-28

 * 10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl start hive-metastore.service hive-server2.service
 * 10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl stop hive-server2.service hive-metastore.service

2021-07-26

 * 20:54 razzi: reran the failed workflow of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-7-25

2021-07-22

 * 18:38 ottomata: deploy refinery to an-launcher1002 for bin/gobblin job lock change

2021-07-20

 * 20:30 joal: rerun webrequest timed-out instances
 * 18:58 mforns: starting refinery deployment
 * 18:40 razzi: razzi@an-launcher1002:~$ sudo puppet agent --enable
 * 18:39 razzi: razzi@an-master1001:/var/log/hadoop-hdfs$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
 * 18:37 razzi: razzi@an-master1002:~$ sudo -i puppet agent --enable
 * 18:34 razzi: razzi@an-master1002:~$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
 * 18:32 razzi: razzi@an-master1002:~$ sudo systemctl start hadoop-yarn-resourcemanager.service
 * 18:31 razzi: razzi@an-master1002:~$ sudo systemctl stop hadoop-yarn-resourcemanager.service
 * 18:22 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
 * 18:21 razzi: re-enable yarn queues by merging puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/705732
 * 17:27 razzi: razzi@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
 * 17:17 razzi: stop all hadoop processes on an-master1001
 * 16:52 razzi: starting hadoop processes on an-master1001 since they didn't failover cleanly
 * 16:31 razzi: sudo bash gid_script.bash on an-maseter1001
 * 16:29 razzi: razzi@alert1001:~$ sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade"
 * 16:25 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-mapreduce-historyserver
 * 16:25 razzi: sudo systemctl stop hadoop-hdfs-zkfc.service on an-master1001 again
 * 16:25 razzi: sudo systemctl stop hadoop-yarn-resourcemanager on an-master1001 again
 * 16:23 razzi: sudo systemctl stop hadoop-hdfs-namenode on an-master1001
 * 16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-hdfs-zkfc
 * 16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-yarn-resourcemanager
 * 16:18 razzi: sudo systemctl stop hadoop-hdfs-namenode
 * 16:10 razzi: razzi@cumin1001:~$ sudo transfer.py an-master1002.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
 * 16:03 razzi: root@an-master1002:/srv/hadoop/name# tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
 * 15:57 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
 * 15:52 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
 * 15:37 razzi: kill yarn applications: for jobId in $(yarn application -list | awk 'NR > 2 { print $1 } '); do yarn application -kill $jobId; done
 * 15:08 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
 * 14:52 razzi: sudo systemctl stop 'gobblin-*.timer'
 * 14:51 razzi: sudo systemctl stop analytics-reportupdater-logs-rsync.timer
 * 14:47 razzi: Disable jobs on an-launcher1002 (see https://phabricator.wikimedia.org/T278423#7190372)
 * 14:46 razzi: razzi@an-launcher1002:~$ sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
 * 08:32 mforns: restarted webrequest bundle (messed up a coord when trying to rerun some failed hours)

2021-07-17

 * 08:54 elukey: run 'sudo find -type f -name '*.log*' -mtime +30 -delete' on an-coord1001:/var/log/hive to free space (root partition almost filled up) - T279304

2021-07-15

 * 16:44 ottomata: deploying refinery and refinery-source 0.1.15 for refine job fixes - T271232
 * 13:39 joal: Kill refine_event application_1623774792907_154469 to let manual run finish
 * 13:35 joal: Kill currently running refine job (application_1623774792907_154014)
 * 11:20 joal: Kill stuck refine application

2021-07-14

 * 17:39 razzi: sudo cookbook sre.druid.roll-restart-workers public for https://phabricator.wikimedia.org/T283067
 * 00:34 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart zookeeper
 * 00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-coordinator
 * 00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-broker
 * 00:28 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-middlemanager
 * 00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-overlord
 * 00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-historical

2021-07-13

 * 19:29 joal: move /wmf/data/raw/eventlogging --> /wmf/data/raw/eventlogging_camus and drop /wmf/data/raw/eventlogging_legacy/*/year=2021/month=07/day=13/hour=14
 * 19:02 razzi: razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-workers analytics
 * 13:03 joal: remove /wmf/gobblin/locks/event_default.lock to unlock gobblin event job

2021-07-12

 * 18:37 joal: Move /wmf/data/raw/event to /wmf/data/raw/event_camus and /wmf/data/raw/event_gobblin to /wmf/data/raw/event
 * 18:36 joal: Delete /year=2021/month=07/day=12/hour=14 of gobblin imported events
 * 18:17 ottomata: stopped puppet and refines and imports for event data on an-launcher1002 in prep for gobblin finalization for event_default job
 * 12:31 joal: Rerun failed webrequest hour after having checked that loss was entirely false-positive

2021-07-09

 * 03:21 joal: Rerun webrequest descendent jobs for 2021-07-08T10:00 problem

2021-07-08

 * 17:22 joal: Deploy refinery to HDFS
 * 16:57 joal: Kill-restart webrequest oozie job after gobblin time-format change
 * 16:44 joal: Deploying refinery to an-launcher and hadoop-test
 * 16:05 joal: Manually add /wmf/data/raw/webrequest/webrequest_text/year=2021/month=7/day=8/hour=9/_IMPORTED

2021-07-07

 * 17:03 joal: Deploy refinery to HDFS
 * 16:52 joal: Deploy refinery to an-launcher1002
 * 16:05 joal: Deploy refinery to test-cluster
 * 13:30 joal: kill-restart webrequest using gobblin data
 * 13:12 ottomata: deploying refinery to an-launcher1002 for webrequest gobblin migratoin
 * 13:09 joal: Move data for webrequest camus-gobblin migration
 * 13:03 ottomata: disabled camus-webrequest and gobblin-webrequest timer on an-launcher1002 in prep for migration

2021-07-06

 * 17:33 joal: Deploy refinery onto HDFS
 * 16:41 joal: Deploy refinery for gobblin
 * 16:03 joal: Kill webrequest_test oozie job
 * 15:55 joal: Drop and recreate wmf_raw.webrequest table on analytics-test-hadoop
 * 15:52 joal: Moved camus and gobblin data for webrequest on analytics-test-hadoop
 * 15:48 ottomata: deploying refinery to test cluster for webrequest_test gobblin job
 * 14:16 ottomata: restarted aqs for july mw histroy snapshot deploy
 * 13:29 joal: Run first manual empty job for webrequest_test on analytics-test-hadoop
 * 13:29 joal: Clean gobblin state_store and data before starting webrequest_test on analytics-test-hadoop

2021-07-03

 * 19:57 joal: rerun learning-features-actor-hourly-wf-2021-7-2-11

2021-07-02

 * 13:47 joal: Reset failed timer refinery-sqoop-mediawiki-private.service
 * 12:21 joal: Replacing failed data with successful data generated when testing https://gerrit.wikimedia.org/r/702877 - wmf_raw.mediawiki_private_cu_changes
 * 00:04 razzi: razzi@an-coord1002:~$ sudo mount -a
 * 00:04 razzi: razzi@an-coord1002:~$ sudo umount /mnt/hdfs
 * 00:03 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-metastore.service
 * 00:02 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-server2.service

2021-07-01

 * 18:56 razzi: razzi@authdns1001:~$ sudo authdns-update
 * 18:19 razzi: razzi@an-coord1001:~$ sudo mount -a
 * 18:18 razzi: razzi@an-coord1001:~$ sudo umount /mnt/hdfs
 * 18:17 razzi: razzi@an-coord1001:~$ sudo systemctl restart presto-server.service
 * 18:16 razzi: razzi@an-coord1001:~$ sudo systemctl restart hive-metastore.service
 * 18:16 razzi: sudo systemctl restart hive-server2.service
 * 18:15 razzi: sudo systemctl restart oozie on an-coord1001 for https://phabricator.wikimedia.org/T283067
 * 16:38 razzi: sudo authdns-update on ns0.wikimedia.org to apply https://gerrit.wikimedia.org/r/c/operations/dns/+/702689

2021-06-30

 * 18:19 razzi: unmount and remount /mnt/hdfs on an-test-client1001 for java security update

2021-06-29

 * 22:55 razzi: sudo systemctl restart hive-server2 on an-test-coord1001.eqiad.wmnet for T283067
 * 22:53 razzi: sudo systemctl restart hive-metastore on an-test-coord1001.eqiad.wmnet for T283067
 * 22:52 razzi: sudo systemctl restart presto-server on an-test-coord1001.eqiad.wmnet for T283067
 * 22:51 razzi: sudo systemctl restart oozie on an-test-coord1001.eqiad.wmnet for T283067
 * 13:31 ottomata: deploying refinery for weekly train

2021-06-28

 * 17:00 elukey: apt-get reinstall llvm-gpu on stat100[5-8] - T285495

2021-06-25

 * 08:01 elukey: reboot an-worker1101 to unblock stuck GPU
 * 07:57 elukey: execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU

2021-06-24

 * 06:38 elukey: drop hieradata/role/common/analytics_cluster/superset.yaml from puppet private repo (unused config, all the values dumplicated in the new hiera config)
 * 06:34 elukey: rename superset hiera role configs in puppet private repo (to match the role change done recently) + superset restart

2021-06-23

 * 17:56 ottomata: enable canary events for NavigationTiming extension streams - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/699789
 * 15:30 elukey: drop /reportupdater-queries on an-launcher1002 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/701130

2021-06-22

 * 14:46 XioNoX: remove decom hosts from the analytics firewall filter on cr2-eqiad - T279429
 * 14:37 XioNoX: start updating analytics firewall rules to capirca generated ones on cr2-eqiad - T279429
 * 14:28 XioNoX: remove decom hosts from the analytics firewall filter on cr1-eqiad - T279429
 * 14:12 XioNoX: start updating analytics firewall rules to capirca generated ones on cr1-eqiad - T279429

2021-06-21

 * 13:35 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet

2021-06-18

 * 06:37 elukey: execute "sudo find -type f -name '*.log*' -mtime +30 -delete" on an-coord1001 to free space in the root partition

2021-06-15

 * 17:46 razzi: remove hdfs namenode backup on stat1004
 * 17:45 razzi: enable puppet on an-launcher
 * 17:45 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
 * 16:55 razzi: sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
 * 16:53 razzi: run uid script on an-master1002
 * 16:33 elukey: restart hadoop-yarn-resourcemanager on an-master1001
 * 16:16 razzi: sudo systemctl stop 'hadoop-*' on an-master1002
 * 16:14 razzi: sudo systemctl stop hadoop-* on an-master1001, then realize I meant to do this on an-master1002, so start hadoop-*
 * 16:11 razzi: downtime an-master1002
 * 15:55 razzi: sudo transfer.py an-master1001.eqiad.wmnet:/srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
 * 15:42 razzi: tar -czf /srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current on an-master1001
 * 15:38 razzi: backup /srv/hadoop/name/current to /home/razzi/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz on an-master1001
 * 15:33 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
 * 15:27 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
 * 15:25 razzi: kill running yarn applications via for loop
 * 15:11 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
 * 15:09 razzi: disable puppet on an-mastesr
 * 15:08 razzi: run puppet on an-masters to update capacity-scheduler.xml
 * 15:02 razzi: disable puppet on an-masters
 * 15:01 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to stop queues
 * 14:35 razzi: disable jobs that use hadoop on an-launcher1002 following https://phabricator.wikimedia.org/T278423#7094641

2021-06-14

 * 18:45 ottomata: remove packges from hadoop common nodes: sudo cumin 'R:Class = profile::analytics::cluster::packages::common' 'apt-get -y remove python3-pandas python3-pycountry python3-numpy python3-tz' - T275786
 * 18:43 ottomata: remove packges from stat nodes: sudo cumin 'stat*' apt-get -y remove subversion mercurial tofrodos libwww-perl libcgi-pm-perl libjson-perl libtext-csv-xs-perl libproj-dev libboost-regex-dev libboost-system-dev libgoogle-glog-dev libboost-iostreams-dev libgdal-dev
 * 07:18 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-6-11

2021-06-10

 * 21:17 razzi: sudo systemctl restart monitor_refine_eventlogging_analytics
 * 18:17 razzi: sudo systemctl restart hadoop-mapreduce-historyserver
 * 17:24 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1002
 * 17:24 razzi: sudo systemctl restart hadoop-hdfs-zkfc on an-master1002
 * 17:12 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
 * 16:25 razzi: rolling restart hadoop masters to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194
 * 14:07 ottomata: altered event.wmdebannerevent event.eventRate field to change type from BIGINT to DOUBLE - T282562

2021-06-08

 * 16:56 elukey: move away from dbstore1004 in favor of dbstore1007 in analytics CNAME/SRV records (will affect analytics-mysql and sqoop)
 * 13:42 ottomata: roll restart an-conf zookeepers - T283067
 * 13:22 ottomata: roll restarting analytics presto-servers - T283067
 * 06:08 elukey: restart yarn nodemanager on analytics1075 to clear the un-healthy state after some days of downtime (one-off issue but let's keep an eye on it)

2021-06-07

 * 18:14 ottomata: rolling restart of kafka jumbo brokers - T283067
 * 17:53 ottomata: rolling restart of kafka jumbo mirror makers - T283067
 * 17:07 ottomata: remove packages from an clsuter nodes: sudo apt-get -y remove r-cran-rmysql python3-matplotlib python3-sklearn python3-enchant python3-nltk gfortran liblapack-dev libopenblas-dev - T275786
 * 16:50 ottomata: restarting mysqld analytics-meta replica on db1108 to apply config change - T272973

2021-06-04

 * 17:42 razzi: sudo cookbook sre.aqs.roll-restart aqs to deploy new mediawiki history snapshot

2021-06-03

 * 22:32 razzi: sudo manage_principals.py create jdl --email_address=jlinehan@wikimedia.org
 * 22:32 razzi: sudo manage_principals.py create phuedx --email_address=phuedx@wikimedia.org
 * 15:46 ottomata: add airflow_2.1.0-py3.7-1_amd64.deb to apt.wm.org
 * 15:20 ottomata: created airflow_analytics database and user on an-coord1001 analytics-meta instance - T272973

2021-06-02

 * 18:09 ottomata: remove .deb packages from stat boxes: python3-mysqldb python3-boto python3-ua-parser python3-netaddr python3-pymysql python3-protobuf python3-unidecode python3-oauth2client python3-oauthlib python3-requests-oauthlib python3-ua-parser - T275786

2021-05-31

 * 06:56 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-29

2021-05-27

 * 14:37 elukey: removed Luca's and Tobias' emails from analytics-alerts@
 * 07:01 elukey: roll restart hdfs namenodes to pick up new GC/heap settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/695933

2021-05-26

 * 19:14 ottomata: deploying refinery and refinery source 0.1.13
 * 17:29 ottomata: killing and restarting oozie cassandra loader jobs coord_unique_devices_daily and coord_pageview_top_percountry_daily after revert of oozie job to load to cassandra 3
 * 14:18 ottomata: deploying refinery...
 * 14:17 ottomata: Deployed refinery-source using jenkins

2021-05-25

 * 18:16 razzi: sudo systemctl start all failed units from `systemctl list-units --state=failed` on an-launcher1002
 * 18:14 razzi: sudo systemctl start eventlogging_to_druid_navigationtiming_hourly.service
 * 18:01 razzi: manually edit /etc/hadoop/conf/capacity-scheduler.xml to make queues running and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
 * 17:52 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues on an-master1001 and an-master1002
 * 17:28 razzi: sudo systemctl restart refine_eventlogging_legacy
 * 17:28 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to enable submitting jobs once again
 * 17:08 razzi: re-enabled puppet on an-masters and an-launcher
 * 17:04 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
 * 17:03 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
 * 16:43 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1001
 * 16:38 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
 * 16:35 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
 * 16:28 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
 * 16:23 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
 * 16:06 razzi: sudo systemctl restart hadoop-hdfs-namenode
 * 15:52 razzi: checkpoint hdfs with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
 * 15:51 razzi: enable safe mode on an-master1001 with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
 * 15:36 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet again
 * 15:35 razzi: re-enable puppet on an-masters, run puppet, and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
 * 15:32 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet
 * 14:39 razzi: stop puppet on an-launcher and stop hadoop-related timers
 * 01:09 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
 * 01:07 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
 * 00:34 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet

2021-05-24

 * 18:05 ottomata: resume failing cassandra 3 oozie loading jobs, they are also loading to cassandra 2: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
 * 18:04 ottomata: suspend failing cassandra 3 oozie loading jobs: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
 * 15:19 ottomata: rm -rf /tmp/analytics/* on an-launcher1002 - T283126

2021-05-20

 * 06:05 elukey: kill christinedk's jupyter process on stat1007 (offboarded user) to allow puppet to run

2021-05-19

 * 16:31 razzi: restart turnilo for T279380

2021-05-18

 * 20:22 razzi: restart oozie virtualpageview hourly, virtualpageview druid daily, virtualpageview druid monthly
 * 18:57 razzi: deployed refinery via scap, then deployed to hdfs
 * 18:46 ottomata: removing extraneous python-kafka and python-confluent-kafka deb packages from analytics cluster - T275786
 * 12:40 joal: Add monitoring data in cassandra-3
 * 06:50 joal: run manual unique-devices cassandra job for one day with debug logging
 * 02:20 ottomata: manually running drop_event with --verbose flag

2021-05-17

 * 11:09 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after host generating failures has been moved out of cluster
 * 10:41 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after drop/create of keyspace
 * 10:28 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing
 * 09:45 joal: Rerun of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-15

2021-05-13

 * 11:41 hnowlan: running truncate "local_group_default_T_pageviews_per_article_flat".data; on aqs1012

2021-05-12

 * 15:17 ottomata: dropped event.mediawiki_job_* tables and data directories with mforns - T273789 & T281605
 * 13:56 ottomata: removing refine_mediawiki_job Refine jobs - T281605

2021-05-11

 * 21:00 mforns: finished repeated refinery deployment (matching source v0.1.11) - missed unmerged change
 * 19:59 mforns: repeating refinery deployment (matching source v0.1.11) - missed unmerged change
 * 19:53 mforns: finished refinery deployment (matching source v0.1.11)
 * 18:41 mforns: starting refinery deployment (matching source v0.1.11)
 * 17:26 mforns: deployed refinery-source v0.1.11

2021-05-06

 * 21:27 razzi: sudo manage_principals.py reset-password nahidunlimited --email_address=nsultan@wikimedia.org
 * 13:29 elukey: roll restart of hadoop yarn nodemanagers to pick up TasksMax=26214
 * 12:39 elukey: restart Yarn RMs to apply the dominant resource calculator setting - T281792
 * 12:15 hnowlan: changed eventlogging CNAME to point to eventlog1003
 * 09:19 hnowlan: starting decommission of eventlog1002

2021-05-05

 * 17:36 razzi: create principal for sihe: sudo manage_principals.py create sihe --email_address=silvan.heintze@wikimedia.de
 * 12:22 joal: Reset monitor_refine_eventlogging_legacy after manual rerun of failed job
 * 12:02 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-5-4

2021-05-04

 * 20:31 joal: Kill-restart 16 cassandra jobs
 * 20:29 joal: Kill-restart referer-daily job
 * 20:12 joal: Deploy refinery onto HDFSb
 * 19:46 joal: Deploying refinery using scap
 * 19:34 joal: refinery v0.1.10 released to Archiva

2021-05-03

 * 14:23 ottomata: stopping all venv based jupyter singleuser servers - T262847
 * 13:59 ottomata: dropped all obselete (upper cased location) event_santizied.*_T280813 tables created for T280813
 * 10:43 joal: Add _SUCCESS flag to /wmf/data/raw/mediawiki_private/tables/cu_changes/month=2021-04 after having manually sqooped missing tables
 * 09:57 joal: restart refinery-sqoop-mediawiki-private timer after patch
 * 09:56 joal: Reset refinery-sqoop-mediawiki-private timer
 * 09:38 joal: Drop already sqooped data to restart jobs
 * 08:53 joal: Deploy refinery for sqoop hotfix
 * 08:33 elukey: clean up libmariadb-java from hadoop workers and clients
 * 07:46 joal: Kill prod sqoop job to restart after fix

2021-04-30

 * 07:04 elukey: hue restarted using the database 'hue' instead of 'hue_next'
 * 06:56 elukey: stop hue to allow database rename (hue_next -> hue)

2021-04-29

 * 15:55 razzi: restart hadoop-yarn-nodemanager and hadoop-hdfs-datanode on an-worker1100 for hadoop to recognize new disk /dev/sdl
 * 15:38 ottomata: enabling event_sanitized_main jobs - T273789
 * 14:57 elukey: run mysql_upgrade on an-coord1001 to complete the buster upgrade - T278424
 * 14:44 hnowlan: restored all eventlogging jobs to eventlog1003
 * 14:21 hnowlan: bump eventlog1003 CPUs to 6
 * 13:53 joal: Rerun failed pageview-hourly-wf-2021-4-29-11 and pageview-hourly-wf-2021-4-29-12
 * 13:09 joal: Rerun failed pageview-hourly-wf-2021-4-29-11
 * 12:35 hnowlan: restarting 2 processors on eventlog1002
 * 12:02 hnowlan: stopping processors on eventlog1002 to migrate to eventlog1003
 * 11:50 elukey: manual stop of one of the eventlog processors on eventlog1002 to see if 1003 takes it over
 * 02:59 milimetric: deployed hotfix for referrer job

2021-04-28

 * 17:46 hnowlan: eventlog1003 joined to groups successfully
 * 17:36 razzi: sudo mkdir /srv/log/eventlogging and sudo chown eventlogging:eventlogging /srv/log/eventlogging to workaround missing directory puppet error (to be puppetized later)
 * 17:31 razzi: remove deployment cache on eventlogging1003: sudo rm -fr /srv/deployment/eventlogging/analytics-cache/
 * 17:26 razzi: manually change /srv/deployment/eventlogging/analytics/.git/DEPLOY_HEAD to deployment1002 on deployment1002 to fix puppet scap error
 * 16:53 hnowlan: stopping deployment-eventlog05 in deployment-prep
 * 14:42 milimetric: deployed refinery with 0.1.9 jars and synced to hdfs
 * 14:30 elukey: chown -R analytics-deploy:analytics-deploy /srv/deployment/analytics on an-coord1001
 * 12:50 ottomata: applied data_purge jobs in analytics test cluster; old data will now be dropped there - T273789

2021-04-27

 * 08:33 elukey: run mysql_upgrade for analytics-meta on an-coord1002 (should be part of the upgrade process) - T278424
 * 07:11 elukey: restart yarn resource managers to pick up yarn label settings

2021-04-26

 * 08:01 elukey: restart hadoop-mapreduce-historyserver on an-master1001 after changes to the yarn ui user
 * 07:36 elukey: re-enable timers after setting the capacity scheduler
 * 07:31 elukey: restart hadoop RM on an-master* to pick up capacity scheduler changes
 * 06:44 elukey: stop timers on an-launcher1002 again as prep step for capacity scheduler changes
 * 06:32 elukey: roll restart of hadoop-yarn-nodemanagers to pick up new log4j settings - T276906
 * 06:25 elukey: re-enable timers
 * 06:20 elukey: reboot an-coord1001 to pick up kernel security settings
 * 05:57 elukey: stop timers on an-launcher1002 to allow a reboot of an-coord1001

2021-04-24

 * 08:03 joal: Rerun failed webrequest-druid-hourly-wf-2021-4-23-13

2021-04-23

 * 14:23 elukey: roll restart an-master100[1,2] daemons to pick up new lo4j settings - T276906
 * 10:30 elukey: restart hadoop daemons (NM, DN, JN) on an-worker1080 to further test the new log4j config - T276906
 * 09:12 elukey: change default log4j hadoop config to include rolling gzip appender

2021-04-21

 * 21:30 ottomata: temporariliy disabling sanitize_eventlogging_analytics_delayed jobs until T280813 is completed (probably tomorrow)
 * 20:04 ottomata: renaming event_santized hive table directories to lower case and repairing table partition paths - T280813
 * 09:28 elukey: roll restart druid-overlord on druid* after an-coord1001 maintenance
 * 09:09 elukey: upgrade hue on an-tool1009 to 4.9.0-2
 * 08:31 elukey: re-enable timers on an-launcher1002 and airflow on an-airflow1001 after maintenance on an-coord1001
 * 07:08 elukey: reimage an-coord1001 after partition reshape (/var/lib/mysql folded in /srv)
 * 06:51 elukey: stop airflow on an-airflow1001
 * 06:49 elukey: stop all services on an-coord1001 as prep step for reimage
 * 06:45 elukey: PURGE BINARY LOGS BEFORE '2021-04-14 00:00:00'; on an-coord1001 to free some space before the reimage
 * 06:00 elukey: stop timers on an-launcher1002 as prep step for an-coord1001 reimage

2021-04-20

 * 15:51 elukey: move analytics-hive.eqiad.wmnet back to an-coord1001 (test on an-coord1002 successful)
 * 15:38 ottomata: deployed refiner to hdfs
 * 13:59 ottomata: deploying refinery and refinery source 0.1.6 for weekly train
 * 13:37 ottomata: deployed aqs
 * 13:16 elukey: failover analytics-hive to an-coord1002 to test the host (running on buster)
 * 12:40 elukey: PURGE BINARY LOGS BEFORE '2021-04-12 00:00:00'; on an-coord1001 - T280367

2021-04-19

 * 16:45 ottomata: make RefineMonitor use analytics keytab - this should be a no-op
 * 16:07 razzi: run kafka preferred-replica-election on jumbo cluster (kafka-jumbo1002)
 * 06:50 elukey: move /var/lib/hadoop/name partition under /srv/hadoop/name on an-master1001 - T265126
 * 05:45 elukey: cleanup Lex's jupyter notebooks on stat1007 to allow puppet to clean up

2021-04-18

 * 07:25 elukey: run "PURGE BINARY LOGS BEFORE '2021-04-11 00:00:00';" on an-coord1001 to free some space - T280367

2021-04-16

 * 15:14 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; on an-coord1001 to free space for /var/lib/mysql - T280367
 * 15:13 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00';
 * 07:54 elukey: drop all the cloudera packages from our repositories

2021-04-15

 * 21:13 razzi: rebalance kafka partitions for webrequest_text partition 23
 * 14:56 elukey: deploy refinery via scap - weekly train
 * 09:50 elukey: rollback hue on an-tool1009 to 4.8, it seems that 4.9 still has issues
 * 06:32 elukey: move hue.wikimedia.org to an-tool1009 (from analytics-tool1001)
 * 01:36 razzi: rebalance kafka partitions for webrequest_text partitions 21,22

2021-04-14

 * 14:05 elukey: run build/env/bin/hue migrate on an-tool1009 after the hue upgade
 * 13:10 elukey: rollback hue-next to 4.8 - issues not present in staging
 * 13:00 elukey: upgrade Hue to 4.9 on an-tool1009 - hue-next.wikimedia.org
 * 10:02 elukey: roll restart yarn nodemanagers on hadoop prod (attempt to see if they entered in a weird state, graceful restart)
 * 09:54 elukey: kill long running mediawiki-job refine erroring out application_1615988861843_166906
 * 09:46 elukey: kill application_1615988861843_163186 for the same reason
 * 09:43 elukey: kill application_1615988861843_164387 to see if any improvement to socket consumption is made
 * 09:14 elukey: run "sudo kill `pgrep -f sqoop`" on an-launcher1002 to clean up old test processes still running

2021-04-13

 * 16:17 razzi: rebalance kafka partitions for webrequest_text partitions 19, 20
 * 13:18 ottomata: Refine now uses refinery-job 0.1.4; RefineFailuresChecker has been removed and its function rolled into RefineMonitor -
 * 10:23 hnowlan: deploying aqs with updated cassandra libraries to aqs1004 while depooled
 * 06:17 elukey: kill application application_1615988861843_158645 to free space on analytics1070
 * 06:10 elukey: kill application_1615988861843_158592 on analytics1061 to allow space to recover (truncate of course in D state)
 * 06:05 elukey: truncate logs for application_1615988861843_158592 on analytics1061 - one partition full

2021-04-12

 * 14:21 ottomata: stop using http proxies for produce_canary_events_job - T274951

2021-04-08

 * 16:33 elukey: reboot an-worker1100 again to check if all the disks come up correctly
 * 15:43 razzi: rebalance kafka partitions for webrequest_text partitions 17, 18
 * 15:35 elukey: reboot an-worker1100 to see if it helps with the strange BBU behavior in T279475
 * 14:07 elukey: drop /var/spool/rsyslog from stat1008 - corrupted files due to root partition filled up caused a SEGV for rsyslog
 * 11:14 hnowlan: created aqs user and loaded full schemas into analytics wmcs cassandra
 * 08:35 elukey: apt-get clean on stat1008 to free some space
 * 07:44 elukey: restart hadoop hdfs masters on an-master100[1,2] to apply the new log4j settings fro the audit log
 * 06:44 elukey: re-deployed refinery to hadoop-test after fixing permissions on an-test-coord1001

2021-04-07

 * 23:03 ottomata: installing anaconda-wmf-2020.02~wmf5 on remaining nodes - T279480
 * 22:51 ottomata: installing anaconda-wmf-2020.02~wmf5 on stat boxes - T279480
 * 22:47 mforns: finished refinery deployment up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
 * 22:39 mforns: deployment of refinery via scap to hadoop-test failed with Permission denied: '/srv/deployment/analytics/refinery-cache/.config' (deployemt to production went fine)
 * 21:44 mforns: starting refinery deploy up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
 * 21:26 mforns: deployed refinery-source v0.1.4
 * 21:25 razzi: sudo apt-get install --reinstall sudo apt-get install --reinstall anaconda-wmf on stat1008
 * 20:15 razzi: rebalance kafka partitions for webrequest_text partitions 15, 16
 * 19:53 ottomata: upgrade anaconda-wmf everywhere to 2020.02~wmf4 with fixes for T279480
 * 14:03 hnowlan: setting profile::aqs::git_deploy: true in aqs-test1001 hiera config

2021-04-06

 * 22:34 razzi: rebalance kafka partitions for webrequest_text_13,14
 * 09:37 elukey: reimage an-coord1002 to Debian Buster

2021-04-05

 * 16:07 razzi: remove old hive logs on an-coord1001: sudo rm /var/log/hive/hive-*.log.2021-02-*
 * 14:54 razzi: remove empty /var/log/sqoop on an-launcher1002 (logs go in /var/log/refinery); sudo rmdir /var/log/sqoop
 * 14:51 razzi: rebalance kafka partitions for webrequest_text partitions 11, 12

2021-04-02

 * 16:28 razzi: rebalance kafka partitions for webrequest_text partitions 9,10
 * 16:19 elukey: all the Hadoop test cluster on Debian Buster
 * 07:28 elukey: manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b

2021-04-01

 * 20:27 razzi: restore superset_production from backup superset_production_1617306805.sql
 * 20:14 razzi: manually run bash /srv/deployment/analytics/superset/deploy/create_virtualenv.sh as analytics_deploy on an-tool1010, since somehow it didn't run with scap
 * 20:01 razzi: sudo chown -R analytics_deploy:analytics_deploy /srv/deployment/analytics/superset/venv since it's owned by root and needs to be removed upon deployment
 * 19:54 razzi: dump superset production to an-coord1001.eqiad.wmnet:/home/razzi/superset_production_1617306805.sql just in case
 * 16:50 razzi: rebalance kafka partitions for webrequest_text partitions 7 and 8

2021-03-31

 * 14:18 hnowlan: starting copy of large tables from aqs1007 to aqs1011

2021-03-30

 * 20:25 joal: Kill-Restart data_quality_stats-hourly-bundle after deploy
 * 20:19 joal: Deploying refinery onto HDFS
 * 19:57 joal: Deploying refinery using scap
 * 19:57 joal: Refinery-source released to archiva and new jars commited to refinery (v0.1.3)
 * 17:07 razzi: rebalance kafka partitions for webrequest_text partitions 5 and 6
 * 12:35 hnowlan: Depooling aqs1004 for another transfer of local_group_default_T_pageviews_per_article_flat
 * 12:30 elukey: restart reportupdater-codemirror on an-launcher1002 fro T275757
 * 11:30 elukey: ERRATA: upgrade to 2.3.6-2
 * 11:29 elukey: upgrade hive client packages to 2.3.6-1 on an-launcher1002 (already applied to all stat100x)

2021-03-25

 * 15:58 elukey: disable vmemory checks in Yarn nodemanagers on Hadoop
 * 13:53 elukey: systemctl restart performance-asotranking on stat1007 for T276121
 * 08:14 elukey: upgrade hive packages on stat100x to 2.6.3-2 - T276121
 * 08:12 elukey: upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia

2021-03-24

 * 18:49 elukey: systemctl restart refinery-import-* failed jobs (/mnt/hdfs errors due to me umounting the mountpoint)
 * 18:43 elukey: kill fuse hdfs mount process on an-launcher1002, re-mounted /mnt/hdfs, too many processes in D state
 * 15:46 razzi: rebalance kafka partitions for webrequest_text partitions 3 and 4
 * 05:40 razzi: sudo chown analytics /var/log/refinery/sqoop-mediawiki.log.1 on an-launcher1002 and restart logrotate

2021-03-22

 * 18:12 elukey: drop /srv/.hardsync* to clean up hardlinks not needed
 * 18:07 elukey: run rm -rfv .hardsync.*/archive/public-datasets/* on thorium:/srv to clean up files to drop (didn't work)
 * 18:01 elukey: drop /srv/.hardsync*trash* on thorium - old hardlinks that should have been trashed
 * 15:52 razzi: rebalance kafka partitions for webrequest_text partition 2
 * 09:28 elukey: move the yarn scheduler in hadoop test to capacity

2021-03-19

 * 15:44 razzi: rebalance kafka partitions for webrequest_text partition 1

2021-03-18

 * 19:30 razzi: rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so back
 * 19:29 razzi: temporarily rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so on aqs1004 to fix https://issues.apache.org/jira/browse/CASSANDRA-11574
 * 19:02 ottomata: hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus - T275396
 * 16:47 razzi: rebalance kafka partitions for webrequest_text partition 0
 * 06:32 elukey: force a manual run of create_virtualenv.sh on an-tool1010 - superset down

2021-03-17

 * 20:45 razzi: release wikistats 2.9.0
 * 20:15 ottomata: install anaconda-wmf 2020.02~wmf3 on analytics cluster clients and workers - T262847
 * 18:10 ottomata: started oozie/cassandra/coord_pageview_top_percountry_daily
 * 15:21 razzi: rebalance kafka partitions for webrequest_upload partitions 22 and 23
 * 13:54 razzi: sudo cookbook sre.hosts.reboot-single an-conf1001.eqiad.wmnet
 * 13:47 razzi: sudo cookbook sre.hosts.reboot-single an-conf1003.eqiad.wmnet
 * 13:41 razzi: sudo cookbook sre.hosts.reboot-single an-conf1002.eqiad.wmnet
 * 13:39 ottomata: deploying refinery for weekly train
 * 13:28 ottomata: deploy aqs as part of train - T207171, T263697
 * 01:28 razzi: rebalance kafka partitions for webrequest_upload partition 21

2021-03-16

 * 14:43 razzi: rebalance kafka partitions for webrequest_upload partition 20
 * 03:17 razzi: rebalance kafka partitions for webrequest_upload partition 19

2021-03-15

 * 16:53 razzi: rebalance kafka partitions for webrequest_upload partition 18
 * 08:25 elukey: stop/start hdfs-balancer on an-launcher1002 with bw 200MB
 * 07:48 joal: Manually start mediawiki-history-drop-snapshot.service to check the run succeeds
 * 07:47 joal: Drop hive wmf.mediawiki_wikitext_history snapshot partitions (2020-08, 2020-09, 2020-10, 2020-11)

2021-03-14

 * 20:49 joal: Manually clean some data ( mediawiki-history-drop-snapshot.service seems not working)
 * 20:46 joal: Force a run of mediawiki-history-drop-snapshot.service to clean up some data

2021-03-12

 * 17:20 elukey: kill duplicate mediawiki-wikitext-history coordinator failing and sending emails to alerts@
 * 07:21 elukey: re-run monitor_refine_event_failure_flags

2021-03-11

 * 22:31 razzi: rebalance kafka partitions for webrequest_upload partition 17
 * 20:20 razzi: disable maintenance mode for matomo1002
 * 20:08 razzi: starting reboot of matomo1002 for kernel upgrade
 * 18:52 razzi: systemctl restart hadoop-hdfs-datanode on analytics1059
 * 18:50 razzi: systemctl restart hadoop-yarn-nodemanager on analytics1059
 * 18:35 razzi: apt-get install parted on analytics1059
 * 15:34 razzi: rebalance kafka partitions for webrequest_upload partition 17
 * 10:52 elukey: drop /home/bsitzmann on all stat100x hosts - T273712
 * 08:25 elukey: drop database dedcode cascade in hive - T276748
 * 08:15 elukey: hdfs dfs -rmr /user/dedcode on an-launcher1002 (data in trash for a month) - T276748

2021-03-10

 * 23:15 razzi: rebalance kafka partitions for webrequest_upload partition 16
 * 18:44 mforns: finished deployment of refinery (session length oozie job)
 * 18:16 mforns: starting deployment of refinery (session length oozie job)
 * 16:54 razzi: rebalance kafka partitions for webrequest_upload partition 15
 * 07:05 elukey: all hadoop worker nodes on Buster
 * 06:28 elukey: force the re-run of refine_eventlogging_legacy - failed due to worker reimage in progress
 * 06:17 elukey: reimage an-worker1111 to buster

2021-03-09

 * 22:00 razzi: rebalance kafka partitions for webrequest_upload partition 14
 * 20:42 elukey: reimaged an-worker1091 to buster
 * 18:26 elukey: reimage an-worker1087 to buster
 * 16:40 elukey: reimage analytics1077 to buster
 * 15:36 razzi: rebalance kafka partitions for webrequest_upload partition 13
 * 15:18 elukey: reimage analytics1072 (hadoop hdfs journal node) to buster
 * 14:29 elukey: drain + reimage an-worker1090/89 to Buster
 * 13:26 elukey: reimage an-worker1102 and an-worker1080 (hdfs journal node) to Buster
 * 12:59 elukey: drain + reimage an-worker1103 to Buster
 * 09:14 elukey: drain + reimage analytics1076 and an-worker1112 to Buster
 * 07:01 elukey: drain + reimage an-worker109[4,5] to Buster

2021-03-08

 * 23:22 razzi: rebalance kafka partitions for webrequest_upload partition 12
 * 18:49 razzi: rebalance kafka partitions for webrequest_upload partition 11
 * 18:11 elukey: drain + reimage an-worker11[15,16] to Buster
 * 17:12 elukey: drain + reimage an-worker11[13,14] to Buster
 * 16:17 elukey: drain + reimage an-worker1109/1110 to Buster
 * 14:54 elukey: drain + reimage an-worker110[7,8] to Buster
 * 14:52 ottomata: altered topics (eqiad|codfw).mediawiki.client.session_tick to have 2 partitions - T276502
 * 13:51 elukey: drain + reimage an-worker110[4,5] to Buster
 * 10:41 elukey: drain + reimage an-worker1104/1089 to Debian Buster
 * 09:19 elukey: drain + reimage an-worker108[3,4] to Buster
 * 08:20 elukey: drain + reimage an-worker108[1,2] to Buster
 * 07:23 elukey: drain + reimage analytics107[4,5] to Buster

2021-03-07

 * 08:00 elukey: "megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll" on analytics1066
 * 07:49 elukey: umount /var/lib/hadoop/data/e on analytics1059 and restart hadoop daemons to exclude failed disk - T276696

2021-03-05

 * 18:30 razzi: run again sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
 * 18:18 razzi: sudo cookbook sre.dns.netbox -t T269211 "Move clouddb1021 to private vlan"
 * 18:17 razzi: re-run interface_automation.ProvisionServerNetwork with private vlan
 * 18:16 razzi: delete non-mgmt interface for clouddb1021
 * 17:07 razzi: sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
 * 16:54 razzi: sudo cookbook sre.dns.netbox -t T269211 "Reimage and rename labsdb1012 to clouddb1021"
 * 16:52 razzi: run script at https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/
 * 16:47 razzi: edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021
 * 16:30 razzi: delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/
 * 16:28 razzi: rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet
 * 16:08 razzi: sudo cookbook sre.hosts.decommission labsdb1012.eqiad.wmnet -t T269211
 * 15:52 razzi: stop mariadb on labsdb1012
 * 15:39 razzi: rebalance kafka partitions for webrequest_upload partition 10
 * 15:07 elukey: drain + reimage analytics1073 and an-worker1086 to Debian Buster
 * 13:36 elukey: roll restart HDFS Namenodes for the Hadoop cluster to pick up new Xmx settings (https://gerrit.wikimedia.org/r/c/operations/puppet/+/668659)
 * 10:20 elukey: force run of refinery-druid-drop-public-snapshots to check Druid public's performances
 * 10:06 elukey: failover HDFS Namenode from 1002 to 1001 (high GC pauses triggered the HDFS zkfc daemon on 1001 and the failover to 1002)
 * 08:32 elukey: drain + reimage an-worker107[8,9] to Debian Buster (one Journal node included)
 * 07:22 elukey: drain + reimage analytics107[0-1] to debian buster
 * 07:13 elukey: add analytis1066 back with /dev/sdb removed
 * 07:01 elukey: stop hadoop daemons on analytics1066 - disk errors on /dev/sdb after reimage

2021-03-04

 * 21:19 razzi: rebalance kafka partitions for webrequest_upload partition 9
 * 16:27 elukey: drain + reimage analytics106[8,9] to Debian Buster (one is a journalnode)
 * 15:12 elukey: drain + reimage analytics106[6,7] to Debian Buster
 * 14:21 elukey: drain + reimage analytics1065 to Debian Buster
 * 13:32 elukey: drain + reimage analytics10[63,64] to Debian Buster
 * 12:48 elukey: drain + reimage analytics10[61,62] to Debian Buster
 * 10:40 elukey: drain + reimage analytics1059/1060 to Debian Buster
 * 09:32 elukey: reboot an-worker[1097-1101] (GPU workers) to pick up the new kernel (5.10)
 * 09:02 elukey: kill/start mediawiki-geoeditors-monthly to apply backtick change (hive script)
 * 08:48 elukey: deploy refinery to hdfs
 * 08:34 elukey: deploy refinery to fix https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668111
 * 07:38 elukey: reboot an-worker1096 to pick up 5.10 kernel

2021-03-03

 * 17:10 elukey: update druid datasource on aqs (roll restart of aqs on aqs100*)
 * 17:06 razzi: rebalance kafka partitions for webrequest_upload partition 8
 * 14:20 elukey: reimage an-worker1099,1100,1101 (GPU worker nodes) to Debian Buster
 * 10:16 elukey: add an-worker113[2,5-8] to the Analytics Hadoop cluster

2021-03-02

 * 23:15 mforns: finished deployment of refinery to hdfs
 * 21:59 mforns: starting refinery deployment using scap
 * 21:48 mforns: deployed refinery-source v0.1.2
 * 17:26 razzi: rebalance kafka partitions for webrequest_upload partition 7
 * 13:42 elukey: Add an-worker11[19,20-28,30,31] to Analytics Hadoop
 * 10:21 elukey: roll restart druid historicals on druid public to pick up new cache settings (enable segment caching)
 * 10:14 elukey: roll restart druid brokers on druid public to pick up new cache settings (no segment caching, only query caching)
 * 08:01 elukey: manual start of performance-asotranking on stat1007 (requested by Gilles) - T276121

2021-03-01

 * 21:24 razzi: rebalance kafka partitions for webrequest_upload partition 6
 * 18:14 razzi: restart timer that wasn't running on an-worker1101: sudo systemctl restart prometheus-debian-version-textfile.timer
 * 17:40 elukey: reimage an-worker1098 (GPU worker node) to Buster
 * 14:48 elukey: reimage an-worker1097 (gpu node) to debian buster
 * 11:55 elukey: roll restart druid broker on druid-analytics (again) to enable query cache settings (missing config due to typo)
 * 11:34 elukey: roll restart historical daemons (again) on druid-analytics to remove stale config and enable (finally) segment caching.
 * 11:02 elukey: roll restart druid-broker and druid-historical daemons on druid-analytics to pick up new cache settings (disable segment caching on broker and enable it on historicals)
 * 09:12 elukey: restart hadoop daemons on an-worker1112 to pick up the new disk
 * 09:11 elukey: remount /dev/sdl on an-worker1112 (wasn't able to make it fail)

2021-02-26

 * 16:03 razzi: rebalance kafka partitions for webrequest_upload partition 4
 * 12:33 elukey: reimaged an-worker1096 (GPU node) to Debian buster (preserving datanode dirs)
 * 09:52 elukey: reimaged analytics1058 to debian buster (preserving datanode partitions)
 * 07:50 elukey: attempt to reimage analytics1058 (part of the cluster, not a new worker node) to Buster
 * 07:29 elukey: added journalnode partition to all hadoop workers not having it in the Analytics cluster
 * 07:01 elukey: reboot an-worker1099 to clear out kernel soft lockup errors
 * 06:59 elukey: restart datanode on an-worker1099 - soft lockup kernel errors

2021-02-25

 * 17:04 razzi: rebalance kafka partitions for webrequest_upload_3
 * 13:36 elukey: drop /srv/backup/wikistats from thorium
 * 13:35 elukey: drop /srv/backup/backup_wikistats_1 from thorium
 * 11:14 elukey: add an-worker111[7,8] to Analytics Hadoop (were previously backup worker nodes)
 * 08:50 elukey: move analytics-privatedata/search/product to fixed gid/uid on all buster nodes (including airflow/stat100x/launcher)

2021-02-24

 * 19:16 ottomata: service hadoop-yarn-nodemanager start on an-worker1112
 * 16:03 milimetric: deployed refinery
 * 14:09 elukey: roll restart druid brokers on druid public to pick up caffeine cache settings
 * 14:03 elukey: roll restart druid brokers on druid analytics to pick up caffeine cache settings
 * 11:08 elukey: restart druid-broker on an-druid1001 (used by Turnilo) with caffeine cache
 * 09:01 elukey: roll restart druid brokers on druid public - locked
 * 07:47 elukey: change gid/uid for druid + roll restart of all druid nodes

2021-02-23

 * 21:20 ottomata: started nodemanager on an-worker1112
 * 21:15 razzi: rebalance kafka partitions for webrequest_upload partition 2
 * 19:31 elukey: roll out new uid/gid for mapred/druid/analytics/yarn/hdfs for all buster nodes (no op for stretch)
 * 17:47 elukey: change uid/gid for yarn/mapred/analytics/hdfs/druid on stat100x, an-presto100x
 * 15:57 elukey: an-launcher1002's timers restored
 * 15:28 elukey: stop timers on an-launcher1002 to change gid/uid for yarn/hdfs/mapred/analytics/druid and to reboot for kernel updates
 * 15:23 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-tool100[8,9]
 * 15:22 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-airflow1001, an-test* buster nodes
 * 15:05 klausman: an-master1001 ~ $ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp analytics-privatedata-users /wmf/data/raw/webrequest/webrequest_text/hourly/2021/02/22/01/webrequest*
 * 14:51 elukey: drop /srv/backup-1007 on stat1008 to free space

2021-02-22

 * 19:27 ottomata: restart oozie on an-coord1001 to pick up new spark share lib without hadoop jars - T274384
 * 14:38 ottomata: upgrade spark2 on analytics cluster to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed) - T274384
 * 14:12 ottomata: upgrade spark2 on an-coord1001 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed), will remove and auto-re add spark-2.4.4-assembly.zip in hdfs after running puppet here
 * 14:07 ottomata: upgrade spark2 on stat1004 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed)
 * 09:01 elukey: reboot stat1005/stat1008 for kernel upgrades

2021-02-19

 * 15:53 elukey: restart oozie again to test another setting for role/admins
 * 15:43 ottomata: installing spark 2.4.4 without hadoop jars on analytics test cluster - T274384
 * 15:31 elukey: restart oozie to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/665352
 * 14:34 joal: rerun mobile_apps-uniques-daily-wf-2021-2-18
 * 09:16 elukey: stop and decom the hadoop backup cluster

2021-02-18

 * 18:38 razzi: rebalance kafka partition for webrequest_upload partition 1
 * 17:27 elukey: an-coord1002 back in service with raid1 configured
 * 15:48 elukey: stop hive/mysql on an-coord1002 as precautionary step to rebuild the md array
 * 13:10 elukey: failover analytics-hive to an-coord1001 after maintenance (DNS change)
 * 11:32 elukey: restart hive daemons on an-coord1001 to pick up new parquet settings
 * 10:07 elukey: hive failover to an-coord1002 to apply new hive settings to an-coord1001
 * 10:00 elukey: restart hive daemons on an-coord1002 (standby coord) to pick up new default parquet file format change
 * 09:46 elukey: upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x

2021-02-17

 * 17:44 razzi: rebalance kafka partitions for webrequest_upload partition 0
 * 16:14 razzi: rebalance kafka partitions for eqiad.mediawiki.api-request
 * 07:04 elukey: reboot stat1004/stat1006/stat1007 for kernel upgrades

2021-02-16

 * 22:31 razzi: rebalance kafka partitions for codfw.mediawiki.api-request
 * 17:44 razzi: rebalance kafka partitions for netflow
 * 17:42 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
 * 07:32 elukey: restart hadoop daemons on an-worker1099 after reconfiguring a new disk
 * 06:58 elukey: restart hdfs/yarn daemons on an-worker1097 to exclude a failed disk

2021-02-15

 * 20:38 mforns: running hdfs fsck to troubleshoot corrupt blocks
 * 17:28 elukey: restart hdfs namenodes on the main cluster to pick up new racking changes (worker nodes from the backup cluster)

2021-02-14

 * 09:38 joal: Restart and backfill mediacount and mediarequest, and backfill mediarequest-AQS and mediacount archive
 * 09:38 joal: deploy refinery onto hdfs
 * 09:14 joal: Deploy hotfix for mediarequest and mediacount

2021-02-12

 * 19:19 milimetric: deployed refinery with query syntax fix for the last broken cassandra job and an updated EL whitelist
 * 18:34 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
 * 18:31 razzi: rebalance kafka partitions for __consumer_offsets
 * 17:48 joal: Rerun wikidata-articleplaceholder_metrics-wf-2021-2-10
 * 17:47 joal: Rerun wikidata-specialentitydata_metrics-wf-2021-2-10
 * 17:43 joal: Rerun wikidata-json_entity-weekly-wf-2021-02-01
 * 17:08 elukey: reboot presto workers for kernel upgrade
 * 16:32 mforns: finished deployment of analytics-refinery
 * 15:26 mforns: started deployment of analytics-refinery
 * 15:16 elukey: roll restart druid broker on druid-public to pick up new settings
 * 07:54 elukey: roll restart of druid brokers on druid-public - locked after scheduled datasource deletion
 * 07:47 elukey: force a manual run of refinery-druid-drop-public-snapshots on an-launcher1002 (3d before its natural start) - controlled execution to see how druid + 3xdataset replication reacts

2021-02-11

 * 14:26 joal: Restart oozie API job after spark sharelib fix (start: 2021-02-10T18:00)
 * 14:20 joal: Rerun failed clicstream instance 2021-01 after sharelib fix
 * 14:16 joal: Restart oozie after having fixed the spark-2.4.4 sharelib
 * 14:12 joal: Fix oozie sharelib for spark-2.4.4 by copying oozie-sharelib-spark-4.3.0.jar onto the spark folder
 * 02:19 milimetric: deployed again to fix old spelling error :) referererererer
 * 00:05 milimetric: deployed refinery and synced to hdfs, restarting cassandra jobs gently

2021-02-10

 * 21:46 razzi: rebalance kafka partitions for eqiad.mediawiki.cirrussearch-request
 * 21:10 razzi: rebalance kafka partitions for codfw.mediawiki.cirrussearch-request
 * 19:11 elukey: drop /user/oozie/share + chown o+rx -R /user/oozie/share + restart oozie
 * 17:56 razzi: rebalance kafka partitions for eventlogging-client-side
 * 01:07 milimetric: deployed refinery with some fixes after BigTop upgrade, will restart three coordinators right now

2021-02-09

 * 22:04 razzi: rebalance kafka partitions for eqiad.resource-purge
 * 20:51 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T07:00 after data was imported to camus
 * 20:50 razzi: rebalance kafka partitions for codfw.resource-purge
 * 20:31 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T06:00 after data was imported to camus
 * 16:30 elukey: restart datanode on ana-worker1100
 * 16:14 ottomata: restart datanode on analytics1059 with 16g heap
 * 16:08 ottomata: restart datanode on an-worker1080 withh 16g heap
 * 15:58 ottomata: restart datanode on analytics1058
 * 15:55 ottomata: restart datenode on an-worker1115
 * 15:38 elukey: restart namenode on an-master1002
 * 15:01 elukey: restart an-worker1104 with 16g heap size to allow bootstrap
 * 15:01 elukey: restart an-worker1103 with 16g heap size to allow bootstrap
 * 14:57 elukey: restart an-worker1102 with 16g heap size to allow bootstrap
 * 14:54 elukey: restart an-worker1090 with 16g heap size to allow bootstrap
 * 14:50 elukey: restart analytics1072 with 16g heap size to allow bootstrap
 * 14:50 elukey: restart analytics1069 with 16g heap size to allow bootstrap
 * 14:08 elukey: restart analytics1069's datanode with bigger heap size
 * 13:39 elukey: restart hdfs-datanode on analytics10[65,69] - failed to bootstrap due to issues reading datanode dirs
 * 13:38 elukey: restart hdfs-datanode on an-worker1080 (test canary - not showing up in block report)
 * 10:04 elukey: stop mysql replication an-coord1001 -> an-coord1002, an-coord1001 -> db1108
 * 08:29 elukey: leave hdfs safemode to let distcp do its job
 * 08:25 elukey: set hdfs safemode on for the Analytics cluster
 * 08:19 elukey: umount /mnt/hdfs from all nodes using it
 * 08:16 joal: Kill flink yarn app
 * 08:08 elukey: stop jupyterhub on stat100x
 * 08:07 elukey: stop hive on an-coord100[1,2] - prep step for bigtop upgrade
 * 08:05 elukey: stop oozie an-coord1001 - prep step for bigtop upgrade
 * 08:03 elukey: stop presto-server on an-presto100x and an-coord1001 - prep step for bigtop upgrade
 * 07:28 elukey: roll out new apt bigtop changes across all hadoop-related nodes
 * 07:19 joal: Killing yarn users applications
 * 07:12 elukey: stop airflow on an-airflow1001 (prep step for bigtop)
 * 07:09 elukey: stop namenode on an-worker1124 (backup cluster), create two new partitions for backup and namenode, restart namenode
 * 06:14 elukey: disable timers on labstore nodes (prep step for bigtop)
 * 06:11 elukey: disable systemd timers on an-launcher1002 (prep step for bigtop)

2021-02-08

 * 22:29 elukey: the previous entry was related to the Hadoop backup cluster
 * 22:29 elukey: hdfs master failover an-worker1118 -> an-worker1124, created dedicated partition for /var/lib/hadoop/name (root partition filled up), restarted namenode on 1118 (now recovering edit logs)
 * 18:44 razzi: rebalance kafka partitions for eventlogging_VirtualPageView
 * 15:12 ottomata: set kafka topic retention to 31 days for (eqiad|codfw.rdf-streaming-updater.mutation) in kafka main-eqiad and main-codfw - T269619

2021-02-05

 * 20:31 razzi: rebalance kafka partitions for eventlogging_SearchSatisfaction
 * 19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.client.session_tick
 * 18:38 razzi: rebalance kafka partitions for codfw.mediawiki.client.session_tick
 * 17:53 razzi: rebalance kafka partitions for codfw.resource_change
 * 17:53 razzi: rebalance kafka partitions for eqiad.resource_change
 * 11:31 elukey: restart turnilo to pick up changes to the config (two new attributes to webrequest_128)

2021-02-04

 * 19:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.wikibase-addUsagesForPage
 * 19:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.wikibase-addUsagesForPage
 * 19:22 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
 * 17:04 elukey: restart presto coordinator on an-coord1001 to pick up logging settings (log to http-request.log)
 * 17:02 elukey: roll restart presto on an-presto* to finally get http-request.log
 * 11:28 elukey: move aqs druid snapshot config to 2021-01
 * 09:01 elukey: restart superset and disable memcached caching
 * 08:08 elukey: move an-worker1117 from Hadoop Analytics to Hadoop Backup

2021-02-03

 * 21:38 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
 * 20:04 razzi: rebalance kafka partitions for eqiad.mediawiki.job.RecordLintJob
 * 20:03 razzi: rebalance kafka partitions for codfw.mediawiki.job.RecordLintJob
 * 18:28 razzi: rebalance kafka partitions for eqiad.mediawiki.job.refreshLinks
 * 18:28 razzi: rebalance kafka partitions for codfw.mediawiki.job.refreshLinks
 * 17:52 razzi: rebalance kafka partitions for eqiad.wdqs-internal.sparql-query
 * 17:50 razzi: rebalance kafka partitions for codfw.wdqs-internal.sparql-query
 * 14:48 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/mediawiki/history_reduced
 * 14:45 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf/mediawiki
 * 14:40 elukey: kill + restart webrequest-druid- { hourly,daily } to pick up new changes after refinery deployment
 * 14:30 elukey: kill + relaunch webrequest_load to pick up new changes after refinery deployment
 * 14:28 elukey: relaunch edit-hourly-druid-coord 02-2021 after chmods
 * 14:25 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/edit
 * 14:24 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf
 * 10:57 elukey: deploy refinery to hdfs
 * 10:36 elukey: released Refinery Source 0.1.0
 * 08:54 elukey: drop v0.1.x tags from Refinery source upstream repo
 * 08:48 elukey: drop refinery source artifacts v0.1.2 from Archiva

2021-02-02

 * 20:39 razzi: rebalance kafka partitions for eqiad.mediawiki.job.htmlCacheUpdate
 * 20:39 razzi: rebalance kafka partitions for codfw.mediawiki.job.htmlCacheUpdate
 * 19:29 ottomata: manually altered event.codemirrorusage to fix incompatible type change: https://phabricator.wikimedia.org/T269986#6797385
 * 19:28 elukey: change archiva-ci password in pwstore, archiva and jenkins
 * 17:53 razzi: rebalance kafka partitions for eqiad.wdqs-external.sparql-query
 * 17:17 razzi: rebalance kafka partitions for eventlogging_CentralNoticeImpression
 * 16:39 razzi: rebalance kafka partitions for eventlogging_InukaPageView
 * 08:42 elukey: decommission an-worker1117 from the Hadoop cluster, to move it under the Backup cluster

2021-02-01

 * 21:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.cdnPurge
 * 21:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.cdnPurge
 * 20:51 razzi: rebalance kafka partitions for eventlogging_PaintTiming
 * 19:01 razzi: rebalance kafka partitions for eventlogging_LayoutShift
 * 18:58 razzi: rebalance kafka partitions for eqiad.mediawiki.job.recentChangesUpdate
 * 18:58 razzi: rebalance kafka partitions for codfw.mediawiki.job.recentChangesUpdate
 * 18:23 razzi: rebalance kafka partitions for codfw.mediawiki.recentchange
 * 18:09 razzi: rebalance kafka partitions for eqiad.resource_change

2021-01-29

 * 20:23 razzi: rebalance kafka partitions for eventlogging_NavigationTiming
 * 19:30 razzi: rebalance kafka partitions for eqiad.mediawiki.revision-score
 * 19:29 razzi: rebalance kafka partitions for codfw.mediawiki.revision-score
 * 19:14 razzi: rebalance kafka partitions for eventlogging_CpuBenchmark
 * 19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
 * 19:10 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
 * 14:33 elukey: rollback presto upgrade, worker seems not able to announce themselves to the query coordinator
 * 14:08 elukey: upgrade presto to 0.246 (from 0.226) on an-presto1001 - worker node
 * 14:02 elukey: upgrade presto to 0.246 (from 0.226) on an-coord1001 - query coordinator
 * 07:44 joal: Copy /wmf/data/event_sanitized to backup cluster (T272846)

2021-01-28

 * 22:23 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
 * 22:22 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
 * 22:01 razzi: rebalance kafka partitions for eventlogging_QuickSurveyInitiation
 * 21:13 razzi: rebalance kafka partitions for topic eventlogging_EditAttemptStep
 * 19:49 mforns: finished deployment of refinery (for v0.0.146)
 * 18:57 mforns: starting deployment of refinery (for v0.0.146)
 * 18:54 mforns: deployed refinery-source v0.0.146 using Jenkins
 * 18:45 razzi: rebalance kafka partitions for topic eqiad.mediawiki.job.ORESFetchScoreJob
 * 18:42 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.ORESFetchScoreJob
 * 18:22 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.wikibase-InjectRCRecords
 * 17:26 razzi: rebalance kafka partitions for topic eqiad.mediawiki.revision-tags-change
 * 17:26 razzi: rebalance kafka partitions for topic codfw.mediawiki.revision-tags-change
 * 16:32 razzi: rebalance kafka partitions for topic eventlogging_CodeMirrorUsage
 * 16:16 elukey: manual failover of hdfs namenode active/master from an-master1002 to an-master1001

2021-01-27

 * 13:02 joal: Copy /wmf/data/event to backup cluster (30Tb) - T272846
 * 11:15 elukey: add client_port and debug fields to X-Analytics in webrequest varnishkafka streams

2021-01-26

 * 16:39 razzi: reboot kafka-test1006 for kernel upgrade
 * 09:37 elukey: reboot dbstore1005 for kernel upgrades
 * 09:35 joal: Copy /wmf/data/discovery to backup cluster (21Tb) - T272846
 * 09:31 elukey: reboot dbstore1003 for kernel upgrades
 * 09:15 elukey: reboot dbstore1004 for kernel upgrades
 * 09:07 joal: Copy /wmf/refinery to backup cluster (1.1Tb) - T272846
 * 09:01 joal: Copy /wmf/discovery to backup cluster (120Gb) - T272846
 * 08:42 joal: Copy /wmf/camus to backup cluster (120Gb) - T272846

2021-01-25

 * 20:42 razzi: rebalance kafka partitions for eqiad.mediawiki.page-properties-change.json
 * 20:41 razzi: rebalance kafka partitions for codfw.mediawiki.page-properties-change
 * 18:58 razzi: rebalance kafka partitions for eventlogging_ExternalGuidance
 * 18:53 razzi: rebalance kafka partitions for eqiad.mediawiki.job.ChangeDeletionNotification
 * 17:13 joal: Copy /user to backup cluster (92Tb) - T272846
 * 16:23 elukey: drain+restart cassandra on aqs1004 to pick up the new openjdk (canary)
 * 16:21 elukey: restart yarn and hdfs daemon on analytics1058 (canary node for new openjdk)
 * 12:25 joal: Copy /wmf/data/archive to backup cluster (32Tb) - T272846
 * 10:20 elukey: restart memcached on an-tool1010 to flush superset's cache
 * 10:18 elukey: restart superset to remove druid datasources support - T263972
 * 09:57 joal: Changing ownership of archive WMF files to analytics:analytics-privatedata-users after update of oozie jobs

2021-01-22

 * 17:38 mforns: finished refinery deploy to HDFS
 * 17:28 mforns: restarted refine_event and refine_eventlogging_legacy in an-launcher1002
 * 17:11 mforns: starting refinery deploy using scap
 * 17:09 mforns: bumped up refinery-source jar version to 0.0.145 in puppet for Refine and DruidLoad jobs
 * 16:44 mforns: Deployed refinery-source v0.0.145 using jenkins
 * 09:48 joal: Raise druid-public default replication-factor from 2 to 3

2021-01-21

 * 18:54 razzi: rebooting nodes for druid public cluster via cookbook
 * 16:49 ottomata: installed libsnappy-dev and python3-snappy on webperf1001
 * 15:17 joal: Kill mediawiki-wikitext-history-wf-2020-12 as it was stuck and failed
 * 11:19 elukey: block UA with 'python-requests.*' hitting AQS via Varnish

2021-01-20

 * 21:48 milimetric: refinery deployed, synced to hdfs, ready to restart 53 oozie jobs, will do so slowly over the next few hours
 * 18:11 joal: Release refinery-source v0.0.144 to archiva with Jenkins

2021-01-15

 * 09:21 elukey: roll restart druid brokers on druid public - stuck after datasource drop

2021-01-11

 * 07:26 elukey: execute 'sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/mediawiki' on launcher to fix dir perms

2021-01-09

 * 15:11 elukey: restart timers 'analytics-*' on labstore100[6,7] to apply new permission settings
 * 08:31 elukey: restart the failed hdfs rsync timers on labstore100[6,7] to kick off the remaining jobs
 * 08:30 elukey: execute hdfs chmod o+x of /wmf/data/archive/projectview /wmf/data/archive/projectview/legacy /wmf/data/archive/pageview/legacy to unblock hdfs rsyncs
 * 08:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/pageview" to unblock labstore hdfs rsyncs
 * 08:21 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/geoeditors" to unblock labstore hdfs rsync

2021-01-08

 * 18:54 joal: Restart jobs for permissions-fix (clickstream, mediacounts-archive, geoeditors-public_monthly, geoeditors-yearly, mobile_app-uniques-[daily|monthly], pageview-daily_dump, pageview-hourly, projectview-geo, unique_devices-[per_domain|per_project_family]-[daily|monthly])
 * 18:14 joal: Restart projectview-hourly job (permissions test)
 * 18:03 joal: Deploy refinery onto HDFS
 * 17:50 joal: deploy refinery with scap
 * 10:01 elukey: restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well
 * 08:46 elukey: force restart of check_webrequest_partitions.service on an-launcher1002
 * 08:44 elukey: force restart of monitor_refine_eventlogging_legacy_failure_flags.service
 * 08:18 elukey: raise default max executor heap size for Spark refine to 4G

2021-01-07

 * 18:22 elukey: chown -R /tmp/analytics analytics:analytics-privatedata-users (tmp dir for data quality stats tables)
 * 18:21 elukey: "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/wmf/data_quality_stats"
 * 18:10 elukey: disable temporarily hdfs-cleaner.timer to prevent /tmp/DataFrameToDruid to be dropped
 * 18:08 elukey: chown -R /tmp/DataFrameToDruid analytics:druid (was: analytics:hdfs) on hdfs to temporarily unblock Hive2Druid jobs
 * 16:31 elukey: remove /etc/mysql/conf.d/research-client.cnf from stat100x nodes
 * 15:40 elukey: deprecate the 'reseachers' posix group for good
 * 11:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event_sanitized" to fix some file permissions as well
 * 10:36 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event" on an-master1001 to fix some file permissions (an-launcher executed timers during the past hours without the new umask) - T270629
 * 09:37 elukey: forced re-run of monitor_refine_event_failure_flags.service on an-launcher1002 to clear alerts
 * 08:26 joal: Rerunning 4 failed refine jobs (mediawiki_cirrussearch_request, day=6/hour=20|21, day=7/hour=0|2)
 * 08:14 elukey: re-enable puppet on an-launcher1002 to apply new refine memory settings
 * 07:59 elukey: re-enabling all oozie jobs previously suspended
 * 07:54 elukey: restart oozie on an-coord1001

2021-01-06

 * 20:42 ottomata: starting remaining refine systemd timers
 * 20:19 ottomata: restarted eventlogging_to_druid timers
 * 20:19 ottomata: restarted drop systemd timers
 * 20:18 ottomata: restarted reportupdater timers
 * 20:14 ottomata: re-starting camus systemd timers
 * 16:45 razzi: restart yarn nodemanagers
 * 16:08 razzi: manually failover hdfs haadmin from an-master1002 to an-master1001
 * 15:53 ottomata: stopping analytics systemd timers on an-launcher1002

2021-01-05

 * 21:32 ottomata: bumped mediawiki history snapshot version in AQS
 * 20:45 ottomata: Refine changes: event tables now have is_wmf_domain, canary events are removed, and corrupt records will result in a better monitoring email
 * 20:43 razzi: deploy aqs as part of train
 * 19:17 razzi: deploying refinery for weekly train
 * 09:29 joal: Manually reload unique-devices monthly in cassandra to fix T271170

2021-01-04

 * 22:22 razzi: reboot an-test-coord1001 to upgrade kernel
 * 14:24 elukey: deprecate the analytics-users group

2021-01-03

 * 14:11 milimetric: reset-failed refinery-sqoop-whole-mediawiki.service
 * 14:10 milimetric: manual sqoop finished, logs on an-launcher1002 at /var/log/refinery/sqoop-mediawiki.log and /var/log/refinery/sqoop-mediawiki-production.log

2021-01-01

 * 14:54 milimetric: deployed refinery hotfix for sqoop problem, after testing on three small wikis

2020-12-29

 * 09:18 elukey: restart hue to pick up analytics-hive endpoint settings

2020-12-23

 * 15:53 ottomata: point analytics-hive.eqiad.wmnet back at an-coord1001 - T268028 T270768

2020-12-22

 * 19:35 elukey: restart hive daemons on an-coord1001 to pick up new settings
 * 18:13 elukey: failover analytics-hive.eqiad.wmnet to an-coord1002 (to allow maintenance on an-coord1001)
 * 18:07 elukey: restart hive server on an-coord1002 (current standby - no traffic) to pick up the new config (use the local metastore as opposed to what it is pointed by analytics-hive)
 * 17:00 mforns: Deployed refinery as part of weekly train (v0.0.142)
 * 16:42 mforns: Deployed refinery-source v0.0.142
 * 15:00 razzi: stopping superset server on analytics-tool1004
 * 10:36 elukey: restart presto coordinator to pick up analytics-hive settings
 * 10:25 elukey: failover analytics-hive.eqiad.wmnet to an-coord1001
 * 09:56 elukey: restart hive daemons on an-coord1001 to pick up analytics-hive settings
 * 07:27 elukey: reboot stat100[4-8] (analytics hadoop clients) for kernel upgrades
 * 07:23 elukey: move all analytics clients (spark refine, stat100x, hive-site.xml on hdfs, etc..) to analytics-hive.eqiad.wmnet

2020-12-18

 * 14:10 elukey: restore stat1004 to its previous settings for kerberos credential cache

2020-12-17

 * 14:54 klausman: Updated all stat100x machines to now sport kafkacat 1.6.0, backported from Bullseye
 * 11:04 elukey: wipe/reimage the hadoop test cluster to start clean for CDH (and then test the upgrade to bigtop 1.5)

2020-12-16

 * 21:07 joal: Kill-restart virtualpageview-hourly-coord and projectview-geo-coord with manually updated jar versions (old versions in conf)
 * 19:35 joal: Kill-restart all oozie jobs belonging to analytics except mediawiki-wikitext-history-coord
 * 18:52 joal: Kill-restart cassandra loading oozie jobs
 * 18:37 joal: Kill-restart wikidata-entity, wikidata-item_page_link and mobile_apps-session_metrics oozie jobs
 * 18:31 joal: Kill-rerun data-quality bundles
 * 16:17 razzi: dropping and re-creating superset staging database
 * 08:13 joal: Manually push updated pageview whitelist to HDFS

2020-12-15

 * 20:24 joal: Kill restart webrequest_load oozie job after deploy
 * 19:43 joal: Deploy refinery onto HDFS
 * 19:14 joal: Scap deploy refinery
 * 18:26 joal: Release refinery-source v0.0.141

2020-12-14

 * 19:09 razzi: restart restart hadoop-yarn-resourcemanager on an-master1002 to promote an-master1001 to active again
 * 19:08 razzi: restarted hadoop-yarn-resourcemanager on an-master1001 again by mistake
 * 19:02 razzi: restart hadoop-yarn-resourcemanager on an-master1002
 * 18:54 razzi: restart hadoop-yarn-resourcemanager on an-master1001
 * 18:43 razzi: applying yarn config change via `sudo cumin "A:hadoop-worker" "systemctl restart hadoop-yarn-nodemanager" -b 10`
 * 14:58 elukey: stat1004's krb credential cache moved under /run (shared between notebooks and ssh/bash) - T255262
 * 07:55 elukey: roll restart yarn daemons to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/649126

2020-12-11

 * 19:30 ottomata: now ingesting Growth EventLogging schemas using event platform refine job; they are exclude-listed from eventlogging-processor. - T267333
 * 07:04 elukey: roll restart presto cluster to pick up new jvm xmx settings
 * 06:57 elukey: restart presto on an-presto1003 since all the memory on the host was occupied, and puppet failed to run

2020-12-10

 * 12:29 joal: Drop-Recreate-Repair wmf_raw.mediawiki_image table

2020-12-09

 * 20:34 elukey: execute on mysql:an-coord1002 "set GLOBAL replicate_wild_ignore_table='superset_staging.%'" to avoid replication for superset_staging from an-coord1002
 * 07:12 elukey: re-enable timers after maintenance
 * 07:07 elukey: restart hive-server2 on an-coord1002 for consistency
 * 07:05 elukey: restart hive metastore and server2 on an-coord1001 to pick up settings for DBTokenStore
 * 06:50 elukey: stop timers on an-launcher1002 as prep step to restart hive

2020-12-07

 * 18:51 joal: Test mediawiki-wikitext-history new sizing settings
 * 18:43 razzi: kill testing flink job: sudo -u hdfs yarn application -kill application_1605880843685_61049
 * 18:42 razzi: truncate /var/lib/hadoop/data/h/yarn/logs/application_1605880843685_61049/container_e27_1605880843685_61049_01_000002/taskmanager.log on an-worker1011

2020-12-03

 * 22:34 milimetric: updated mw history snapshot on AQS
 * 07:09 elukey: manual reset-failed refinery-sqoop-whole-mediawiki.service on an-launcher1002 (job launched manually)

2020-12-02

 * 21:37 joal: Manually create _SUCCESS flags for banner history monthly jobs to kick off (they'll be deleted by the purge tomorrow morning)
 * 21:16 joal: Rerun timed out jobs after oozie config got updated (mediawiki-geoeditors-yearly-coord and banner_activity-druid-monthly-coord)
 * 20:49 ottomata: deployed eventgate-analytics-external with refactored stream config, hopefully this will work around the canary events alarm bug - T266573
 * 18:20 mforns: finished netflow migration wmf->event
 * 17:50 mforns: starting netflow migration wmf->event
 * 17:50 joal: Manually start refinery-sqoop-production on an-launcher1002 to cover for couped runs failure
 * 16:50 mforns: restarted turnilo to clear deleted datasource
 * 16:47 milimetric: faked _SUCCESS flag for image table to allow daisy-chained mediawiki history load dependent coordinators to keep running
 * 07:49 elukey: restart oozie to pick up new settings for T264358

2020-12-01

 * 19:43 razzi: deploy refinery with refinery-source v0.0.140
 * 10:50 elukey: restart oozie to pick up new logging settings
 * 09:03 elukey: clean up old hive metastore/server old logs on an-coord1001 to free space

2020-11-30

 * 17:51 joal: Deploy refinery onto hdfs
 * 17:49 joal: Kill-restart mediawiki-history-load job after refactor (1 coordinator per table) and tables addition
 * 17:32 joal: Kill-restart mediawiki-history-reduced job for druid-public datasource number of shards update
 * 17:32 joal: Deploy refinery using scap for naming hotfix
 * 15:29 ottomata: migrated EventLogging schemas SpecialMuteSubmit and SpecialInvestigate to EventGate - T268517
 * 14:56 joal: Deploying refinery onto hdfs
 * 14:49 joal: Create new hive tables for newly sqooped data
 * 14:45 joal: Deploy refinery using scap
 * 09:08 elukey: force execution of refinery-drop-pageview-actor-hourly-partitions on an-launcher1002 (after args fixup from Joseph)

2020-11-27

 * 14:51 elukey: roll restart zookeeper on druid* nodes for openjdk upgrades
 * 10:29 elukey: restart eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 (failed) to see if the hive metastore works
 * 10:27 elukey: restart oozie and presto-server on an-coord1001 for openjdk upgrades
 * 10:27 elukey: restart hive server and metastore on an-coord1001 - openjdk upgrades + problem with high GC caused by a job
 * 08:05 elukey: roll restart druid public cluster for openjdk upgrades

2020-11-26

 * 13:52 elukey: roll restart druid daemons on druid analytics to pick up new openjdk upgrades
 * 13:08 elukey: force umount/mount of all /mnt/hdfs mountpoints to pick up opendjdk upgrades
 * 09:07 elukey: force purging https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Diego_Maradona/daily/2020110500/2020112500 from caches
 * 08:40 elukey: roll restart cassandra on aqs10* for openjdk upgrades

2020-11-25

 * 19:04 joal: Killing job application_1605880843685_18336 as it consumes too much resources
 * 18:40 elukey: restart turnilo to pick up new netflow config changes
 * 16:46 elukey: move analytics1066 to C3
 * 16:11 elukey: move analytics1065 to C3
 * 15:38 elukey: move stat1004 to A5

2020-11-24

 * 19:33 elukey: kill and restart webrequest_load bundle to pick up analytics-hive.eqiad.wmnet settings
 * 19:05 elukey: deploy refinery to hdfs (even if not really needed)
 * 18:47 elukey: deploy analytics refinery as part of the regular weekly train
 * 15:38 elukey: move druid1005 from rack B7 to B6
 * 14:59 elukey: move analytics1072 from rack B2 to B3
 * 09:16 elukey: drop principals and keytabs for analytics10[42-57] - T267932

2020-11-21

 * 08:10 elukey: remove big stderrlog fine in /var/lib/hadoop/data/d/yarn/logs/application_1605880843685_1450 on an-worker1110
 * 08:05 elukey: remove big stderrlog fine in /var/lib/hadoop/data/e/yarn/logs/application_1605880843685_1450 on an-worker1105

2020-11-20

 * 21:09 razzi: truncate /var/lib/hadoop/data/u/yarn/logs/application_1605880843685_0581/container_e27_1605880843685_0581_01_000171/stderr logfile on an-worker1098

2020-11-19

 * 16:35 elukey: roll restart hadoop workers for openjdk upgrades
 * 07:07 elukey: roll restart java daemons on Hadoop test for openjdk upgrades
 * 06:50 elukey: restart refinery-import-siteinfo-dumps.service on an-launcher1002

2020-11-18

 * 09:22 elukey: set dns_canonicalize_hostname = false to all kerberos clients

2020-11-17

 * 23:09 mforns: restarted browser general oozie job
 * 23:00 mforns: finished deploying refinery (regular weekly deployment train)
 * 22:36 mforns: deploying refinery (regular weekly deployment train)
 * 15:11 elukey: drop backup@localhost user from an-coord1001's mariadb meta instance (not used anymore)
 * 15:09 elukey: drop 'dump' user from an-coord1001's analytics meta (related to dbprov hosts, previous attempts before db1108)
 * 14:57 elukey: stutdown stat1008 for ram expansion
 * 11:28 elukey: set analytics meta instance on an-coord1002 as replica of an-coord1001

2020-11-16

 * 10:41 klausman: about to update stat1008 to new kernel and rocm
 * 09:13 joal: Rerun webrequest-refine for hours 0 to 6 of day 2020-11-16 - This will prevent webrequest-druid-daily to get loaded with incoherent data due to bucketing change
 * 08:45 joal: Correct webrequest job directly on HDFS and restart webrequest bundle oozie job
 * 08:43 joal: Kill webrequest bundle to correct typo
 * 08:31 joal: Restart webrequest bundle oozie job with update
 * 08:31 joal: Restart webrequest bun
 * 08:25 joal: Deploying refinery onto HDFS
 * 08:13 joal: Deploying refinery with scap
 * 08:01 joal: Repair wmf.webrequest hive table partitions
 * 08:01 joal: Recreate wmf.webrequest hive table with new partitioning
 * 08:00 joal: Drop webrequest table
 * 07:55 joal: Kill webrequest-bundle oozie job for table update

2020-11-15

 * 08:27 elukey: truncate -s 10g /var/lib/hadoop/data/n/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000177/stderr on an-worker1100
 * 08:21 elukey: sudo truncate -s 10g /var/lib/hadoop/data/c/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000019/stderr on an-worker1098

2020-11-10

 * 19:32 joal: Deploy wikistats2 v2.8.2
 * 18:16 joal: Releasing refinery-source v0,0,139 to archiva
 * 14:48 mforns: restarted data quality stats daily bundle with new metric
 * 13:30 elukey: add hive-server2 to an-coord1002
 * 07:40 elukey: upgrade hue to hue_4.8.0-2 on an-tool1009

2020-11-09

 * 18:34 elukey: drop hdfs-balancer multi-gb log file from launcher1002
 * 18:33 elukey: manually start logrotate.timer apt.timer etc.. on an-launcher1002 - stopped since the last time that I have disabled timers
 * 17:48 razzi: reboot an-coord1002 to see if it updates kernel cpu instructions

2020-11-08

 * 06:31 elukey: truncate huge log file on an-worker1103 for app id application_1601916545561_147041

2020-11-06

 * 19:00 mforns: launched backfilling of data quality stats for os_family_entropy_by_access_method

2020-11-05

 * 18:32 razzi: shutting down kafka-jumbo1005 to allow dcops to upgrade NIC
 * 17:47 razzi: shutting down kafka-jumbo1004 to allow dcops to upgrade NIC
 * 16:57 razzi: shutting down kafka-jumbo1003 to allow dcops to upgrade NIC
 * 16:25 razzi: shutting down kafka-jumbo1002 to allow dcops to upgrade NIC
 * 14:55 elukey: shutdown kafka-jumbo1001 to swap NICs (1g -> 10g)
 * 06:30 elukey: truncate application_1601916545561_129457's taskmanager.log (~600G) on an-worker1113 due to partition 'e' full
 * 02:05 milimetric: deployed refinery pointing to refinery-source v0.0.138

2020-11-04

 * 09:20 elukey: upgrade hue to 4.8.0 on hue-next

2020-11-03

 * 16:52 elukey: mv /srv/analytics.wikimedia.org/published/datasets/archive/public-datasets to /srv/backup/public-datasets on thorium - T265971
 * 15:52 elukey: re-enable timers after maintenance
 * 14:02 elukey: stop timers on an-launcher1002 to drain the cluster (an-coord1001 maintenance prep-step)
 * 13:02 elukey: force a restart of performance-asoranking.service on stat1007 after fix for pandas' sort - T266985
 * 07:26 elukey: re-run cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat failed hour via hue

2020-11-02

 * 21:15 ottomata: evolved Hive table event.contenttranslationabusefilter to match migrated event platform schema - T259163
 * 13:40 elukey: roll restart zookeeper ok an-conf* to pick up new openjdk upgrades
 * 12:40 elukey: forced re-creation of base jupyterhub venvs on stat1007

2020-10-30

 * 17:01 elukey: kafka preferred-replica-election on jumbo1001

2020-10-29

 * 14:25 elukey: restart zookeeper on an-conf1001 for openjdk upgrades

2020-10-27

 * 17:38 ottomata: restrict Fuzz Faster U Fool user agents from submittnig eventlogging legacy systemd data - T266130

2020-10-22

 * 14:05 ottomata: bump camus version to wmf12 for all camus jobs. should be no-op now. - T251609
 * 13:56 ottomata: camus-eventgate-main_events now uses EventStreamConfig to discover topics to ingest, but still uses regex to find topics to monitor - T251609
 * 13:04 ottomata: camus-eventgate-analytics_events now uses EventStreamConfig to discovery topics to ingest and canary topics to monitor - T251609
 * 13:03 elukey: restart turnilo to pick up new wmf_netflow settings
 * 11:51 ottomata: camus-eventgate-analytics-external now uses EventStreamConfig to discovery topics to ingest and canary topics to monitor
 * 07:03 elukey: decom analytics1057 from the Hadoop cluster
 * 06:54 elukey: restart httpd on matomo1002, errors while connecting
 * 06:31 elukey: restart turnilo to apply new settings for wmf_netflow
 * 06:06 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics /wmf/data/archive/geoip" on an-launcher1002 - permission issues for 'analytics' and /wmf/data/archive/geoip
 * 02:37 ottomata: re-run webrequest-load-wf- { text,upload } -2020-10-21- { 19,20 } oozie jobs after they timed out waiting for data due to camus misconfiguration (fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/635678)

2020-10-21

 * 20:12 razzi: stop nginx on analytics-tool1001.eqiad.wmnet to switch to envoy (hue-next)
 * 20:10 razzi: stop nginx on analytics-tool1001.eqiad.wmnet to switch to envoy (hue)
 * 20:07 razzi: stop nginx on analytics-tool1007.eqiad.wmnet to switch to envoy (turnilo)
 * 20:05 razzi: stop nginx on analytics-tool1004.eqiad.wmnet to switch to envoy (superset)
 * 20:02 razzi: stop nginx on matomo1002.eqiad.wmnet to switch to envoy
 * 10:41 elukey: decommission analytics1052 from the hadoop cluster
 * 10:26 elukey: move journalnode from analytics1052 (to be decommed) to an-worker1080

2020-10-20

 * 20:59 mforns: Deploying refinery with refinery-deploy-to-hdfs (for 0.0.137)
 * 20:24 mforns: Deploying refinery with scap for v0.0.137
 * 20:00 mforns: Deployed refinery-source v0.0.137
 * 15:00 ottomata: disabling sending EventLogging events to eventlogging-valid-mixed topic - T265651
 * 13:34 elukey: upgrade superset's presto TLS config after the above changes
 * 13:33 elukey: move presto to pupet host TLS certificates
 * 10:29 klausman: rocm38 install on an-worker1101 successful, rebooting to make sure everything is in place
 * 06:41 elukey: decom analytics1056 from the hadoop cluster

2020-10-19

 * 14:40 ottomata: restarted eventlogging-processor with filter to skip events already migrated to event platform - T262304
 * 10:09 elukey: add pps/bps measures to wmf_netflow in turnilo
 * 07:27 elukey: decom analytics1055 from the hadoop cluster
 * 06:47 elukey: turnilo upgraded to 1.27.0

2020-10-18

 * 07:01 elukey: decom analytics1054 from hadoop

2020-10-17

 * 06:08 elukey: decom analytics1053 from the hadoop cluster

2020-10-15

 * 17:57 razzi: taking yarn.wikimedia.org offline momentarily to test new tls configuration: T240439
 * 14:51 elukey: roll restart druid-historical daemons on druid1004-1008 to pick up new conn pooling changes
 * 07:03 elukey: restart oozie to pick up the analytics team's admin list
 * 06:09 elukey: decommission analytics1050 from the hadoop cluster

2020-10-14

 * 17:39 joal: Rerun refine for mediawiki_api_request failed hour
 * 15:59 elukey: drain + reboot an-worker1100 to pick up GPU settings
 * 15:29 elukey: drain + reboot an-worker110[1,2] to pick up GPU settings
 * 14:56 elukey: drain + reboot an-worker109[8,9] to pick up GPU settings
 * 05:48 elukey: decom analytics1049 from the Hadoop cluster

2020-10-13

 * 12:38 elukey: drop /srv/backup/mysql from an-master1002 (not used anymore)
 * 08:59 klausman: Regenned the jupyterhub venvs on stat1004
 * 07:56 klausman: re-imaging stat1004 to Buster
 * 06:20 elukey: decom analytics1048 from the Hadoop cluster

2020-10-12

 * 11:36 joal: Clean druid test-datasources
 * 11:32 elukey: remove analytics-meta lvm backup settings from an-coord1001
 * 11:23 elukey: remove analytics-meta lvm backup settings from an-master1002
 * 07:02 elukey: reduce hdfs block replication factor on Hadoop test to 2
 * 05:37 elukey: decom analytics1047 from the Hadoop cluster

2020-10-11

 * 08:33 elukey: drop some old namenode backups under /srv on an-master1002 to free some space
 * 08:24 elukey: decommission analytics1046 from the hadoop cluster
 * 08:12 elukey: clean up logs on an-launcher1002 (disk space full)

2020-10-10

 * 12:01 elukey: decommission analytics1045 from the Hadoop cluster

2020-10-09

 * 13:17 elukey: execute "cumin 'stat100[5,8]* or an-worker109[6-9]* or an-worker110[0,1]*' 'apt-get install -y linux-headers-amd64'"
 * 11:15 elukey: bootstrap the Analytics Hadoop test cluster
 * 09:47 elukey: roll restart of hadoop-yarn-nodemanager on all hadoop workers to pick up new settings
 * 07:58 elukey: decom analytics1044 from Hadoop
 * 07:04 elukey: failover from an-master1002 to 1001 for HDFS namenode (the namenode failed over hours ago, no logs to check)

2020-10-08

 * 18:08 razzi: restart oozie server on an-coord1001 for reverting T262660
 * 17:42 razzi: restart oozie server on an-coord1001 for T262660
 * 17:19 elukey: removed /var/lib/puppet/clientbucket/6/f/a/c/d/9/8/d/6facd98d16886787ab9656eef07d631e/content on an-launcher1002 (29G, last modified Aug 4th)
 * 15:45 elukey: executed git pull on /srv/jupyterhub/deploy and run again create_virtualenv.sh on stat1007 (pyspark kernels may not run correctly due to a missing feature)
 * 15:43 elukey: executed git pull on /srv/jupyterhub/deploy and run again create_virtualenv.sh on stat1006 (pyspark kernels not running due to a missing feature)
 * 13:13 elukey: roll restart of druid overlords and coordinators on druid public to pick up new TLS settings
 * 12:51 elukey: roll restart of druid overlords and coordinators on druid analytics to pick up new TLS settings
 * 10:35 elukey: force the re-creation of default jupyterhub venvs on stat1006 after reimage
 * 08:47 klausman: Starting re-image of stat1006 to Buster
 * 07:14 elukey: decom analytics1043 from the Hadoop cluster
 * 06:46 elukey: move the hdfs balancer from an-coord1001 to an-launcher1002

2020-10-07

 * 08:45 elukey: decom analytics1042 from hadoop

2020-10-06

 * 13:14 elukey: cleaned up /srv/jupyter/venv and re-created it to allow jupyterhub to start cleanly on stat1007
 * 12:56 joal: Restart oozie to pick up new spark settings
 * 12:47 elukey: force re-creation of the base virtualenv for jupyter on stat1007 after the reimage
 * 12:20 elukey: update HDFS Namenode GC/Heap settings on an-master100[1,2]
 * 12:19 elukey: increase spark shuffle io retry logic (10 tries every 10s)
 * 09:08 elukey: add an-worker1114 to the hadoop cluster
 * 09:04 klausman: Starting reimaging of stat1007
 * 07:32 elukey: bootstrap an-worker111[13] as hadoop workers

2020-10-05

 * 19:14 mforns: restarted oozie coord unique_devices-per_domain-monthly after deployment
 * 19:05 mforns: finished deploying refinery to unblock deletion of raw mediawiki_job and raw netflow data
 * 18:45 mforns: deploying refinery to unblock deletion of raw mediawiki_job and raw netflow data
 * 18:20 elukey: manual creation of /opt/rocm -> /opt/rocm-3.3.0 on stat1008 to avoid failures in finding the lib dir
 * 17:11 elukey: bootstrap an-worker[1115-1117] as hadoop workers
 * 14:52 milimetric: disabling drop-el-unsanitized-events timer until https://gerrit.wikimedia.org/r/c/analytics/refinery/+/631804/ is deployed
 * 14:41 elukey: shutdown stat1005 and stat1008 for ram expansion (1005 again)
 * 14:25 elukey: shutdown an-master1001 for ram expansion
 * 13:54 elukey: shutdown stat1005 for ram upgrade
 * 13:31 elukey: shutdown an-master1002 for ram expansion (64 -> 128G)
 * 12:35 elukey: execute "PURGE BINARY LOGS BEFORE '2020-09-28 00:00:00';" on an-coord1001's mysql to free space - T264081
 * 10:31 elukey: bootstrap an-worker111[0,2] as hadoop workers
 * 10:31 elukey: bootstrap an-worker111[0,2
 * 06:33 elukey: reboot stat1005 to resolve weird GPU state (scheduled last week)

2020-10-03

 * 10:35 joal: Manually run mediawiki-history-denormalize after fail-rerun problem (second time)

2020-10-02

 * 16:43 joal: Rerun mediawiki-history-denormalize-wf-2020-09 after failed instance
 * 14:23 elukey: live patch refinery-drop-older-than on stat1007 to unblock timer (patch https://gerrit.wikimedia.org/r/6317800)
 * 13:00 elukey: add an-worker110[6-9] to the Hadoop cluster
 * 06:49 elukey: add an-worker110[0-2] to the hadoop cluster
 * 06:33 joal: Manually sqoop page_props and user_properties to unlock mediawiki-history-load oozie job

2020-10-01

 * 19:07 fdans: deploying wikistats
 * 19:06 fdans: restarted banner_activity-druid-daily-coord from Sep 26
 * 18:59 fdans: restarting mediawiki-history-load-coord
 * 18:57 fdans: creating hive table wmf_raw.mediawiki_page_props
 * 18:56 fdans: creating hive table wmf_raw.mediawiki_user_properties
 * 17:40 elukey: remove + re-create /srv/deployment/analytics/refinery* on stat100[46] (perm issues after reimage)
 * 17:32 elukey: remove + re-create /srv/deployment/analytics/refinery on stat1007 (perm issues after reimage)
 * 17:18 fdans: deploying refinery
 * 14:51 elukey: bootstrap an-worker109[8-9] as hadoop workers (with GPU)
 * 13:35 elukey: bootstrap an-worker1097 (GPU node) as hadoop worker
 * 13:15 elukey: restart performance-asoranking on stat1007
 * 13:15 elukey: execute "sudo chown analytics-privatedata:analytics-privatedata-users /srv/published-datasets/performance/autonomoussystems/*" on stat1007 to fix a perm issue after reimage
 * 10:30 elukey: add an-worker1103 to the hadoop cluster
 * 07:15 elukey: restart hdfs namenodes on an-master100[1,2] to pick up new hadoop workers settings
 * 06:04 elukey: execyte "sudo chown -R analytics-privatedata:analytics-privatedata-users /srv/geoip/archive" on stat1007 - T264152
 * 05:58 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics-privatedata /wmf/data/archive/geoip" - T264152

2020-09-30

 * 07:29 elukey: execute "alter table superset_production.alerts drop key ix_alerts_active;" on db1108's analytics-meta instance to fix replication after Superset upgrade - T262162
 * 07:04 elukey: superset upgraded to 0.37.2 on analytics-tool1004 - T262162
 * 05:47 elukey: "PURGE BINARY LOGS BEFORE '2020-09-22 00:00:00';" on an-coord1001's mariadb - T264081

2020-09-28

 * 18:37 elukey: execute "PURGE BINARY LOGS BEFORE '2020-09-20 00:00:00';" on an-coord1001's mariadb as attempt to recover space
 * 18:37 elukey: execute "PURGE BINARY LOGS BEFORE '2020-09-15 00:00:00';" on an-coord1001's mariadb as attempt to recover space
 * 15:09 elukey: execute set global max_connections=200 on an-coord1001's mariadb (hue reporting too many conns, but in reality the fault is from superset)
 * 10:02 elukey: force /srv/jupyterhub/deploy/create_virtual_env.sh on stat1007 after the reimage
 * 07:58 elukey: starting the process to decom the old hadoop test cluster

2020-09-27

 * 06:53 elukey: manually ran /usr/bin/find /srv/backup/hadoop/namenode -mtime +14 -delete on an-master1002 to free space on the /srv partition

2020-09-25

 * 16:25 elukey: systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear alerts
 * 15:52 elukey: restart hdfs namenodes to correct rack settings of the new host
 * 15:42 elukey: add an-worker1096 (GPU worker) to the hadoop cluster
 * 08:57 elukey: restart daemons on analytics1052 (journalnode) to verify new TLS setting simplification (no truststore config in ssl-server.xml, not needed)
 * 07:18 elukey: restart datanode on analytics1044 after new datanode partition settings (one partition was missing, caught by https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647)

2020-09-24

 * 13:24 elukey: moved the hadoop cluster to puppet TLS certificates
 * 13:20 elukey: re-enable timers on an-launcher1002 after maintenance
 * 09:51 elukey: stop all timers on an-launcher1002 to ease maintenance
 * 09:41 elukey: force re-creation of jupyterhub's default venv on stat1006 after reimage
 * 07:29 klausman: Starting reimaging of stat1006
 * 06:48 elukey: on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mirrys/logs/*
 * 06:45 elukey: on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs/*
 * 06:39 elukey: manually ran "/usr/bin/find /srv/backup/hadoop/namenode -mtime +15 -delete" on an-master1002 to free some space in the backup partition

2020-09-23

 * 07:29 elukey: re-enable timers on al-launcher1002 - maintenance postponed
 * 06:06 elukey: stop timers on an-launcher1002 as prep step before maintenance

2020-09-22

 * 06:29 elukey: re-run webrequest-load-text 21/09T21 - failed due to sporadic hive/kerberos issue (SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default;principal=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA: Peer indicated failure: Failure to initialize security context)

2020-09-21

 * 18:00 elukey: execute sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mgerlach/logs/* to free ~30TB of space on HDFS (Replicated)
 * 17:44 elukey: restart yarn resource managers on an-master100[1,2] to pick up settings for https://gerrit.wikimedia.org/r/c/operations/puppet/+/628887
 * 16:59 joal: Manually add _SUCCESS file to events to hourly-partition of page_move events so that wikidata-item_page_link job starts
 * 16:21 joal: Kill restart wikidata-item_page_link-weekly-coord to not wait on missing data
 * 15:45 joal: Restart wikidata-json_entity-weekly coordinator after wrong kill in new hue UI
 * 15:42 joal: manually killing wikidata-json_entity-weekly-wf-2020-08-31 - Raw data is missing from dumps folder (json dumps)

2020-09-18

 * 15:05 elukey: systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear icinga alrms
 * 10:38 elukey: force ./create_virtualenv.sh in /srv/jupyterhub/deploy to update the jupyter's default venv

2020-09-17

 * 10:12 klausman: started backup of stat1004's /srv to stat1008

2020-09-16

 * 19:12 joal: Manually kill webrequest-hour oozie job that started before the restart could happen (waiting for previous hour to be finished)
 * 19:00 joal: Kill-restart data-quality-hourly bundle after deploy
 * 18:57 joal: Kill-restart webrequest after deploy
 * 18:44 joal: Kill restart mediawiki-history-reduced job after deploy
 * 17:59 joal: Deploy refinery onto HDFS
 * 17:46 joal: Deploy refinery using scap
 * 15:27 elukey: update the TLS backend certificate for Analytics UIs (unified one) to include hue-next.w.o as SAN
 * 12:11 klausman: stat1008 updated to use rock/rocm DKMS driver and back in operation
 * 11:28 klausman: starting to upgrade to rock-dkms driver on stat1008
 * 08:11 elukey: superset 0.37.1 deployed to an-tool1005 (staging env)

2020-09-15

 * 13:43 elukey: re-enable timers on an-launcher1002 after maintenance to an-coord1001
 * 13:43 elukey: restart of hive/oozie/presto daemons on an-coord1001
 * 12:30 elukey: stop timers on an-launcher1002 to drain the cluster and restart an-coord1001's daemons (hive/oozie/presto)
 * 06:48 elukey: run systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002

2020-09-14

 * 14:36 milimetric: deployed eventstreams with new KafkaSSE version on staging, eqiad, codfw

2020-09-11

 * 15:41 milimetric: restarted data quality stats bundles
 * 01:32 milimetric: deployed small fix for hql of editors_bycountry load job
 * 00:46 milimetric: deployed refinery source 0.0.136, refinery, and synced to HDFS

2020-09-09

 * 10:11 klausman: Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442)
 * 07:25 elukey: restart varnishkafka-webrequest on cp5010 and cp5012, delivery reports errors happening since yesterday's network outage

2020-09-04

 * 18:11 milimetric: aqs deploy went well! Geoeditors endpoint is live internally, data load job was successful, will submit pull request for public endpoint.
 * 06:54 joal: Manually restart mediawiki-history-drop-snapshot after hive-partitions/hdfs-folders mismatch fix
 * 06:08 elukey: reset-failed mediawiki-history-drop-snapshot on an-launcher1002 to clear icinga errors
 * 01:52 milimetric: aborted aqs deploy due to cassandra error

2020-09-03

 * 19:15 milimetric: finished deploying refinery and refinery-source, restarting jobs now
 * 13:59 milimetric: edit-hourly-druid-wf-2020-08 fails consistently
 * 13:56 joal: Kill-restart mediawiki-history-reduced oozie job into production queue
 * 13:56 joal: rerun edit-hourly-druid-wf-2020-08 after failed attempt

2020-09-02

 * 18:24 milimetric: restarting mediawiki history denormalize coordinator in production queue, due to failed 2020-08 run
 * 08:37 elukey: run kafka preferred-replica-election on jumbo after jumbo1003's reimage to buster

2020-08-31

 * 13:43 elukey: run kafka preferred-replica-election on Jumbo after jumbo1001's reimage
 * 07:13 elukey: run kafka preferred-replica-election on Jumbo after jumbo1005's reimage

2020-08-28

 * 14:25 mforns: deployed pageview whitelist with new wiki: ja.wikivoyage
 * 14:18 elukey: run kafka preferred-replica-election on jumbo after the reimage of jumbo1006
 * 07:21 joal: Manually add ja.wikivoyage to pageview allowlist to prevent alerts

2020-08-27

 * 19:05 mforns: finished refinery deploy (ref v0.0.134)
 * 18:41 mforns: starting refinery deploy (ref v0.0.134)
 * 18:30 mforns: deployed refinery-source v0.0.134
 * 13:29 elukey: restart jvm daemons on analytics1042, aqs1004, kafka-jumbo1001 to pick up new openjdk upgrades (canaries)

2020-08-25

 * 15:47 elukey: restart mariadb@analytics_meta on db1108 to apply a replication filter (exclude superset_staging database from replication)
 * 06:35 elukey: restart mediawiki-history-drop-snapshot on an-launcher1002 to check that it works

2020-08-24

 * 06:50 joal: Dropping wikitext-history snapshots 2020-04 and 2020-05 keeping two (2020-06 and 2020-07) to free space in hdfs

2020-08-23

 * 19:34 nuria: deleted 1.2 TB from hdfs://analytics-hadoop/user/analytics/.Trash/200811000000
 * 19:31 nuria: deleted 1.2 TB from hdfs://analytics-hadoop/user/nuria/.Trash/*
 * 19:26 nuria: deleted 300G from hdfs://analytics-hadoop/user/analytics/.Trash/200814000000
 * 19:25 nuria: deleted 1.2 TB from hdfs://analytics-hadoop/user/analytics/.Trash/200808000000

2020-08-20

 * 16:49 joal: Kill restart webrequest-load bundle to move it to production queue

2020-08-14

 * 09:13 fdans: restarting refine to apply T257860

2020-08-13

 * 16:13 fdans: restarting webrequest bundle
 * 14:44 fdans: deploying refinery
 * 14:13 fdans: updating refinery source symlinks

2020-08-11

 * 17:36 ottomata: refine with refinery-source 0.0.132 and merge_with_hive_schema_before_read=true - T255818
 * 14:52 ottomata: scap deploy refinery to an-launcher1002 to get camus wrapper script changes

2020-08-06

 * 14:47 fdans: deploying refinery
 * 08:07 elukey: roll restart druid-brokers (on both clusters) to pick up new changes for monitorings

2020-08-05

 * 13:04 elukey: restart yarn resource managers on an-master100[12] to pick up new Yarn settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/618529
 * 13:03 elukey: set yarn_scheduler_minimum_allocation_mb = 1 (was zero) to Hadoop to workaround a Flink 1.1 issue (namely it doesn't work if the value is <= 0)
 * 09:32 elukey: set ticket max renewable lifetime to 7d on all kerberos clients (was zero, the default)

2020-08-04

 * 08:30 elukey: resume druid-related oozie coordinator jobs via Hue (after druid upgrade)
 * 08:28 elukey: started netflow kafka supervisor on Druid Analytics (after upgrade)
 * 08:19 elukey: restore systemd timers for druid jobs on an-launcher1002 (after druid upgrade)
 * 07:33 elukey: stop systemd timers related to druid on an-launcher1002
 * 07:29 elukey: stop kafka supervisor for netflow on Druid Analytics (prep step for druid upgrade)
 * 07:00 elukey: suspend all druid-related coordinators in Hue as prep step for upgrade

2020-08-03

 * 09:53 elukey: move all druid-related systemd timer to spark client mode - T254493
 * 08:07 elukey: roll restart aqs on aqs* to pick up new druid settings

2020-08-01

 * 13:22 joal: Rerun cassandra-monthly-wf-local_group_default_T_unique_devices-2020-7 to load missing data (email with bug description sent to list)

2020-07-31

 * 14:46 mforns: restarted webrequest oozie bundle
 * 14:46 mforns: restarted mediawiki history reduced oozie job
 * 09:00 elukey: SET GLOBAL expire_logs_days=14; on matomo1002's mysql
 * 09:00 elukey: SET GLOBAL expire_logs_days=14; on an-coord1001's mysql
 * 06:32 elukey: roll restart of druid brokers on druid100[4-8] to pick up new changes

2020-07-30

 * 19:14 mforns: finished refinery deploy (for v0.0.132)
 * 18:48 mforns: starting refinery deploy (for v0.0.132)
 * 18:27 mforns: deployed refinery-source v0.0.132

2020-07-29

 * 14:37 mforns: quick deployment of pageview white-list

2020-07-28

 * 17:52 ottomata: stopped riting eventlogging data log files on eventlog1002 and stopped syncing them to stat100[67] - T259030
 * 14:29 elukey: stop client-side-events-log.service on eventlog1002 to avoid /srv to fill up
 * 09:48 elukey: re-enable eventlogging file consumers on eventlog1002
 * 09:10 elukey: temporarily stop eventlogging file consumers on eventlog1002 to copy some data over to stat1005 (/srv partition full)
 * 08:03 elukey: Superset migrated to CAS
 * 06:42 elukey: re-run webrequest-load hour 2020-7-28-3

2020-07-27

 * 17:15 elukey: restart eventlogging on eventlog1002 to update the event whitelist (exclude MobileWebUIClickTracking)
 * 08:19 elukey: reset-failed the monitor_refine_failures for eventlogging on an-launcher1002
 * 06:44 elukey: truncate big log file on an-launcher1002 that is filling up the /srv partition

2020-07-22

 * 15:05 joal: manually drop /user/analytics/.Trash/200714000000/wmf/data/wmf/pageview/actor to free some space
 * 15:03 joal: Manually drop /wmf/data/wmf/mediawiki/wikitext/history/snapshot=2020-03 to free some spqce
 * 15:01 elukey: hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs
 * 14:49 elukey: hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics/logs/*
 * 08:09 elukey: turnilo.wikimedia.org migrated to CAS

2020-07-21

 * 18:30 mforns: finished re-deploying refinery to unbreak unique devices per domain monthly
 * 18:05 mforns: re-deploying refinery to unbreak unique devices per domain monthly
 * 17:34 mforns: restarted unique_devices-per_domain-daily-coord
 * 15:09 elukey: yarn.wikimedia.org migrated earlier on to CAS auth
 * 14:58 ottomata: Refine - reverted change to not merge hive schema + event schema before reading - T255818
 * 13:36 ottomata: Refine no longer merges with Hive table schema when reading (except for refine_eventlogging_analytics job) - T255818

2020-07-20

 * 19:56 joal: kill-restart cassandra unique-devices loading daily and monthly after deploy (2020-07-20 and 2020-07-01)
 * 19:55 joal: kill-restart mediawiki-history-denormalize after dpeloy (2020-07-01)
 * 19:55 joal: kill-restart webrequest after dpeloy (2020-07-20T18:00)
 * 19:19 mforns: finished refinery deployment (for v0.0.131)
 * 19:02 mforns: starting refinery deployment (for v0.0.131)
 * 19:02 mforns: deployed refinery-source v0.0.131
 * 18:16 joal: Rerun cassandra-daily-coord-local_group_default_T_unique_devices from 2020-07-15 to 2020-07-19 (both included)
 * 14:50 elukey: restart superset to pick up TLS to mysql settings
 * 14:18 elukey: re-enable timers on an-launcher1002
 * 14:01 elukey: resume pageview-daily_dump-coord via Hue to ease the draining + mariadb restart
 * 14:00 elukey: restart mariadb on an-coord1001 with TLS settings
 * 13:43 elukey: suspend pageview-daily_dump-coord via Hue to ease the draining + mariadb restart
 * 12:55 elukey: stop timers on an-launcher1002 to ease the mariadb restart on an-coord1001 (second attempt)
 * 09:10 elukey: start timers on an-launcher1002 (no mysql restart happened, long jobs not completing, will postpone)
 * 07:16 joal: Restart mobile_apps-session_metrics-wf-7-2020-7-12 after heisenbug kerbe failure
 * 06:58 elukey: stop timers on an-launcher1002 to ease the mariadb restart on an-coord1001

2020-07-17

 * 12:34 elukey: deprecate pivot.wikimedia.org (to ease CAS work)

2020-07-15

 * 17:58 joal: Backfill cassandra unique-devices for per-project-family starting 2019-07
 * 08:18 elukey: move piwik to CAS (idp.wikimedia.org)

2020-07-14

 * 15:50 elukey: upgrade spark2 on all stat100x hosts
 * 15:07 elukey: upgrade spark2 to 2.4.4-bin-hadoop2.6-3 on stat1004
 * 14:55 elukey: re-create jupyterhub's venv on stat1005/8 after https://gerrit.wikimedia.org/r/612484
 * 14:45 elukey: re-create jupyterhub's base kernel directory on stat1005 (trying to debug some problems)
 * 07:27 joal: Restart forgotten unique-devices per-project-family jobs after yesterday deploy

2020-07-13

 * 20:17 milimetric: deployed weekly train with two oozie job bugfixes and rename to pageview_actor table
 * 19:42 joal: Deploy refinery with scap
 * 19:24 joal: Drop pageview_actor_hourly and replace it by pageview_actor
 * 18:26 joal: Kill pageview_actor_hourly and unique_devices_per_project_family jobs to copy backfilled data
 * 12:35 joal: Start backfilling of wdqs_internal (external had been done, not internal :S)

2020-07-10

 * 17:10 nuria: updating the EL whitelist, refinery reploy (but not source)
 * 16:01 milimetric: deployed, EL whitelist is updated

2020-07-09

 * 18:52 elukey: upgrade spark2 to 2.4.4-bin-hadoop2.6-3 on stat1008

2020-07-07

 * 10:12 elukey: decom archiva1001

2020-07-06

 * 08:09 elukey: roll restart aqs on aqs100[4-9] to pick up new druid settings
 * 07:51 elukey: enable binlog on matomo's database on matomo1002

2020-07-04

 * 10:52 joal: Rerun mediawiki-geoeditors-monthly-wf-2020-06 after heisenbug (patch provided for long-term fix)

2020-07-03

 * 19:20 joal: restart failed webrequest-load job webrequest-load-wf-text-2020-7-3-17 with higher thresholds - error due to burst of requests in ulsfo
 * 19:13 joal: restart mediawiki-history-denormalize oozie job using 0.0.115 refinery-job jar
 * 19:05 joal: kill manual execution of mediawiki-history to save an-coord1001 (too big of a spark-driver)
 * 18:53 joal: restart webrequest-load-wf-text-2020-7-3-17 after hive server failure
 * 18:52 joal: restart data_quality_stats-wf-event.navigationtiming-useragent_entropy-hourly-2020-7-3-15 after have server failure
 * 18:51 joal: restart virtualpageview-hourly-wf-2020-7-3-15 after hive-server failure
 * 16:41 joal: Rerun mediawiki-history-check_denormalize-wf-2020-06 after having cleaned up wrong files and restarted a job without deterministic skewed join

2020-07-02

 * 18:16 joal: Launch a manual instance of mediawiki-history-denormalize to release data despite oozie failing
 * 16:17 joal: rerun mediawiki-history-denormalize-wf-2020-06 after oozie sharelib bump through manual restart
 * 12:41 joal: retry mediawiki-history-denormalize-wf-2020-06
 * 07:26 elukey: start a tmux on an-launcher1002 with 'sudo -u analytics /usr/local/bin/kerberos-run-command analytics /usr/local/bin/refinery-sqoop-mediawiki-production'
 * 07:20 elukey: execute systemctl reset-failed refinery-sqoop-whole-mediawiki.service to clear our alarms on launcher1002

2020-07-01

 * 19:04 joal: Kill/restart webrequest-load-bundle for mobile-pageview update
 * 18:59 joal: kill/restart pageview-druid jobs (hourly, daily, monthly) for in_content_namespace field update
 * 18:57 joal: kill/restart mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord for bz2 codec update
 * 18:55 joal: kill/restart mediawiki-history-denormalize-coord after skewed-join strategy update
 * 18:52 joal: Kill/Restart unique_devices-per_project_family-monthly-coord after fix
 * 18:41 joal: deploy refinery to HDFS
 * 18:28 joal: Deploy refinery using scap after hotfix
 * 18:20 joal: Deploy refinery using scap
 * 16:58 joal: trying to release refinery-source 0.0.129 to archiva, version 3
 * 16:51 elukey: remove /etc/maven/settings.xml from all analytics nodes that have it

2020-06-30

 * 18:28 joal: trying to release refinery-source to archiva from jenkins (second time)
 * 16:30 joal: Release refinery-source v0.0.129 using jenkins
 * 16:30 joal: Deploy refien
 * 16:05 elukey: re-enable timers on an-launcher1002 after archiva maintenance
 * 15:23 elukey: stop timers on an-launcher1002 to ease debugging for refinery deploy
 * 13:12 elukey: restart nodemanager on analytics1068 after GC overhead and OOMs
 * 09:32 joal: Kill/Restart mediawiki-wikitext-history job now that the current month one is done (bz2 fix)

2020-06-29

 * 13:09 elukey: archiva.wikimedia.org migrated to archiva1002

2020-06-25

 * 17:20 elukey: move RU jobs/timers from an-launcher1001 to an-launcher1002
 * 16:07 elukey: move all timers but RU from an-launcher1001 to 1002 (puppet disabled on 1001, all timers completed)
 * 12:13 elukey: reimage notebook1003/4 to debian buster as fresh start
 * 09:28 joal: Kill-restart pageview-hourly to read from pageview_actor
 * 09:25 joal: Kill-restart pageview_actor jobs (current+backfill) after dpeloy
 * 09:14 joal: Deploy refinery to HDFS
 * 08:56 joal: deploying refinery using scap to fix pageview_actor_hourly
 * 08:02 joal: Start backfilling pageview_actor_hourly job with new patch (expected to solve heisenbug)
 * 07:40 joal: Dropping refinery-camus jars from archiva up to 0.0.115
 * 07:04 joal: rerun failed pageview_actor_hourly

2020-06-24

 * 19:36 joal: Cleaning refinery-spark from archiva (up to 0.0.115)
 * 19:28 joal: Cleaning refinery-tools from archiva (up to 0.0.115)
 * 19:16 joal: Restarting unique-devices jobs to use pageview_actor_hourly instead of webrequest (4 jobs)
 * 19:08 joal: Start pageview_actor_hourly oozie job
 * 19:06 joal: Create pageview_actor_hourly after deploy to start new jobs
 * 18:57 joal: Clean archiva refinery-camus except 0.0.90
 * 18:54 joal: Deploying refinery onto HDFS
 * 18:47 joal: clean archiva from refinery-hive (up to 0.0.115)
 * 18:47 joal: Deploying refinery using scap
 * 18:15 joal: launching a new jenkins release after cleanup
 * 17:43 joal: Reseting refinery-source to v0.0.128 for clean release after jenkins-archiva password fix
 * 16:20 joal: Releasing refinery-source 0.0.128 to archiva
 * 06:50 elukey: truncate /srv/reportupdater/log/reportupdater-ee-beta-features from 43G to 1G on an-launcher1001 (disk space issues)

2020-06-22

 * 18:50 joal: Manually update pageview whitelist adding shnwiktionary

2020-06-20

 * 07:41 elukey: powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console
 * 07:37 elukey: powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console

2020-06-17

 * 19:59 milimetric: deployed quick fix for data stats job
 * 18:04 elukey: decommission matomo1001
 * 16:57 ottomata: produce searchsatisfaction events on group0 wikis via eventgate - T249261
 * 07:17 joal: Deleting mediawiki-history-text (avro) for 2020-01 and 2020-02 (we still have 2020-03 and 2020-04) - Expected free space: 160Tb
 * 06:40 elukey: reboot krb1001 for kernel upgrades
 * 06:24 elukey: reboot an-master100[1,2] for kernel upgrades
 * 06:03 elukey: reboot an-conf100[1-3] for kernel upgrades
 * 05:45 elukey: reboot stat1007/8 for kernel upgrades

2020-06-16

 * 19:58 ottomata: evolving event.SearchSatisfaction Hive table using /analytics/legacy/searchsatisfaction/latest schema
 * 19:41 ottomata: bumping Refine refinery jar version to 0.0.127 - T238230
 * 19:17 ottomata: deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261
 * 16:02 elukey: reboot kafka-jumbo1008 for kernel upgrades
 * 15:33 milimetric: refinery deployed and synced to hdfs, with refinery-source at 0.0.126
 * 15:20 elukey: reboot kafka-jumbo1007 for kernel upgrades
 * 15:13 elukey: re-enabling timers on launcher after maintenance
 * 15:06 elukey: reboot an-coord1001 for kernel upgrades
 * 14:27 elukey: stop timers on an-launcher1001, prep before rebooting an-coord1001
 * 14:23 elukey: reboot druid100[7,8] for kernel upgrades
 * 11:51 elukey: re-run webrequest-druid-hourly-coord 16/06T10
 * 11:36 elukey: reboot an-druid100[1,2] for kernel upgrades

2020-06-15

 * 09:37 elukey: restart refinery-druid-drop-public-snapshots.service after change in vlan firewall rules (added druid100[7,8] to term druid)

2020-06-11

 * 15:01 mforns: started refinery deploy for v0.0.126
 * 14:58 mforns: deployed refinery-source v0.0.126
 * 13:57 ottomata: removed accidentally added page_restrictions column(s) on Hive table event.mediawiki_user_blocks_change after a incorrect schema change was merged (no data was ever set in this column)

2020-06-09

 * 07:32 elukey: upgrade ROCm to 3.3 on stat1005

2020-06-08

 * 15:42 elukey: remove access to notebook100[3,4] - T249752
 * 14:07 elukey: move matomo cron archiver to systemd timer archiver (with nagios alarming)
 * 14:02 elukey: re-enable timers on an-coord1001
 * 14:01 elukey: restart hive/oozie on an-coord1001 for openjdk upgrades
 * 13:42 elukey: roll restart kafka jumbo brokers for openjdk upgrades
 * 13:26 elukey: stop timers on an-launcher to drain jobs and restart hive/oozie for openjdk upgrades

2020-06-05

 * 17:56 elukey: roll restart presto server on an-presto* to pick up new openjdk upgrades
 * 16:45 elukey: upgrade turnilo to 1.24.0
 * 13:26 elukey: reimage druid1006 to debian buster
 * 09:26 elukey: roll restart cassandra on AQS to pick up openjdk upgrades

2020-06-04

 * 19:12 elukey: roll restart of aqs to pick up new druid settings
 * 18:39 mforns: deployed wikistats2 2.7.5
 * 13:33 elukey: re-enable netflow hive2druid jobs after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/602356/
 * 10:56 elukey: depooled and reimage druid1004 to Debian Buster (Druid public cluster)
 * 07:31 elukey: stop netflow hive2druid timers to do some experiments
 * 06:13 elukey: kill application_1589903254658_75731 (druid indexation for netflow still running since 12h ago)
 * 05:36 elukey: restart druid middlemanager on druid1002 - strange protobuf warnings, netflow hive2druid indexation job stuck for hours
 * 05:13 elukey: reimage druid1003 to Buster

2020-06-03

 * 17:10 elukey: restart RU jobs after adding memory to an-launcher1001
 * 16:57 elukey: reboot an-launcher1001 to get new memory
 * 16:01 elukey: stop timers on an-launcher, prep for reboot
 * 09:35 elukey: re-run webrequest-druid-hourly-coord 03/06T7 (failed due to druid1002 moving to buster)
 * 08:50 elukey: reimage druid1002 to Buster

2020-06-01

 * 14:54 elukey: stop all timers on an-launcher1001, prep step for reboot
 * 12:54 elukey: /user/dedcode/.Trash/* -skipTrash
 * 06:53 elukey: re-run virtualpageview-hourly-wf-2020-5-31-19
 * 06:28 elukey: temporary stop of all RU jobs on an-launcher1001 to priviledge camus and others
 * 06:03 elukey: kill all airflow-related processes on an-launcher1001 - host killing tasks due to OOM

2020-05-30

 * 08:15 elukey: manual reset-failed of monitor_refine_mediawiki_job_events_failure_flags

2020-05-29

 * 13:19 elukey: re-run druid webrequest hourly 29/05T11 (failed due to a host reimage in progress)
 * 12:19 elukey: reimage druid1001 to Debian Buster
 * 10:05 elukey: move el2druid config from druid1001 to an-druid1001

2020-05-28

 * 18:31 milimetric: after deployment, restarted four oozie jobs with new SLAs and fixed datasets definitions
 * 06:40 elukey: slowly restarting all RU units on an-launcher1001
 * 06:32 elukey: delete old RU pid files with timestamp May 27 19:00 (scap deployment failed to an-launcher due to disk issues) except ./jobs/reportupdater-queries/pingback/.reportupdater.pid that was working fine

2020-05-27

 * 19:53 joal: Start pageview-complete dump oozie job after deploy
 * 19:24 joal: Deploy refinery onto hdfs
 * 19:22 joal: restart failed services on an-launcher1001
 * 19:06 joal: Deploy refinery using scap to an-launcher1001 only
 * 18:41 joal: Deploying refinery with scap
 * 13:42 ottomata: increased Kafka topic retention in jumbo-eqiad to 31 days for (eqiad|codfw).mediawiki.revision-create - T253753
 * 07:09 joal: Rerun webrequest-druid-hourly-wf-2020-5-26-17
 * 07:04 elukey: matomo upgraded to 3.13.5 on matomo1001
 * 06:17 elukey: superset upgraded to 0.36
 * 05:52 elukey: attempt to upgrade Superset to 0.36 - downtime expected

2020-05-24

 * 10:04 elukey: re-run virtualpageview-hourly 23/05T15 - failed due to a sporadic kerberos/hive issue

2020-05-22

 * 09:11 elukey: superset upgrade attempt to 0.36 failed due to a db upgrade error (not seen in staging), rollback to 0.35.2
 * 08:15 elukey: superset down for maintenance
 * 07:09 elukey: add druid100[7,8] to the LVS druid-public-brokers service (serving AQS's traffic)

2020-05-21

 * 17:24 elukey: add druid100[7,8] to the druid public cluster (not serving load balancer traffic for the moment, only joining the cluster) - T252771
 * 16:44 elukey: roll restart druid historical nodes on druid100[4-6] (public cluster) to pick up new settings - T252771
 * 14:02 elukey: restart druid kafka supervisor for wmf_netflow after maintenance
 * 13:53 elukey: restart druid-historical on an-druid100[1,2] to pick up new settings
 * 13:17 elukey: kill wmf_netflow druid supervisor for maintenance
 * 13:13 elukey: stop druid-daemons on druid100[1-3] (one at the time) to move the druid partition from /srv/druid to /srv (didn't think about it before) - T252771
 * 09:16 elukey: move Druid Analytics SQL in Superset to druid://an-druid1001.eqiad.wmnet:8082/druid/v2/sql/
 * 09:05 elukey: move turnilo to an-druid1001 (beefier host)
 * 08:15 elukey: roll restart of all druid historicals in the analytics cluster to pick up new settings

2020-05-20

 * 13:55 milimetric: deployed refinery with refinery-source v0.0.125

2020-05-19

 * 15:28 elukey: restart hadoop master daemons on an-master100[1,2] for openjdk upgrades
 * 06:29 elukey: roll restart zookeeper on druid100[4-6] for openjdk upgrades
 * 06:18 elukey: roll restart zookeeper on druid100[1-3] for openjdk upgrades

2020-05-18

 * 14:02 elukey: roll restart of hadoop daemons on the prod cluster for openjdk upgrades
 * 13:30 elukey: roll restart hadoop daemons on the test cluster for openjdk upgrades
 * 10:33 elukey: add an-druid100[1,2] to the Druid Analytics cluster

2020-05-15

 * 13:23 elukey: roll restart of the Druid analytics cluster to pick up new openjdk + /srv completed
 * 13:15 elukey: turnilo back to druid1001
 * 13:03 elukey: move turnilo config to druid1002 to ease druid maintenance
 * 12:31 elukey: move superset config to druid1002 (was druid1003) to ease maintenance
 * 09:08 elukey: restart druid brokers on Analytics Public

2020-05-14

 * 18:41 ottomata: fixed TLS authentication for Kafka mirror maker on jumbo - T250250
 * 12:49 joal: Release 2020-04 mediawiki_history_reduced to public druid for AQS (elukey did it :-P)
 * 09:53 elukey: upgrade matomo to 3.13.3
 * 09:50 elukey: set matomo in maintenance mode as prep step for upgrade

2020-05-13

 * 21:36 elukey: powercycle analytics1055
 * 13:46 elukey: upgrade spark2 on all stat100x hosts - T250161
 * 06:47 elukey: upgrade spark2 on stat1004 - canary host - T250161

2020-05-11

 * 10:17 elukey: re-run webrequest-load-wf-text-2020-5-11-9
 * 06:06 elukey: restart wikimedia-discovery-golden on stat1007 - apparenlty killed by no memory left to allocate on the system
 * 05:14 elukey: force re-run of monitor_refine_event_failure_flags after fixing a refine failed hour

2020-05-10

 * 07:44 joal: Rerun webrequest-load-wf-upload-2020-5-10-1

2020-05-08

 * 21:06 ottomata: running prefered replica election for kafka-jumbo to get preferred leaders back after reboot of broker earlier today - T252203
 * 15:36 ottomata: starting kafka broker on kafka-jumbo1006, same issue on other brokers when they are leaders of offending partitions - T252203
 * 15:27 ottomata: stopping kafka broker on kafka-jumbo1006 to investigate camus import failures - T252203
 * 15:16 ottomata: restarted turnilo after applying nuria and mforns changes

2020-05-07

 * 17:39 ottomata: deploying fix to refinery bin/camus CamusPartitionChecker when using dynamic stream configs
 * 16:49 joal: Restart and babysit mediawiki-history-denormalize-wf-2020-04
 * 16:37 elukey: roll restart of all the nodemanagers on the hadoop cluster to pick up new jvm settings
 * 13:53 elukey: move stat1007 to role::statistics::explorer (adding jupyterhub)
 * 11:00 joal: Moving application_1583418280867_334532 to the nice queue
 * 10:58 joal: Rerun wikidata-articleplaceholder_metrics-wf-2020-5-6
 * 07:45 elukey: re-run mediawiki-history-denormalize
 * 07:43 elukey: kill application_1583418280867_333560 after a chat with David, the job is consuming ~2TB of RAM
 * 07:32 elukey: re-run mediawiki history load
 * 07:18 elukey: execute yarn application -movetoqueue application_1583418280867_332862 -queue root.nice
 * 07:06 elukey: restart mediawiki-history-load via hue
 * 06:41 elukey: restart oozie on an-coord1001
 * 05:46 elukey: re-run mediarequest-hourly-wf-2020-5-6-19
 * 05:35 elukey: re-run two failed hours for webrequest load text (07/05T05) and upload (06/05T23)
 * 05:33 elukey: restart hadoop yarn nodemanager on analytics1071

2020-05-06

 * 12:49 elukey: restart oozie on an-coord1001 to pick up the new shlib retention changes
 * 12:28 mforns: re-run pageview-druid-hourly-coord for 2020-05-06T06:00:00 after oozie shared lib update
 * 11:30 elukey: use /run/user as kerberos credential cache for stat1005
 * 09:25 elukey: re-run projectview coordinator for 2020-5-6-5 after oozie shared lib update
 * 09:24 elukey: re-run virtualpageview coordinator for 2020-5-6-5 after oozie shared lib update
 * 09:13 elukey: re-run apis coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:11 elukey: re-run learning features actor coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:10 elukey: re-run aqs-hourly coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:09 elukey: re-run mediacounts coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:08 elukey: re-run mediarequest coordinator for 2020-5-6-7 after oozie shared lib update
 * 09:08 elukey: re-run data quality coordinators for 2020-5-6-5/6 after oozie shared lib update
 * 09:05 elukey: re-run pageview-hourly coordinator 2020-5-6-6 after oozie shared lib update
 * 09:04 elukey: execute oozie admin -sharelibupdate on an-coord1001
 * 06:05 elukey: execute hdfs dfs -chown -R analytics-search:analytics-search-users /wmf/data/discovery/search_satisfaction/daily/year=2019

2020-05-05

 * 19:49 mforns: Finished re-deploying refinery using scap, then re-deploying onto hdfs
 * 18:47 mforns: Finished deploying refinery using scap, then deploying onto hdfs
 * 18:13 mforns: Deploying refinery using scap, then deploying onto hdfs
 * 18:02 mforns: Deployed refinery-source using the awesome new jenkins jobs :]
 * 13:15 joal: Dropping unavailable mediawiki-history-reduced datasources from superset

2020-05-04

 * 17:08 joal: Restart refinery-sqoop-mediawiki-private.service on an-launcher1001
 * 17:03 elukey: restart refinery-drop-webrequest-refined-partitions after manual chown
 * 17:03 joal: Restart refinery-sqoop-whole-mediawiki.service on an-launcher1001
 * 17:02 elukey: chown analytics (was: hdfs) /wmf/data/wmf/webrequest/webrequest_source=text/year=2019/month=12/day=14/hour= { 13,18 }
 * 16:44 joal: Deploy refinery again using scap (trying to fox sqoop)
 * 15:39 joal: restart refinery-sqoop-whole-mediawiki.service
 * 15:37 joal: restart refinery-sqoop-mediawiki-private.service
 * 14:50 joal: Deploy refinery using scap to fix sqoop
 * 13:43 elukey: restart refinery-sqoop-whole-mediawiki to test failure exit codes
 * 06:50 elukey: upgrade druid-exporter on all druid nodes

2020-05-03

 * 19:36 joal: Rerun mobile_apps-session_metrics-wf-7-2020-4-26

2020-05-02

 * 10:54 joal: Rerun predictions-actor-hourly-wf-2020-5-2-0

2020-05-01

 * 16:59 elukey: test prometheus-druid-exporter 0.8 on druid1001 (deb packages not yet uploaded, just build and manually installed)

2020-04-30

 * 10:36 elukey: run superset init to add missing perms on an-tool1005 and analytics-tool1004 - T249681
 * 07:14 elukey: correct X-Forwarded-Proto for superset (http -> https) and restart it

2020-04-29

 * 18:55 joal: Kill-restart cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat
 * 18:46 joal: Kill-restart pageview-hourly job
 * 18:45 joal: No restart needed for pageview-druid jobs
 * 18:36 joal: kill restart pageview-druid jobs (hourly, daily, monthly) to add new dimension
 * 18:29 joal: Kill-restart data-quality-stats-hourly bundle
 * 17:57 joal: Deploy refinery on HDFS
 * 17:45 elukey: roll restart Presto workers to pick up the new jvm settings (110G heap size)
 * 16:06 joal: Deploying refinery using scap
 * 15:57 joal: Deploying AQS using scap
 * 14:26 elukey: enable TLS consumer/producers for kafka main -> jumbo mirror maker - T250250
 * 13:48 joal: Releasing refinery 0.0.123 onto archiva with Jenkins
 * 08:47 elukey: roll restart zookeeper on an-conf* to pick up new openjdk11 updates (affects hadoop)

2020-04-27

 * 13:02 elukey: superset 0.36.0 deployed to an-tool1005

2020-04-26

 * 18:14 elukey: restart nodemanager on analytics1054 - failed due to heap pressure
 * 18:14 elukey: re-run webrequest-load-coord-text 26/04/2020T16 via Hue

2020-04-23

 * 13:57 elukey: launch again data quality stats bundle with https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/592008/ applied locally

2020-04-22

 * 06:46 elukey: kill dataquality hourly bundle again, traffic_by_country keeps failing
 * 06:11 elukey: start data quality bundle hourly with --user=analytics
 * 05:45 elukey: add a separate refinery scap target for the Hadoop test cluster and redeploy to check new settings

2020-04-21

 * 23:17 milimetric: restarted webrequest bundle, babysitting that first before going on
 * 23:00 milimetric: forgot a small jar version update, finished deploying now
 * 21:38 milimetric: deployed twice because analytics1030 failed with "OSError { } " but seems ok after the second deploy
 * 14:27 elukey: add motd to notebook100[3,4] to alert about host deprecation (in favor of stat100x)
 * 11:51 elukey: manually add SUCCESS flags under /wmf/data/wmf/banner_activity/daily/year=2020/month=1 and /wmf/data/wmf/banner_activity/daily/year=2019/month=12 to unblock druid banner monthly indexations

2020-04-20

 * 14:38 ottomata: restarting eventlogging-processor with updated python3-ua-parser for parsing KaiOS user ageints
 * 10:28 elukey: drop /srv/log/mw-log/archive/api from stat1007 (freeing 1.3TB of space!)

2020-04-18

 * 21:40 elukey: force hdfs-balancer as attempt to redistribute hdfs blocks more evenly to worker nodes (hoping to free the busiest ones)
 * 21:32 elukey: drop /user/analytics-privatedata/.Trash/* from hdfs to free some space (~100G used)
 * 21:25 elukey: drop /var/log/hadoop-yarn/apps/analytics-search/* from hdfs to free space (~8T replicated used)
 * 21:21 elukey: drop /user/ { analytics|hdfs } /.Trash/* from hdfs to free space (~100T used)
 * 21:12 elukey: drop /var/log/hadoop-yarn/apps/analytics from hdfs to free space (15.1T replicated)

2020-04-17

 * 13:45 elukey: lock down /srv/log/mw-log/archive/ on stat1007 to analytics-privatedata-users access only
 * 10:26 elukey: re-created default venv for notebooks on notebook100[3,4] (missed to git pull before re-creaing it the last time)

2020-04-16

 * 05:34 elukey: restart hadoop-yarn-nodemanager on an-worker108[4,5] - failed after GC OOM events (heavy spark jobs)

2020-04-15

 * 14:03 elukey: update Superset Alpha role perms with what stated in T249923#6058862
 * 09:35 elukey: restart jupyterhub too as follow up
 * 09:35 elukey: execute "create_virtualenv.sh ../venv" on stat1006, notebook1003, notebook1004 to apply new settings to Spark kernels (re-creating them)
 * 09:09 elukey: restart druid brokers on druid100[4-6] - stuck after datasource deletion

2020-04-11

 * 09:19 elukey: set hive-security: read-only for the Presto hive connector and roll restart the cluster

2020-04-10

 * 16:31 elukey: enable TLS from kafkatee to Kafka on analytics1030 (test instance)
 * 15:45 elukey: migrate data_purge timers from an-coord1001 to an-launcher1001
 * 09:11 elukey: move druid_load jobs from an-coord1001 to an-launcher1001
 * 08:08 elukey: move project_namespace_map from an-coord1001 to an-launcher1001
 * 07:38 elukey: move hdfs-cleaner from an-coord1001 to an-launcher1001

2020-04-09

 * 20:54 elukey: re-run webrequest upload/text hour 15:00 from Hue (stuck due to missing _IMPORTED flag, caused by an-launcher1001 migration. Andrew fixed it re-running manually the Camus checker)
 * 16:00 elukey: move camus timers from an-coord1001 to an-launcher1001
 * 15:20 elukey: absent spark refine timers on an-coord1001 and move them to an-launcher1001

2020-04-07

 * 09:17 elukey: enable refine for TwoColConflictExit (EL schema)

2020-04-06

 * 13:23 elukey: upgraded stat1008 to AMD ROCm 3.3 (enables tensorflow 2.x)
 * 12:33 joal: Bump AQS druid backend to 2020-03
 * 11:50 elukey: deploy new druid datasource in Druid public
 * 06:29 elukey: allow all analytics-privatedata-users to use the GPUs on stat1005/8

2020-04-04

 * 06:52 elukey: restart refinery-import-page-history-dumps

2020-04-03

 * 09:57 elukey: remove TwoColConflictExit from eventlogging's refine blacklist

2020-04-02

 * 19:31 joal: restart paegviewhourly job after manual patch
 * 19:29 joal: Manually patching last deploy to fic virtualpageview job - code merged
 * 17:48 joal: Kill/restart virtualpageview-hourly-coord after deploy
 * 16:55 joal: Deploy refinery onto HDFS
 * 16:30 joal: Deploy refinery using scap
 * 16:12 elukey: re-enable timers on an-coord1001 after maintenance
 * 15:52 elukey: restart hive server2/metastore with G1 settings
 * 14:05 elukey: temporary stop timers on an-coord1001 to facilitate hive daemons restarts
 * 13:47 hashar: test 1 2 3
 * 13:30 joal: Releasing refinery-source v0.0.121 using new jenkins-docker :)
 * 08:23 elukey: kill/restart netflow realtime druid indexation with a new dimension (peer_ip_src) - T246186

2020-04-01

 * 21:19 joal: restart pageview-hourly-wf-2020-4-1-15
 * 18:24 joal: Kill learning-features-actor-hourly as new version to come
 * 18:23 joal: Restart unique_devices-per_project_family-monthly-wf-2020-3 and aqs-hourly-wf-2020-4-1-15 after hive fialure
 * 18:21 joal: restart webrequest-load-wf-upload-2020-4-1-16 and webrequest-load-wf-text-2020-4-1-16 after hive failure
 * 18:14 joal: Kill groceryheist job taking half the cluster
 * 18:06 ottomata: restarted hive-server2
 * 10:07 jbond42: updating icu packages

2020-03-31

 * 12:57 jbond42: updating icu on presto-analytics-canary and hadoop-worker-canary

2020-03-30

 * 07:27 elukey: run /usr/local/bin/refine_sanitize_eventlogging_analytics_immediate --ignore_failure_flag=true --since=72 --verbose --table_whitelist_regex="ResourceTiming" refine_sanitize_eventlogging_analytics_immediate to fix _REFINE_FAILED events
 * 07:16 elukey: run eventlogging refine manually for schemas "EditorActivation|EditorJourney|HomepageVisit|VisualEditorFeatureUse|WikibaseTermboxInteraction|UploadWizardErrorFlowEvent|MobileWikiAppiOSReadingLists|ContentTranslationCTA|QuickSurveysResponses|MobileWikiAppiOSSessions to fix _REFINE_FAILED events

2020-03-29

 * 08:44 elukey: blacklist TwoColConflictExit from Eventlogging Refine to avoid alarm spam

2020-03-28

 * 16:54 elukey: restart yarn nodemanger on analytics1071 - network errors in the logs

2020-03-27

 * 08:09 elukey: deployed new kernerls for https://gerrit.wikimedia.org/r/580083 on stat1004

2020-03-26

 * 09:09 elukey: re-running manually webrequest-load upload 26/03/2020T08 - kerberos failures

2020-03-25

 * 08:14 elukey: restart presto-server on an-coord1001 to remove jmx catalog config

2020-03-24

 * 15:46 elukey: restart all cron.service processes on stat/notebook (killing long lingering processes) to move the unit under user.slice

2020-03-21

 * 14:17 joal: Restart wikidata_item_page_link job with manual fix - review to be confirmed
 * 14:06 joal: Kill buggy wikidata_item_page_link job

2020-03-18

 * 19:39 fdans: refinery deployed
 * 18:52 fdans: deploying refinery
 * 18:51 fdans: refinery source 0.0.119 jars generated and symlinked
 * 18:17 fdans: beginning deploy of refinery-source 0.0.119

2020-03-17

 * 17:25 elukey: deploy superset to enable Presto and Kerberos (Pyhive 0.6.2.)

2020-03-16

 * 19:43 joal: Kill-restart wikidata-articleplaceholder_metrics-coord to fix yarn queue
 * 18:30 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 17:05 elukey: roll restart of hadoop namenodes to get the new GC setting (MaxGCPauseMillis 400 -> 1000)

2020-03-13

 * 12:18 joal: Restart cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2020-3-12

2020-03-12

 * 22:53 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 22:22 mforns: deployed refinery-source using jenkins
 * 11:09 elukey: roll restart kerberos kdcs to pick up new ticket lifetime settings (10h -> 48h)
 * 08:27 elukey: re-running refine eventlogging with --since 12 (very conservative but just in case)

2020-03-11

 * 14:49 elukey: add xmldumps mountpoints on stat1004 and stat1005

2020-03-10

 * 15:20 elukey: remove the analytics user keytab from stat100[4,5]
 * 15:06 elukey: move stat1006 to role::statistics::explorer
 * 09:24 elukey: removed /etc/mysql/conf.d/stats-research-client.cnf from all stat boxes (all file used for RU, now on an-launcher1001)

2020-03-09

 * 07:27 elukey: deploy jupyterhub on notebook100[3,4] (manual venv re-creation) to allow the use of the user.slice - T247055
 * 07:26 elukey: upgrade nodejs from 6->10 on stat1* and notebook1*

2020-03-08

 * 17:58 elukey: restart hadoop-yarn-nodemanger on an-worker1087

2020-03-06

 * 14:58 joal: AQS new druid snapshot released (2020-02)
 * 10:06 elukey: roll restart Presto daemons for openjdk upgrades
 * 09:45 elukey: roll restart of cassandra on AQS to pick up new openjdk upgrades

2020-03-05

 * 19:45 elukey: deleted dangling 'reports' symlink on stat100[6,7] in /srv/published
 * 19:39 elukey: mv /srv/reportupdater to /srv/reportupdater-backup05032020 on stat100[6,7]
 * 16:34 mforns: restart turnilo to refresh deleted datasources
 * 14:16 elukey: restart hdfs/yarn master daemons to pick up new core-site changes for Superset
 * 06:48 elukey: restart yarn on analytics1074 (GC overhead, traces of network errors with datanodes)

2020-03-04

 * 08:41 joal: Kill-restart mediawiki-history-reduced-coord
 * 08:38 joal: Kill-restart mediawiki-history-dumps-coord

2020-03-03

 * 21:19 joal: Kill-restart actor jobs
 * 21:17 joal: kill-restart mediawiki-history-check_denormalize-coord
 * 21:16 joal: Kill-restart mediawiki-history job
 * 21:10 joal: Kill Wikidataplaceholder failling coord
 * 21:08 joal: Kill restart wikidata-specialentitydata_metrics-coord
 * 21:07 joal: Start Wikidataplaceholder job
 * 21:06 joal: Kill/restart edit_hourly job
 * 21:04 joal: Start wikidata_item_page_link coordinator
 * 20:46 joal: Deploy refinery onto HDFS
 * 20:34 joal: Deploy refinery using scap
 * 20:28 joal: Add new jars to refinery using Jenkins
 * 20:01 joal: Release refinery-source v0.0.117 with Jenkins
 * 16:37 mforns: restarted turnilo to refresh deleted test datasource
 * 11:56 joal: Kill actor-hourly oozie test jobs (precision of previous message)
 * 11:55 joal: Kill actor-hourly tests
 * 10:50 elukey: restarted kafka jumbo (kafka + mirror maker) for openjdk upgrades
 * 09:22 joal: Rerunning failed mediawiki-history jobs for 2020-02 after mediawiki-history-denormalize issue
 * 09:16 joal: Manually restarting mediawiki-history-denormalize with new patch to try
 * 08:36 elukey: roll restart kafka-jumbo for openjdk upgrades
 * 08:34 elukey: re-enable timers on an-coord1001 after maintenance
 * 08:30 joal: Correct previsou message: Kill mediawiki-history (not mediawiki-history-reduced) as it is failing
 * 08:30 joal: Kill mediawiki-history-reduced as it is failing
 * 08:22 elukey: hive metastore/server2 now running without zookeeper settings and without DBTokenStore (in memory one used instead, the default)
 * 08:19 elukey: restart oozie/hive daemons on an-coord1001 for openjdk upgrades
 * 06:41 elukey: roll restart druid daemons for openjdk upgrades
 * 06:39 elukey: sto timers on an-coord1001 to facilitate daemon restarts (hive/oozie)

2020-03-02

 * 19:58 joal: Remove faulty _REFINED file at /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2020/month=3/day=2/hour=10/_REFINED
 * 15:38 elukey: apply new settings to all stat/notebooks
 * 15:31 elukey: setting new user.slice global memory/cpu settings on notebook1003
 * 15:25 elukey: setting new user-slice global memory/cpu settings on stat1007

2020-02-28

 * 19:10 milimetric: deployed 0.0.116 and restarted webrequest load bundle at 2020-02-28T14
 * 14:49 joal: Drop test keyspaces in cassandra cluster

2020-02-27

 * 21:16 milimetric: tried to deploy AQS but it failed with the same integration test on mediarequests, sending email

2020-02-26

 * 15:06 ottomata: dropped and re-added backfilled partitions on event.CentralNoticeImpression table to propogate schema alter on main table - T244771
 * 09:50 joal: Force delete old api/cirrus events from HDFS trash to free some space

2020-02-24

 * 18:20 elukey: move report updater jobs from stat1007 to an-launcher1001

2020-02-22

 * 14:21 elukey: restart hadoop-yarn-nodemanager on analytics1044 - broken disk, apply hiera overrides to exclude it
 * 14:11 elukey: restart hadoop-yarn-nodemanager on analytics1073 - process died, logs saved in /home/elukey

2020-02-21

 * 16:04 ottomata: altered event.CentralNoticeImpression table column event.campaignStatuses to type string, will backfill data - T244771
 * 11:49 elukey: restart varnishkafka on various cp30xx nodes
 * 11:41 elukey: restart varnishkafka on cp3057 (stuck in timeouts to kafka, analytics alarms raised)
 * 08:19 fdans: deploying refinery
 * 00:11 joal: Rerun failed wikidata-json_entity-weekly-coord instances after having created the missing hive table

2020-02-20

 * 16:57 fdans: refinery source jars updated
 * 16:39 fdans: deploying refinery source 0.0.114
 * 15:16 fdans: deploying AQS

2020-02-19

 * 16:58 ottomata: Deployed refinery using scap, then deployed onto hdfs

2020-02-17

 * 18:29 elukey: reboot turnilo and superset's hosts for kernel upgrades
 * 18:25 elukey: restart kafka on kafka-jumbo1001 to pick up new openjdk updates
 * 18:22 elukey: restart cassandra on aqs1004 to pick up new openjdk updates
 * 17:59 elukey: restart druid daemons on druid1003 to pick up new openjdk updates
 * 17:58 elukey: restart cassandra on aqs1004 to pick up new openjdk updates
 * 17:56 elukey: restart hadoop daemons on analytics1042 to pick up new openjdk updates

2020-02-15

 * 12:07 elukey: re-run failed pageview druid hour
 * 12:05 elukey: re-run failed virtualpageview hours

2020-02-12

 * 14:33 elukey: restart hue on analytics-tool1001
 * 13:36 joal: Kill-restart webrequest bundle to see if it mitigates the error

2020-02-10

 * 15:26 elukey: kill application_1576512674871_246621 (consuming too much memory)
 * 14:31 elukey: kill application_1576512674871_246419 (eating a ton of ram on the cluster)

2020-02-08

 * 09:35 elukey: created /wmf/data/raw/wikidata/dumps/all_ttl on hdfs
 * 09:35 elukey: created /wmf/data/raw/wikidata/dumps/all_json on hdfs

2020-02-05

 * 21:14 joal: Kill data_quality_stats-hourly-bundle and data_quality_stats-daily-bundle
 * 21:11 joal: Kill-restart mediawiki-history-dumps-coord, drop existing data, and restart at 2019-11
 * 21:06 joal: Kill-restart mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord
 * 20:51 joal: Deploy refinery using scap
 * 20:29 joal: Refinery-source released in archiva by jenkins
 * 20:20 joal: Deploy hdfs-tools 0.0.5 using scap

2020-02-03

 * 11:20 elukey: restart oozie on an-coord1001
 * 10:11 elukey: enable all timers on an-coord1001 after spark encryption/auth settings
 * 09:32 elukey: roll restart yarn node managers again to pick up spark encryption/authentication settings
 * 08:34 elukey: stop timers on an-coord1001 to drain the cluster and ease the deploy of spark encryption settings
 * 07:58 elukey: roll restart hadoop yarn node managers to pick up new libcrypto.so link (shouldn't be necessary but just in case)
 * 07:24 elukey: create /usr/lib/x86_64-linux-gnu/libcrypto.so on all the analytics nodes via puppet

2020-01-27

 * 05:38 elukey: re-run webrequest text 2020-01-26T20/21 with higher dataloss thresholds (false positives)
 * 02:49 elukey: re-run refine eventlogging manually to clear out refine failed events

2020-01-26

 * 17:58 elukey: re-run failed refine job for MobileWebUIActionsTracking 2020-01-26T12
 * 17:32 elukey: restart varnishkafka on cp3056/cp3064 due to network issues on the hosts

2020-01-23

 * 17:48 milimetric: launching a sqoop for imagelinks (will be slow because tuning sess)

2020-01-20

 * 12:19 elukey: restart zookeeper on an-conf100X to pick up openjdk-11 updates

2020-01-18

 * 10:06 elukey: re-run all entropy job failed via Hue (StopWatch issue)

2020-01-16

 * 20:52 mforns: deployed refinery accompanying source v0.0.112
 * 17:00 mforns: deployed refinery-source v0.0.112
 * 15:17 elukey: upgrade superset to 0.35.2
 * 15:14 elukey: stop superset as prep step for upgrade

2020-01-15

 * 10:44 elukey: remove flume-ng and spark-python/core packages from an-coord1001,analytics1030,analytics-tool1001,analytics1039 - T242754
 * 10:39 elukey: remove flume-ng from all stat/notebooks - T242754
 * 10:37 elukey: remove spark-core flume-ng from all the hadoop workers - T242754
 * 08:44 elukey: move aqs to the new rsyslog-logstash pipeline

2020-01-14

 * 20:12 milimetric: deployed aqs with new service-runner version 2.7.3

2020-01-13

 * 21:45 milimetric: webrequest restarted
 * 21:32 milimetric: killing webrequest bundle for restart
 * 15:00 joal: Deploy hdfs-tools 0.0.3 using scap
 * 14:24 joal: Releasing hdfs-tools 0.0.3 to archiva
 * 12:54 elukey: restart hue to re-apply user hive limits (again)

2020-01-10

 * 14:30 elukey: restart oozie with new settings to instruct it to pick up spark-defaults.conf settings from /etc/spark2/conf
 * 07:38 elukey: re-run virtualpageviews-druid-daily 09/01/2020 via Hue
 * 07:37 elukey: systemctl restart drop-el-unsanitized-events on an-coord1001

2020-01-09

 * 11:17 moritzm: installing cyrus-sasl security updates
 * 11:10 elukey: remove old accounts (user: absent) from Superset
 * 10:30 elukey: revert hue's hive query limit and restart hue - T242306
 * 07:45 elukey: re-run failed data-quality-stats-event.navigationtiming-useragent_entropy-hourly-coord 2020/01/09T00
 * 07:33 elukey: kill test_elukey_webrequest_sampled_128 from druid
 * 07:30 elukey: restart turnilo after updating the webrequest_sampled_128's config

2020-01-08

 * 20:44 joal: Restart webrequest-load-bundle to update queue to production
 * 20:17 joal: rerun edit-hourly-wf-2019-12 after having updated the underlying table
 * 20:06 joal: Prepare and start learning-features-actor-hourly-coord
 * 19:56 joal: kill wikidata-articleplaceholder_metrics-coord as it is buggy
 * 19:56 joal: Kill-restart edit-hourly-coord and edit-hourly-druid-coord
 * 19:48 joal: Kill-restart wikidata-articleplaceholder_metrics-coord
 * 19:44 joal: Kill-restart mediawiki-history-load-coord, mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-metrics-coord, mediawiki-history-reduced-coord, mediawiki-history-dumps-coord
 * 19:42 joal: Kill-restart mediawiki-history-load-coord,
 * 19:29 joal: Kill-restart webrequest-druid-daily-coord and webrequest-druid-hourly-coord after deploy
 * 19:16 joal: Deploy refinery on HDFS
 * 19:04 joal: Deploy refinery using scap
 * 18:30 joal: Releasing refinery-0.0.110 to archiva using Jenkins
 * 18:11 joal: AQS deployed with new druid datasource (2019-12)
 * 17:52 joal: Rerun webrequest-load-wf-text-2020-1-8-15 with updated thresholds after frontend issue

2020-01-07

 * 17:54 elukey: apt-get remove python3.5 on stat1005
 * 15:16 elukey: re-enable timers on an-coord1001 after hive restart
 * 15:03 elukey: restart hive (server+metastore) on an-coord1001 to apply delegation token settings
 * 14:36 elukey: stop timers on an-coord1001 as prep step to restart hive
 * 14:05 elukey: apply max cpu cores usage (via systemd cgroups) on stat/notebook
 * 07:59 elukey: restart hue (again) with correct principal settings)
 * 07:42 elukey: restart Hue after applying a new kerberos setting (hue_principal, was not specified before)

2020-01-06

 * 16:45 joal: Manually sqoop missing tables (content,content_models,slot_roles,slots,wbc_entity_usage0

2020-01-02

 * 18:32 elukey: restart hue with new hive query limits