Analytics/Server Admin Log/Archive/2021

From mediawiki.org

2021-12-22[edit]

  • 19:13 milimetric: Additional context on the last delete message: on an-launcher1002 which is filled up
  • 19:12 milimetric: Marcel and I are deleting files from /tmp older than 60 days
  • 15:55 mforns: finished refinery deployment for anomaly detection queries
  • 14:54 mforns: starting refinery deployment for anomaly detection queries

2021-12-20[edit]

  • 18:59 mforns: finished deployment of refinery, adding anomaly detection hql for airflow job
  • 18:39 mforns: started to deploy refinery, adding anomaly detection hql for airflow job

2021-12-17[edit]

  • 12:32 btullis: Upgraded druid packages, with pool/depool on druid1004
  • 11:20 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
  • 11:18 btullis: updating reprepo with new druid packages for buster-wikimedia to pick up new log4j jar files

2021-12-16[edit]

  • 11:01 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
  • 11:01 btullis: upgrading druid on the test cluster with new packages to test log4j changes.

2021-12-15[edit]

  • 08:51 joal: Rerun failed cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2021-12-13 after cluster restart
  • 07:20 elukey: elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics

2021-12-14[edit]

  • 19:02 milimetric: finished deploying the weekly train as per etherpad
  • 18:04 joal: Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-12-13 after cluster reboot
  • 17:51 btullis: rebooting aqs1015
  • 17:25 btullis: rebooting aqs1013
  • 17:19 btullis: rebooting aqs1012
  • 16:00 btullis: rebooting aqs1011
  • 15:53 btullis: rebooting aqs1010
  • 15:00 btullis: btullis@aqs1010:~$ sudo nodetool-a repair --full system_auth
  • 14:59 btullis: cassandra@cqlsh> ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; on aqs1010-a
  • 14:25 btullis: btullis@aqs1011:$ sudo systemctl start cassandra-b.service
  • 12:44 joal: Rerun failed cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2021-12-14-10
  • 12:42 joal: Kill late spark cassandra loading job

2021-12-11[edit]

  • 10:06 elukey: kill process 2560 on stat1005 to allow puppet to clean up the related user (offboarded)
  • 10:04 elukey: kill process 2831 on stat1008 to allow puppet to clean up the related user (offboarded)

2021-12-09[edit]

  • 11:08 btullis: roll restarting druid historical daemons on analytics cluster T297148
  • 10:46 btullis: roll restarting druid brokers on analytics cluster

2021-12-07[edit]

  • 20:09 ottomata: deploy wikistats2 with doc updates

2021-12-03[edit]

  • 17:36 razzi: restart aqs-next to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cumin A:aqs-next 'systemctl restart aqs'`
  • 17:36 razzi: restart aqs to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cookbook sre.aqs.roll-restart aqs`
  • 07:33 elukey: move kafka-test to fixed uid/gid

2021-12-02[edit]

  • 20:05 ottomata: restarting pageview-druid-daily-coord (killing 0062888-210701181527401-oozie-oozi-C) - I can't seem to rerun a particular hour, so just starting again from that hour.
  • 17:57 elukey: drop "EventLogging MySQL" datasource from Superset (not valid anymore)
  • 17:26 joal: Kill paragon job to prevent more nodemangers to OOM

2021-12-01[edit]

2021-11-27[edit]

  • 09:56 elukey: powercycle analytics1071, soft lockup stacktraces in the tty

2021-11-24[edit]

  • 17:30 mforns: Deployed refinery using scap, then deployed onto hdfs
  • 12:31 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
  • 07:10 elukey: drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 on stat1006 to free space on the root partition

2021-11-23[edit]

  • 11:56 btullis: roll-restarting the cassandra services on the aqs cluster. (Not the aqs_next cluster)
  • 11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart presto-server.service
  • 11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart oozie.service

2021-11-22[edit]

  • 12:18 btullis: failed back the hive services to an-coord1001 via CNAME change
  • 11:36 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
  • 10:44 btullis: deploying DNS change to switch hive to the standby server.
  • 10:18 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore

2021-11-18[edit]

  • 17:26 elukey: varnishkafka-webrequest on cp3050 is running with /etc/ssl/localcerts/wmf_trusted_root_CAs.pem
  • 10:03 elukey: restart prometheus-druid-exporter on Druid Analytics to clear unnecessary metrics
  • 07:32 elukey: restart prometheus-druid-exporter on Druid Public to see metrics difference

2021-11-17[edit]

  • 16:01 btullis: roll-restarting kafka-test brokers
  • 12:12 btullis: roll-restarting the presto analytics workers
  • 11:44 btullis: btullis@archiva1002:~$ sudo systemctl restart archiva.service
  • 07:29 elukey: `apt-get clean` on an-tool1005 to free space in the root partition
  • 07:28 elukey: `sudo pkill -U jmixter` on stat100[5,8] to allow puppet to run and remove the offboarded user

2021-11-16[edit]

  • 19:40 joal: Deploying refinery to HDFS
  • 19:15 joal: Deploying refinery with scap
  • 18:23 joal: Releasing refinery-source v0.1.21
  • 11:32 btullis: btullis@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers public
  • 10:20 btullis: roll-restarting hadoop masters

2021-11-15[edit]

  • 16:37 joal: Rerun failed mediawiki-wikitext-history-wf-2021-10

2021-11-11[edit]

  • 06:56 elukey: `systemctl start prometheus-mysqld-exporter@analytics_meta` on db1108

2021-11-10[edit]

  • 18:20 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
  • 10:19 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed

2021-11-09[edit]

  • 16:52 razzi: restart presto server on an-coord1001 to apply change for T292087
  • 16:30 razzi: set superset presto version to 0.246 in ui
  • 16:30 razzi: set superset presto timeout to 170s: {"connect_args":{"session_props":{"query_max_run_time":"170s"} for T294771}}
  • 12:23 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed
  • 07:23 elukey: `apt-get clean` on stat1006 to free some space (root partition full)

2021-11-08[edit]

  • 19:51 ottomata: an-coord1002: drop user 'admin'@'localhost'; start slave; to fix broken replication - T284150
  • 19:44 razzi: create admin user on an-coord1001 for T284150
  • 18:07 razzi: run `create user 'admin'@'localhost' identified by <password>; grant all privileges on *.* to admin;` to allow milimetric to access mysql on an-coord1002 for T284150

2021-11-04[edit]

  • 16:39 razzi: add "can sql json on superset" permission to Alpha role on superset.wikimedia.org
  • 16:14 razzi: drop and restore superset_staging database to test permissions as they are in production

2021-11-03[edit]

  • 17:07 razzi: razzi@an-tool1010:~$ sudo systemctl stop superset
  • 16:57 razzi: dump mysql in preparation for superset upgrade
  • 02:23 milimetric: deployed refinery with regular train

2021-10-29[edit]

  • 23:04 btullis: deleted all remaining old cassandra snapshots on aqs100x servers.
  • 22:58 btullis: deleted old snapshots from aqs1006 and aqs1009
  • 17:45 razzi: set presto_analytics_hive extra parameter engine_params.connect_args.session_props.query_max_run_time to 55s on superset.wikimedia.org
  • 10:39 elukey: roll restart of kafka-test to pick up new truststore (root PKI added)

2021-10-28[edit]

  • 19:13 ottomata: re-enable hdfs-cleaner for /wmf/gobblin

2021-10-26[edit]

  • 09:01 btullis: reverted hive services back to an-coord1001.

2021-10-25[edit]

  • 16:03 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
  • 13:02 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore
  • 12:51 btullis: btullis@aqs1007:~$ sudo nodetool-a clearsnapshot

2021-10-21[edit]

  • 14:05 ottomata: rerun refine_eventlogging_analytics refine_eventlogging_legacy and refine_event with -ignore-done-flag=true --since=2021-10-21T01:00:00 --until=2021-10-21T04:00:00 for backfill of missing data after gobblin problems
  • 13:39 btullis: btullis@an-launcher1002:~$ sudo systemctl restart gobblin-event_default
  • 10:35 joal: Re-refine netflow data after gobblin pulled data fix
  • 08:41 joal: Rerun webrequest-load jobs for hour 2021-10-21T02:00

2021-10-20[edit]

  • 18:11 razzi: Deployed refinery using scap, then deployed onto hdfs
  • 16:36 razzi: deploy refinery change for https://phabricator.wikimedia.org/T287084
  • 07:15 joal: rerun webrequest-load-wf-upload-2021-10-20-1 after node issue
  • 06:27 elukey: reboot analytics1066 - OS showing CPU soft lockups, tons of defunct processes (including node manager) and high CPU usage

2021-10-19[edit]

  • 07:14 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_top_files-2021-10-17

2021-10-18[edit]

  • 19:29 joal: Rerun cassandra-daily-wf-local_group_default_T_top_pageviews-2021-10-17
  • 18:36 joal: Rerun cassandra-daily-wf-local_group_default_T_unique_devices-2021-10-17
  • 16:22 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-10-17
  • 16:16 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_referer-2021-10-17
  • 15:17 joal: Rerun failed instances from cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2
  • 14:49 elukey: restart hadoop-yarn-nodemanager on an-worker1119 and an-worker1103 (Java OOM in the logs)
  • 12:09 btullis: root@aqs1013:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
  • 12:09 btullis: root@aqs1012:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
  • 09:25 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1013-b/
  • 09:17 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1012-b/
  • 09:16 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/cassandra_migration/aqs1012-b/

2021-10-15[edit]

  • 08:33 btullis: btullis@aqs1007:~$ sudo nodetool-b clearsnapshot

2021-10-13[edit]

  • 19:49 mforns: re-ran cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat for 2021-10-12 successfully
  • 17:58 ottomata: deleting files on stat1008 in /tmp older than 10 days and larger than 20M sudo find /tmp -mtime +10 -size +20M | xargs sudo rm -rfv
  • 17:54 ottomata: removed /tmp/spark-* files belonging to aikochou on stat1008

2021-10-12[edit]

  • 15:43 btullis: btullis@aqs1008:~$ sudo nodetool-b clearsnapshot
  • 13:17 btullis: btullis@analytics1069:~$ sudo shutdown -h now
  • 13:15 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-hdfs-*
  • 13:14 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-yarn-nodemanager.service
  • 07:26 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-10-11

2021-10-11[edit]

  • 07:37 joal: rerun refine_event for `event`.`mediawiki_content_translation_event` year=2021/month=10/day=10/hour=16

2021-10-10[edit]

  • 18:07 joal: Rerun webrequest-load-wf-text-2021-10-10-10 - failed due to network issue

2021-10-06[edit]

  • 14:30 elukey: upgrade stat1005 to ROCm 4.2.0
  • 13:20 btullis: btullis@aqs1004:~$ sudo nodetool-a clearsnapshot
  • 10:20 elukey: upgrade ROCm to 4.2 on stat1008

2021-10-05[edit]

  • 11:28 elukey: failover analytics-hive back to an-coord1001 after maintenance

2021-10-04[edit]

  • 16:56 elukey: restart java daemons on an-coord1001 (standby)
  • 13:43 elukey: failover analytics-hive to an-coord1002 (to restart java daemons on 1001)
  • 07:43 joal: Kill-restart mediawiki-history-reduced job after deploy (more ressources)
  • 07:32 joal: Deploy refinery to hdfs
  • 07:10 joal: Deploy refinery for mediawiki-history-reduced hotfix
  • 06:56 joal: Kill-restart pageview-monthly_dump-coord to apply fix for SLA

2021-10-01[edit]

  • 15:11 btullis: sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='editoractivation' --since='2021-09-29T22:00:00.000Z' --until='2021-09-30T23:00:00.000Z'

2021-09-30[edit]

  • 19:55 ottomata: not changing to stats uid to 499; it already exists as a another system user
  • 19:54 ottomata: changing stats uid and gid on an-launcher1002 and stat1005 to 499
  • 09:32 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_netflow --ignore_failure_flag=true --since=2021-09-28T11:00:00 --until 2021-09-28T12:00:00

2021-09-29[edit]

  • 09:16 elukey: restart hive-* units on an-coord1002 for openjdk upgrades (standby node)

2021-09-28[edit]

  • 13:14 btullis: Deployed refinery using scap, then deployed onto hdfs
  • 12:34 btullis: deploying refinery
  • 09:55 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100*.eqiad.wmnet' 'nodetool-a snapshot -t T291472 local_group_default_T_pageviews_per_article_flat' 'nodetool-b snapshot -t T291472 local_group_default_T_pageviews_per_article_flat'
  • 09:36 elukey: restart java daemons on an-test-coord1001 to pick up new openjdk

2021-09-27[edit]

  • 11:18 btullis: btullis@stat1005:~$ sudo apt purge usrmerge
  • 11:11 btullis: btullis@stat1005:~$ sudo apt install usrmerge

2021-09-24[edit]

  • 22:33 razzi: restart an-test-coord presto coordinator service to experiment withweb-ui.authentication.type=fixed
  • 15:06 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].eqiad.wmnet' 'nodetool-a snapshot -t T291469' 'nodetool-b snapshot -t T291469'
  • 14:47 btullis: btullis@aqs1007:~$ sudo nodetool-a repair --full local_group_default_T_mediarequest_per_file data
  • 11:02 btullis: btullis@an-master1001:~$ sudo systemctl restart hadoop-mapreduce-historyserver
  • 10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-namenode
  • 10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-zkfc
  • 10:35 btullis: btullis@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
  • 10:07 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='centralnoticeimpression' --since='2021-09-23T04:00:00.000Z' --until='2021-09-24T05:00:00.000Z'

2021-09-22[edit]

  • 17:23 razzi: razzi@an-test-coord1001:/etc/presto$ sudo systemctl restart presto-server
  • 17:05 joal: Kill-restart oozie jobs after deploy (mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-dumps-coord, mediawiki-history-reduced-coord)
  • 11:54 joal: release refiner-source v0.1.18 to archiva with Jenkins

2021-09-20[edit]

  • 08:12 elukey: remove old /reportcard (password protected, old files from 2012) httpd settings for stats.wikimedia.org

2021-09-18[edit]

  • 06:48 joal: Rerun webrequest-load-wf-text-2021-9-18-0 for errors after yesterday night production issue

2021-09-17[edit]

  • 16:03 btullis: Cleared all snapshots on aqs100[47] to reclaim space with nodetool-[ab] clearsnapshot (T249755)
  • 15:15 btullis: btullis@aqs1004:~$ sudo nodetool-a repair --full && sudo nodetool-b repair --full (T249755)
  • 10:18 btullis: btullis@an-web1001:~$ sudo find /srv/published-rsynced -user systemd-coredump -exec chown stats {} \;
  • 09:47 milimetric: deployed refinery to sync sanitize allowlist, deleting event_sanitized data per decision in the task
  • 08:21 elukey: disable mod_cgi/mod_cgid on an-web1001 (and remove cgi-perl related httpd configs/settings)

2021-09-16[edit]

  • 19:25 ottomata: pointing analytics-web cname at new an-web1001, this moves stats and analytics .wm.org from thorium to an-web1001 - T285355
  • 18:30 joal: Create HDFS home folder for user 'analytics-research'
  • 07:03 elukey: stop jupyter-kaywong-singleuser.service on stat1005 to allow puppet to clean up

2021-09-15[edit]

  • 16:26 joal: Deploying refinery

2021-09-13[edit]

  • 18:25 razzi: (I stopped replication earlier but forgot to !log)
  • 18:24 razzi: razzi@dbstore1007:~$ for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done - reenable replication for T290841
  • 18:19 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s4.service for T290841
  • 18:13 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841
  • 18:05 razzi: sudo systemctl restart mariadb@s2.service

2021-09-07[edit]

  • 11:41 joal: Restarting cassandra hourly loading job after C2 snapshot taken and C3 tables truncated
  • 11:37 joal: Re-Add test rows in cassandra3 cluster after tables got truncated
  • 10:25 hnowlan: truncating data tables on aqs_next cluster
  • 10:12 joal: Kill cassandra-hourl loading job for cluster-migration first step

2021-09-03[edit]

  • 11:43 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs (second)
  • 09:57 joal: Deploy AQS on new AQS servers
  • 09:45 joal: Kill-restart mediarequest-top cassandra loading jobs after deploy
  • 09:12 joal: Rerun mediawiki-history-denormalize-wf-2021-08 after failure
  • 09:07 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs

2021-09-01[edit]

  • 16:44 mforns: finished one-off deployment of refinery to fix cassandra3 loading
  • 15:57 joal: Kill cassandra loading jobs and restart them after deploy
  • 15:55 mforns: starting one-off deployment of refinery to fix cassandra3 loading
  • 13:15 joal: Restart cassandra jobs to load cassandra3 with spark
  • 08:21 joal: Rerun webrequest-load-wf-upload-2021-9-1-0

2021-08-31[edit]

  • 23:25 mforns: finished deployment of refinery (regular weekly train v0.1.17) successfully, only an-test-coord1001.eqiad.wmnet failed
  • 22:41 mforns: starting deployment of refinery (regular weekly train v0.1.17)
  • 22:27 mforns: Deployed refinery-source using jenkins
  • 10:30 hnowlan: sudo cookbook sre.aqs.roll-restart aqs-next

2021-08-30[edit]

  • 06:53 elukey: drop an-airflow1001's old airflow logs to fix root partition almost filled up

2021-08-26[edit]

  • 06:22 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type d -empty -delete
  • 06:21 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type f -delete -mtime +60

2021-08-25[edit]

  • 13:40 joal: Kill restart pageview-monthly_dump job and 2 backfilling jobs
  • 13:34 joal: Deploy refinery onto HDFS
  • 13:09 joal: Deploying refinery using scap

2021-08-24[edit]

  • 10:30 btullis: btullis@an-launcher1002:~$ sudo systemctl start hdfs-balancer.service

2021-08-20[edit]

  • 08:46 btullis: btullis@druid1001:~$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord

2021-08-19[edit]

  • 19:05 razzi: razzi@deploy1002:/srv/deployment/analytics/aqs/deploy$ scap deploy "Deploy aqs 9c062f2"
  • 19:02 razzi: note that the aqs-deploy repo's commit message DOES NOT include the changes of aqs in its changes list (though it has the correct SHA in the first line)
  • 18:26 razzi: Beginning aqs deploy process
  • 17:55 razzi: razzi@labstore1007:~$ sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service
  • 17:53 razzi: sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service on labstore1006

2021-08-18[edit]

  • 17:37 btullis: on an-coord1001: MariaDB [superset_production]> update clusters set broker_host='an-druid1001.eqiad.wmnet' where cluster_name='analytics-eqiad';
  • 15:08 joal: Restart oozie jobs loading druid to use new druid-host
  • 08:55 joal: Deploying refinery with scap

2021-08-13[edit]

  • 16:46 elukey: cleanup /srv/discovery on stat1007 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/712422
  • 15:16 milimetric: reran the other three failed jobs successfully
  • 14:52 milimetric: rerunning webrequest-druid-hourly-wf-2021-8-13-13 because of failure to connect to Hive metastore

2021-08-12[edit]

  • 14:46 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl disable druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
  • 14:45 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord

2021-08-11[edit]

  • 19:43 btullis: btullis@druid1003:~$ sudo systemctl stop druid-overlord && sudo systemctl disable druid-overlord
  • 19:41 btullis: btullis@druid1003:~$ sudo systemctl stop druid-historical && sudo systemctl disable druid-historical
  • 19:40 btullis: btullis@druid1003:~$ sudo systemctl stop druid-coordinator && sudo systemctl disable druid-coordinator
  • 19:37 btullis: btullis@druid1003:~$ sudo systemctl stop druid-broker && sudo systemctl disable druid-broker
  • 19:30 btullis: btullis@druid1003:~$ curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable
  • 12:13 btullis: migration of zookeeper from druid1002 to an-druid1002 complete, with quorum and two zynced followers. Re-enabling puppet on all druid nodes.
  • 09:48 btullis: suspended the following oozie jobs in hue: webrequest-druid-hourly-coord, pageview-druid-hourly-coord, edit-hourly-druid-coord
  • 09:45 btullis: btullis@an-launcher1002:~$ sudo systemctl disable eventlogging_to_druid_editattemptstep_hourly.timer eventlogging_to_druid_navigationtiming_hourly.timer eventlogging_to_druid_netflow_hourly.timer eventlogging_to_druid_prefupdate_hourly.timer
  • 09:21 elukey: run "sudo find /var/log/airflow -type f -mtime +15 -delete" on an-airflow1001 to free space (root partition almost full)

2021-08-10[edit]

  • 17:27 razzi: resume the following schedules in hue: edit-hourly-druid-coord, pageview-druid-hourly-coord, webrequest-druid-hourly-coord
  • 17:10 razzi: sudo cookbook sre.druid.roll-restart-workers analytics (errored out)
  • 09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_prefupdate_hourly.service
  • 09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_netflow_daily.service

2021-08-09[edit]

  • 10:45 btullis_: btullis@an-druid1003:/var/log/druid$ sudo chown -R druid:druid /srv/druid /var/log/druid
  • 10:25 btullis_: btullis@an-druid1003:~$ sudo puppet agent -tv

2021-08-04[edit]

  • 09:12 btullis: btullis@an-coord1001:~$ sudo systemctl start hive-metastore.service hive-server2.service
  • 09:12 btullis: btullis@an-coord1001:~$ sudo systemctl stop hive-server2.service hive-metastore.service
  • 09:00 btullis: sudo systemctl start hive-metastore && sudo systemctl start hive-server2
  • 09:00 btullis: btullis@an-coord1002:~$ sudo systemctl stop hive-server2 && sudo systemctl stop hive-metastore

2021-08-03[edit]

  • 19:23 ottomata: bump Refine to refinery version 0.1.16 to pick up normalized_host transform - now all event tables will have a new normalized_host field - T251320
  • 19:02 ottomata: Deployed refinery using scap, then deployed onto hdfs
  • 14:57 ottomata: rerunning webrequest refine for upload 08-03T01:00 - 0042643-210701181527401-oozie-oozi-W

2021-08-02[edit]

  • 18:49 razzi: sudo cookbook sre.druid.roll-restart-workers analytics
  • 17:57 razzi: sudo cookbook sre.druid.roll-restart-workers public

2021-07-30[edit]

  • 22:22 razzi: razzi@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers test

2021-07-29[edit]

  • 18:12 razzi: sudo cookbook sre.aqs.roll-restart aqs

2021-07-28[edit]

  • 10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl start hive-metastore.service hive-server2.service
  • 10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl stop hive-server2.service hive-metastore.service

2021-07-26[edit]

  • 20:54 razzi: reran the failed workflow of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-7-25

2021-07-22[edit]

  • 18:38 ottomata: deploy refinery to an-launcher1002 for bin/gobblin job lock change

2021-07-20[edit]

  • 20:30 joal: rerun webrequest timed-out instances
  • 18:58 mforns: starting refinery deployment
  • 18:40 razzi: razzi@an-launcher1002:~$ sudo puppet agent --enable
  • 18:39 razzi: razzi@an-master1001:/var/log/hadoop-hdfs$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • 18:37 razzi: razzi@an-master1002:~$ sudo -i puppet agent --enable
  • 18:34 razzi: razzi@an-master1002:~$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • 18:32 razzi: razzi@an-master1002:~$ sudo systemctl start hadoop-yarn-resourcemanager.service
  • 18:31 razzi: razzi@an-master1002:~$ sudo systemctl stop hadoop-yarn-resourcemanager.service
  • 18:22 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
  • 18:21 razzi: re-enable yarn queues by merging puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/705732
  • 17:27 razzi: razzi@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
  • 17:17 razzi: stop all hadoop processes on an-master1001
  • 16:52 razzi: starting hadoop processes on an-master1001 since they didn't failover cleanly
  • 16:31 razzi: sudo bash gid_script.bash on an-maseter1001
  • 16:29 razzi: razzi@alert1001:~$ sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade"
  • 16:25 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-mapreduce-historyserver
  • 16:25 razzi: sudo systemctl stop hadoop-hdfs-zkfc.service on an-master1001 again
  • 16:25 razzi: sudo systemctl stop hadoop-yarn-resourcemanager on an-master1001 again
  • 16:23 razzi: sudo systemctl stop hadoop-hdfs-namenode on an-master1001
  • 16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-hdfs-zkfc
  • 16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-yarn-resourcemanager
  • 16:18 razzi: sudo systemctl stop hadoop-hdfs-namenode
  • 16:10 razzi: razzi@cumin1001:~$ sudo transfer.py an-master1002.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
  • 16:03 razzi: root@an-master1002:/srv/hadoop/name# tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
  • 15:57 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • 15:52 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • 15:37 razzi: kill yarn applications: for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
  • 15:08 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • 14:52 razzi: sudo systemctl stop 'gobblin-*.timer'
  • 14:51 razzi: sudo systemctl stop analytics-reportupdater-logs-rsync.timer
  • 14:47 razzi: Disable jobs on an-launcher1002 (see https://phabricator.wikimedia.org/T278423#7190372)
  • 14:46 razzi: razzi@an-launcher1002:~$ sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • 08:32 mforns: restarted webrequest bundle (messed up a coord when trying to rerun some failed hours)

2021-07-17[edit]

  • 08:54 elukey: run 'sudo find -type f -name '*.log*' -mtime +30 -delete' on an-coord1001:/var/log/hive to free space (root partition almost filled up) - T279304

2021-07-15[edit]

  • 16:44 ottomata: deploying refinery and refinery-source 0.1.15 for refine job fixes - T271232
  • 13:39 joal: Kill refine_event application_1623774792907_154469 to let manual run finish
  • 13:35 joal: Kill currently running refine job (application_1623774792907_154014)
  • 11:20 joal: Kill stuck refine application

2021-07-14[edit]

  • 17:39 razzi: sudo cookbook sre.druid.roll-restart-workers public for https://phabricator.wikimedia.org/T283067
  • 00:34 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart zookeeper
  • 00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-coordinator
  • 00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-broker
  • 00:28 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-middlemanager
  • 00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-overlord
  • 00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-historical

2021-07-13[edit]

  • 19:29 joal: move /wmf/data/raw/eventlogging --> /wmf/data/raw/eventlogging_camus and drop /wmf/data/raw/eventlogging_legacy/*/year=2021/month=07/day=13/hour=14
  • 19:02 razzi: razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-workers analytics
  • 13:03 joal: remove /wmf/gobblin/locks/event_default.lock to unlock gobblin event job

2021-07-12[edit]

  • 18:37 joal: Move /wmf/data/raw/event to /wmf/data/raw/event_camus and /wmf/data/raw/event_gobblin to /wmf/data/raw/event
  • 18:36 joal: Delete /year=2021/month=07/day=12/hour=14 of gobblin imported events
  • 18:17 ottomata: stopped puppet and refines and imports for event data on an-launcher1002 in prep for gobblin finalization for event_default job
  • 12:31 joal: Rerun failed webrequest hour after having checked that loss was entirely false-positive

2021-07-09[edit]

  • 03:21 joal: Rerun webrequest descendent jobs for 2021-07-08T10:00 problem

2021-07-08[edit]

  • 17:22 joal: Deploy refinery to HDFS
  • 16:57 joal: Kill-restart webrequest oozie job after gobblin time-format change
  • 16:44 joal: Deploying refinery to an-launcher and hadoop-test
  • 16:05 joal: Manually add /wmf/data/raw/webrequest/webrequest_text/year=2021/month=7/day=8/hour=9/_IMPORTED

2021-07-07[edit]

  • 17:03 joal: Deploy refinery to HDFS
  • 16:52 joal: Deploy refinery to an-launcher1002
  • 16:05 joal: Deploy refinery to test-cluster
  • 13:30 joal: kill-restart webrequest using gobblin data
  • 13:12 ottomata: deploying refinery to an-launcher1002 for webrequest gobblin migratoin
  • 13:09 joal: Move data for webrequest camus-gobblin migration
  • 13:03 ottomata: disabled camus-webrequest and gobblin-webrequest timer on an-launcher1002 in prep for migration

2021-07-06[edit]

  • 17:33 joal: Deploy refinery onto HDFS
  • 16:41 joal: Deploy refinery for gobblin
  • 16:03 joal: Kill webrequest_test oozie job
  • 15:55 joal: Drop and recreate wmf_raw.webrequest table on analytics-test-hadoop
  • 15:52 joal: Moved camus and gobblin data for webrequest on analytics-test-hadoop
  • 15:48 ottomata: deploying refinery to test cluster for webrequest_test gobblin job
  • 14:16 ottomata: restarted aqs for july mw histroy snapshot deploy
  • 13:29 joal: Run first manual empty job for webrequest_test on analytics-test-hadoop
  • 13:29 joal: Clean gobblin state_store and data before starting webrequest_test on analytics-test-hadoop

2021-07-03[edit]

  • 19:57 joal: rerun learning-features-actor-hourly-wf-2021-7-2-11

2021-07-02[edit]

  • 13:47 joal: Reset failed timer refinery-sqoop-mediawiki-private.service
  • 12:21 joal: Replacing failed data with successful data generated when testing https://gerrit.wikimedia.org/r/702877 - wmf_raw.mediawiki_private_cu_changes
  • 00:04 razzi: razzi@an-coord1002:~$ sudo mount -a
  • 00:04 razzi: razzi@an-coord1002:~$ sudo umount /mnt/hdfs
  • 00:03 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-metastore.service
  • 00:02 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-server2.service

2021-07-01[edit]

  • 18:56 razzi: razzi@authdns1001:~$ sudo authdns-update
  • 18:19 razzi: razzi@an-coord1001:~$ sudo mount -a
  • 18:18 razzi: razzi@an-coord1001:~$ sudo umount /mnt/hdfs
  • 18:17 razzi: razzi@an-coord1001:~$ sudo systemctl restart presto-server.service
  • 18:16 razzi: razzi@an-coord1001:~$ sudo systemctl restart hive-metastore.service
  • 18:16 razzi: sudo systemctl restart hive-server2.service
  • 18:15 razzi: sudo systemctl restart oozie on an-coord1001 for https://phabricator.wikimedia.org/T283067
  • 16:38 razzi: sudo authdns-update on ns0.wikimedia.org to apply https://gerrit.wikimedia.org/r/c/operations/dns/+/702689

2021-06-30[edit]

  • 18:19 razzi: unmount and remount /mnt/hdfs on an-test-client1001 for java security update

2021-06-29[edit]

  • 22:55 razzi: sudo systemctl restart hive-server2 on an-test-coord1001.eqiad.wmnet for T283067
  • 22:53 razzi: sudo systemctl restart hive-metastore on an-test-coord1001.eqiad.wmnet for T283067
  • 22:52 razzi: sudo systemctl restart presto-server on an-test-coord1001.eqiad.wmnet for T283067
  • 22:51 razzi: sudo systemctl restart oozie on an-test-coord1001.eqiad.wmnet for T283067
  • 13:31 ottomata: deploying refinery for weekly train

2021-06-28[edit]

  • 17:00 elukey: apt-get reinstall llvm-gpu on stat100[5-8] - T285495

2021-06-25[edit]

  • 08:01 elukey: reboot an-worker1101 to unblock stuck GPU
  • 07:57 elukey: execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU

2021-06-24[edit]

  • 06:38 elukey: drop hieradata/role/common/analytics_cluster/superset.yaml from puppet private repo (unused config, all the values dumplicated in the new hiera config)
  • 06:34 elukey: rename superset hiera role configs in puppet private repo (to match the role change done recently) + superset restart

2021-06-23[edit]

2021-06-22[edit]

  • 14:46 XioNoX: remove decom hosts from the analytics firewall filter on cr2-eqiad - T279429
  • 14:37 XioNoX: start updating analytics firewall rules to capirca generated ones on cr2-eqiad - T279429
  • 14:28 XioNoX: remove decom hosts from the analytics firewall filter on cr1-eqiad - T279429
  • 14:12 XioNoX: start updating analytics firewall rules to capirca generated ones on cr1-eqiad - T279429

2021-06-21[edit]

  • 13:35 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet

2021-06-18[edit]

  • 06:37 elukey: execute "sudo find -type f -name '*.log*' -mtime +30 -delete" on an-coord1001 to free space in the root partition

2021-06-15[edit]

  • 17:46 razzi: remove hdfs namenode backup on stat1004
  • 17:45 razzi: enable puppet on an-launcher
  • 17:45 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • 16:55 razzi: sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
  • 16:53 razzi: run uid script on an-master1002
  • 16:33 elukey: restart hadoop-yarn-resourcemanager on an-master1001
  • 16:16 razzi: sudo systemctl stop 'hadoop-*' on an-master1002
  • 16:14 razzi: sudo systemctl stop hadoop-* on an-master1001, then realize I meant to do this on an-master1002, so start hadoop-*
  • 16:11 razzi: downtime an-master1002
  • 15:55 razzi: sudo transfer.py an-master1001.eqiad.wmnet:/srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
  • 15:42 razzi: tar -czf /srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current on an-master1001
  • 15:38 razzi: backup /srv/hadoop/name/current to /home/razzi/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz on an-master1001
  • 15:33 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • 15:27 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • 15:25 razzi: kill running yarn applications via for loop
  • 15:11 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • 15:09 razzi: disable puppet on an-mastesr
  • 15:08 razzi: run puppet on an-masters to update capacity-scheduler.xml
  • 15:02 razzi: disable puppet on an-masters
  • 15:01 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to stop queues
  • 14:35 razzi: disable jobs that use hadoop on an-launcher1002 following https://phabricator.wikimedia.org/T278423#7094641

2021-06-14[edit]

  • 18:45 ottomata: remove packges from hadoop common nodes: sudo cumin 'R:Class = profile::analytics::cluster::packages::common' 'apt-get -y remove python3-pandas python3-pycountry python3-numpy python3-tz' - T275786
  • 18:43 ottomata: remove packges from stat nodes: sudo cumin 'stat*' apt-get -y remove subversion mercurial tofrodos libwww-perl libcgi-pm-perl libjson-perl libtext-csv-xs-perl libproj-dev libboost-regex-dev libboost-system-dev libgoogle-glog-dev libboost-iostreams-dev libgdal-dev
  • 07:18 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-6-11

2021-06-10[edit]

  • 21:17 razzi: sudo systemctl restart monitor_refine_eventlogging_analytics
  • 18:17 razzi: sudo systemctl restart hadoop-mapreduce-historyserver
  • 17:24 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1002
  • 17:24 razzi: sudo systemctl restart hadoop-hdfs-zkfc on an-master1002
  • 17:12 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
  • 16:25 razzi: rolling restart hadoop masters to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194
  • 14:07 ottomata: altered event.wmdebannerevent event.eventRate field to change type from BIGINT to DOUBLE - T282562

2021-06-08[edit]

  • 16:56 elukey: move away from dbstore1004 in favor of dbstore1007 in analytics CNAME/SRV records (will affect analytics-mysql and sqoop)
  • 13:42 ottomata: roll restart an-conf zookeepers - T283067
  • 13:22 ottomata: roll restarting analytics presto-servers - T283067
  • 06:08 elukey: restart yarn nodemanager on analytics1075 to clear the un-healthy state after some days of downtime (one-off issue but let's keep an eye on it)

2021-06-07[edit]

  • 18:14 ottomata: rolling restart of kafka jumbo brokers - T283067
  • 17:53 ottomata: rolling restart of kafka jumbo mirror makers - T283067
  • 17:07 ottomata: remove packages from an clsuter nodes: sudo apt-get -y remove r-cran-rmysql python3-matplotlib python3-sklearn python3-enchant python3-nltk gfortran liblapack-dev libopenblas-dev - T275786
  • 16:50 ottomata: restarting mysqld analytics-meta replica on db1108 to apply config change - T272973

2021-06-04[edit]

  • 17:42 razzi: sudo cookbook sre.aqs.roll-restart aqs to deploy new mediawiki history snapshot

2021-06-03[edit]

  • 22:32 razzi: sudo manage_principals.py create jdl --email_address=jlinehan@wikimedia.org
  • 22:32 razzi: sudo manage_principals.py create phuedx --email_address=phuedx@wikimedia.org
  • 15:46 ottomata: add airflow_2.1.0-py3.7-1_amd64.deb to apt.wm.org
  • 15:20 ottomata: created airflow_analytics database and user on an-coord1001 analytics-meta instance - T272973

2021-06-02[edit]

  • 18:09 ottomata: remove .deb packages from stat boxes: python3-mysqldb python3-boto python3-ua-parser python3-netaddr python3-pymysql python3-protobuf python3-unidecode python3-oauth2client python3-oauthlib python3-requests-oauthlib python3-ua-parser - T275786

2021-05-31[edit]

  • 06:56 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-29

2021-05-27[edit]

2021-05-26[edit]

  • 19:14 ottomata: deploying refinery and refinery source 0.1.13
  • 17:29 ottomata: killing and restarting oozie cassandra loader jobs coord_unique_devices_daily and coord_pageview_top_percountry_daily after revert of oozie job to load to cassandra 3
  • 14:18 ottomata: deploying refinery...
  • 14:17 ottomata: Deployed refinery-source using jenkins

2021-05-25[edit]

  • 18:16 razzi: sudo systemctl start all failed units from `systemctl list-units --state=failed` on an-launcher1002
  • 18:14 razzi: sudo systemctl start eventlogging_to_druid_navigationtiming_hourly.service
  • 18:01 razzi: manually edit /etc/hadoop/conf/capacity-scheduler.xml to make queues running and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • 17:52 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues on an-master1001 and an-master1002
  • 17:28 razzi: sudo systemctl restart refine_eventlogging_legacy
  • 17:28 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to enable submitting jobs once again
  • 17:08 razzi: re-enabled puppet on an-masters and an-launcher
  • 17:04 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
  • 17:03 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
  • 16:43 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1001
  • 16:38 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • 16:35 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • 16:28 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
  • 16:23 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
  • 16:06 razzi: sudo systemctl restart hadoop-hdfs-namenode
  • 15:52 razzi: checkpoint hdfs with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • 15:51 razzi: enable safe mode on an-master1001 with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • 15:36 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet again
  • 15:35 razzi: re-enable puppet on an-masters, run puppet, and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • 15:32 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet
  • 14:39 razzi: stop puppet on an-launcher and stop hadoop-related timers
  • 01:09 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
  • 01:07 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
  • 00:34 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet

2021-05-24[edit]

  • 18:05 ottomata: resume failing cassandra 3 oozie loading jobs, they are also loading to cassandra 2: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
  • 18:04 ottomata: suspend failing cassandra 3 oozie loading jobs: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
  • 15:19 ottomata: rm -rf /tmp/analytics/* on an-launcher1002 - T283126

2021-05-20[edit]

  • 06:05 elukey: kill christinedk's jupyter process on stat1007 (offboarded user) to allow puppet to run

2021-05-19[edit]

  • 16:31 razzi: restart turnilo for T279380

2021-05-18[edit]

  • 20:22 razzi: restart oozie virtualpageview hourly, virtualpageview druid daily, virtualpageview druid monthly
  • 18:57 razzi: deployed refinery via scap, then deployed to hdfs
  • 18:46 ottomata: removing extraneous python-kafka and python-confluent-kafka deb packages from analytics cluster - T275786
  • 12:40 joal: Add monitoring data in cassandra-3
  • 06:50 joal: run manual unique-devices cassandra job for one day with debug logging
  • 02:20 ottomata: manually running drop_event with --verbose flag

2021-05-17[edit]

  • 11:09 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after host generating failures has been moved out of cluster
  • 10:41 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after drop/create of keyspace
  • 10:28 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing
  • 09:45 joal: Rerun of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-15

2021-05-13[edit]

  • 11:41 hnowlan: running truncate "local_group_default_T_pageviews_per_article_flat".data; on aqs1012

2021-05-12[edit]

  • 15:17 ottomata: dropped event.mediawiki_job_* tables and data directories with mforns - T273789 & T281605
  • 13:56 ottomata: removing refine_mediawiki_job Refine jobs - T281605

2021-05-11[edit]

  • 21:00 mforns: finished repeated refinery deployment (matching source v0.1.11) - missed unmerged change
  • 19:59 mforns: repeating refinery deployment (matching source v0.1.11) - missed unmerged change
  • 19:53 mforns: finished refinery deployment (matching source v0.1.11)
  • 18:41 mforns: starting refinery deployment (matching source v0.1.11)
  • 17:26 mforns: deployed refinery-source v0.1.11

2021-05-06[edit]

  • 21:27 razzi: sudo manage_principals.py reset-password nahidunlimited --email_address=nsultan@wikimedia.org
  • 13:29 elukey: roll restart of hadoop yarn nodemanagers to pick up TasksMax=26214
  • 12:39 elukey: restart Yarn RMs to apply the dominant resource calculator setting - T281792
  • 12:15 hnowlan: changed eventlogging CNAME to point to eventlog1003
  • 09:19 hnowlan: starting decommission of eventlog1002

2021-05-05[edit]

  • 17:36 razzi: create principal for sihe: sudo manage_principals.py create sihe --email_address=silvan.heintze@wikimedia.de
  • 12:22 joal: Reset monitor_refine_eventlogging_legacy after manual rerun of failed job
  • 12:02 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-5-4

2021-05-04[edit]

  • 20:31 joal: Kill-restart 16 cassandra jobs
  • 20:29 joal: Kill-restart referer-daily job
  • 20:12 joal: Deploy refinery onto HDFSb
  • 19:46 joal: Deploying refinery using scap
  • 19:34 joal: refinery v0.1.10 released to Archiva

2021-05-03[edit]

  • 14:23 ottomata: stopping all venv based jupyter singleuser servers - T262847
  • 13:59 ottomata: dropped all obselete (upper cased location) event_santizied.*_T280813 tables created for T280813
  • 10:43 joal: Add _SUCCESS flag to /wmf/data/raw/mediawiki_private/tables/cu_changes/month=2021-04 after having manually sqooped missing tables
  • 09:57 joal: restart refinery-sqoop-mediawiki-private timer after patch
  • 09:56 joal: Reset refinery-sqoop-mediawiki-private timer
  • 09:38 joal: Drop already sqooped data to restart jobs
  • 08:53 joal: Deploy refinery for sqoop hotfix
  • 08:33 elukey: clean up libmariadb-java from hadoop workers and clients
  • 07:46 joal: Kill prod sqoop job to restart after fix

2021-04-30[edit]

  • 07:04 elukey: hue restarted using the database 'hue' instead of 'hue_next'
  • 06:56 elukey: stop hue to allow database rename (hue_next -> hue)

2021-04-29[edit]

  • 15:55 razzi: restart hadoop-yarn-nodemanager and hadoop-hdfs-datanode on an-worker1100 for hadoop to recognize new disk /dev/sdl
  • 15:38 ottomata: enabling event_sanitized_main jobs - T273789
  • 14:57 elukey: run mysql_upgrade on an-coord1001 to complete the buster upgrade - T278424
  • 14:44 hnowlan: restored all eventlogging jobs to eventlog1003
  • 14:21 hnowlan: bump eventlog1003 CPUs to 6
  • 13:53 joal: Rerun failed pageview-hourly-wf-2021-4-29-11 and pageview-hourly-wf-2021-4-29-12
  • 13:09 joal: Rerun failed pageview-hourly-wf-2021-4-29-11
  • 12:35 hnowlan: restarting 2 processors on eventlog1002
  • 12:02 hnowlan: stopping processors on eventlog1002 to migrate to eventlog1003
  • 11:50 elukey: manual stop of one of the eventlog processors on eventlog1002 to see if 1003 takes it over
  • 02:59 milimetric: deployed hotfix for referrer job

2021-04-28[edit]

  • 17:46 hnowlan: eventlog1003 joined to groups successfully
  • 17:36 razzi: sudo mkdir /srv/log/eventlogging and sudo chown eventlogging:eventlogging /srv/log/eventlogging to workaround missing directory puppet error (to be puppetized later)
  • 17:31 razzi: remove deployment cache on eventlogging1003: sudo rm -fr /srv/deployment/eventlogging/analytics-cache/
  • 17:26 razzi: manually change /srv/deployment/eventlogging/analytics/.git/DEPLOY_HEAD to deployment1002 on deployment1002 to fix puppet scap error
  • 16:53 hnowlan: stopping deployment-eventlog05 in deployment-prep
  • 14:42 milimetric: deployed refinery with 0.1.9 jars and synced to hdfs
  • 14:30 elukey: chown -R analytics-deploy:analytics-deploy /srv/deployment/analytics on an-coord1001
  • 12:50 ottomata: applied data_purge jobs in analytics test cluster; old data will now be dropped there - T273789

2021-04-27[edit]

  • 08:33 elukey: run mysql_upgrade for analytics-meta on an-coord1002 (should be part of the upgrade process) - T278424
  • 07:11 elukey: restart yarn resource managers to pick up yarn label settings

2021-04-26[edit]

  • 08:01 elukey: restart hadoop-mapreduce-historyserver on an-master1001 after changes to the yarn ui user
  • 07:36 elukey: re-enable timers after setting the capacity scheduler
  • 07:31 elukey: restart hadoop RM on an-master* to pick up capacity scheduler changes
  • 06:44 elukey: stop timers on an-launcher1002 again as prep step for capacity scheduler changes
  • 06:32 elukey: roll restart of hadoop-yarn-nodemanagers to pick up new log4j settings - T276906
  • 06:25 elukey: re-enable timers
  • 06:20 elukey: reboot an-coord1001 to pick up kernel security settings
  • 05:57 elukey: stop timers on an-launcher1002 to allow a reboot of an-coord1001

2021-04-24[edit]

  • 08:03 joal: Rerun failed webrequest-druid-hourly-wf-2021-4-23-13

2021-04-23[edit]

  • 14:23 elukey: roll restart an-master100[1,2] daemons to pick up new lo4j settings - T276906
  • 10:30 elukey: restart hadoop daemons (NM, DN, JN) on an-worker1080 to further test the new log4j config - T276906
  • 09:12 elukey: change default log4j hadoop config to include rolling gzip appender

2021-04-21[edit]

  • 21:30 ottomata: temporariliy disabling sanitize_eventlogging_analytics_delayed jobs until T280813 is completed (probably tomorrow)
  • 20:04 ottomata: renaming event_santized hive table directories to lower case and repairing table partition paths - T280813
  • 09:28 elukey: roll restart druid-overlord on druid* after an-coord1001 maintenance
  • 09:09 elukey: upgrade hue on an-tool1009 to 4.9.0-2
  • 08:31 elukey: re-enable timers on an-launcher1002 and airflow on an-airflow1001 after maintenance on an-coord1001
  • 07:08 elukey: reimage an-coord1001 after partition reshape (/var/lib/mysql folded in /srv)
  • 06:51 elukey: stop airflow on an-airflow1001
  • 06:49 elukey: stop all services on an-coord1001 as prep step for reimage
  • 06:45 elukey: PURGE BINARY LOGS BEFORE '2021-04-14 00:00:00'; on an-coord1001 to free some space before the reimage
  • 06:00 elukey: stop timers on an-launcher1002 as prep step for an-coord1001 reimage

2021-04-20[edit]

  • 15:51 elukey: move analytics-hive.eqiad.wmnet back to an-coord1001 (test on an-coord1002 successful)
  • 15:38 ottomata: deployed refiner to hdfs
  • 13:59 ottomata: deploying refinery and refinery source 0.1.6 for weekly train
  • 13:37 ottomata: deployed aqs
  • 13:16 elukey: failover analytics-hive to an-coord1002 to test the host (running on buster)
  • 12:40 elukey: PURGE BINARY LOGS BEFORE '2021-04-12 00:00:00'; on an-coord1001 - T280367

2021-04-19[edit]

  • 16:45 ottomata: make RefineMonitor use analytics keytab - this should be a no-op
  • 16:07 razzi: run kafka preferred-replica-election on jumbo cluster (kafka-jumbo1002)
  • 06:50 elukey: move /var/lib/hadoop/name partition under /srv/hadoop/name on an-master1001 - T265126
  • 05:45 elukey: cleanup Lex's jupyter notebooks on stat1007 to allow puppet to clean up

2021-04-18[edit]

  • 07:25 elukey: run "PURGE BINARY LOGS BEFORE '2021-04-11 00:00:00';" on an-coord1001 to free some space - T280367

2021-04-16[edit]

  • 15:14 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; on an-coord1001 to free space for /var/lib/mysql - T280367
  • 15:13 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00';
  • 07:54 elukey: drop all the cloudera packages from our repositories

2021-04-15[edit]

  • 21:13 razzi: rebalance kafka partitions for webrequest_text partition 23
  • 14:56 elukey: deploy refinery via scap - weekly train
  • 09:50 elukey: rollback hue on an-tool1009 to 4.8, it seems that 4.9 still has issues
  • 06:32 elukey: move hue.wikimedia.org to an-tool1009 (from analytics-tool1001)
  • 01:36 razzi: rebalance kafka partitions for webrequest_text partitions 21,22

2021-04-14[edit]

  • 14:05 elukey: run build/env/bin/hue migrate on an-tool1009 after the hue upgade
  • 13:10 elukey: rollback hue-next to 4.8 - issues not present in staging
  • 13:00 elukey: upgrade Hue to 4.9 on an-tool1009 - hue-next.wikimedia.org
  • 10:02 elukey: roll restart yarn nodemanagers on hadoop prod (attempt to see if they entered in a weird state, graceful restart)
  • 09:54 elukey: kill long running mediawiki-job refine erroring out application_1615988861843_166906
  • 09:46 elukey: kill application_1615988861843_163186 for the same reason
  • 09:43 elukey: kill application_1615988861843_164387 to see if any improvement to socket consumption is made
  • 09:14 elukey: run "sudo kill `pgrep -f sqoop`" on an-launcher1002 to clean up old test processes still running

2021-04-13[edit]

  • 16:17 razzi: rebalance kafka partitions for webrequest_text partitions 19, 20
  • 13:18 ottomata: Refine now uses refinery-job 0.1.4; RefineFailuresChecker has been removed and its function rolled into RefineMonitor -
  • 10:23 hnowlan: deploying aqs with updated cassandra libraries to aqs1004 while depooled
  • 06:17 elukey: kill application application_1615988861843_158645 to free space on analytics1070
  • 06:10 elukey: kill application_1615988861843_158592 on analytics1061 to allow space to recover (truncate of course in D state)
  • 06:05 elukey: truncate logs for application_1615988861843_158592 on analytics1061 - one partition full

2021-04-12[edit]

  • 14:21 ottomata: stop using http proxies for produce_canary_events_job - T274951

2021-04-08[edit]

  • 16:33 elukey: reboot an-worker1100 again to check if all the disks come up correctly
  • 15:43 razzi: rebalance kafka partitions for webrequest_text partitions 17, 18
  • 15:35 elukey: reboot an-worker1100 to see if it helps with the strange BBU behavior in T279475
  • 14:07 elukey: drop /var/spool/rsyslog from stat1008 - corrupted files due to root partition filled up caused a SEGV for rsyslog
  • 11:14 hnowlan: created aqs user and loaded full schemas into analytics wmcs cassandra
  • 08:35 elukey: apt-get clean on stat1008 to free some space
  • 07:44 elukey: restart hadoop hdfs masters on an-master100[1,2] to apply the new log4j settings fro the audit log
  • 06:44 elukey: re-deployed refinery to hadoop-test after fixing permissions on an-test-coord1001

2021-04-07[edit]

  • 23:03 ottomata: installing anaconda-wmf-2020.02~wmf5 on remaining nodes - T279480
  • 22:51 ottomata: installing anaconda-wmf-2020.02~wmf5 on stat boxes - T279480
  • 22:47 mforns: finished refinery deployment up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
  • 22:39 mforns: deployment of refinery via scap to hadoop-test failed with Permission denied: '/srv/deployment/analytics/refinery-cache/.config' (deployemt to production went fine)
  • 21:44 mforns: starting refinery deploy up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
  • 21:26 mforns: deployed refinery-source v0.1.4
  • 21:25 razzi: sudo apt-get install --reinstall sudo apt-get install --reinstall anaconda-wmf on stat1008
  • 20:15 razzi: rebalance kafka partitions for webrequest_text partitions 15, 16
  • 19:53 ottomata: upgrade anaconda-wmf everywhere to 2020.02~wmf4 with fixes for T279480
  • 14:03 hnowlan: setting profile::aqs::git_deploy: true in aqs-test1001 hiera config

2021-04-06[edit]

  • 22:34 razzi: rebalance kafka partitions for webrequest_text_13,14
  • 09:37 elukey: reimage an-coord1002 to Debian Buster

2021-04-05[edit]

  • 16:07 razzi: remove old hive logs on an-coord1001: sudo rm /var/log/hive/hive-*.log.2021-02-*
  • 14:54 razzi: remove empty /var/log/sqoop on an-launcher1002 (logs go in /var/log/refinery); sudo rmdir /var/log/sqoop
  • 14:51 razzi: rebalance kafka partitions for webrequest_text partitions 11, 12

2021-04-02[edit]

  • 16:28 razzi: rebalance kafka partitions for webrequest_text partitions 9,10
  • 16:19 elukey: all the Hadoop test cluster on Debian Buster
  • 07:28 elukey: manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b

2021-04-01[edit]

  • 20:27 razzi: restore superset_production from backup superset_production_1617306805.sql
  • 20:14 razzi: manually run bash /srv/deployment/analytics/superset/deploy/create_virtualenv.sh as analytics_deploy on an-tool1010, since somehow it didn't run with scap
  • 20:01 razzi: sudo chown -R analytics_deploy:analytics_deploy /srv/deployment/analytics/superset/venv since it's owned by root and needs to be removed upon deployment
  • 19:54 razzi: dump superset production to an-coord1001.eqiad.wmnet:/home/razzi/superset_production_1617306805.sql just in case
  • 16:50 razzi: rebalance kafka partitions for webrequest_text partitions 7 and 8

2021-03-31[edit]

  • 14:18 hnowlan: starting copy of large tables from aqs1007 to aqs1011

2021-03-30[edit]

  • 20:25 joal: Kill-Restart data_quality_stats-hourly-bundle after deploy
  • 20:19 joal: Deploying refinery onto HDFS
  • 19:57 joal: Deploying refinery using scap
  • 19:57 joal: Refinery-source released to archiva and new jars commited to refinery (v0.1.3)
  • 17:07 razzi: rebalance kafka partitions for webrequest_text partitions 5 and 6
  • 12:35 hnowlan: Depooling aqs1004 for another transfer of local_group_default_T_pageviews_per_article_flat
  • 12:30 elukey: restart reportupdater-codemirror on an-launcher1002 fro T275757
  • 11:30 elukey: ERRATA: upgrade to 2.3.6-2
  • 11:29 elukey: upgrade hive client packages to 2.3.6-1 on an-launcher1002 (already applied to all stat100x)

2021-03-25[edit]

  • 15:58 elukey: disable vmemory checks in Yarn nodemanagers on Hadoop
  • 13:53 elukey: systemctl restart performance-asotranking on stat1007 for T276121
  • 08:14 elukey: upgrade hive packages on stat100x to 2.6.3-2 - T276121
  • 08:12 elukey: upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia

2021-03-24[edit]

  • 18:49 elukey: systemctl restart refinery-import-* failed jobs (/mnt/hdfs errors due to me umounting the mountpoint)
  • 18:43 elukey: kill fuse hdfs mount process on an-launcher1002, re-mounted /mnt/hdfs, too many processes in D state
  • 15:46 razzi: rebalance kafka partitions for webrequest_text partitions 3 and 4
  • 05:40 razzi: sudo chown analytics /var/log/refinery/sqoop-mediawiki.log.1 on an-launcher1002 and restart logrotate

2021-03-22[edit]

  • 18:12 elukey: drop /srv/.hardsync* to clean up hardlinks not needed
  • 18:07 elukey: run rm -rfv .hardsync.*/archive/public-datasets/* on thorium:/srv to clean up files to drop (didn't work)
  • 18:01 elukey: drop /srv/.hardsync*trash* on thorium - old hardlinks that should have been trashed
  • 15:52 razzi: rebalance kafka partitions for webrequest_text partition 2
  • 09:28 elukey: move the yarn scheduler in hadoop test to capacity

2021-03-19[edit]

  • 15:44 razzi: rebalance kafka partitions for webrequest_text partition 1

2021-03-18[edit]

  • 19:30 razzi: rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so back
  • 19:29 razzi: temporarily rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so on aqs1004 to fix https://issues.apache.org/jira/browse/CASSANDRA-11574
  • 19:02 ottomata: hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus - T275396
  • 16:47 razzi: rebalance kafka partitions for webrequest_text partition 0
  • 06:32 elukey: force a manual run of create_virtualenv.sh on an-tool1010 - superset down

2021-03-17[edit]

  • 20:45 razzi: release wikistats 2.9.0
  • 20:15 ottomata: install anaconda-wmf 2020.02~wmf3 on analytics cluster clients and workers - T262847
  • 18:10 ottomata: started oozie/cassandra/coord_pageview_top_percountry_daily
  • 15:21 razzi: rebalance kafka partitions for webrequest_upload partitions 22 and 23
  • 13:54 razzi: sudo cookbook sre.hosts.reboot-single an-conf1001.eqiad.wmnet
  • 13:47 razzi: sudo cookbook sre.hosts.reboot-single an-conf1003.eqiad.wmnet
  • 13:41 razzi: sudo cookbook sre.hosts.reboot-single an-conf1002.eqiad.wmnet
  • 13:39 ottomata: deploying refinery for weekly train
  • 13:28 ottomata: deploy aqs as part of train - T207171, T263697
  • 01:28 razzi: rebalance kafka partitions for webrequest_upload partition 21

2021-03-16[edit]

  • 14:43 razzi: rebalance kafka partitions for webrequest_upload partition 20
  • 03:17 razzi: rebalance kafka partitions for webrequest_upload partition 19

2021-03-15[edit]

  • 16:53 razzi: rebalance kafka partitions for webrequest_upload partition 18
  • 08:25 elukey: stop/start hdfs-balancer on an-launcher1002 with bw 200MB
  • 07:48 joal: Manually start mediawiki-history-drop-snapshot.service to check the run succeeds
  • 07:47 joal: Drop hive wmf.mediawiki_wikitext_history snapshot partitions (2020-08, 2020-09, 2020-10, 2020-11)

2021-03-14[edit]

  • 20:49 joal: Manually clean some data ( mediawiki-history-drop-snapshot.service seems not working)
  • 20:46 joal: Force a run of mediawiki-history-drop-snapshot.service to clean up some data

2021-03-12[edit]

  • 17:20 elukey: kill duplicate mediawiki-wikitext-history coordinator failing and sending emails to alerts@
  • 07:21 elukey: re-run monitor_refine_event_failure_flags

2021-03-11[edit]

  • 22:31 razzi: rebalance kafka partitions for webrequest_upload partition 17
  • 20:20 razzi: disable maintenance mode for matomo1002
  • 20:08 razzi: starting reboot of matomo1002 for kernel upgrade
  • 18:52 razzi: systemctl restart hadoop-hdfs-datanode on analytics1059
  • 18:50 razzi: systemctl restart hadoop-yarn-nodemanager on analytics1059
  • 18:35 razzi: apt-get install parted on analytics1059
  • 15:34 razzi: rebalance kafka partitions for webrequest_upload partition 17
  • 10:52 elukey: drop /home/bsitzmann on all stat100x hosts - T273712
  • 08:25 elukey: drop database dedcode cascade in hive - T276748
  • 08:15 elukey: hdfs dfs -rmr /user/dedcode on an-launcher1002 (data in trash for a month) - T276748

2021-03-10[edit]

  • 23:15 razzi: rebalance kafka partitions for webrequest_upload partition 16
  • 18:44 mforns: finished deployment of refinery (session length oozie job)
  • 18:16 mforns: starting deployment of refinery (session length oozie job)
  • 16:54 razzi: rebalance kafka partitions for webrequest_upload partition 15
  • 07:05 elukey: all hadoop worker nodes on Buster
  • 06:28 elukey: force the re-run of refine_eventlogging_legacy - failed due to worker reimage in progress
  • 06:17 elukey: reimage an-worker1111 to buster

2021-03-09[edit]

  • 22:00 razzi: rebalance kafka partitions for webrequest_upload partition 14
  • 20:42 elukey: reimaged an-worker1091 to buster
  • 18:26 elukey: reimage an-worker1087 to buster
  • 16:40 elukey: reimage analytics1077 to buster
  • 15:36 razzi: rebalance kafka partitions for webrequest_upload partition 13
  • 15:18 elukey: reimage analytics1072 (hadoop hdfs journal node) to buster
  • 14:29 elukey: drain + reimage an-worker1090/89 to Buster
  • 13:26 elukey: reimage an-worker1102 and an-worker1080 (hdfs journal node) to Buster
  • 12:59 elukey: drain + reimage an-worker1103 to Buster
  • 09:14 elukey: drain + reimage analytics1076 and an-worker1112 to Buster
  • 07:01 elukey: drain + reimage an-worker109[4,5] to Buster

2021-03-08[edit]

  • 23:22 razzi: rebalance kafka partitions for webrequest_upload partition 12
  • 18:49 razzi: rebalance kafka partitions for webrequest_upload partition 11
  • 18:11 elukey: drain + reimage an-worker11[15,16] to Buster
  • 17:12 elukey: drain + reimage an-worker11[13,14] to Buster
  • 16:17 elukey: drain + reimage an-worker1109/1110 to Buster
  • 14:54 elukey: drain + reimage an-worker110[7,8] to Buster
  • 14:52 ottomata: altered topics (eqiad|codfw).mediawiki.client.session_tick to have 2 partitions - T276502
  • 13:51 elukey: drain + reimage an-worker110[4,5] to Buster
  • 10:41 elukey: drain + reimage an-worker1104/1089 to Debian Buster
  • 09:19 elukey: drain + reimage an-worker108[3,4] to Buster
  • 08:20 elukey: drain + reimage an-worker108[1,2] to Buster
  • 07:23 elukey: drain + reimage analytics107[4,5] to Buster

2021-03-07[edit]

  • 08:00 elukey: "megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll" on analytics1066
  • 07:49 elukey: umount /var/lib/hadoop/data/e on analytics1059 and restart hadoop daemons to exclude failed disk - T276696

2021-03-05[edit]

  • 18:30 razzi: run again sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
  • 18:18 razzi: sudo cookbook sre.dns.netbox -t T269211 "Move clouddb1021 to private vlan"
  • 18:17 razzi: re-run interface_automation.ProvisionServerNetwork with private vlan
  • 18:16 razzi: delete non-mgmt interface for clouddb1021
  • 17:07 razzi: sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
  • 16:54 razzi: sudo cookbook sre.dns.netbox -t T269211 "Reimage and rename labsdb1012 to clouddb1021"
  • 16:52 razzi: run script at https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/
  • 16:47 razzi: edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021
  • 16:30 razzi: delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/
  • 16:28 razzi: rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet
  • 16:08 razzi: sudo cookbook sre.hosts.decommission labsdb1012.eqiad.wmnet -t T269211
  • 15:52 razzi: stop mariadb on labsdb1012
  • 15:39 razzi: rebalance kafka partitions for webrequest_upload partition 10
  • 15:07 elukey: drain + reimage analytics1073 and an-worker1086 to Debian Buster
  • 13:36 elukey: roll restart HDFS Namenodes for the Hadoop cluster to pick up new Xmx settings (https://gerrit.wikimedia.org/r/c/operations/puppet/+/668659)
  • 10:20 elukey: force run of refinery-druid-drop-public-snapshots to check Druid public's performances
  • 10:06 elukey: failover HDFS Namenode from 1002 to 1001 (high GC pauses triggered the HDFS zkfc daemon on 1001 and the failover to 1002)
  • 08:32 elukey: drain + reimage an-worker107[8,9] to Debian Buster (one Journal node included)
  • 07:22 elukey: drain + reimage analytics107[0-1] to debian buster
  • 07:13 elukey: add analytis1066 back with /dev/sdb removed
  • 07:01 elukey: stop hadoop daemons on analytics1066 - disk errors on /dev/sdb after reimage

2021-03-04[edit]

  • 21:19 razzi: rebalance kafka partitions for webrequest_upload partition 9
  • 16:27 elukey: drain + reimage analytics106[8,9] to Debian Buster (one is a journalnode)
  • 15:12 elukey: drain + reimage analytics106[6,7] to Debian Buster
  • 14:21 elukey: drain + reimage analytics1065 to Debian Buster
  • 13:32 elukey: drain + reimage analytics10[63,64] to Debian Buster
  • 12:48 elukey: drain + reimage analytics10[61,62] to Debian Buster
  • 10:40 elukey: drain + reimage analytics1059/1060 to Debian Buster
  • 09:32 elukey: reboot an-worker[1097-1101] (GPU workers) to pick up the new kernel (5.10)
  • 09:02 elukey: kill/start mediawiki-geoeditors-monthly to apply backtick change (hive script)
  • 08:48 elukey: deploy refinery to hdfs
  • 08:34 elukey: deploy refinery to fix https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668111
  • 07:38 elukey: reboot an-worker1096 to pick up 5.10 kernel

2021-03-03[edit]

  • 17:10 elukey: update druid datasource on aqs (roll restart of aqs on aqs100*)
  • 17:06 razzi: rebalance kafka partitions for webrequest_upload partition 8
  • 14:20 elukey: reimage an-worker1099,1100,1101 (GPU worker nodes) to Debian Buster
  • 10:16 elukey: add an-worker113[2,5-8] to the Analytics Hadoop cluster

2021-03-02[edit]

  • 23:15 mforns: finished deployment of refinery to hdfs
  • 21:59 mforns: starting refinery deployment using scap
  • 21:48 mforns: deployed refinery-source v0.1.2
  • 17:26 razzi: rebalance kafka partitions for webrequest_upload partition 7
  • 13:42 elukey: Add an-worker11[19,20-28,30,31] to Analytics Hadoop
  • 10:21 elukey: roll restart druid historicals on druid public to pick up new cache settings (enable segment caching)
  • 10:14 elukey: roll restart druid brokers on druid public to pick up new cache settings (no segment caching, only query caching)
  • 08:01 elukey: manual start of performance-asotranking on stat1007 (requested by Gilles) - T276121

2021-03-01[edit]

  • 21:24 razzi: rebalance kafka partitions for webrequest_upload partition 6
  • 18:14 razzi: restart timer that wasn't running on an-worker1101: sudo systemctl restart prometheus-debian-version-textfile.timer
  • 17:40 elukey: reimage an-worker1098 (GPU worker node) to Buster
  • 14:48 elukey: reimage an-worker1097 (gpu node) to debian buster
  • 11:55 elukey: roll restart druid broker on druid-analytics (again) to enable query cache settings (missing config due to typo)
  • 11:34 elukey: roll restart historical daemons (again) on druid-analytics to remove stale config and enable (finally) segment caching.
  • 11:02 elukey: roll restart druid-broker and druid-historical daemons on druid-analytics to pick up new cache settings (disable segment caching on broker and enable it on historicals)
  • 09:12 elukey: restart hadoop daemons on an-worker1112 to pick up the new disk
  • 09:11 elukey: remount /dev/sdl on an-worker1112 (wasn't able to make it fail)

2021-02-26[edit]

  • 16:03 razzi: rebalance kafka partitions for webrequest_upload partition 4
  • 12:33 elukey: reimaged an-worker1096 (GPU node) to Debian buster (preserving datanode dirs)
  • 09:52 elukey: reimaged analytics1058 to debian buster (preserving datanode partitions)
  • 07:50 elukey: attempt to reimage analytics1058 (part of the cluster, not a new worker node) to Buster
  • 07:29 elukey: added journalnode partition to all hadoop workers not having it in the Analytics cluster
  • 07:01 elukey: reboot an-worker1099 to clear out kernel soft lockup errors
  • 06:59 elukey: restart datanode on an-worker1099 - soft lockup kernel errors

2021-02-25[edit]

  • 17:04 razzi: rebalance kafka partitions for webrequest_upload_3
  • 13:36 elukey: drop /srv/backup/wikistats from thorium
  • 13:35 elukey: drop /srv/backup/backup_wikistats_1 from thorium
  • 11:14 elukey: add an-worker111[7,8] to Analytics Hadoop (were previously backup worker nodes)
  • 08:50 elukey: move analytics-privatedata/search/product to fixed gid/uid on all buster nodes (including airflow/stat100x/launcher)

2021-02-24[edit]

  • 19:16 ottomata: service hadoop-yarn-nodemanager start on an-worker1112
  • 16:03 milimetric: deployed refinery
  • 14:09 elukey: roll restart druid brokers on druid public to pick up caffeine cache settings
  • 14:03 elukey: roll restart druid brokers on druid analytics to pick up caffeine cache settings
  • 11:08 elukey: restart druid-broker on an-druid1001 (used by Turnilo) with caffeine cache
  • 09:01 elukey: roll restart druid brokers on druid public - locked
  • 07:47 elukey: change gid/uid for druid + roll restart of all druid nodes

2021-02-23[edit]

  • 21:20 ottomata: started nodemanager on an-worker1112
  • 21:15 razzi: rebalance kafka partitions for webrequest_upload partition 2
  • 19:31 elukey: roll out new uid/gid for mapred/druid/analytics/yarn/hdfs for all buster nodes (no op for stretch)
  • 17:47 elukey: change uid/gid for yarn/mapred/analytics/hdfs/druid on stat100x, an-presto100x
  • 15:57 elukey: an-launcher1002's timers restored
  • 15:28 elukey: stop timers on an-launcher1002 to change gid/uid for yarn/hdfs/mapred/analytics/druid and to reboot for kernel updates
  • 15:23 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-tool100[8,9]
  • 15:22 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-airflow1001, an-test* buster nodes
  • 15:05 klausman: an-master1001 ~ $ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp analytics-privatedata-users /wmf/data/raw/webrequest/webrequest_text/hourly/2021/02/22/01/webrequest*
  • 14:51 elukey: drop /srv/backup-1007 on stat1008 to free space

2021-02-22[edit]

  • 19:27 ottomata: restart oozie on an-coord1001 to pick up new spark share lib without hadoop jars - T274384
  • 14:38 ottomata: upgrade spark2 on analytics cluster to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed) - T274384
  • 14:12 ottomata: upgrade spark2 on an-coord1001 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed), will remove and auto-re add spark-2.4.4-assembly.zip in hdfs after running puppet here
  • 14:07 ottomata: upgrade spark2 on stat1004 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed)
  • 09:01 elukey: reboot stat1005/stat1008 for kernel upgrades

2021-02-19[edit]

  • 15:53 elukey: restart oozie again to test another setting for role/admins
  • 15:43 ottomata: installing spark 2.4.4 without hadoop jars on analytics test cluster - T274384
  • 15:31 elukey: restart oozie to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/665352
  • 14:34 joal: rerun mobile_apps-uniques-daily-wf-2021-2-18
  • 09:16 elukey: stop and decom the hadoop backup cluster

2021-02-18[edit]

  • 18:38 razzi: rebalance kafka partition for webrequest_upload partition 1
  • 17:27 elukey: an-coord1002 back in service with raid1 configured
  • 15:48 elukey: stop hive/mysql on an-coord1002 as precautionary step to rebuild the md array
  • 13:10 elukey: failover analytics-hive to an-coord1001 after maintenance (DNS change)
  • 11:32 elukey: restart hive daemons on an-coord1001 to pick up new parquet settings
  • 10:07 elukey: hive failover to an-coord1002 to apply new hive settings to an-coord1001
  • 10:00 elukey: restart hive daemons on an-coord1002 (standby coord) to pick up new default parquet file format change
  • 09:46 elukey: upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x

2021-02-17[edit]

  • 17:44 razzi: rebalance kafka partitions for webrequest_upload partition 0
  • 16:14 razzi: rebalance kafka partitions for eqiad.mediawiki.api-request
  • 07:04 elukey: reboot stat1004/stat1006/stat1007 for kernel upgrades

2021-02-16[edit]

  • 22:31 razzi: rebalance kafka partitions for codfw.mediawiki.api-request
  • 17:44 razzi: rebalance kafka partitions for netflow
  • 17:42 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
  • 07:32 elukey: restart hadoop daemons on an-worker1099 after reconfiguring a new disk
  • 06:58 elukey: restart hdfs/yarn daemons on an-worker1097 to exclude a failed disk

2021-02-15[edit]

  • 20:38 mforns: running hdfs fsck to troubleshoot corrupt blocks
  • 17:28 elukey: restart hdfs namenodes on the main cluster to pick up new racking changes (worker nodes from the backup cluster)

2021-02-14[edit]

  • 09:38 joal: Restart and backfill mediacount and mediarequest, and backfill mediarequest-AQS and mediacount archive
  • 09:38 joal: deploy refinery onto hdfs
  • 09:14 joal: Deploy hotfix for mediarequest and mediacount

2021-02-12[edit]

  • 19:19 milimetric: deployed refinery with query syntax fix for the last broken cassandra job and an updated EL whitelist
  • 18:34 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
  • 18:31 razzi: rebalance kafka partitions for __consumer_offsets
  • 17:48 joal: Rerun wikidata-articleplaceholder_metrics-wf-2021-2-10
  • 17:47 joal: Rerun wikidata-specialentitydata_metrics-wf-2021-2-10
  • 17:43 joal: Rerun wikidata-json_entity-weekly-wf-2021-02-01
  • 17:08 elukey: reboot presto workers for kernel upgrade
  • 16:32 mforns: finished deployment of analytics-refinery
  • 15:26 mforns: started deployment of analytics-refinery
  • 15:16 elukey: roll restart druid broker on druid-public to pick up new settings
  • 07:54 elukey: roll restart of druid brokers on druid-public - locked after scheduled datasource deletion
  • 07:47 elukey: force a manual run of refinery-druid-drop-public-snapshots on an-launcher1002 (3d before its natural start) - controlled execution to see how druid + 3xdataset replication reacts

2021-02-11[edit]

  • 14:26 joal: Restart oozie API job after spark sharelib fix (start: 2021-02-10T18:00)
  • 14:20 joal: Rerun failed clicstream instance 2021-01 after sharelib fix
  • 14:16 joal: Restart oozie after having fixed the spark-2.4.4 sharelib
  • 14:12 joal: Fix oozie sharelib for spark-2.4.4 by copying oozie-sharelib-spark-4.3.0.jar onto the spark folder
  • 02:19 milimetric: deployed again to fix old spelling error :) referererererer
  • 00:05 milimetric: deployed refinery and synced to hdfs, restarting cassandra jobs gently

2021-02-10[edit]

  • 21:46 razzi: rebalance kafka partitions for eqiad.mediawiki.cirrussearch-request
  • 21:10 razzi: rebalance kafka partitions for codfw.mediawiki.cirrussearch-request
  • 19:11 elukey: drop /user/oozie/share + chown o+rx -R /user/oozie/share + restart oozie
  • 17:56 razzi: rebalance kafka partitions for eventlogging-client-side
  • 01:07 milimetric: deployed refinery with some fixes after BigTop upgrade, will restart three coordinators right now

2021-02-09[edit]

  • 22:04 razzi: rebalance kafka partitions for eqiad.resource-purge
  • 20:51 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T07:00 after data was imported to camus
  • 20:50 razzi: rebalance kafka partitions for codfw.resource-purge
  • 20:31 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T06:00 after data was imported to camus
  • 16:30 elukey: restart datanode on ana-worker1100
  • 16:14 ottomata: restart datanode on analytics1059 with 16g heap
  • 16:08 ottomata: restart datanode on an-worker1080 withh 16g heap
  • 15:58 ottomata: restart datanode on analytics1058
  • 15:55 ottomata: restart datenode on an-worker1115
  • 15:38 elukey: restart namenode on an-master1002
  • 15:01 elukey: restart an-worker1104 with 16g heap size to allow bootstrap
  • 15:01 elukey: restart an-worker1103 with 16g heap size to allow bootstrap
  • 14:57 elukey: restart an-worker1102 with 16g heap size to allow bootstrap
  • 14:54 elukey: restart an-worker1090 with 16g heap size to allow bootstrap
  • 14:50 elukey: restart analytics1072 with 16g heap size to allow bootstrap
  • 14:50 elukey: restart analytics1069 with 16g heap size to allow bootstrap
  • 14:08 elukey: restart analytics1069's datanode with bigger heap size
  • 13:39 elukey: restart hdfs-datanode on analytics10[65,69] - failed to bootstrap due to issues reading datanode dirs
  • 13:38 elukey: restart hdfs-datanode on an-worker1080 (test canary - not showing up in block report)
  • 10:04 elukey: stop mysql replication an-coord1001 -> an-coord1002, an-coord1001 -> db1108
  • 08:29 elukey: leave hdfs safemode to let distcp do its job
  • 08:25 elukey: set hdfs safemode on for the Analytics cluster
  • 08:19 elukey: umount /mnt/hdfs from all nodes using it
  • 08:16 joal: Kill flink yarn app
  • 08:08 elukey: stop jupyterhub on stat100x
  • 08:07 elukey: stop hive on an-coord100[1,2] - prep step for bigtop upgrade
  • 08:05 elukey: stop oozie an-coord1001 - prep step for bigtop upgrade
  • 08:03 elukey: stop presto-server on an-presto100x and an-coord1001 - prep step for bigtop upgrade
  • 07:28 elukey: roll out new apt bigtop changes across all hadoop-related nodes
  • 07:19 joal: Killing yarn users applications
  • 07:12 elukey: stop airflow on an-airflow1001 (prep step for bigtop)
  • 07:09 elukey: stop namenode on an-worker1124 (backup cluster), create two new partitions for backup and namenode, restart namenode
  • 06:14 elukey: disable timers on labstore nodes (prep step for bigtop)
  • 06:11 elukey: disable systemd timers on an-launcher1002 (prep step for bigtop)

2021-02-08[edit]

  • 22:29 elukey: the previous entry was related to the Hadoop backup cluster
  • 22:29 elukey: hdfs master failover an-worker1118 -> an-worker1124, created dedicated partition for /var/lib/hadoop/name (root partition filled up), restarted namenode on 1118 (now recovering edit logs)
  • 18:44 razzi: rebalance kafka partitions for eventlogging_VirtualPageView
  • 15:12 ottomata: set kafka topic retention to 31 days for (eqiad|codfw.rdf-streaming-updater.mutation) in kafka main-eqiad and main-codfw - T269619

2021-02-05[edit]

  • 20:31 razzi: rebalance kafka partitions for eventlogging_SearchSatisfaction
  • 19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.client.session_tick
  • 18:38 razzi: rebalance kafka partitions for codfw.mediawiki.client.session_tick
  • 17:53 razzi: rebalance kafka partitions for codfw.resource_change
  • 17:53 razzi: rebalance kafka partitions for eqiad.resource_change
  • 11:31 elukey: restart turnilo to pick up changes to the config (two new attributes to webrequest_128)

2021-02-04[edit]

  • 19:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.wikibase-addUsagesForPage
  • 19:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.wikibase-addUsagesForPage
  • 19:22 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
  • 17:04 elukey: restart presto coordinator on an-coord1001 to pick up logging settings (log to http-request.log)
  • 17:02 elukey: roll restart presto on an-presto* to finally get http-request.log
  • 11:28 elukey: move aqs druid snapshot config to 2021-01
  • 09:01 elukey: restart superset and disable memcached caching
  • 08:08 elukey: move an-worker1117 from Hadoop Analytics to Hadoop Backup

2021-02-03[edit]

  • 21:38 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
  • 20:04 razzi: rebalance kafka partitions for eqiad.mediawiki.job.RecordLintJob
  • 20:03 razzi: rebalance kafka partitions for codfw.mediawiki.job.RecordLintJob
  • 18:28 razzi: rebalance kafka partitions for eqiad.mediawiki.job.refreshLinks
  • 18:28 razzi: rebalance kafka partitions for codfw.mediawiki.job.refreshLinks
  • 17:52 razzi: rebalance kafka partitions for eqiad.wdqs-internal.sparql-query
  • 17:50 razzi: rebalance kafka partitions for codfw.wdqs-internal.sparql-query
  • 14:48 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/mediawiki/history_reduced
  • 14:45 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf/mediawiki
  • 14:40 elukey: kill + restart webrequest-druid-{hourly,daily} to pick up new changes after refinery deployment
  • 14:30 elukey: kill + relaunch webrequest_load to pick up new changes after refinery deployment
  • 14:28 elukey: relaunch edit-hourly-druid-coord 02-2021 after chmods
  • 14:25 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/edit
  • 14:24 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf
  • 10:57 elukey: deploy refinery to hdfs
  • 10:36 elukey: released Refinery Source 0.1.0
  • 08:54 elukey: drop v0.1.x tags from Refinery source upstream repo
  • 08:48 elukey: drop refinery source artifacts v0.1.2 from Archiva

2021-02-02[edit]

  • 20:39 razzi: rebalance kafka partitions for eqiad.mediawiki.job.htmlCacheUpdate
  • 20:39 razzi: rebalance kafka partitions for codfw.mediawiki.job.htmlCacheUpdate
  • 19:29 ottomata: manually altered event.codemirrorusage to fix incompatible type change: https://phabricator.wikimedia.org/T269986#6797385
  • 19:28 elukey: change archiva-ci password in pwstore, archiva and jenkins
  • 17:53 razzi: rebalance kafka partitions for eqiad.wdqs-external.sparql-query
  • 17:17 razzi: rebalance kafka partitions for eventlogging_CentralNoticeImpression
  • 16:39 razzi: rebalance kafka partitions for eventlogging_InukaPageView
  • 08:42 elukey: decommission an-worker1117 from the Hadoop cluster, to move it under the Backup cluster

2021-02-01[edit]

  • 21:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.cdnPurge
  • 21:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.cdnPurge
  • 20:51 razzi: rebalance kafka partitions for eventlogging_PaintTiming
  • 19:01 razzi: rebalance kafka partitions for eventlogging_LayoutShift
  • 18:58 razzi: rebalance kafka partitions for eqiad.mediawiki.job.recentChangesUpdate
  • 18:58 razzi: rebalance kafka partitions for codfw.mediawiki.job.recentChangesUpdate
  • 18:23 razzi: rebalance kafka partitions for codfw.mediawiki.recentchange
  • 18:09 razzi: rebalance kafka partitions for eqiad.resource_change

2021-01-29[edit]

  • 20:23 razzi: rebalance kafka partitions for eventlogging_NavigationTiming
  • 19:30 razzi: rebalance kafka partitions for eqiad.mediawiki.revision-score
  • 19:29 razzi: rebalance kafka partitions for codfw.mediawiki.revision-score
  • 19:14 razzi: rebalance kafka partitions for eventlogging_CpuBenchmark
  • 19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
  • 19:10 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
  • 14:33 elukey: rollback presto upgrade, worker seems not able to announce themselves to the query coordinator
  • 14:08 elukey: upgrade presto to 0.246 (from 0.226) on an-presto1001 - worker node
  • 14:02 elukey: upgrade presto to 0.246 (from 0.226) on an-coord1001 - query coordinator
  • 07:44 joal: Copy /wmf/data/event_sanitized to backup cluster (T272846)

2021-01-28[edit]

  • 22:23 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
  • 22:22 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
  • 22:01 razzi: rebalance kafka partitions for eventlogging_QuickSurveyInitiation
  • 21:13 razzi: rebalance kafka partitions for topic eventlogging_EditAttemptStep
  • 19:49 mforns: finished deployment of refinery (for v0.0.146)
  • 18:57 mforns: starting deployment of refinery (for v0.0.146)
  • 18:54 mforns: deployed refinery-source v0.0.146 using Jenkins
  • 18:45 razzi: rebalance kafka partitions for topic eqiad.mediawiki.job.ORESFetchScoreJob
  • 18:42 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.ORESFetchScoreJob
  • 18:22 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.wikibase-InjectRCRecords
  • 17:26 razzi: rebalance kafka partitions for topic eqiad.mediawiki.revision-tags-change
  • 17:26 razzi: rebalance kafka partitions for topic codfw.mediawiki.revision-tags-change
  • 16:32 razzi: rebalance kafka partitions for topic eventlogging_CodeMirrorUsage
  • 16:16 elukey: manual failover of hdfs namenode active/master from an-master1002 to an-master1001

2021-01-27[edit]

  • 13:02 joal: Copy /wmf/data/event to backup cluster (30Tb) - T272846
  • 11:15 elukey: add client_port and debug fields to X-Analytics in webrequest varnishkafka streams

2021-01-26[edit]

  • 16:39 razzi: reboot kafka-test1006 for kernel upgrade
  • 09:37 elukey: reboot dbstore1005 for kernel upgrades
  • 09:35 joal: Copy /wmf/data/discovery to backup cluster (21Tb) - T272846
  • 09:31 elukey: reboot dbstore1003 for kernel upgrades
  • 09:15 elukey: reboot dbstore1004 for kernel upgrades
  • 09:07 joal: Copy /wmf/refinery to backup cluster (1.1Tb) - T272846
  • 09:01 joal: Copy /wmf/discovery to backup cluster (120Gb) - T272846
  • 08:42 joal: Copy /wmf/camus to backup cluster (120Gb) - T272846

2021-01-25[edit]

  • 20:42 razzi: rebalance kafka partitions for eqiad.mediawiki.page-properties-change.json
  • 20:41 razzi: rebalance kafka partitions for codfw.mediawiki.page-properties-change
  • 18:58 razzi: rebalance kafka partitions for eventlogging_ExternalGuidance
  • 18:53 razzi: rebalance kafka partitions for eqiad.mediawiki.job.ChangeDeletionNotification
  • 17:13 joal: Copy /user to backup cluster (92Tb) - T272846
  • 16:23 elukey: drain+restart cassandra on aqs1004 to pick up the new openjdk (canary)
  • 16:21 elukey: restart yarn and hdfs daemon on analytics1058 (canary node for new openjdk)
  • 12:25 joal: Copy /wmf/data/archive to backup cluster (32Tb) - T272846
  • 10:20 elukey: restart memcached on an-tool1010 to flush superset's cache
  • 10:18 elukey: restart superset to remove druid datasources support - T263972
  • 09:57 joal: Changing ownership of archive WMF files to analytics:analytics-privatedata-users after update of oozie jobs

2021-01-22[edit]

  • 17:38 mforns: finished refinery deploy to HDFS
  • 17:28 mforns: restarted refine_event and refine_eventlogging_legacy in an-launcher1002
  • 17:11 mforns: starting refinery deploy using scap
  • 17:09 mforns: bumped up refinery-source jar version to 0.0.145 in puppet for Refine and DruidLoad jobs
  • 16:44 mforns: Deployed refinery-source v0.0.145 using jenkins
  • 09:48 joal: Raise druid-public default replication-factor from 2 to 3

2021-01-21[edit]

  • 18:54 razzi: rebooting nodes for druid public cluster via cookbook
  • 16:49 ottomata: installed libsnappy-dev and python3-snappy on webperf1001
  • 15:17 joal: Kill mediawiki-wikitext-history-wf-2020-12 as it was stuck and failed
  • 11:19 elukey: block UA with 'python-requests.*' hitting AQS via Varnish

2021-01-20[edit]

  • 21:48 milimetric: refinery deployed, synced to hdfs, ready to restart 53 oozie jobs, will do so slowly over the next few hours
  • 18:11 joal: Release refinery-source v0.0.144 to archiva with Jenkins

2021-01-15[edit]

  • 09:21 elukey: roll restart druid brokers on druid public - stuck after datasource drop

2021-01-11[edit]

  • 07:26 elukey: execute 'sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/mediawiki' on launcher to fix dir perms

2021-01-09[edit]

  • 15:11 elukey: restart timers 'analytics-*' on labstore100[6,7] to apply new permission settings
  • 08:31 elukey: restart the failed hdfs rsync timers on labstore100[6,7] to kick off the remaining jobs
  • 08:30 elukey: execute hdfs chmod o+x of /wmf/data/archive/projectview /wmf/data/archive/projectview/legacy /wmf/data/archive/pageview/legacy to unblock hdfs rsyncs
  • 08:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/pageview" to unblock labstore hdfs rsyncs
  • 08:21 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/geoeditors" to unblock labstore hdfs rsync

2021-01-08[edit]

  • 18:54 joal: Restart jobs for permissions-fix (clickstream, mediacounts-archive, geoeditors-public_monthly, geoeditors-yearly, mobile_app-uniques-[daily|monthly], pageview-daily_dump, pageview-hourly, projectview-geo, unique_devices-[per_domain|per_project_family]-[daily|monthly])
  • 18:14 joal: Restart projectview-hourly job (permissions test)
  • 18:03 joal: Deploy refinery onto HDFS
  • 17:50 joal: deploy refinery with scap
  • 10:01 elukey: restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well
  • 08:46 elukey: force restart of check_webrequest_partitions.service on an-launcher1002
  • 08:44 elukey: force restart of monitor_refine_eventlogging_legacy_failure_flags.service
  • 08:18 elukey: raise default max executor heap size for Spark refine to 4G

2021-01-07[edit]

  • 18:22 elukey: chown -R /tmp/analytics analytics:analytics-privatedata-users (tmp dir for data quality stats tables)
  • 18:21 elukey: "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/wmf/data_quality_stats"
  • 18:10 elukey: disable temporarily hdfs-cleaner.timer to prevent /tmp/DataFrameToDruid to be dropped
  • 18:08 elukey: chown -R /tmp/DataFrameToDruid analytics:druid (was: analytics:hdfs) on hdfs to temporarily unblock Hive2Druid jobs
  • 16:31 elukey: remove /etc/mysql/conf.d/research-client.cnf from stat100x nodes
  • 15:40 elukey: deprecate the 'reseachers' posix group for good
  • 11:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event_sanitized" to fix some file permissions as well
  • 10:36 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event" on an-master1001 to fix some file permissions (an-launcher executed timers during the past hours without the new umask) - T270629
  • 09:37 elukey: forced re-run of monitor_refine_event_failure_flags.service on an-launcher1002 to clear alerts
  • 08:26 joal: Rerunning 4 failed refine jobs (mediawiki_cirrussearch_request, day=6/hour=20|21, day=7/hour=0|2)
  • 08:14 elukey: re-enable puppet on an-launcher1002 to apply new refine memory settings
  • 07:59 elukey: re-enabling all oozie jobs previously suspended
  • 07:54 elukey: restart oozie on an-coord1001

2021-01-06[edit]

  • 20:42 ottomata: starting remaining refine systemd timers
  • 20:19 ottomata: restarted eventlogging_to_druid timers
  • 20:19 ottomata: restarted drop systemd timers
  • 20:18 ottomata: restarted reportupdater timers
  • 20:14 ottomata: re-starting camus systemd timers
  • 16:45 razzi: restart yarn nodemanagers
  • 16:08 razzi: manually failover hdfs haadmin from an-master1002 to an-master1001
  • 15:53 ottomata: stopping analytics systemd timers on an-launcher1002

2021-01-05[edit]

  • 21:32 ottomata: bumped mediawiki history snapshot version in AQS
  • 20:45 ottomata: Refine changes: event tables now have is_wmf_domain, canary events are removed, and corrupt records will result in a better monitoring email
  • 20:43 razzi: deploy aqs as part of train
  • 19:17 razzi: deploying refinery for weekly train
  • 09:29 joal: Manually reload unique-devices monthly in cassandra to fix T271170

2021-01-04[edit]

  • 22:22 razzi: reboot an-test-coord1001 to upgrade kernel
  • 14:24 elukey: deprecate the analytics-users group

2021-01-03[edit]

  • 14:11 milimetric: reset-failed refinery-sqoop-whole-mediawiki.service
  • 14:10 milimetric: manual sqoop finished, logs on an-launcher1002 at /var/log/refinery/sqoop-mediawiki.log and /var/log/refinery/sqoop-mediawiki-production.log

2021-01-01[edit]

  • 14:54 milimetric: deployed refinery hotfix for sqoop problem, after testing on three small wikis