Analytics/Server Admin Log

2023-02-16

 * 21:10 SandraEbele: restarted oozie webrequest load bundle.
 * 21:09 SandraEbele: Added new field referer_data to wmf.webrequest table using the alter table statement
 * 21:07 SandraEbele: successfully deployed analytics refinery
 * 18:46 SandraEbele: started deploying analytics refinery
 * 18:37 SandraEbele: killed webrequest bundle ooze jobs to deploy refinery changes.
 * 16:55 SandraEbele: Deployed refinery-source change to remove Github.io from Mediasites definition of referers.

2023-02-13

 * 21:40 xcollazo: deploying section_topics v0.5.0 on platform_eng Airflow instance
 * 21:39 ottomata: enabled rc1.mediawiki.page_change stream on group0 and group1 wikis
 * 14:15 btullis: roll-restarting all eventgate pods
 * 14:06 nfraison: Reimage an-test-presto1001 to upgrade to bullseye T329361
 * 10:46 nfraison: restarting presto-worker on an-presto[1001-1015].eqiad.wmnet to pick up new gc logging settings T329054
 * 10:15 btullis: Reimage an-test-worker1001 to upgrade to bullseye T329363
 * 09:59 nfraison: restarting presto-coordinator on an-coord1001 to pick up new gc logging settings T329054
 * 09:57 nfraison: re-enabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet
 * 09:08 aqu: Rerun killed Oozie pageview-hourly-coord of 2023-02-11 with sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -rerun 0019103-210107075406929-oozie-oozi-C -date 2023-02-11T14:00Z::2023-02-11T16:00Z
 * 09:04 nfraison: restarting presto-coordinator on an-test-coord1001 to pick up new gc logging settings T329054
 * 08:59 nfraison: restarting presto-worker on an-test-presto1001 to pick up new gc logging settings T329054
 * 08:52 nfraison: disabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/888214 on test cluster first only

2023-02-10

 * 23:22 mforns: unpaused all airflow dags and cleared all failed tasks after the incident
 * 22:30 btullis: starting the hadoop-yarn-resourcemanager on an-master1001 and failing back to iy.
 * 22:25 btullis: stopping hadoop-yarn-resourcemanager service in an-master1001 to fail over automatically to an-master1002
 * 21:21 mforns: restarted airflow@analytics.service in an-launcher1002

2023-02-09

 * 17:32 mforns: deployed airflow
 * 12:01 btullis: Shutting down an-worker109[89] and dse-k8s-worker1002 for another GPU move - T318696
 * 10:36 joal: Start airflow webrequest_actor jobs
 * 10:26 joal: Deploy analytics-airflow
 * 10:25 joal: Setup airflow start-date variables for new dags
 * 10:10 joal: Merge airflow code for learning/actor -> webrequest_actor move
 * 10:01 joal: Move data and update hive tables from learning/actor convention to webrequest_actor convention
 * 09:59 joal: Kill oozie pageview-learning jobs

2023-02-08

 * 19:26 milimetric: finished deploying refinery-source 0.2.11, refinery, and synced to hdfs
 * 12:04 btullis: shut down an-worker109[67] and dse-k8s-worker1001 ready for GPU swap.

2023-02-03

 * 15:23 milimetric: deployed airflow-dags/analytics to disable skein log collection from the SparkSubmitOperator.
 * 10:11 steve_munene: roll-restart aqs to update mediawiki_history_snapshot to 2023-01

2023-02-02

 * 12:26 btullis: deploying the updated build of superset to production T328047
 * 09:56 btullis: correction: beginning a rolling reboot of all aqs servers for T325132
 * 09:52 btullis: beginning a rolling reboot of all aqs servers for T326945
 * 08:44 steve_munene: Deployed refinery using scap, then deployed onto hdfs
 * 08:26 steve_munene: refinery-deploy-to-hdfs run4

2023-02-01

 * 10:51 steve_munene: Deploying refinery for ops week

2023-01-30

 * 16:41 btullis: started an-presto1006-1015 again, but disabled the presto service on them once again T323783 and T325809

2023-01-27

 * 11:41 steve_munene: datahub helmfile apply on main for T327884
 * 11:17 btullis: shut down an-worker1087 to await RAID BBU replacement
 * 11:03 steve_munene: datahub: apply on main for T327884

2023-01-26

 * 10:42 joal: deploying airflow analytics for GDI dags
 * 10:36 joal: drop/recreate wmf_raw.mediawiki_private_cu_changes hive table to have new fields
 * 10:01 joal: deploy refinery onto hdfs
 * 09:48 joal: deploying refinery using scap (no refinery-source deploy)
 * 09:43 joal: Rerun failed 'cassandra_daily_load.load_mediarequest_per_file_to_cassandra 2023-01-25T00:00:00+00:00' task

2023-01-25

 * 16:54 steve_munene: Restarting presto-server.service on presto coordinator an-coord1001 for T323783
 * 16:53 btullis: kicked off a rolling reboot of kafka-jumbo as part of T325132
 * 15:14 btullis: rebooting an-conf1003 for new kernel
 * 14:54 btullis: started a rolling-reboot of the hadoop workers via `sre.hadoop.reboot-workers` cookbook.

2023-01-23

 * 13:06 btullis: restarted webrequest_sampled_supervisor realtime druid indexation job
 * 10:04 btullis: proceeding to upgrade an-tool1010 to bullseye for superset 1.5.3 upgrade T323458

2023-01-19

 * 10:25 btullis: enabled dashboard native filtering in superset https://gerrit.wikimedia.org/r/c/operations/puppet/+/881510 for T318299

2023-01-17

 * 20:54 xcollazo: dropping old partitions from image_suggestions Hive tables as per https://phabricator.wikimedia.org/T325837
 * 16:50 btullis: shutdown an-worker1086 for RAID BBU replacement

2023-01-16

 * 08:46 elukey: powercycle an-worker1125 - soft lockup traces registered in the tty, host frozen

2023-01-10

 * 17:33 btullis: chassis power reset on an-worker1032 (T326459)
 * 15:58 SandraEbele: backfilling refine_event_sanitized_analytics_immediate on an-launcher1002 ‘sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event_sanitized_analytics_immediate —ignore_failure_flag=true --since=2023-01-07T17:00:00 until=2023-01-08T10:00:00
 * 15:55 SandraEbele: reran failed pageview-druid-hourly-coord oozie job for 2023-1-10-10.
 * 11:36 btullis: roll-rebooting the analytics druid cluster to pick up new kernel
 * 10:24 btullis: roll-rebooting the druid-public cluster to pick up new kernel

2023-01-09

 * 17:09 aqu: Relaunching refine_event after partial backfilling `sudo systemctl start refine_event.service` (an-launcher1002)
 * 14:48 SandraEbele: reran webrequest failed jobs ‘sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -Dstart_time=2023-01-08T07:00Z -Dstop_time=2023-01-08T14:59Z -Dwebrequest_source=text -Derror_incomplete_data_threshold=100 -Dwarning_incomplete_data_threshold=100 -Derror_data_loss_threshold=100 -Dwarning_data_loss_threshold=100 -submit -config /home/ebysans/webrequest_text_coordinator.properties’
 * 10:21 aqu: backfilling with refine_event on an-launcher1002 `sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event --ignore_failure_flag=true --since=2023-01-07T16:00:00 --until=2023-01-09T09:00:00 --verbose`
 * 09:48 aqu: killing refine_event yarn application `sudo -u analytics yarn application -kill application_1663082229270_682638`
 * 09:39 aqu: Manually kill the Spark process on an-launcher1002 `sudo -u analytics kill -9 28538`

2023-01-06

 * 12:29 steve_munene: roll restarting aqs servers for to bump up mediawiki_history_snapshot to 2022-12

2023-01-04

 * 17:14 xcollazo: Dropped all temporary differential privacy tables with the 'DROP DATABASE tumult_temp_*' pattern.

2023-01-03

 * 11:08 btullis: restarted hive-server2 and hive-metastore services on an-coord1001 after failover to standby server
 * 10:39 btullis: fail over hive services to an-coord1002 with change to the DNS CNAME for analytics-hive.eqiad.wmnet
 * 10:20 btullis: restart hive-server2 and hive-metastore services on an-coord1002 prior to failover

2022-12-25

 * 19:52 btullis: reran the `refine_eventlogging_legacy` job
 * 16:56 btullis: restarted `monitor_refine_event` service on an-launcher1002 after successful refine run
 * 16:55 btullis: reran refine_event for 'mediawiki_api_request|mediawiki_cirrussearch_request' at 16:40

2022-12-22

 * 11:01 btullis: powering up an-presto10[05-15] but presto-server will be disabled.

2022-12-21

 * 14:42 elukey: `apt-get clean` on an-launcher1002 to free some space
 * 01:17 xcollazo: Deleted unused tables analytics_platform_eng.imagerec and analytics_platform_eng.imagerec_prod.

2022-12-19

 * 13:45 btullis: restart presto-server on an-coord1001 to increase heap from 4GB to 16 GB T325331
 * 12:11 aqu: systemctl start hadoop-namenode-backup-hdfs.service on an-master1002 at 11am UTC
 * 09:36 aqu: Deployed analytics/refinery using scap, then deployed onto HDFS.
 * 09:17 aqu: About to deploy analytics/refinery (bug fix in HDFS usage pipeline)

2022-12-16

 * 15:36 xcollazo: deploying 'Fix subtle bug on image_suggestions when resolving varprop.' on platform_eng Airflow instance.

2022-12-15

 * 22:28 btullis: run `sudo apt clean` on an-coord1001
 * 19:08 xcollazo: Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance.
 * 10:03 joal: Restart failed airflow tasks

2022-12-13

 * 21:35 aqu: Deploying analytics/refinery (HDFS FSImage conversion to XML script)

2022-12-09

 * 08:38 joal: Kill refine_eventlogging_legacy stuck job (application_1663082229270_510052)

2022-12-08

 * 13:55 joal: rerun webrequest failed jobs for hour 2022-12-08-T11:00Z with updated workflow (no dataloss checks)
 * 12:23 joal: rerun webrequest failed jobs for hour 2022-12-08-T11:00Z

2022-12-07

 * 17:57 aqu: Adding raw hdfs fsimage dir in HDFS (an-launcher1002)
 * 17:47 aqu: Adding hdfs/usage folder dataset in HDFS
 * 16:24 aqu: Deploying analytics/refinery (HDFS usage scripts)
 * 15:13 btullis: roll-restarting AQS to pick up new mediawiki_history_reduce snapshot
 * 14:06 btullis: rebuilding an-tool1005 as bullseye to test superset 1.5.2 upgrade
 * 09:10 btullis: reboot an-worker1108 as it was spinning with soft CPU lockups

2022-12-06

 * 12:47 btullis: sudo systemctl restart wmf_auto_restart_prometheus-mysqld-exporter.service on matomo1002
 * 11:53 btullis: attempting to unmount and remount `/mnt/hdfs` on stat1004

2022-12-05

 * 11:45 steve_munene: restarting presto-server.service on an-presto1007 T323783

2022-11-30

 * 16:45 btullis: roll-restarting presto workers again for T321960 and T321231
 * 16:20 btullis: roll-restarting presto workers for T321960 and T321231
 * 16:19 btullis: restarting presto-server on an-coord1001 for T321960 and T321231
 * 13:39 btullis: pushing out conda-analytics to all remaining servers `btullis@cumin1001:~$ sudo debdeploy deploy -u 2022-11-30-conda-analytics.yaml -Q P:analytics::conda_analytics`
 * 13:02 btullis: deploying conda-analytics 0.0.12 to stat boxes for T321088
 * 12:29 btullis: repooling eqiad for eventstreams for T324074
 * 11:59 btullis: depooling eqiad for eventstreams for T324074
 * 11:34 btullis: repooling codfw for eventstreams for T324074
 * 11:32 btullis: destroying the eventstreams deployment in codfw and reapplying for T324074
 * 11:11 btullis: depooling codfw for eventstreams for T324074

2022-11-29

 * 17:12 ottomata: deploying refinery, then restarting druid webrequest daily and hourly loading oozie jobs
 * 17:08 btullis: booted all of the an-worker nodes that had been switched off.
 * 15:04 btullis: shutting down an-worker1093
 * 15:03 btullis: shutting down an-worker1089
 * 15:02 btullis: shutting down an-worker1085
 * 15:00 btullis: shutting down an-worker1083
 * 14:58 btullis: shutting down an-worker1079
 * 14:55 btullis: shutting down an-worker1090

2022-11-28

 * 12:00 btullis: restarted presto-server on an-coord1001 to test T321960

2022-11-25

 * 15:29 btullis: reset the bmc on an-coord1002
 * 11:24 elukey: restart turnilo on an-tool1007 to pick up new settings for webrequest_sampled_live
 * 10:07 elukey: refresh the webrequest-sampled-live druid supervisor after https://gerrit.wikimedia.org/r/c/analytics/refinery/+/859463

2022-11-24

 * 16:21 SandraEbele: restarted webrequest-druid-daily-coord as part of weekly deployment train.
 * 16:15 SandraEbele: killed webrequest-druid-daily-coord for restart as part of weekly deployment train.
 * 16:13 SandraEbele: successfully restarted webrequest-druid-hourly-coord for restart as part of weekly deployment train.
 * 16:11 SandraEbele: killed webrequest-druid-hourly-coord for restart as part of weekly deployment train.
 * 15:30 SandraEbele: Started deployment of refinery as part of weekly deployment train

2022-11-23

 * 15:38 btullis: roll-restarting kafka-jumbo brokers to pick up new certificates. T323697

2022-11-18

 * 18:56 mforns: re-ran refine_event_sanitized_analytics_immediate from 2022-11-17T13 to 2022-11-18T18 to fix the issues caused by a bug (allow-list typo) deployed yesterday.

2022-11-17

 * 17:14 mforns: restarted mediawiki-denormalize-coord as part of weekly deployment train
 * 16:07 mforns: finished refinery deployment
 * 15:53 mforns: started refinery deployment for weekly train (accompanying refinery-source 0.2.9)
 * 14:52 btullis: deploying updated hadoop packages to druid-public
 * 14:51 btullis: deploying updated hadoop packages to druid-analytics
 * 14:37 btullis: deploying updated hadoop packages to hue and yarn webservers
 * 14:34 btullis: deploying updated hadoop packages to analytics-presto hosts

2022-11-16

 * 21:40 mforns: deployed airflow up to e08e32e83b519dee214b7177bbe0fd3ac5a0be3c
 * 20:37 mforns: deployed refinery-source 0.2.9 as part of weekly deployment train
 * 09:11 elukey: update the webrequest sampled live supervisor on Druid Analytics after https://gerrit.wikimedia.org/r/857408

2022-11-15

 * 14:24 elukey: started webrequest_sampled supervisor on Druid Analytics - T314981
 * 11:50 elukey: `elukey@kafka-jumbo1001:~$ kafka topics --create --topic webrequest_sampled --partitions 3 --replication-factor 3` - T314981

2022-11-07

 * 06:24 aqu: sudo systemctl reset-failed monitor_refine_eventlogging_legacy.service
 * 06:00 aqu: Rerunning on an-launcher1002 sudo -u analytics kerberos-run-command analytics refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='homepagemodule' --since='2022-11-04T15:00:00.000Z' --until='2022-11-05T16:00:00.000Z'

2022-11-04

 * 10:14 btullis: btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other/pageview_complete/2022/2022-11$ sudo systemctl restart analytics-dumps-fetch-pageview_complete_dumps.service
 * 10:14 btullis: btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other/pageview_complete/2022/2022-11$ sudo chown dumpsgen:dumpsgen pageviews-20221102-automated.bz2

2022-11-03

 * 08:55 joal: Add _SUCCESS file to manually computed pageview-actor data for 2022-11-02T11:00

2022-10-27

 * 17:24 mforns: re-running webrequest-load-wf-text-2022-10-27-10 with lower thresholds

2022-10-25

 * 17:28 mforns: deployed refinery to the test cluster

2022-10-24

 * 16:19 btullis: `chown analytics-deploy /srv/deployment/analytics` on clouddumps100[1-2]
 * 15:30 mforns: finished deploying refinery as part of the weekly train
 * 15:30 mforns: deployed airflow-dags as part of weekly train
 * 15:12 mforns: starting refinery regular weekly deploy
 * 07:32 elukey: `elukey@stat1005:~$ sudo systemctl reset-failed session-c4122.scope session-c4123.scope session-c4124.scope session-c4447.scope session-c4450.scope session-c4449.scope session-c4638.scope jupyter-echetty-singleuser.service`
 * 07:30 elukey: `elukey@stat1004:~$ sudo systemctl reset-failed jupyter-ntsako-singleuser.service`

2022-10-23

 * 13:31 elukey: clean logs with 10d+ on an-airflow1001 to free some space
 * 13:26 elukey: clean logs with 15d+ on an-airflow1001 to free some space

2022-10-22

 * 08:17 joal: rerun webrequest-load-wf-text-2022-10-22-3 oozie job with higher error threshold

2022-10-21

 * 16:55 btullis: restarting hive-server2 service on an-coord1001
 * 16:49 btullis: restarting hue on an-tool1009
 * 15:18 joal: restart hive-server2 service
 * 07:32 joal: restart failed oozie jobs
 * 07:28 joal: Restart HiveServer2 on an-coord1001 (I didn't even know I could do this)
 * 06:53 joal: killing old mjolnit jobs
 * 06:50 joal: Kill rerun stuck oozie job
 * 06:37 joal: Kill skein test jobs in arn

2022-10-19

 * 17:14 btullis: reset the BMC on analytics1075

2022-10-17

 * 18:17 mforns: deleted Airflow DAGs for backfilling of Cassandra loading of unique devices

2022-10-15

 * 09:24 joal: Rerun failed refine_eventlogging_analytics job
 * 09:00 joal: Rerun pageview-hourly-wf-2022-10-14-23

2022-10-13

 * 13:43 mforns: cleared airflow job wikidata_dump_to_hive_weekly

2022-10-12

 * 15:26 ottomata: remove materialized .json files from schemas/event/primary - this should be a no-op as no clients should actually be using the json files. - T315674

2022-10-11

 * 15:44 ottomata: remove materialized .json files from schemas/event/secondary - this should be a no-op as no clients should actually be using the json files. - T315674
 * 15:04 btullis: reset the BMC on an-worker1086 with `sudo bmc-device --cold-reset`
 * 06:44 elukey: kill leftover process of jmads on stat1005 to allow user cleanup via puppet
 * 06:43 elukey: kill leftover process of nokafor on stat1004 to allow user cleanup via puppet
 * 06:37 elukey: kill leftover process of bmansurov on stat1007 to allow user cleanup via puppet
 * 06:34 elukey: kill leftover process of bmansurov on an-airflow1002 to allow user cleanup via puppet

2022-10-10

 * 15:36 mforns: reran geoeditors_public_monthly airflow DAG for Sept 2022, after fix
 * 15:34 mforns: deployed airflow to fix geoeditors_public_monthly DAG
 * 15:31 mforns: started unique devices daily back-filling in cassandra from 1st of July to end of Sept

2022-10-08

 * 11:48 joal: rerun webrequest-load-wf-text-2022-10-7-20

2022-10-07

 * 09:26 elukey: delete calico pods in CrashLoop on dse (probably due to the incorrect docker settings)
 * 07:54 elukey: re-initialize docker on dse-k8s-worker1004 - wrong storage type set (devicemapper instead of overlay2)
 * 07:49 elukey: re-initialize docker on dse-k8s-worker100[5-8] - wrong storage type set (devicemapper instead of overlay2)

2022-10-06

 * 19:51 SandraEbele: Started airflow projectview_hourly_dag
 * 19:51 SandraEbele: Killed Oozie projectview-hourly job
 * 19:40 SandraEbele: Deployed airflow to fix projectview_hourly_dag
 * 13:48 btullis: decommission aqs1007 (also forgot to log aqs1006)
 * 12:15 btullis: decommissioning aqs1005
 * 11:23 btullis: decommissioning aqs1004

2022-10-05

 * 16:48 btullis: forcibly and lazily unmounted legacy labstore hosts from an-launcher1002 and removed their /etc/fstab entries
 * 15:27 SandraEbele: deployed refinery source
 * 14:33 mforns: finished refinery deploy - regular weekly train
 * 14:05 mforns: starting refinery deploy - regular weekly train
 * 13:49 SandraEbele: Started Airflow projectview_geo job
 * 13:48 SandraEbele: killed Oozie projectview-geo-coord job
 * 13:21 SandraEbele: deploying fix for projective tags on airflow.

2022-10-04

 * 09:53 btullis: deployed eventgate-logging-external to eqiad (a few minutes ago)
 * 09:45 btullis: deploying new eventgate-logging-external service to codfw
 * 09:44 btullis: deploying new eventgate-logging-external service to staging

2022-10-02

 * 08:13 elukey: apt-get clean on an-airflow1001 to free some space on the root partition

2022-09-30

 * 08:41 btullis: restarted hive-server2 and hive-metastore services on an-coord1002 (standby) server

2022-09-29

 * 12:34 joal: Rerun failed oozie webrequest-load-wf-text-2022-9-29-9
 * 06:38 joal: Try to rerun airflow unique_devices_daily.compute_per_project_family_metrics.2022-09-15
 * 06:37 joal: Rerun airflow unique_devices_dailyschedule: @daily

2022-09-28

 * 19:50 mforns: killed oozie's unique_devices-per_domain-daily-coord because we migrated it to airflow
 * 19:49 mforns: killed oozie's unique_devices-per_project_family-daily-coord because we migrated it to airflow
 * 19:48 mforns: killed oozie's unique_devices-per_project_family-monthly-coord because we migrated it to airflow
 * 19:48 mforns: killed oozie's unique_devices-per_domain-monthly-coord because we migrated it to airflow
 * 18:22 mforns: deployed airflow to fix unique_devices jobs
 * 15:29 SandraEbele: started airflow projectview_geo job
 * 15:01 btullis: roll-restarting druid-analytics
 * 15:00 SandraEbele: deploying Airflow for hdfsarchiver operator fix
 * 14:02 btullis: roll-restarting druid-public
 * 09:22 btullis: started cookbook sre.kafka.roll-restart-brokers jumbo-eqiad

2022-09-27

 * 15:05 mforns: re-ran wikidata_metrics_to_graphite_daily failed airflow tasks
 * 15:03 mforns: re-ran cassandra_daily_load failed airflow tasks
 * 14:59 mforns: re-ran apis_metrics_to_graphite_hourly
 * 14:56 mforns: deployed Airflow (fixed)
 * 14:23 mforns: rolled back Airflow
 * 14:23 mforns: deployed Airflow for 3 fixes

2022-09-26

 * 20:07 xcollazo: Kill oozie geoeditors jobs for load, public monthly, and yearly after Airflow migration.
 * 16:13 joal: rerunning failed webrequest-text-2022-09-26-15
 * 13:48 aqu: Deploying airflow-dags on analytics & analytics_test
 * 11:03 btullis: failing back hive to an-coord1001 using DNS https://gerrit.wikimedia.org/r/c/operations/dns/+/832294
 * 09:41 btullis: rebooted matomo1002 at the VM level to pick up new disk
 * 09:40 btullis: merged the spark3 patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/834500
 * 06:36 elukey: clean up my old home dir on matomo1002, ran `apt-get clean` + some other clean up steps on matomo1002 to free space on the root partition

2022-09-23

 * 19:11 mforns: deployed Airflow analytics for a quick fix

2022-09-22

 * 22:26 joal: Kill oozie cassandra monthly loading jobs as we migrate them to airflow
 * 22:20 joal: Deploy airflow for cassandra-loading patch
 * 20:53 joal: Deploy analytics airflow-dags to try to fix cassandra loading jobs

2022-09-21

 * 19:25 joal: Kill oozie daily cassandra loading jobs as we move them to airflow
 * 19:18 ottomata: kill aarora process 30421 run_embedding_training.sh on stat1005
 * 19:13 joal: Deployed refinery for HQL patch (Njideka)
 * 19:11 ottomata: kill aarora process 14584 on stat1005 - using 2500% cpu

2022-09-20

 * 20:10 mforns: finished refinery deployment (weekly train)
 * 19:55 mforns: starting refinery deployment (weekly train)
 * 15:45 joal: kill oozie hourly cassandra loading job (1 job) in favor of the airflow one

2022-09-19

 * 22:28 milimetric: Wikistats: improved build a little and deployed fix to T312717

2022-09-15

 * 08:43 aqu: about to deploy analytics/refinery
 * 05:14 aqu: sudo -u analytics kerberos-run-command analytics refine_eventlogging_legacy --table_include_regex='wikipediaportal' --since='2022-09-13T23:00:00.000Z' --until='2022-09-15T00:00:00.000Z'

2022-09-14

 * 17:11 aqu: Sep 14 15:23:34 UTC sudo systemctl start check_webrequest_partitions.service
 * 12:56 aqu: ~1hago sudo systemctl start refinery-sqoop-mediawiki-production-daily.service ; sudo systemctl start refinery-import-siteinfo-dumps.service ; sudo systemctl start refinery-import-page-current-dumps.service ; sudo systemctl start refinery-import-page-history-dumps.service
 * 11:34 btullis: remounted all remaining /mnt/hdfs mount points, except stat1005 which is busy
 * 11:12 btullis: remounted /mnt/hdfs on an-coord100[1-2]
 * 11:09 btullis: remounted /mnt/hdfs on an-airflow1001
 * 09:14 joal: Restart oozie virtualpageview job
 * 09:10 btullis: re-mounted /mnt/hdfs on an-launcher1002.
 * 07:11 joal: restart webrequest oozie bundle

2022-09-13

 * 17:22 joal: rerun refine_eventloggin_legacy
 * 17:14 joal: rerun refine_event
 * 17:14 joal: rerun refine_netflow
 * 16:53 joal: Rerun refine_eventlogging_analytics
 * 16:45 joal: Kill-rerun suspended oozie jobs (virtual-pagview and predictions-actor
 * 16:34 joal: rerun failed webrequest oozie jobs
 * 16:30 btullis: restarting hive-server2 and hive-metastore on an-coord1001 (currently standby)
 * 16:29 btullis: restarting oozie on an-coord1001
 * 16:10 joal: Rerun failed oozie webrequest jobs
 * 15:57 btullis: rolling out updated hadoop packages to an-airflow1003
 * 15:55 btullis: rolling out upgraded hadoop client packages to stat servers.
 * 15:51 btullis: restarting eventlogging_to_druid_network_flows_internal_hourly.service eventlogging_to_druid_prefupdate_hourly.service refine_event_sanitized_analytics_immediate.service refine_event_sanitized_main_immediate.service
 * 15:49 btullis: restarting eventlogging_to_druid_navigationtiming_hourly.service on an-launcher1002
 * 15:46 btullis: restarting eventlogging_to_druid_editattemptstep_hourly.service on an-launcher1002
 * 15:44 btullis: cancel that last message. Upgrading hadoop packages on an-launcher instead. They were inadvertently omitted last time.
 * 15:39 btullis: Going to downgrade hadoop on ann hadoop-worker nodes to 2.10.1
 * 15:21 btullis: failed over hive to an-coord1002 via DNS https://gerrit.wikimedia.org/r/c/operations/dns/+/831906
 * 15:20 btullis: restarted yarn service on an-master1002 to make the active host an-master1001 again.
 * 15:11 btullis: restart hive-server2 and hive-metastore service on an-coord1002 to pick up new version of hadoop
 * 14:55 btullis: rolling out updated hadoop packages to analytics-airflow (cumin alias) hosts
 * 14:42 btullis: sudo systemctl restart analytics-reportupdater-logs-rsync.service on an-launcher1002
 * 13:21 joal: Manual launch of refinery-drop-mediawiki-snapshots with new tables in patch https://gerrit.wikimedia.org/r/831866
 * 10:51 btullis: attempting failback operation on hadoop namenodes
 * 09:42 btullis: roll-restarting the hadoop masters via the cookbook

2022-09-12

 * 08:37 btullis: cold-reset BMC device on analytics1073

2022-09-08

 * 17:32 joal: make ops reboot stat1008

2022-09-07

 * 13:36 joal: rerun failed airflow tasks

2022-09-06

 * 22:18 milimetric: restarted webrequest druid daily and hourly jobs
 * 22:18 milimetric: restarted referrer daily coordinator
 * 22:18 milimetric: restarted webrequest load bundle
 * 21:57 milimetric: finished cleaning up bad state and re-deploying refinery
 * 21:45 milimetric: cleared logs earlier than September 1st from an-launcher1002:/srv/airflow-analytics/logs/scheduler
 * 18:49 milimetric: finished refinery-source 0.2.6 deploy, waiting 5 minutes and starting refinery deploy
 * 18:28 milimetric: weekly deployment train starting
 * 09:55 btullis: merged and deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/821695

2022-09-04

 * 12:49 elukey: pkill remaining processes of user effeietsanders on stat1008 to unblock puppet

2022-09-02

 * 08:25 joal: Restart mediawiki_history_denormalize job manually

2022-08-30

 * 17:49 joal: Deploying refinery onto HDFS
 * 17:11 joal: deploy refinery using scap
 * 17:11 joal: release refinery-source v0.2.5 to archiva

2022-08-29

 * 16:44 mforns: killed mediawiki-history-dumps oozie after migration to airflow
 * 08:04 joal: Rerun refine_eventlogging_legacy failed hours
 * 07:54 joal: rerun pageview-hourly-wf-2022-8-28-15 oozie workflow

2022-08-22

 * 16:25 btullis: btullis@an-airflow1004:~$ sudo systemctl reset-failed ifup@ens13.service

2022-08-19

 * 08:45 btullis: restarted archiva to pick up new JRE

2022-08-18

 * 19:57 ottomata: apply yarn production queue changes to allow analytics-research and analytics-platform-eng users to submit jobs to production queue - T312858
 * 14:04 btullis: re-running refine_eventlogging_legacy for helppanel
 * 09:51 btullis: restarted monitor-refine-event on an-launcher1002

2022-08-17

 * 13:19 mforns: deployed airflow for https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/117

2022-08-16

 * 18:49 ottomata: complete refinery deploy that was unfinished from last week. an-launcher1002 and hdfs already have this version (6e47e0e712528c8816b7fd7456b8745e4dbc5c72) deployed.
 * 16:02 btullis: deploying airflow-dags

2022-08-15

 * 19:26 ottomata: test

2022-08-10

 * 18:04 ottomata: Deployed refinery using scap, then deployed onto hdfs
 * 17:03 ottomata: stopping puppet and drop data timers on an-launcher1002 and an-test-coord1001 to deploy drop script changes - T270433
 * 13:42 btullis: failed hive back to an-coord1001 via DNS change.
 * 11:47 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2.service hive-metastore.service

2022-08-08

 * 11:43 btullis: rebooting an-worker1102 due to kernel soft lockups

2022-08-05

 * 16:05 milimetric: force scap deploying refinery
 * 16:01 ottomata: removing airflow logs older than 7 days on an-launcher1002

2022-08-04

 * 18:31 ottomata: dropping medawiki_web_ui_interactions hive tables and data - T314151
 * 18:19 milimetric: scap deploying refinery host by host after Ben cleaned up the repos with "git checkout master"
 * 18:11 btullis: btullis@deploy1002:/srv/deployment/analytics/refinery$ scap deploy -l stat1008.eqiad.wmnet "Regular analytics weekly train [analytics/refinery@$(git rev-parse --short HEAD)]"
 * 18:05 btullis: we are re-deploying refinery to an-launcher1002 with the command above
 * 18:04 btullis: btullis@deploy1002:/srv/deployment/analytics/refinery$ scap deploy -l an-launcher1002.eqiad.wmnet "Regular analytics weekly train [analytics/refinery@$(git rev-parse --short HEAD)]"
 * 18:02 btullis: analytics-deploy@an-launcher1002:/srv/deployment/analytics/refinery$ git checkout master
 * 15:59 SandraEbele: Deploying analytics refinery using scap.

2022-08-02

 * 12:54 btullis: sudo systemctl reset-failed on stat1008 to remove failed debmonitor alerts

2022-07-28

 * 20:05 SandraEbele: killing Oozie projectview-hourly and projectview-geo jobs to deploy corresponding jobs on airflow.

2022-07-24

 * 21:10 btullis: swapping disks on archiva1002
 * 20:36 btullis: rebooting archiva1002 to pick up new disk
 * 15:36 btullis: btullis@ganeti1027:~$ sudo gnt-instance modify --disk add:size=200g archiva1002.wikimedia.org

2022-07-22

 * 21:19 ottomata: restarted airflow-scheduler@platform_eng on an-airflow1003 for marco and cormac

2022-07-19

 * 10:05 elukey: reboot an-worker1127 - hdfs datanode caused CPU stalls

2022-07-13

 * 14:19 aqu: Deployed refinery using scap, then deployed onto hdfs (prod + test)
 * 06:16 aqu: analytics/refinery deployment

2022-07-07

 * 13:38 btullis: restart refine_eventlogging_legacy_test.service on an-test-coord1001
 * 09:56 btullis: restarted oozie on an-test-coord1001
 * 09:23 btullis: rebooted dbstore1007
 * 09:21 btullis: rebooted dbstore1005
 * 09:02 btullis: restarting dbstore1003 as per announced maintenance window

2022-07-06

 * 18:09 ottomata: enabling iceberg hive catalog connector on analytics_cluster presto
 * 17:57 ottomata: upgrading presto to 0.273.3 in analytics cluster - T311525
 * 09:50 btullis: roll-restarting hadoop workers on the test cluster.
 * 09:46 btullis: restarting refinery-drop-webrequest-raw-partitions.service on an-test-coord1001
 * 09:44 btullis: restarting refinery-drop-webrequest-refined-partitions.service on an-test-coord1001
 * 09:42 btullis: restarted drop_event.service on an-test-coord1001
 * 09:35 btullis: restarting hive-server2 and hive-metastore on an-test-coord1001

2022-07-05

 * 11:01 btullis: sudo cookbook sre.hadoop.roll-restart-masters test

2022-07-04

 * 16:14 btullis: systemctl restart airflow-scheduler@research.service (on an-airflow1002)
 * 08:04 elukey: kill leftover processes of user `mewoph` on stat100x to allow puppet runs

2022-06-29

 * 17:27 mforns: killed mediawiki-history-load bundle in Hue, and started corresponding mediawiki_history_load DAG in Airflow
 * 13:12 mforns: re-deployed refinery with scap and refinery-deploy-to-hdfs
 * 11:51 btullis: btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet

2022-06-28

 * 20:57 mforns: refinery deploy failed and I rolled back successfully, will try and repeat tomorrow when other people are present :]
 * 20:19 mforns: starting refinery deployment for refinery-source v0.2.2
 * 20:19 mforns: starting refinery deploymenty
 * 17:25 ottomata: installing presto 0.273.3 on an-test-coord1001 and an-test-presto1001
 * 12:48 milimetric: deploying airflow-dags/analytics to work on the metadata ingestion jobs

2022-06-27

 * 20:33 btullis: systemctl reset-failed jupyter-aarora-singleuser and jupyter-seddon-singleuser on stat1005
 * 20:16 btullis: checking and restarting prometheus-mysqld-exporter on an-coord1001
 * 15:25 btullis: upgraded conda-base-env on an-test-client1001 from 0.0.1 to 0.0.4

2022-06-24

 * 15:14 ottomata: backfilled eventlogging data lost during failed gobblin job - T311263

2022-06-23

 * 13:48 btullis: started the namenode service on an-master1001 after failback failure
 * 13:41 btullis: The failback didn't work again.
 * 13:39 btullis: attempting failback of namenode service from an-master1002 to an-master1001
 * 13:07 btullis: restarted hadoop-hdfs-namenode service on an-master1001
 * 11:25 joal: kill oozie mediawiki-geoeditors-monthly-coord in favor of airflow job
 * 08:52 joal: Deploy airflow

2022-06-22

 * 20:55 aqu: `scap deploy -f analytics/refinery` because of a crash during `git-fat pull`
 * 19:30 aqu: Deploying analytics/refinery

2022-06-21

 * 14:56 aqu: RefineSanitize from an-launcher1002: sudo -u analytics kerberos-run-command analytics spark2-submit --class org.wikimedia.analytics.refinery.job.refine.RefineSanitize --master yarn --deploy-mode client /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.1.15.jar  --config_file         /home/aqu/refine.properties  --since               "2022-06-19T09:52:00+0000"  --until
 * 13:33 aqu: sudo systemctl start monitor_refine_event_sanitized_main_immediate.service on an-launcher1002
 * 10:47 btullis: proceeding with the hadoop.roll-restart-masters cookbook

2022-06-20

 * 07:14 SandraEbele: Started Airflow 3 Wikidata metrics jobs (Articleplaceholder, Reliability and SpecialEntityData metrics).
 * 07:12 SandraEbele: Started Airflow3 Wikidata metrics jobs (Articleplaceholder, Relia)
 * 07:11 SandraEbele: killed Oozie wikidata-articleplaceholder_metrics-coord, wikidata-reliability_metrics-coord, and wikidata-specialentitydata_metrics-coord jobs.

2022-06-17

 * 12:35 SandraEbele: deployed daily airflow dag for 3 Wikidata metrics.
 * 08:36 btullis: power cycled an-worker1109 as it was stuck with CPU soft lockups

2022-06-16

 * 06:49 joal: Rerun webrequest-load-wf-upload-2022-6-15-22 after weird oozie failure

2022-06-15

 * 14:48 btullis: deploying datahub 0.8.38

2022-06-14

 * 10:48 joal: unpause renamed dags
 * 10:44 joal: Deploy Airflow
 * 10:12 btullis: manually failing back hdfs-namenode to an-master1001 after fixing typo
 * 09:36 joal: deploy refinery onto HDFS
 * 08:48 btullis: roll-restarting hadoop masters T310293
 * 08:40 joal: Deploying using scap again after failure cleanup on an-launcher1002
 * 07:45 joal: deploy refinery using scap

2022-06-13

 * 14:00 btullis: restarting presto service on an-coord1001
 * 13:20 btullis: btullis@datahubsearch1001:~$ sudo systemctl reset-failed ifup@ens13.service T273026
 * 13:09 btullis: restarting oozie service on an-coord1001
 * 12:59 btullis: havaing failed over hive to an-coord1002 10 minutes ago, I'm restarting hive services on an-coord1001
 * 12:26 btullis: restarting hive-server2 and hive-metastore on an-coord1002
 * 09:54 joal: rerun failed refine for network_flows_internal
 * 09:54 joal: Rerun failed refine for mediawiki_talk_page_edit events
 * 09:51 joal: Manually rerun webrequest_text laod for hour 2022-06-13T03:00
 * 07:18 joal: Manually rerun webrequest_text laod for hour 2022-06-12T08:00

2022-06-10

 * 17:00 ottomata: applied change to airflow instances to bump scheduler parsing_processes = # of cpu processors
 * 08:58 btullis: cookbook sre.hadoop.roll-restart-workers analytics

2022-06-09

 * 17:17 joal: Rerun refine for failed datasets
 * 14:15 btullis: manually failing back HDFS namenode from an-master1002 to an-master1001
 * 13:15 btullis: roll-restarting the hadoop masters to pick up new JRE

2022-06-08

 * 18:06 joal: Restart airflow after deploy for dag reprocessing
 * 18:02 joal: deploying Airflow dags
 * 13:45 btullis: deploying refinery

2022-06-07

 * 13:45 btullis: deploying updated eventgate images to all remaining deployments.
 * 11:33 btullis: deployed an updated version of eventgate to eventgate-analytics-external to address the timing mis-calculation.
 * 10:51 btullis: restart the eventlogging_to_druid_netflow-sanitization_daily service on an-launcher1002

2022-06-06

 * 13:45 btullis: restarting archiva service for new JRE
 * 06:31 elukey: restart memcached on an-tool1005 to pick up puppet settings and clear an alert in icinga

2022-06-05

 * 03:14 milimetric: rerunning mw history since the last failure just looked like a fluke

2022-06-04

 * 11:41 joal: Maunally launch refinery-sqoop-mediawiki-production after manual fix of refinery-sqoop-mediawiki
 * 11:39 joal: Manually sqoop enwiki:user and commonswiki:user and add _SUCCESS flag for following job to kick off

2022-06-02

 * 15:50 mforns: deployed wikistats 2.9.5
 * 14:02 joal: Start browser_general_daily on airflow
 * 13:19 joal: Drop and recreate wmf_raw.mediawiki_page table (field removal)
 * 12:44 joal: Remove wrongly formatted interlanguage data
 * 12:36 joal: Kill interlanguage-daily oozie job after successfull move to airflow
 * 12:15 joal: Deploy interlanguage fix to airflow
 * 09:56 joal: Relaunch sqoop after having deployed a corrective patch
 * 09:46 joal: Manually mark interlaguage historical tasks failed in airflow
 * 08:54 joal: Deploy airflow with spark3 jobs
 * 08:47 joal: Merging 2 airflow spark3 jobs now that their refinery counterpart is dpeloyed
 * 08:07 joal: Deploy refinery onto HDFS
 * 07:26 joal: Deploy refinery using scap

2022-06-01

 * 21:04 milimetric: trying to rerun sqoop from a screen on an-launcher
 * 20:09 SandraEbele: Successfully deployed refinery using scap, then deployed onto hdfs.
 * 18:51 SandraEbele: About to deploy analytics/refinery (regular weekly train)
 * 08:39 elukey: powercycle an-worker1094 - OEM event registered in `racadm getsel`, host frozen

2022-05-31

 * 18:48 ottomata: sudo -u hdfs hdfs dfsadmin -safemode leave on an-master1001
 * 18:12 ottomata: sudo service hadoop-hdfs-namenode start on an-master1002
 * 18:10 ottomata: sudo -u hdfs hdfs dfsadmin -safemode enter
 * 17:47 btullis: starting namenode services on am-master1001
 * 17:44 btullis: restarting the datanodes on all five of the affected hadoop workers.
 * 17:43 btullis: restarting journalnode service on each of the five hadoop workers with journals.
 * 17:41 btullis: resizing each journalnode with resize2fs
 * 17:38 btullis: sudo lvresize -L+20G analytics1069-vg/journalnode
 * 17:38 btullis: increasing each of the hadoop journalnodes by 20 GB
 * 17:33 ottomata: stop journalnodes and datanodes on 5 hadoop journalnode hosts
 * 17:30 btullis: stopped the hdfs-namenode service on an-master100[1-2]
 * 15:36 milimetric: dropped razzi databases and deleted HDFS directories (in trash)
 * 06:26 elukey: `elukey@an-master1001:~$ sudo systemctl reset-failed hadoop-clean-fairscheduler-event-logs.service`

2022-05-30

 * 20:19 SandraEbele: Restarted oozie job pageview-druid-daily-coord
 * 11:28 joal: deploy airflow spark3 aqs_hourly

2022-05-25

 * 21:09 joal: Resume aqs_hourly job in airflow test
 * 20:33 joal: Pausing aqs_hourly job in airflow test intil we fix the spark3 issue
 * 06:20 elukey: `elukey@an-tool1011:~$ sudo systemctl reset-failed ifup@ens13.service` - T273026

2022-05-24

 * 19:54 SandraEbele: Deployed refinery using scap, then deployed onto hdfs successfully.
 * 18:34 SandraEbele: Deploying refinery, regular weekly deployment
 * 13:18 joal: Release refinery-source v0.2.0 to archiva
 * 10:21 btullis: restarted hadoop-yarn-nodemanager on an-worker1139

2022-05-23

 * 18:27 mforns: killed mobile_apps-session_metrics-coord (Airflow job is taking over)

2022-05-21

 * 15:52 joal: Kill yarn app application_1651744501826_83884 in order to prevent the HDFS alerts

2022-05-19

 * 16:59 ottomata: deploying airflow-dags analytics with new artifact names, first clearing artifacts cache dir - T307115

2022-05-18

 * 10:57 btullis: upgrading datahub to version 0.8.34

2022-05-17

 * 21:32 razzi: sudo systemctl reset-failed ifup@ens13.service on an-tool1007
 * 08:54 btullis: booted an-tool1007 from network to begin buster upgrade

2022-05-12

 * 14:49 razzi: undo the 2 previous confctl changes to repool dbproxy1019 to wikireplicas-b only
 * 14:35 razzi: razzi@cumin1001:~$ sudo confctl select service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=yes # for T298940

2022-05-11

 * 18:20 razzi: disregard the above log; wrote out the command but then saw there was a warning for cr2-eqiad
 * 18:15 razzi: razzi@lvs1019:~$ systemctl stop pybal.service to apply change https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915
 * 18:06 razzi: razzi@lvs1020:~$ systemctl stop pybal.service to apply change https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915
 * 13:29 mforns: restarted oozie jobs after deployment: mediarequest_top_files, pageview_top_articles, unique_devices_per_domain_monthly, unique_devices_per_project_family_monthly

2022-05-10

 * 20:32 mforns: finished refinery deploy (regular weekly train)
 * 19:34 mforns: starting refinery deploy (regular weekly train)

2022-05-09

 * 15:06 SandraEbele: killed ‘apis-coord' oozie job and started corresponding airflow job ‘apis_metrics_to_graphite’

2022-05-06

 * 09:11 joal: kill cassandra-monthly-wf-local_group_default_T_mediarequest_top_files-2022-4 again
 * 08:44 joal: Rerun cassandra-monthly-wf-local_group_default_T_mediarequest_top_files-2022-4 with SRE watching network
 * 08:29 joal: kill cassandra-monthly-wf-local_group_default_T_mediarequest_top_files-2022-4 as it was probably saturating network

2022-05-05

 * 18:53 btullis: restarting airflow-scheduler@platform_eng.service on an-airflow1003
 * 18:53 btullis: restarted airflow-scheduler@research.service on an-airflow1002
 * 18:49 btullis: restarting airflow-scheduler@analytics service on an-launcher1002
 * 12:26 aqu: Regular analytics weekly train [analytics/refinery@cc4b2bd]
 * 09:53 btullis: roll-restarting hadoop masters to pick up new heap size
 * 09:16 btullis: re-enabling gobblin jobs now
 * 09:15 btullis: restarting failed eventlogging_to_druid_ services on an-launcher1002
 * 09:00 btullis: restarting an-coord1001
 * 08:53 btullis: stopping oozie on an-coord1001

2022-05-04

 * 08:47 btullis: rebooting an-coord1002 to pick up new kernel

2022-05-03

 * 18:24 razzi: remove /etc/apache2/sites-available/50-superset-wikimedia-org.conf from an-tool1005 (superset staging) since it was removed from puppet but has no ensure: absent

2022-04-27

 * 19:37 ottomata: restarting airflow services on all airflow instances after installing updated airflow debian package

2022-04-26

 * 19:02 aqu: About to deploy analytics/refinery: Weekly deployment train + Artifacts to 0.1.27
 * 12:02 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2022-4-23

2022-04-25

 * 20:09 ottomata: dropping event.ios_notification_interaction hive table and data for backwards incompatible schema change in T290920
 * 11:51 btullis: failing back hdfs active role to an-master1001
 * 11:49 btullis: restarted hadoop-yarn-resourcemanager on an-master1002 to force the active role back to an-master1001
 * 11:01 btullis: rebooting an-master1001
 * 10:25 btullis: restarting the `check_webrequest_partitions` service on an-launcher1002
 * 09:39 btullis: failover to an-master1002 successful at 3rd attempt
 * 09:30 btullis: 2nd attempt to switch HDFS services to an-master1002
 * 09:13 btullis: switching HDFS services to an-master1002
 * 08:53 btullis: rebooting an-master1002 - T304938

2022-04-23

 * 09:38 elukey: `apt-get clean` on an-airflow1001 to free some space

2022-04-21

 * 22:26 mforns: killed browser_general oozie job and started corresponding airflow job

2022-04-13

 * 16:40 razzi: reboot an-launcher1002 for security updates

2022-04-12

 * 22:12 milimetric: deployed and synced refinery-source 0.1.26 to hdfs

2022-04-11

 * 12:35 aqu: About to deploy analytics/refinery "Migrate mediarequest hourly from Oozie to Airflow" (replace previous msg)
 * 12:35 aqu: About to deploy refinery/source "Migrate mediarequest hourly from Oozie to Airflow"

2022-04-06

 * 20:53 razzi: roll restart aqs to deploy new mediawiki history snapshot
 * 15:51 mforns: deployed airflow to analytics (big refactor)
 * 15:23 mforns: deployed Airflow to analytics_test (big refactor)
 * 09:18 btullis: restarted eventlogging_to_druid_netflow_hourly on an-launcher1002

2022-04-05

 * 20:41 razzi: deploying refinery for https://gerrit.wikimedia.org/r/c/analytics/refinery/+/776269/
 * 15:54 razzi: razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005
 * 15:10 razzi: razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1003
 * 15:02 razzi: set dbstore1003.eqiad.wmnet to downtime for upgrade T299481
 * 15:01 razzi: set dbstore1003.eqiad.wmnet to downtime for upgrade

2022-04-01

 * 09:05 btullis: restarted varnishkafka-eventlogging.service on cp3050 T300246

2022-03-29

 * 20:08 joal: rerun cassandra editors_bycountry_monthly for month 2022-02
 * 20:08 mforns: restarted webrequest bundle
 * 19:57 mforns: restarted mediawiki-geoeditors-public_monthly-coord
 * 19:56 mforns: finished refinery deployment (regular weekly train) scap and hdfs
 * 19:53 joal: Add new columns to wmf.webrequest (high entropy CH-UA)
 * 19:16 joal: Drop/recreate wmf_raw.webrequest for schema change (high-entropy CH-UA)
 * 19:13 mforns: starting refinery deployment (regular weekly train)
 * 19:11 joal: kill webrequest-load oozie bundle for webrequest schema change
 * 17:13 razzi: razzi@cumin1001:~$ sudo cookbook sre.hosts.downtime an-tool1005.eqiad.wmnet -D 1 -r 'Testing deploy of superset 1.4.2 to staging'
 * 15:38 ntsako: Stopped geoeditor Airflow DAGs to check on data quality
 * 14:13 btullis: correction: restarted hadoop-yarn-nodemanager.service on an-worker1128
 * 14:13 btullis: restarted hadoop-yarn-nodemanager.service on an-worker1238

2022-03-24

 * 11:15 btullis: roll-restarting kafka-jumbo brokers T300626

2022-03-21

 * 18:10 razzi: sudo systemctl restart jupyter-bearloga-singleuser on stat1008

2022-03-17

 * 17:10 ottomata: restart webrequest and pageview_actor data purge - https://gerrit.wikimedia.org/r/c/operations/puppet/+/771389
 * 14:07 btullis: shutdown analytics1063 and analytics1067 with 120 minutes of downtime T303151
 * 06:46 elukey: kill remaining hanging processes for ppche*lko and accra*ze on an-test-client1001 to allow users offboard (puppet broken)

2022-03-16

 * 19:14 ottomata: deploying refinery to hadoop-test cluster with new gobblin-wmf-core jar
 * 18:00 razzi: sudo cookbook sre.hosts.downtime -D 3 -r 'Setting up karapace for the first time' karapace1001.eqiad.wmnet
 * 17:57 btullis: restarted mediawiki-history-drop-snapshot service on an-launcher1002
 * 16:03 aqu: analytics/refinery - scap deply "Migrate session_length/daily from Oozie to Airflow"
 * 10:26 btullis: rerunning failed mediawiki_structured_task_article_link_suggestion_interaction refnie job

2022-03-15

 * 22:16 razzi: upload karapace_2.1.3-py3.7-1_amd64.deb to apt.wikimedia.org
 * 19:58 razzi: upload karapace_2.1.3-py3.7-0_amd64.deb to apt.wikimedia.org
 * 17:24 ottomata: also change stats uid and gid to 918 on an-web1001 - T291384
 * 14:35 ottomata: change stats uid and gid on all stat boxes to 918 - T291384
 * 13:59 ottomata: roll restarting kafka jumbo brokers to set max.incremental.fetch.session.cache.slots=2000 - T303324

2022-03-14

 * 21:05 razzi: `sudo kill -9 15674` to stop unresponsive hive query

2022-03-09

 * 21:05 ottomata: fix group ownership of cchen.db/new_editors/cohort=2021-12 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/cchen.db/new_editors/cohort=2021-12
 * 18:33 ottomata: fix group ownership of wmf_product.db//new_editors/cohort=2021-12 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/new_editors/cohort=2021-12
 * 18:32 ottomata: fix group ownership of wmf_product.db/global_markets_pageviews/year=2022/month=2 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/global_markets_pageviews/year=2022/month=2
 * 18:19 btullis: btullis@ganeti1024:~$ sudo gnt-instance start karapace1001.eqiad.wmnet (T301562)
 * 16:16 ottomata: fix group ownership of wmf_product.db/poageviews_corrected/year=222/month=2 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/pageviews_corrected/year=2022/month=2

2022-03-08

 * 13:31 ottomata: restarted webrequest-load oozie bundle as 0073173-220113112502223-oozie-oozi-B starting at 2022-03-08T12:00Z
 * 13:09 ottomata: killing and rerunning webrequest-load-text-wf for webrequest_source=text/year=2022/month=3/day=7/hour=17, it was stuck in add_partition task as SUSPENDED, not sure why.
 * 12:47 btullis: roll-restarting druid-analytics T300626
 * 12:08 btullis: roll-restarting druid-public. T300626
 * 11:21 btullis: roll-restarting druid-test T300626
 * 11:00 btullis: roll-restarting aqs T300626
 * 10:57 btullis: restarted archiva T300626

2022-03-07

 * 19:14 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/*/hourly/year=2022/month=3/day=7 to make sure perms are fixed after revert of T291664
 * 19:13 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/virtualpageview/hourly/year=2022/month=3/day=7 - revert of T291664
 * 18:45 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/mediacounts/year=2022/month=3/day=7
 * 18:37 ottomata: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7 - after reverting - T291664
 * 18:34 ottomata: restarting hive-server2 on an-coord1001 to revert hive.warehouse.subdir.inherit.perms change - T291664
 * 14:44 btullis: failing back hive services to an-coord1001
 * 13:09 aqu_: About to deploy analytics/refinery - Migrate wikidata/item_page_link/weekly from Oozie to Airflow
 * 12:45 aqu_: About to deploy airflow-dags/analytics - Migrates wikidata/item_page_link
 * 12:10 btullis: restarted hive-server2 process on an-coord1001
 * 11:52 btullis: obtaining heap dump: `hive@an-coord1001:/srv/hive-tmp$ jmap -dump:format=b,file=hive_server2_heap_T303168.bin 16971`
 * 11:51 btullis: obtaining summary of heap objects and sizes: `hive@an-coord1001:/srv/hive-tmp$ jmap -histo:live 16971 > hive-object-storage-and-sizes.T303168.txt`
 * 11:38 btullis: failing over hive to an-coord1001 T303168

2022-03-05

 * 10:03 elukey: restart hadoop-yarn-nodemanager on an-worker1132 (unhealthy node, reason Linux Container Executor reached unrecoverable exception)

2022-03-04

 * 17:46 mforns: deployed Airflow to analytics instance to fix skein logs problem
 * 15:50 mforns: deployed airflow in an-test-client1001 to test skein log fix
 * 05:19 milimetric: rerunning monthly edit hourly druid oozie coordinator

2022-03-03

 * 17:48 ottomata: roll restart aqs to pick up new MW history snapshot

2022-03-01

 * 18:38 SandraEbele: sandra testing
 * 18:34 razzi: demo irc logging to data eng team members
 * 10:19 btullis: btullis@an-coord1002:/srv$ sudo rm -rf an-coord1001-backup/ (#T302777)
 * 09:48 elukey: elukey@stat1004:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the host)

2022-02-28

 * 16:00 milimetric: refinery done deploying and syncing, new sqoop list is up
 * 15:01 milimetric: deploying new wikis to sqoop list ahead of sqoop job starting in a few hours

2022-02-25

 * 17:00 milimetric: rerunning webrequest-load-wf-text-2022-2-25-15 after confirming all false positive loss

2022-02-23

 * 23:00 razzi: sudo maintain-views --table flaggedrevs --databases fiwiki on clouddb1014.eqiad.wmnet and clouddb1018.eqiad.wmnet for T302233

2022-02-22

 * 10:37 btullis: re-enabled puppet on an-launcher1002, having absented the network_internal druid load job
 * 09:30 aqu: Deploying analytics/refinery on hadoop-test only.
 * 07:38 elukey: systemctl reset-failed mediawiki-history-drop-snapshot on an-launcher1002 (opened since a week ago)
 * 07:30 elukey: kill remaining processes of rhuang-ctr on stat1004 and an-test-client1001 (user offboarded, but still holding jupyter notebooks etc..). Puppet was broken trying to remove the user.

2022-02-21

 * 17:55 elukey: kill remaining processes of rhuang-ctr on various stat nodes (user offboarded, but still holding jupyter notebooks etc..). Puppet was broken trying to remove the user.
 * 16:58 mforns: Deployed refinery using scap, then deployed onto hdfs (aqs hourly airflow queries)

2022-02-19

 * 12:21 elukey: stop puppet on an-launcher1002, stop timers for eventlogging_to_druid_network_flows_internal_ { hourly,daily } since no data is coming to the Kafka topic (expected due to some work for the Marseille DC) and it keeps alarming

2022-02-17

 * 16:18 mforns: deployed wikistats2

2022-02-16

 * 14:13 mforns: deployed airflow-dags to analytics instance

2022-02-15

 * 17:20 ottomata: split anaconda-wmf into 2 packages: anaconda-wmf-base and anaconda-wmf. anaconda-wmf-base is installed on workers, anaconda-wmf on clients.  The size of the package on workers is now much smaller.  Installing throught the cluster.  relevant: T292699

2022-02-14

 * 17:38 razzi: razzi@an-test-client1001:~$ sudo systemctl reset-failed airflow-scheduler@analytics-test.service
 * 16:08 razzi: sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 50 eqiad_B datahubsearch1002 for T301383

2022-02-12

 * 08:50 elukey: truncate /var/log/auth.log to 1g on krb1001 to free space on root partition (original log saved under /srv)

2022-02-11

 * 15:06 ottomata: set hive.warehouse.subdir.inherit.perms = false - T291664

2022-02-10

 * 18:54 ottomata: setting up research airflow-dags scap deployment, recreating airflow database and starting from scractch (fab okayed this) - T295380
 * 16:48 ottomata: deploying airflow analytics with lots of recent changes to airflow-dags repository

2022-02-09

 * 17:41 joal: Deploy refinery onto HDFS
 * 17:05 joal: Deploying refinery with scap
 * 16:39 joal: Release refinery-source v0.1.25 to archiva

2022-02-08

 * 07:27 elukey: restart hadoop-yarn-nodemanager on an-worker1115 (container executor reached unrecoverable exception, doesn't talk with the Yarn RM anymore)

2022-02-07

 * 18:43 ottomata: manually installing airflow_2.1.4-py3.7-2_amd64.deb on an-test-client1001
 * 14:38 ottomata: merged Set spark maxPartitionBytes to hadoop dfs block size - T300299
 * 12:17 btullis: depooled aqs1009
 * 11:59 btullis: depooled aqs1008
 * 11:41 btullis: depooled aqs1007
 * 11:03 btullis: depooled aqs1006
 * 10:22 btullis: depooling aqs1005

2022-02-04

 * 16:05 elukey: unmask prometheus-mysqld-exporter.service and clean up the old @analytics + wmf_auto_restart units (service+timer) not used anymore on an-coord100[12]
 * 12:55 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-2-3
 * 07:12 elukey: `GRANT PROCESS, REPLICATION CLIENT ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5` on an-test-coord1001 to allow the prometheus exporter to gather metrics
 * 07:09 elukey: cleanup wmf_auto_restart_prometheus-mysqld-exporter@analytics-meta on an-test-coord1001 and unmasked wmf_auto_restart_prometheus-mysqld-exporter (now used)
 * 07:03 elukey: clean up wmf_auto_restart_prometheus-mysqld-exporter@matomo on matomo1002 (not used anymore, listed as failed)

2022-02-03

 * 19:35 joal: Rerun virtualpageview-druid-monthly-wf-2022-1
 * 19:32 btullis: re-running the failed refine_event job as per email.
 * 19:27 joal: Rerun virtualpageview-druid-daily-wf-2022-1-16
 * 19:12 joal: Kill druid indexation stuck task on Druid (from 2022-01-17T02:31)
 * 19:09 joal: Kill druid-loading stuck yarn applications (3 HiveToDruid, 2 oozie launchers)
 * 10:04 btullis: pooling the remaining aqs_next nodes.
 * 07:01 elukey: kill leftover processes of decommed user on an-test-client1001

2022-02-01

 * 20:05 btullis: btullis@an-launcher1002:~$ sudo systemctl restart refinery-sqoop-whole-mediawiki.service
 * 19:01 joal: Deploying refinery with scap
 * 18:36 joal: Rerun virtualpageview-druid-daily-wf-2022-1-16
 * 18:34 joal: rerun webrequest-druid-hourly-wf-2022-2-1-12
 * 17:43 btullis: btullis@an-launcher1002:~$ sudo systemctl start refinery-sqoop-whole-mediawiki.service
 * 17:29 btullis: about to deploy analytics/refinery
 * 12:28 elukey: kill processes related to offboarded user on stat1006 to unblock puppet
 * 11:09 btullis: btullis@an-test-coord1001:~$ sudo apt-get -f install

2022-01-31

 * 14:51 btullis: btullis@an-launcher1002:~$ sudo systemctl start mediawiki-history-drop-snapshot.service
 * 14:03 btullis: btullis@an-launcher1002:~$ sudo systemctl start mediawiki-history-drop-snapshot.service

2022-01-27

 * 08:15 joal: Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-1-26

2022-01-26

 * 15:54 joal: Add new CH-UA fields to wmf_raw.webrequest and wmf.webrequest
 * 15:44 joal: Kill-restart webrequest oozie job after deploy
 * 15:40 joal: Kill-restart edit-hourly oozie job after deploy
 * 15:27 joal: Deploy refinery to HDFS
 * 15:10 elukey: elukey@cp4036:~$ sudo systemctl restart varnishkafka-eventlogging
 * 15:10 elukey: elukey@cp4036:~$ sudo systemctl restart varnishkafka-statsv
 * 15:06 elukey: elukey@cp4035:~$ sudo systemctl restart varnishkafka-eventlogging.service - metrics showing messages stuck for a poll
 * 14:56 elukey: elukey@cp4035:~$ sudo systemctl restart varnishkafka-webrequest.service - metrics showing messages stuck for a poll
 * 14:52 joal: Deploy refinery with scap
 * 10:07 btullis: btullis@cumin1001:~$ sudo cumin 'O:cache::upload or O:cache::text' 'disable-puppet btullis-T296064-T299401'

2022-01-25

 * 19:46 ottomata: removing hdfs druid deep storage from test cluster
 * 19:37 ottomata: reseting test cluster druid via druid reset-cluster https://druid.apache.org/docs/latest/operations/reset-cluster.html - T299930
 * 14:30 ottomata: stopping services on an-test-coord1001 - T299930
 * 14:29 ottomata: stopping druid* on an-test-druid1001 - T299930
 * 11:30 btullis: pooled aqs1011 T298516
 * 11:29 btullis: btullis@puppetmaster1001:~$ sudo -i confctl select name=aqs1011.eqiad.wmnet set/pooled=yes

2022-01-24

 * 21:18 btullis: btullis@deploy1002:/srv/deployment/analytics/refinery$ scap deploy -e hadoop-test -l an-test-coord1001.eqiad.wmnet
 * 20:35 btullis: rebooting an-test-coord1001 after recreating the /srv/file system.
 * 20:28 btullis: root@an-test-coord1001:~# mke2fs -t ext4 -j -m 0.5 /dev/vg0/srv
 * 19:53 btullis: power cycled an-test-coord1001 from racadm
 * 19:50 btullis: rebooting an-test-coord1001
 * 19:19 ottomata: kill mysqld on an-test-coord1001 - 19:19:04 [@an-test-coord1001:/etc] $ sudo kill 42433
 * 19:02 razzi: razzi@an-test-coord1001:~$ sudo systemctl stop presto-server
 * 18:23 razzi: downtime an-test-coord1001 while attempting to fix /srv partition
 * 11:48 elukey: roll restart of kafka test brokers to pick up the new keystore/tls-certs (1y of validity)

2022-01-22

 * 08:36 elukey: `apt-get clean` on an-test-coord1001 to free some space

2022-01-21

 * 01:03 milimetric: rerunning the eventlogging_to_druid_network_flows_internal-sanitization_daily timer that failed to get logs

2022-01-20

 * 11:58 btullis: re-enabled puppet on all hive nodes, deploying the updated log4j configuration for parquet
 * 11:36 btullis: temporarily disabling puppet on servers with hive installed T297734
 * 07:49 joal: Rerun failed webrequest jobs (text and upload, 2022-01-19T19:00

2022-01-19

 * 15:44 ottomata: installing anaconda-wmf_2020.02~wmf6_amd64.deb on all analytics cluster nodes. -  T292699
 * 14:00 ottomata: installing anaconda-wmf_2020.02~wmf6_amd64.deb on stat1004 - T292699

2022-01-17

 * 07:19 elukey: launch webrequest bundle from 2022-01-16T01:00 (first hour missing for text) - 0003712-220113112502223-oozie-oozi-B
 * 07:17 elukey: kill webrequest bundle, text coordinator failed (logs/info/etc.. https://hue.wikimedia.org/hue/jobbrowser/#!id=0024621-210701181527401-oozie-oozi-B)
 * 07:13 elukey: umount/mount /mnt/hdfs on an-coord1001 to pick up java upgrades

2022-01-16

 * 16:43 elukey: `elukey@an-launcher1002:~$ sudo systemctl reset-failed eventlogging_to_druid_network_internal_flows-sanitization_daily.service eventlogging_to_druid_network_internal_flows_daily.service eventlogging_to_druid_network_internal_flows_hourly.service

2022-01-13

 * 12:41 joal: rerun failed instances of webrequest-load-coord
 * 11:59 btullis: stopped eventlogging service on eventlog1003 with 1 hour's downtime.
 * 11:52 btullis: Upgrading hive packages on stat1005
 * 11:26 btullis: restarted hive-metastore and hive-server2 on an-coord1001 after running puppet.
 * 11:23 btullis: btullis@an-coord1001:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie oozie-client
 * 11:18 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-metastore hive-server2
 * 09:53 btullis: DNS change deployed, failing over hive to an-coord1002.
 * 09:42 btullis: btullis@an-coord1002:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie-client
 * 08:45 joal: Kill-restart wikidata-json_entity-weekly-coord after deploy

2022-01-12

 * 21:13 joal: Deploying refinery to HDFS
 * 20:46 joal: Deploying refinery with scap
 * 20:35 joal: refinery-source v0.1.24 released on archiva
 * 11:25 elukey: move kafka-jumbo nodes to fixed kafka uid/gid
 * 07:46 elukey: `systemctl reset-failed product-analytics-movement-metrics.service` on stat1007

2022-01-10

 * 13:56 btullis: Upgrading oozie packages on an-test-coord1001 to test new log4j versions

2022-01-08

 * 10:51 elukey: start hive-server2 on an-coord1002 - failed to connect to the metastore due to restart
 * 10:41 elukey: restart hive daemons on an-coord1002 (after my last upgrade/rollback of packages the prometheus agent settings were not picked up, so no metrics)

2022-01-07

 * 20:16 ottomata: altering hive table MobileWikiAppiOSUserHistory field  event.device_level_enabled to string - T298721
 * 17:29 btullis: deployed updated hive packages to an-test-worker100[1-3] and an-test-ui1001
 * 14:52 btullis: root@aqs1014:~# jmap -dump:live,format=b,file=/srv/cassandra-b/tmp/aqs1014-b-dump202201071450.hprof 4468

2022-01-06

 * 18:02 btullis: btullis@aqs1010:~$ sudo systemctl restart cassandra-a.service
 * 12:22 btullis: restarting cassandra-a service on aqs1004.eqiad.wmnet in order to troubleshoot logging.
 * 11:24 btullis: restarting cassandra-a service on aqs1010.eqiad.wmnet in order to troubleshoot logging.
 * 08:12 joal: Rerun failed webrequest-load-wf-text-2022-1-6-7
 * 07:58 joal: Rerun refine_event_sanitized_analytics_immediate missing hours after errors from the past days
 * 07:39 joal: Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-05T2[123]:00:00 and 2022-01-06T00:00:00, dropping malformed rows as discussed with schema owner

2022-01-05

 * 19:16 joal: Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-04T1[5789]:00:00, dropping malformed rows as discussed with schema owner
 * 11:37 btullis: Upgrading hive on an-test-client1001 in order to test log4j upgrade
 * 11:35 btullis: Upgrading hive packages on an-test-coord1001 to test log4j changes.

2022-01-04

 * 10:39 elukey: restart cassandra-a on aqs1010 (heap size used in full, high GC)
 * 10:20 elukey: restart cassandra-a on aqs1015 (heap size used in full, high GC)

2022-01-03

 * 18:26 joal: rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2022-1-1
 * 16:08 joal: Kill cassandra3-local_group_default_T_mediarequest_per_file-daily-2022-1-1
 * 11:26 elukey: restart cassandra-b on aqs1015 (instance not responding, probably trashing)
 * 11:16 elukey: restart cassandra-b on aqs1010 (stuck trashing)
 * 10:34 elukey: depool aqs1010 (`sudo -i depool` on the node) to allow investigation of the cassandra -b instance
 * 10:22 elukey: powercycle an-worker1114 (CPU soft lockup errors in mgmt console)
 * 10:20 elukey: powercycle an-worker1120 (CPU soft lockup errors in mgmt console)