Analytics/Server Admin Log

2018-03-06

 * 19:06 elukey: cleaned up id=0 rows on db1108 (log database) for T188991
 * 10:19 elukey: restart webrequest-load-wf-upload-2018-3-6-7 (failed due to reboots)
 * 10:08 elukey: re-starting mysql consumers on eventlog1001
 * 09:41 elukey: stop eventlogging's mysql consumers for db1107 (el master) kernel updates

2018-03-05

 * 18:22 elukey: restart webrequest-load-wf-upload-2018-3-5-16 via Hue (failed due to reboots)
 * 18:21 elukey: restart webrequest-load-wf-text-2018-3-5-16 via Hue (failed due to reboots)
 * 15:00 mforns: rerun mediacounts-load-wf-2018-3-5-9
 * 10:57 joal: Relaunch Mediawiki-history job manually from spark2 to see if new versions helps
 * 10:57 joal: Killing failing Mediawiki-History job for 2018-03

2018-03-02

 * 15:33 mforns: rerun webrequest-load-wf-text-2018-3-2-12

2018-03-01

 * 14:59 elukey: shutdown deployment-eventlog02 in favor of deployment-eventlog05 in deployment-prep (Ubuntu -> Debian EL migration)
 * 09:45 elukey: rerun webrequest-load-wf-text-2018-3-1-6 manually, failed due to analytics1030's reboot

2018-02-28

 * 22:09 milimetric: re-deployed refinery for a small docs fix in the sqoop script
 * 17:55 milimetric: Refinery synced to HDFS, deploy completed
 * 17:40 milimetric: deploying Refinery
 * 08:38 joal: rerun cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2018-2-27-15

2018-02-27

 * 19:12 ottomata: updating spark2-* CLIs to spark 2.2.1: T185581

2018-02-21

 * 20:48 ottomata: now running 2 camus webrequest jobs, one consuming from jumbo (no data yet), the other from analytics. these should be fine to run in parallel.
 * 07:21 elukey: reboot db1108 (analytics-slave.eqiad.wmnet) for mariadb+kernel updates

2018-02-19

 * 17:14 elukey: deployed eventlogging - https://gerrit.wikimedia.org/r/#/c/405687/
 * 07:35 elukey: re-run wikidata-specialentitydata_metrics-wf-2018-2-17 via Hue

2018-02-16

 * 15:41 elukey: add analytics1057 back in the Hadoop worker pool after disk swap
 * 10:55 elukey: increased topic partitions for netflow to 3

2018-02-15

 * 13:54 milimetric: deployment of refinery and refinery-source done
 * 12:52 joal: Killing webrequest-load bundle (next restart should be at hour 12:00)
 * 08:18 elukey: removed jmxtrans and java 7 from analytics1003 and re-launched refinery-drop-mediawiki-snapshots
 * 07:51 elukey: removed default-java packages from analytics1003 and re-launched refinery-drop-mediawiki-snapshots

2018-02-14

 * 13:44 elukey: rollback java 8 upgrade for archiva - issues with Analytics builds
 * 13:35 elukey: installed openjdk-8 on meitnerium, manually upgraded java-update-alternatives to java8, restarted archiva
 * 13:14 elukey: removed java 7 packages from analytics100[12]
 * 12:43 elukey: jmxtrans removed from all the Hadoop workers
 * 12:43 elukey: openjdk-7-* packages removed from all the Hadoop workers

2018-02-13

 * 11:42 elukey: force kill of yarn nodemanager + other containers on analytics1057 (node failed, unit masked, processes still around)

2018-02-12

 * 23:16 elukey: re-run webrequest-load-wf-upload-2018-2-12-21 via Hue (node managers failure)
 * 23:13 elukey: manual restart of Yarn Node Managers on analytics1058/31
 * 23:09 elukey: cleaned up tmp files on all analytics hadoop worker nodes, job filling up tmp
 * 17:18 elukey: home dirs on stat1004 moved to /srv/home (/home symlinks to it)
 * 17:15 ottomata: restarting eventlogging-processors to blacklist Print schema in eventlogging-valid-mixed (MySQL)
 * 14:46 ottomata: deploying eventlogging for T186833 with EventCapsule in code and IP NO_DB_PROPERTIES

2018-02-09

 * 12:19 joal: Rerun wikidata-articleplaceholder_metrics-wf-2018-2-8

2018-02-08

 * 16:23 elukey: stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one

2018-02-07

 * 13:55 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
 * 13:49 elukey: restart overlord/middlemanager on druid1005

2018-02-06

 * 19:40 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
 * 15:36 elukey: drain + shutdown of analytics1038 to replace faulty BBU
 * 09:58 elukey: applied https://gerrit.wikimedia.org/r/c/405687/ manually on deployment-eventlog02 for testing

2018-02-05

 * 15:51 elukey: live hacked deployment-eventlog02's /srv/deployment/eventlogging/analytics/eventlogging/handlers.py to add poll(0) to the confluent kafka producer - T185291
 * 11:03 elukey: restart eventlogging/forwarder legacy-zmq on eventlog1001 due to slow memory leak over time (cached memory down to zero)

2018-02-02

 * 17:09 joal: Webrequest upload 2018-02-02 hours 9 and 11 dataloss warning have been checked - They are false positive
 * 09:56 joal: unique_devices-per_project_family-monthly-wf-2018-1 after failure

2018-02-01

 * 17:00 ottomata: killing stuck JsonRefine eventlogging analytics job application_1515441536446_52892, not sure why this is stuck.
 * 14:06 joal: Dataloss alerts for upload 2018-02-01 hours 1, 2, 3 and 5 were false positives
 * 12:17 joal: Restart cassandra monthly bundle after January deploy

2018-01-23

 * 20:10 ottomata: hdfs dfs -chmod 775 /wmf/data/archive/mediacounts/daily/2018 for T185419
 * 09:26 joal: Dataloss warning for upload and text 2018-01-23:06 is confirmed to be false positive

2018-01-22

 * 17:36 joal: Kill-Restart clickstream oozie job after deploy
 * 17:12 joal: deploying refinery onto HDFS
 * 17:12 joal: Refinery deployed from scap

2018-01-18

 * 19:11 joal: Kill-Restart coord_pageviews_top_bycountry_monthly ooie job from 2015-05
 * 19:10 joal: Add fake data to cassandra to silent alarms (Thanks again ema)
 * 18:56 joal: Truncating table "local_group_default_T_top_bycountry"."data" in cassandra before reload
 * 15:21 mforns: refinery deployment using scap and then deploying onto hdfs finished
 * 15:07 mforns: starting refinery deployment
 * 12:43 elukey: piwik on bohrium re-enabled
 * 12:40 elukey: set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot)
 * 09:38 elukey: reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites
 * 09:37 elukey: resumed druid hourly index jobs via hue and restored pivot's configuration
 * 09:21 elukey: reboot druid1001 for kernel upgrades
 * 09:00 elukey: suspended hourly druid batch index jobs via Hue
 * 08:58 elukey: temporarily set druid1002 in superset's druid cluster config (via UI)
 * 08:53 elukey: temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted)
 * 08:52 elukey: disable druid1001's middlemanager as prep step for reboot
 * 07:11 elukey: re-run webrequest-load-wf-misc-2018-1-18-3 via Hue

2018-01-17

 * 17:33 elukey: killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present)
 * 17:29 elukey: restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task)
 * 16:24 elukey: re-run all the pageview-druid-hourly failed jobs via Hue
 * 14:42 elukey: restart druid middlemanager on druid1003 as attempt to unblock realtime streaming
 * 14:21 elukey: forced kill of banner impression data streaming job to get it restarted
 * 11:44 elukey: re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot)
 * 11:44 elukey: restart druid middlemanager on druid1002
 * 10:38 elukey: stopped all crons on hadoop-coordinator-1
 * 10:37 elukey: re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot)
 * 10:22 elukey: reboot druid1002 for kernel upgrades
 * 09:53 elukey: disable druid middlemanager on druid1002 as prep step for reboot
 * 09:46 elukey: rebooted analytics1003
 * 09:46 elukey: removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?)
 * 08:53 elukey: disabled camus as prep step for analytics1003 reboot

2018-01-15

 * 13:39 elukey: stop eventlogging and reboot eventlog1001 for kernel updates
 * 09:58 elukey: rolling reboots of aqs hosts (1005->1009) for kernel updates
 * 09:11 elukey: reboot aqs1004 for kernel updates

2018-01-12

 * 13:03 joal: Rerun webrequest-load-wf-text-2018-1-12-9
 * 13:02 joal: Rerun webrequest-load-wf-upload-2018-1-12-9
 * 10:33 elukey: reboot analytics1066->69 for kernel updates
 * 09:07 elukey: reboot analytics1063->65 for kernel updates

2018-01-11

 * 22:35 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/403774
 * 22:04 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403762/
 * 20:57 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403753/
 * 17:37 joal: Kill manual banner-streaming job to see it restarted by cron
 * 17:11 ottomata: restart kafka on kafka-jumbo1003
 * 17:08 ottomata: restart kafka on kafka-jumbo1001...something is not right with my certpath change yesterday
 * 14:46 joal: Deploy refinery onto HDFS
 * 14:33 joal: Deploy refinery with Scap
 * 14:07 joal: Manually restarting banner streaming job to prevent alerting
 * 13:23 joal: Killing banner-streaming job to have it auto-restarted from cron
 * 11:45 elukey: re-run webrequest-load-wf-text-2018-1-11-8 (failed due to reboots)
 * 11:39 joal: rerun mediacounts-load-wf-2018-1-11-8
 * 10:48 joal: Restarting banner-streaming job after hadoop nodes reboot
 * 10:01 elukey: reboot analytics1059-61 for kernel updates
 * 09:34 elukey: reboot analytics1055->1058 for kernel updates
 * 09:04 elukey: reboot analytics1051->1054 for kernel updates

2018-01-10

 * 16:56 elukey: reboot analytics1048->50 for kernel updates
 * 16:23 ottomata: restarting kafka jumbo brokers to apply java.security certpath restrictions
 * 11:51 elukey: re-run webrequest-load-wf-upload-2018-1-10-10 (failed due to reboots)
 * 11:27 elukey: re-run webrequest-load-wf-text-2018-1-10-10 (failed due to reboots)
 * 11:26 elukey: reboot analytics1044->47 for kernel updates
 * 11:03 elukey: reboot analytics1040->43 for kernel updates

2018-01-09

 * 16:53 joal: Rerun pageview-druid-hourly-wf-2018-1-9-13
 * 15:33 elukey: stop mysql on dbstore1002 as prep step for shutdown (stop all slaves, mysql stop)
 * 15:10 elukey: reboot analytics1028 (hadoop worker and hdfs journal node) for kernel updates
 * 15:00 elukey: reboot kafka-jumbo1006 for kernel updates
 * 14:41 elukey: reboot kafka-jumbo1005 for kernel updates
 * 14:33 elukey: reboot kafka1023 for kernel updates
 * 14:04 elukey: reboot kafka1022 for kernel updates
 * 13:51 elukey: reboot kafka-jumbo1003 for kernel updates
 * 10:08 elukey: reboot kafka-jumbo1002 for kernel updates
 * 09:35 elukey: reboot kafka1014 for kernel updates

2018-01-08

 * 19:07 milimetric: Deployed refinery and synced to hdfs
 * 15:23 elukey: reboot kafka1013 for kernel updates
 * 13:40 elukey: reboot analytics10[36-39] for kernel updates
 * 12:59 elukey: reboot kafka1012 for kernel updates
 * 12:43 joal: Deploy AQS from tin
 * 12:36 fdans: Deploying AQS
 * 12:33 joal: Update fake-data in cassandra adiing top-by-country needed row
 * 11:07 elukey: re-run webrequest-load-wf-text-2018-1-8-8 (failed after some reboots due to kernel updates)
 * 10:07 elukey: drain + reboot analytics1029,1031->1034 for kernel updates

2018-01-07

 * 09:01 elukey: re-enabled puppet on db110[78] - eventlogging_sync restarted on db1108 (analytics-slave)

2018-01-06

 * 08:09 elukey: re-enable eventlogging mysql consumers after database maintenance

2018-01-05

 * 13:18 fdans: deploying AQS

2018-01-04

 * 19:54 joal: Deploying refinery onto hadoop
 * 19:45 joal: Deploy refinery using scap
 * 19:38 joal: Deploy refinery-source using jenkins
 * 16:01 ottomata: killing json_refine_eventlogging_analytics job that started yesterday and has not completed (has no executors running?) application_1512469367986_81514. I think the cluster is just too busy? mw-history job running...
 * 10:34 elukey: re-run mediacounts-archive-wf-2018-01-03

2018-01-03

 * 15:00 ottomata: restarting kafka-jumbo brokers to enable tls version and cipher suite restrictions

2018-01-02

 * 11:13 joal: Kill and restart cassandra loading oozie bundle to pick new pageview_top_bycountry job
 * 08:22 elukey: restart druid coordinators to pick up new jvm settings (freeing up 6GB of used memory)

2017-12-21

 * 15:54 joal: Start backfilling monthly pageview-by-country
 * 15:45 joal: deploy refinery onto HDFSb
 * 15:38 joal: Deploying refinery with Scap

2017-12-20

 * 15:40 ottomata: removing some old webrequest data from hdfs
 * 14:56 ottomata: dropping some old wmf.webrequest partitions and data

2017-12-19

 * 17:28 elukey: re-enabled superset
 * 17:16 joal: Initilizaing new cassandra keyspace for pageviews/top-by-country
 * 16:52 elukey: temporarily stop superset to test druid's performances
 * 16:34 elukey: manually started eventlogging cleaner on db1107 to purge/sanitize data up to 90 days ago (tmux is running for user eventlogcleaner)
 * 14:10 elukey: temporary changed JVM Heap settings for the druid broker on druid1001 - Xmx25g Xms10g (run puppet and restart the daemon to rollback)

2017-12-12

 * 15:54 milimetric: sieging aqs1004 with 100.000 transactions

2017-12-11

 * 14:07 elukey: disable druid middlemanager on druid1003 to drain + restart to pick up new logging settings
 * 12:59 elukey: disabled druid middlemanager on druid1002 to drain+restart with new logging config

2017-12-08

 * 11:15 joal: Start mediawiki-history oozie jobs new-version
 * 10:49 joal: Update wmf.mediawiki_history as explained in email (rename current table to old, create new one)

2017-12-07

 * 21:09 joal: Start clickstream oozie job
 * 20:45 joal: Kill restbase oozie job and restart apis replacing one
 * 20:12 joal: Trying to deploy refinery again
 * 19:30 joal: Deploying refinery now that -source is deployed
 * 18:39 milimetric: Deployed refinery-source using jenkins
 * 15:03 elukey: restart webrequest-misc load job (Dec 7 2017 06:00:00)
 * 12:24 elukey: camus re-enabled after analytics1003 reboot
 * 08:31 elukey: stop camus on an1003 as prep step for reboot

2017-12-06

 * 14:55 elukey: restart hue to pick up the new oozie server
 * 14:55 elukey: oozie server accidentally restarted due to a puppet change (the service auto-restarts)
 * 11:22 elukey: disabled temporarily druid's middlemanager on druid1001 to test the Real Time monitor setting

2017-12-05

 * 10:35 elukey: re-enable webrequest bundle and camus after reboots
 * 10:31 elukey: disabled druid middlemanager on druid1003 with curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable
 * 10:03 elukey: stop camus as precautionary measure before Hadoop masters reboot
 * 09:57 elukey: suspend webrequest load bundle as extra precaution before Hadoop masters reboot

2017-12-04

 * 16:29 elukey: restart webrequest-load-wf-upload-2017-12-4-12 (failed due to hadoop reboots)
 * 16:12 elukey: restart webrequest-load-wf-upload-2017-12-4-13 (failed due to hadoop reboots)
 * 15:09 joal: Rerun webrequest-load-wf-upload-2017-12-4-12 and webrequest-load-wf-upload-2017-12-4-13
 * 15:08 joal: Rerunning 15:47:35 whatuuuup mforns
 * 14:17 elukey: re-run pageview-druid-hourly-wf-2017-12-4-11 in Hue (failed due to reboots)
 * 12:04 elukey: re-run webrequest-load-wf-upload-2017-12-4-8 (failed due to reboots)
 * 12:04 elukey: re-run webrequest-load-check_sequence_statistics-wf-upload-2017-12-4-7 (failed due to reboots)

2017-12-02

 * 11:47 joal: Rerun unique_devices-per_project_family-monthly-wf-2017-11

2017-12-01

 * 15:20 elukey: rerun webrequest-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency
 * 15:09 elukey: rerun pageview-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency
 * 13:07 elukey: re-run aqs-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
 * 12:42 elukey: temporarily switch pivot's config to druid1002 (to reboot druid1001)
 * 12:37 elukey: re-run webrequest-load-wf-upload-2017-12-1-10 and  webrequest-load-wf-upload-2017-12-1-7 (failed due to Hadoop reboots)
 * 12:36 elukey: re-run webrequest-load-wf-text-2017-12-1-10 and webrequest-load-wf-text-2017-12-1-9 (failed due to Hadoop reboots)
 * 12:35 elukey: re-run pageview-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
 * 12:34 elukey: re-run webrequest-druid-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)

2017-11-30

 * 18:20 elukey: re-run webrequest-load-wf-upload-2017-11-30-16 (failed due to hadoop reboots)
 * 18:19 elukey: re-run webrequest-load-wf-text-2017-11-30-14 (failed due to hadoop reboots)
 * 16:21 joal: wikidata-wdqs_extract-wf-2017-11-30-15
 * 15:50 elukey: restart hue on thorium - timeouts and 500s
 * 14:58 joal: Update druid overlord config to equalDistribution dynamically

2017-11-29

 * 21:46 joal: rerun pageview-druid-hourly-wf-2017-11-29-18 and pageview-druid-hourly-wf-2017-11-29-19
 * 21:19 joal: rerun webrequest-druid-hourly-wf-2017-11-29-18

2017-11-28

 * 14:41 ottomata: restarting eventlogging on eventlog1001 for https://gerrit.wikimedia.org/r/#/c/393613/
 * 09:08 elukey: log database on dbstore1002 dropped for good

2017-11-22

 * 16:09 ottomata: restarting eventlogging services on eventlog1001

2017-11-20

 * 18:28 elukey: deployed prometheus-druid-exporter (still not released in apt) on druid1004 for testing
 * 15:45 ottomata: deploying fixes to EL EventCapsule discrepancies: https://phabricator.wikimedia.org/T179625#3755242

2017-11-16

 * 15:25 milimetric: deployed refinery and running interlanguage links dataset now

2017-11-15

 * 14:22 addshore: addshore@stat1005:/srv/analytics-wmde$ sudo -u analytics-wmde rm -rf /srv/analytics-wmde/r-library
 * 14:22 addshore: addshore@stat1005:/srv/analytics-wmde$ sudo -u analytics-wmde rm -rf /srv/analytics-wmde/installRlib

2017-11-14

 * 09:45 elukey: executed chmod g+rx /home/ezachte/wikistats_data/dumps to unblock Joseph (should be safe)

2017-11-13

 * 21:20 addshore: addshore@stat1005:/srv/analytics-wmde/wdcm/src$ sudo -u analytics-wmde Rscript ./_installProduction_analytics-wmde.R
 * 21:20 addshore: test
 * 14:44 joal: Resuming all druid loading jobs after fixing restart issues
 * 14:18 joal: Suspending pageview-druid-hourly-coord again trying to fix druid loadin
 * 14:10 joal: Unsuspend pageview-druid-hourly-coord
 * 13:08 joal: Suspend webrequest druid loading waiting for elukey
 * 13:05 joal: Rerun webrequest-druid-hourly-wf-2017-11-13-11
 * 11:15 elukey: suspend pageview-druid-hourly-coord to allow an easier druid daemon reload (new prometheus jvm agent)

2017-11-08

 * 15:16 ottomata: deploying eventlogging analytics change for eventcapsule schema fixes, will be no-op until we deploy puppet changes too
 * 11:28 elukey: resumed cassandra-coord-pageview-per-project-hourly after maintenance to aqs hosts
 * 10:04 elukey: suspended cassandra-coord-pageview-per-project-hourly as prep step to reboot aqs nodes - T179943

2017-11-06

 * 15:37 milimetric: found geowiki was hitting the wrong databases, updated it to always hit analytics-store

2017-11-03

 * 10:55 joal: Kill mediawiki-history oozie job to prevent computing october snapshot before fixing reconstruction process

2017-11-02

 * 08:54 elukey: relaunched failed pageview-druid-hourly jobs - Druid indexation check failures in the logs (01 Nov 2017 21:00:00 and 01 Nov 2017 19:00:00)

2017-11-01

 * 20:06 ottomata: rerunning pageview-druid-hourly-wf-2017-11-1-18
 * 19:05 ottomata: deploying refinery with refinery/source 0.0.54 for JsonRefine job T162610
 * 18:40 ottomata: rerunning unique_devices-per_project_family-druid-monthly-wf-2017-10

2017-10-30

 * 10:12 elukey: added Francisco to the analytics-alerts@ mailing list

2017-10-27

 * 07:40 elukey: re-run wikidata-articleplaceholder_metrics-wf-2017-10-26
 * 07:36 elukey: stop & mask hadoop-httpfs.service on analytics1001 after https://gerrit.wikimedia.org/r/#/c/386684/

2017-10-26

 * 16:58 ottomata: now mirroring main Kafka cluster topics to jumbo Kafka cluster,  with MirrorMaker instances running on analytics-eqiad broker nodes. https://phabricator.wikimedia.org/T177216

2017-10-25

 * 13:32 elukey: restart yarn nodemanager and hdfs datanode on analytics1030 to apply new JVM settings

2017-10-24

 * 20:29 nuria_: started unique_devices-per_project_family-druid-daily-coord 0102816-170829140538136-oozie-oozi-C
 * 20:24 nuria_: restarted job unique_devices-per_project_family-druid-monthly-coord 0102799-170829140538136-oozie-oozi-C
 * 20:23 nuria_: restarted job uniques-monthly-per-domain-druid 0102785-170829140538136-oozie-oozi-C
 * 19:44 nuria_: killing druid coordinators uniques-monthly and per-project-family: 0066771-170829140538136-oozie-oozi-C,0066767-170829140538136-oozie-oozi-C,0010139-170621131133576-oozie-oozi-C

2017-10-23

 * 18:50 joal: Deploying AQS after fix
 * 13:30 joal: deploy AQS from tin

2017-10-19

 * 20:04 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 11:44 joal: deploying AQS in beta
 * 11:44 joal: deploying AQS in b

2017-10-16

 * 17:32 mforns: restarted EventLogging for changes in blacklist to take effect
 * 16:27 joal: Re-Deploy AQS after monitoring fix
 * 16:14 joal: Deploy AQS with new code

2017-10-13

 * 16:49 ottomata: deployed refinery to use rand for webrequest sampling

2017-10-12

 * 15:40 elukey: run kafka preferred-replica-election to allow kafka1013 to re-join the topic leaders
 * 14:48 elukey: disable httpfs access on analytics1001

2017-10-09

 * 18:28 ottomata: resuming oozie druid indexing jobs, 1004-1006 are offline
 * 16:34 ottomata: stopping druid services on druid1006
 * 16:05 ottomata: pausing all druid oozie coordinators in preperation for druid public separation
 * 12:47 joal: Kill restart oozie job lading mediawiki-history into druid
 * 12:14 joal: Kill-Restart oozie jobs loading banner data into druid
 * 12:04 joal: Deploy refinery onto HDFS
 * 11:47 joal: Deploying refinery from scap
 * 08:53 joal: Rerunning wikidata-articleplaceholder_metrics-wf-2017-10-7 after failure

2017-10-06

 * 11:10 elukey: restart all druid daemons to pick up new logging changes
 * 11:08 joal: Rerun pageview-druid-hourly-wf-2017-10-6-9
 * 09:31 elukey: restart all the druid daemons on druid1005 to apply the new logging rules
 * 08:49 elukey: restarted all the druid broker daemons to pick up the new logging changes

2017-10-05

 * 13:48 milimetric: restarted banner_activity-druid-monthly for September again

2017-10-04

 * 18:39 ottomata: druid-analytics.svc.eqiad.wmnet:8082 should only be accessible to analytics networks
 * 17:32 ottomata: deploying new LVS service for druid-analytics-broker

2017-10-03

 * 14:50 milimetric: restarted failed workflow 0057215-170829140538136-oozie-oozi-W (druid monthly banner activity)

2017-09-28

 * 10:02 elukey: renabled camus after maintenance
 * 09:51 elukey: restart mapreduce history server on an1001 to apply new heap settings (Xmx/s to 4g)

2017-09-27

 * 15:18 joal: Kill/restart stuck jobs
 * 14:45 elukey: rolling restart of all the Yarn nodemanager daemons on analytics1028-1068 (ease heap consumption pressure, seamless restart)
 * 13:40 elukey: manual failover of HDFS namenode from an1002 to an1001
 * 13:17 elukey: manual failover of HDFS namenode from an1001 to an1002 to test 6G max heap size
 * 13:14 elukey: restart mapreduce history server on analytics1001 after crash (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2017-09-26

 * 14:49 joal: restart mobile_apps session_metrics bundle
 * 14:49 joal: restart
 * 11:01 joal: Restart mediawiki-history-denormalize and mediawiki-history-druid jobs after deploy
 * 10:58 joal: Restart webrequest load job after deploy
 * 10:35 joal: Deploying refinery onto HDFS
 * 10:25 joal: Deploy Refinery with scap
 * 09:33 joal: Releasing refinery-source v0.0.53 with Jenskins

2017-09-25

 * 08:41 joal: Rerun mobile_apps-session_metrics-wf- 2017-9-17 after failure

2017-09-19

 * 19:24 joal: Rerun pageview-druid-hourly-wf-2017-9-19-17 failed during druid restart
 * 19:19 ottomata: restarting druid broker and historical processes with druid.processing.numMergeBuffers=10

2017-09-14

 * 17:35 ottomata: restaring eventlogging processor(s) with MySQL blacklist of PageCreation schema

2017-09-13

 * 16:08 ottomata: restarting druid-brokers with increase in query cache size
 * 11:15 joal: Kill-Restart mediawiki-history-denormalize-coord and launch new coords mediawiki-history-load and mediawiki-history-reduced
 * 11:11 joal: Kill-Restart oozie pageview druid loading jobs (hourly, daily, monthly)
 * 11:03 joal: Deploy refinery onto HDFS
 * 10:57 joal: Deploy refinery from scap
 * 10:08 joal: Deploying refinery-source using Jenkins

2017-09-07

 * 08:41 joal: Rerun Workflow banner_activity-druid-daily-wf-2017-9-6

2017-09-04

 * 08:55 joal: Kill - Restart mediawiki-history-druid-coord to pick last update

2017-09-01

 * 18:30 joal: Rerun Workflow webrequest-load-wf-misc-2017-9-1-16 after very weird failure
 * 10:06 elukey: killed root rsyncs on thorium, disabled puppet
 * 01:31 ottomata: restarted hue (a few minutes ago) not totally sure why it died

2017-08-30

 * 15:54 elukey: re-added analytics1055 among the hdfs/yarn worker after maintenance
 * 14:07 elukey: restart java daemons on druid100[456] for jvm security updates
 * 09:07 elukey: restart all jvm daemons on druid100[123] for security updates
 * 09:07 elukey: restart pageview-druid-hourly-wf-2017-8-30-7 in Hue after druid1001 daemons restart

2017-08-29

 * 19:36 ottomata: restarting all kafka brokers and mirror maker processes to apply https://gerrit.wikimedia.org/r/#/c/374610/
 * 12:46 elukey: suspend oozie jobs from Hue to allow a easier restart of oozie/hive daemons

2017-08-28

 * 13:57 elukey: restart kafka* on kafka1012 for openjdk security updates (canary)
 * 10:34 elukey: restart yarn and hdfs on analytics1030 for jvm updates (canary)

2017-08-23

 * 19:50 joal: restart oozie webrequest-load bundle after bug correction
 * 19:46 joal: Deploy refinery onto hdfs
 * 19:41 joal: Deploying refinery from tin
 * 19:36 joal: Deployed werbrequest-source using jenkins for bug correction
 * 19:26 joal: Alter wmf.webrequest and wmf.wdqs_extract tables to correct bug
 * 19:25 joal: Kill oozie webrequest-load bundle for redeploy after bug correction
 * 11:04 joal: Update wmf.wdqs_extract table for normalized_host update
 * 10:12 joal: Restart oozie webrequest-load bundle after deploy and updates
 * 10:09 joal: Alter webrequest table before restarting oozie load bundle
 * 10:06 joal: Deploying refinery onto hdfs
 * 09:59 joal: Deploying refinery
 * 09:59 joal: Kill oozie webrequest-load bundle for restart after deploy
 * 08:25 joal: Deploying refinery-source v0.0.50 using jenkins

2017-08-22

 * 19:52 joal: Drop / recreate wmf.mediawiki_history table for naming correction
 * 13:57 ottomata: sudo -u hdfs hdfs dfs -rm /tmp/druid-indexing/classpath/guava.jar (guava 11.0.2 is conflicting with guava 16.0.1. from druid-hdfs-storage-cdh extension).  Not sure how guava 11.0.2 got there, but let's see if it doesn't come back
 * 08:27 joal: Rerun druid loading jobs after night failures

2017-08-21

 * 13:46 ottomata: adding index on (database, rev_timestamp) on mediawiki_page_create_2 table on db1047: T170990
 * 13:26 ottomata: adding index on (database, rev_timestamp) on mediawiki_page_create_2 table on dbstore1002: T170990

2017-08-14

 * 16:40 elukey: analytics1034 back in service after swapping the eth cable - T172633

2017-08-10

 * 20:06 milimetric: stopped Wikimetrics web and queue on wikimetrics-01.eqiad.wmflabs because the queue ran into errors connecting to the database (max 10 connections limit reached)
 * 08:59 elukey: updated librdkafka1 to 0.9.4.1 on eventlog1001

2017-08-08

 * 18:39 elukey: restart projectview-hourly-wf-2017-8-8-14, pageview-druid-hourly-wf-2017-8-8-14, pageview-hourly-wf-2017-8-8-14 via Hue (analytics1055 disk failure)
 * 14:20 elukey: restart varnishkafka statsv/eventlogging instances to pick up https://gerrit.wikimedia.org/r/#/c/370637/ (kafka protocol explicitly set to 0.9.0.1)

2017-08-06

 * 11:03 elukey: stop yarn on analytics1034 to reload the tg3 driver - T172633

2017-08-03

 * 16:15 ottomata: druid cluster restarted with 0.9.2 mysql-metadata-storage extension, un-suspending oozie druid jobs
 * 14:11 ottomata: pausing oozie druid jobs and doing a cluster upgrade/restart again to make sure updated version of mysql-metadata-storage jar is properly loaded
 * 09:56 elukey: set piwik in maintenance mode to allow mysql updates
 * 08:08 elukey: restarted Druid jobs failed over night (drud_loader.py error) and due to Hive metastore restart
 * 08:03 elukey: restart hive-metastore to pick up new JVM Xms settings

2017-08-02

 * 14:34 ottomata: beginning druid upgrade to 0.92 (take 2 :) )
 * 14:23 elukey: restart hive-server to pick up JVM Xms4g change
 * 14:22 ottomata: suspending druid oozie jobs

2017-08-01

 * 17:24 ottomata: beginning druid upgrade to 0.9.2 http://druid.io/docs/0.9.2/operations/rolling-updates.html
 * 17:10 ottomata: pausing all druid oozie coordinators
 * 12:49 elukey: restart hive daemons on analytics1003 to pick up new jvm settings (bigger Xmx, JMX ports)
 * 10:05 elukey: suspended again webrequest-load-bundle as prep step to restart the hive daemons
 * 07:58 elukey: suspended webrequest-load-bundle as prep step to restart the hive daemons
 * 07:03 elukey: restarted mobile_apps-session_metrics-coord-global-30days failed job via Hue

2017-07-31

 * 13:45 elukey: suspended webrequest-load-bundle as prep step to restart hive metastore/server
 * 10:34 elukey: restart hive-server on an1003 - beeline not connecting, thrift errors

2017-07-28

 * 07:55 elukey: update nodejs to 6.11 on aqs1004 (testing prod node after beta qa)
 * 07:54 elukey: re-run webrequest-load-wf-upload-2017-7-28-6 from Hue (was playing with eth0 issues on an1034)
 * 02:08 ottomata: stat1002: disabled puppet, umounted /tmp, /home and /a, poweroff

2017-07-26

 * 21:01 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 18:57 mforns: Deployed refinery-source using jenkins

2017-07-25

 * 15:24 elukey: restart cassandra loading after maintenance via hue
 * 13:06 elukey: stop cassandra load bundle, restarting AQS for jvm updates
 * 12:13 elukey: executed sudo apt-get remove openjdk-8-jre openjdk-8-jre-headless on druid nodes

2017-07-24

 * 14:24 ottomata: restarted mysql-eventbus eventlogging consumer with new consumer group

2017-07-20

 * 20:31 nuria_: restaring eventlogging on eventlog1001
 * 20:30 nuria_: deploying eventlogging c1c2c39411ccd002ff8cea197bc535155213f5fb and restarting
 * 18:18 ottomata: deleted instance deployment-eventlogging03 in favor of new instance deployment-eventlog02
 * 17:14 ottomata: killed tranquility instances tranq-banners and tranq-netflow running on druid1003 in joal's screen sessions

2017-07-18

 * 13:04 ottomata: adding unique index on meta_id and index on meta_dt to mediawiki_page_{create,delete,move,undelete}_1 on db1046 MySQL eventlogging master

2017-07-17

 * 16:27 elukey: set innodb_flush_log_at_trx_commit on bohrium to 2 and sync_binlog=300 to reduce iowait - T164073
 * 14:31 elukey: set innodb_flush_log_at_trx_commit on bohrium to 1 (default value)- T164073

2017-07-12

 * 13:48 fdans: updated pageview whitelist with din.wikipedia

2017-07-11

 * 05:24 elukey: drop _Edit_11448630_old from dbstore1002

2017-07-10

 * 16:14 nuria_: deploying eventlogging 5e16da16e3f5ce287829390a76b9f5b0c7715ee5

2017-07-08

 * 07:55 elukey: re-run wikidata-specialentitydata_metrics-wf-2017-7-7 in Hue (failed Spark job)

2017-07-06

 * 10:37 elukey: taking mysqldump for Piwik and storing it on stat1002:/a/backup/bohrium/mysqldump_20170706.sql

2017-07-04

 * 11:21 joal: Redeploying refinery with scap
 * 11:10 joal: Restart unique_devices-per_project_family-monthly-coord after correction deployed
 * 11:03 joal: Deploying refinery onto hdfs
 * 10:57 joal: Deploying refinery with scap

2017-07-03

 * 16:40 joal: Manually launch sqoop imports for enwiki revision, and wikidatawiki revision and logging tables, snapshot=2017-06

2017-07-01

 * 21:33 joal: Restart cassandra bundle at beginning of the month

2017-06-29

 * 11:39 joal: Update tables and archived data and kill/start jobs for unique-devices per project-family
 * 11:34 joal: Kill and restart druid webrequest sampled oozie jobs after deploy
 * 11:18 joal: Update tables and restart mediawiki_history oozie jobs after deploy
 * 10:58 elukey: deploy refinery to HDFS
 * 10:57 elukey: fixed archiva whitelist in the analytics VLAN (VM changed IP)
 * 07:03 joal: Deploying refinery with scap (after yesterday's failure)

2017-06-28

 * 18:17 joal: Deploying refinery with scap
 * 16:25 joal: Building / Deploying refinery-source from jenkins to archiva (v0.0.480
 * 15:42 elukey: analytics1030 back to the worker nodes after maintenance

2017-06-27

 * 16:26 milimetric: quarry Rebooted all the boxes in an attempt to fix performance problems
 * 10:05 elukey: added https://wiki.apache.org/commons/VfsProblems to stat1004
 * 07:14 joal: Rerun wikidata-articleplaceholder_metrics-wf-2017-6-26

2017-06-24

 * 10:31 elukey: re-run webrequest-load-coord-misc's failed job in hue

2017-06-23

 * 07:32 elukey: uploaded new pageview whitelist following https://wikitech.wikimedia.org/wiki/Analytics/Team/Oncall#Find_and_fix_pageview_whitelist_exceptions for kbp.wikipedia

2017-06-21

 * 20:23 joal: Disable puppet agent and restart kafka with 48h retention in deployment-kafka01
 * 13:59 elukey: eventlogging restarted after reboot
 * 13:54 elukey: stop eventlogging and reboot eventlog1001
 * 13:15 elukey: reboot analytics1003 for kernel update
 * 11:08 elukey: stop camus on an1003

2017-06-20

 * 19:24 ottomata: beginning to consume select eventbus event using eventlogging mysql consumer and inserting into eventlogging analytics mysql db
 * 18:01 joal: Rerun webrequest-load-wf-text-2017-6-20-12 after oozie failure
 * 16:23 joal: Restarted tranquility for banners and netflow on druid1003
 * 16:18 joal: Rererun pageview-druid-hourly-wf-2017-6-20-14 (failed due to druid reboots)
 * 16:04 elukey: re-run pageview-druid-hourly-wf-2017-6-20-14 (failed due to druid reboots)
 * 14:46 elukey: re-run failed webrequest-load-text/upload jobs due to reboots
 * 13:29 elukey: restart webrequest-load-coord-text and webrequest-load-coord-upload failed jobs due to reboots
 * 13:14 elukey: re-run wikidata-wdqs_extract-wf-2017-6-20-11 (failed for connection issues, likely due to reboots)
 * 11:54 joal: Deleting old unique_devices data (renamed to unique_devices_per_domain)
 * 10:27 elukey: reboot kafka1012, analytics1028, aqs1004 for kernel upgrades (canary hosts)
 * 08:51 elukey: manually added the user 'hdfs' to the 'hive' group to be able to run refinery-drop-webrequest-partitions
 * 08:49 elukey: manually running /srv/deployment/analytics/refinery/bin/refinery-drop-webrequest-partitions on an1003 to free hdfs space

2017-06-19

 * 12:10 elukey: disable BBU auto learn on all the hadoop workers

2017-06-13

 * 10:10 elukey: merged big zookeeper refactoring https://gerrit.wikimedia.org/r/#/c/354449 - Druid's Hadoop client config now correctly points to conf1* and not drud1*

2017-06-12

 * 17:21 joal: Last deploy of the day for uniques patch
 * 13:26 joal: redeploying refinery after bug patch
 * 11:32 joal: Change production last_access_uniques dataset to unique_devices/per_domain
 * 11:11 joal: Deploy refinery onto HDFS
 * 11:03 joal: Regular weekly deploy of refinery (mostly unique_devices patches)
 * 10:54 joal: Refinery-source deployed to archiva

2017-06-08

 * 16:41 nuria_: deploying refinery to cluster
 * 13:44 elukey: AQS cluster in beta wiped and re-bootstrapped due to T167222
 * 12:54 elukey: run megacli -LDSetProp ADRA -LALL -aALL on analytics1047 to set ReadAheadAdaptive on analytics[1042-1046,1048-1057].eqiad.wmnet - T166140
 * 12:16 elukey: run megacli -LDSetProp ADRA -LALL -aALL on analytics1047 to set ReadAheadAdaptive - T166140
 * 10:35 elukey: executed megacli -LDSetProp NoCachedBadBBU -LALL -aALL on analytics1049/45
 * 10:28 elukey: executed megacli -LDSetProp NoCachedBadBBU -LALL -aALL on analytics1032 as test - T166140
 * 07:25 elukey: kill maps webrequest load coordinator as temporary measure to avoid oozie spamming
 * 07:21 elukey: suspended cache maps as temporary measure to avoid oozie spamming

2017-06-07

 * 06:50 elukey: restarted mediacounts-archive-wf-2017-06-06 in Hue (Java OOMs)

2017-06-06

 * 15:44 ottomata: restarting eventlogging mysql consumer to allow is_mediawiki events through is_not_bot filter
 * 15:24 ottomata: restarting eventlogging processor to bring in is_mediawiki ua classification

2017-06-02

 * 14:48 mforns: Restarted webrequest-load-bundle after deploy
 * 08:41 joal: Restarted last_access_uniques-monthly-coord after bug correction and deploy
 * 04:42 elukey: removed some old scap revs for the Analytics refinery on stat1002 to free space

2017-06-01

 * 14:29 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 12:47 mforns: Deployed refinery-source v0.0.46 using jenkins

2017-05-29

 * 09:45 joal: Restarted wikidata-articleplaceholder_metrics-wf-2017-5-27

2017-05-26

 * 12:54 elukey: restarted master Hadoop daemons for jvm upgrade
 * 12:39 elukey: re-added analytics1030 to the hadoop workers

2017-05-25

 * 10:11 elukey: removed /usr/share/druid/extensions/druid-hdfs-storage-cdh/druid-hdfs-storage-0.10.0.jar from all druid nodes
 * 07:23 joal: Restart pageview-druid-hourly-wf-2017-5-24-19

2017-05-24

 * 21:09 ottomata: pausing all druid oozie coordinators until hadoop loading is fixed
 * 15:27 joal: Resume pageview-druid-hourly-coord after druid upgrade
 * 13:07 joal: Suspend pageview-druid-hourly-coord for druid upgrade
 * 09:06 joal: Restart oozie mediawiki_history denormalize/metrics job after bug fixing deploy
 * 09:04 joal: Restart oozie last_accesst_uniques daily/monthly job after bug fixing deploy
 * 09:01 joal: Restart oozie restbase job after bug fixing deploy
 * 08:52 joal: Deploy refinery to HDFS
 * 08:48 joal: Deploying refine

2017-05-23

 * 14:32 joal: Start 1-off oozie jobs adding underestimate and offset values in historical archived uniques datasets
 * 13:50 joal: Restarted oozie last_access_uniques jobs (daily + monthly) after deploy
 * 13:47 joal: Restarted oozie druid hourly pageview job after deploy
 * 13:46 joal: Restarted oozie druid uniques job after deploy
 * 12:56 joal: Deploying refinery to HDFS
 * 12:10 joal: Start refinery deployment

2017-05-16

 * 22:54 elukey: disabled puppet and hadoop daemons again on analytics1030 (still need hw maintenance but motherboard replaced)
 * 22:54 elukey: analytics1040 back to the hadoop worker nodes after maintenance

2017-05-09

 * 10:13 elukey: re-run manually 2017-05-08T18 for misc due to job errors (failed oozie id 0020276-170424154741156-oozie-oozi-W)

2017-05-05

 * 13:08 elukey: restart Pivot on thorium after banner_activity_minutely_sanitization_test cleanup
 * 12:02 elukey: removed /etc/cron.d/piwik-archive on bohrium, now puppet creates it for user www-data

2017-05-04

 * 16:26 elukey: set daily cron archiver (rather than every hour) for Piwik on bohrium
 * 10:09 joal: Rerun full druid loading for daily uniques - 0012911-170424154741156-oozie-oozi-C
 * 09:17 joal: Deploy refinery onto hdfs
 * 09:13 joal: Deploy refinery from naos :)

2017-05-03

 * 08:32 elukey: added "adapter=MYSQLI" to config.ini to enable LOAD FILE capabilities on piwik (restarted apache2)
 * 08:20 elukey: GRANT FILE on *.* to piwik@localhost executed on bohrium (https://piwik.org/faq/troubleshooting/#faq_194)
 * 08:16 elukey: removed 2>&1 from the Piwik cron archive script
 * 08:12 elukey: set Piwik archive cron on bohrium to run every 3600s (rather than 14400)

2017-05-02

 * 12:57 elukey: set long_query_time=5 to mysql on bohrium
 * 12:54 elukey: enabled mysql slow query log on bohrium (/var/log/mysql/slow-query.log0
 * 09:53 joal: Restart mediawiki history jobs to pick up new snapshot format

2017-04-28

 * 08:45 joal: Restart Workflow webrequest-load-wf-maps-2017-4-28-1

2017-04-27

 * 16:40 joal: Manually push (again) pageview whitelist
 * 16:37 joal: restart Workflow aqs-hourly-wf-2017-4-27-14 and Workflow pageview-hourly-wf-2017-4-27-14
 * 12:06 joal: Manually push pageview whitelist to silence oozie alerts
 * 09:47 elukey: re-enabled tracking in piwik after maintenance
 * 09:44 elukey: disabled tracking in piwik to allow mysql upgrade

2017-04-26

 * 18:51 elukey: resumed oozie the complainer on Hue
 * 10:12 elukey: restarted webrequest-load-(text|upload|misc|maps) failed jobs (Hadoop workers maintenance)
 * 09:53 elukey: restart mediacounts-load-wf-2017-4-26-7 (failed due to mainteance to the hadoop cluster)
 * 09:51 elukey: restart aqs-hourly-wf-2017-4-26-8 (failed due to an1036's hdfs daemon went down for maintenance)

2017-04-25

 * 10:33 joal: restart failed mediacounts-archive-coord : Workflow mediacounts-archive-wf-2017-04-24

2017-04-24

 * 15:54 elukey: re-enabled oozie bundles webrequest-load and transwer_to_es
 * 13:22 elukey: disable camus cron on an1003
 * 13:08 elukey: suspended transfer_to_es bundle
 * 13:07 elukey: suspended webrequest-load-bundle

2017-04-21

 * 13:51 elukey: set innodb_flush_log_at_trx_commit = 0 and sync_binlog = 300 on bohrium's mysql
 * 10:35 elukey: restart pivot for nodejs security upgrades

2017-04-20

 * 17:32 joal: Start daily uniques druid loading job (from 2015-12-17)
 * 17:26 joal: Restart druid pageview loading [daily-monthly]
 * 17:06 joal: Restart wikidata-specialentitydata_metrics-coord and wikidata-articleplaceholder_metrics-coord
 * 16:59 joal: Restart mobile_apps-uniques-[daily|monthly]-coord
 * 15:49 milimetric: deployed Refinery
 * 07:41 elukey: Restart mediacounts-archive-wf-2017-04-19 in Hue (Java Heap space issue)

2017-04-12

 * 09:53 elukey: stop Clickhouse on druid100[123]

2017-04-07

 * 08:30 joal: Insert fake test data in aqs pagecounts endpoint to set monitoring back to non-alarm state

2017-04-05

 * 16:03 elukey: restarted webrequest-load-wf-text-2017-4-5-14
 * 15:13 elukey: removed /etc/cron.daily/blogreport from eventlog1001 (manual backup in /home/elukey)
 * 13:24 ottomata: deployed slightly improved eventlogging_sync.sh script for on db1047 and dbstore1002

2017-04-04

 * 19:50 ottomata: beginning jessie reimage for analytics105[56]
 * 18:13 ottomata: starting jessie upgrade of analytics105[34]
 * 17:26 joal: Restart mediawiki-history-denormalize-wf-2017-04
 * 16:17 joal: Restart webrequest-load-wf-text-2017-4-4-14
 * 08:14 elukey: restarted webrequest-load-wf-text-2017-4-4-6

2017-04-03

 * 18:24 nuria: starting replication back on Eventlogging 1002/1047/1046
 * 18:15 ottomata: dropping EL tables with really old data
 * 12:53 elukey: restart webrequest-load-wf-upload-2017-4-3-11
 * 11:49 elukey: manual run of sudo -u stats /a/refinery-source/guard/run_all_guards.sh --rebuild-jar
 * 11:38 joal: Restart corrected mediawiki-history oozie job
 * 11:30 joal: Deploying refinery to HDFS
 * 11:25 joal: Deploying refinery
 * 10:35 joal: Deploying refinery-source to archiva

2017-04-01

 * 07:26 joal: Kill old cassandra bundles and restart new one for project_v2 production codeb

2017-03-30

 * 18:02 elukey: an1039 back up again after thermal paste applied
 * 17:54 ottomata: stopping hadoop services on analytics1046 for jessie upgrade

2017-03-29

 * 17:19 nuria: restarted EL on eventlog1001 with new changeset and tables renamed
 * 17:13 nuria: deploying eventlogging latetst: 28740773cea545215ea610c8c3e1a3ba36ef5a6a (UA changes)

2017-03-28

 * 14:30 elukey: analytics1028 back serving traffic - T159632

2017-03-27

 * 16:05 joal: Relaunch corrected denormalize oozie job
 * 14:06 elukey: restart hadoop-yarn-nodemanager on analytics1044
 * 13:03 elukey: fixed permissions (hdfs:hdfs -> root:root for /var/lib/hadoop/data)
 * 11:37 joal: Start manual sqoop for failed wikis (dawiki, cebwiki, srwiki)
 * 07:34 elukey: re-run mediacounts-load-wf-2017-3-24-14 from hue

2017-03-23

 * 20:17 ottomata: moved all analytics cluster cron jobs (camus and other) from analytics1027 to analytics1003: T159527
 * 20:14 ottomata: earlier today: upgraded from cdh5.2 to cdh5.10 on analytics1030, somehow we missed it! :o
 * 11:30 joal: Restart webrequest-load-wf-maps-2017-3-23-10

2017-03-21

 * 15:33 ottomata: restarting eventlogging client side processors with ImageMetrics blacklist change
 * 09:55 joal: Reset hdfs folders and hive tables and partitions for productionisation of mediawiki history
 * 09:52 joal: Restart webrequest-load bundle to pick up new pageview definition (2017-03-21T09:00Z)
 * 05:57 joal: Restart cassandra-hourly-wf-local_group_default_T_pageviews_per_project-2017-3-20-23
 * 05:54 joal: Deploying refinery

2017-03-20

 * 17:47 joal: Deploy refinery-source to archiva
 * 13:21 elukey: restarted pageview-hourly-wf-2017-3-20-11
 * 07:14 elukey: restarted webrequest-load-wf-misc-2017-3-20-3

2017-03-18

 * 19:04 joal: restart mediacounts-load-wf-2017-3-18-15 and mediacounts-load-wf-2017-3-18-16
 * 12:39 joal: Restart webrequest-load-wf-maps-2017-3-18-11
 * 08:13 elukey: restarted 18 Mar 2017 03:00:00 webrequest-load-maps

2017-03-17

 * 14:24 elukey: analytics1044 back in the cluster
 * 12:27 elukey: restarted webrequest-load-wf-text-2017-3-17-10
 * 10:51 joal: restarted mediacounts-archive-wf-2017-03-16

2017-03-16

 * 10:51 fdans: deploying aqs to production

2017-03-15

 * 16:07 elukey: Wiped AQS Beta cassandra cluster

2017-03-14

 * 18:43 nuria: rolling back prior eventlogging deployment, userAgent column is restricted to 191 chars, needs to be bigger or UAs are truncated
 * 14:14 elukey: analytics1043 back in service
 * 12:53 elukey: restarted webrequest-load-wf-upload-2017-3-14-11
 * 12:53 elukey: restarted webrequest-load-wf-text-2017-3-14-11
 * 06:53 elukey: re-run mediacounts-archive-wf-2017-03-13 from Hue (OOMs in the stderr)

2017-03-13

 * 14:36 elukey: analytics1042 back among the Hadoop workers
 * 08:53 elukey: set innodb_buffer_pool_size=2048 for mysql on bohrium (Piwik)

2017-03-12

 * 22:24 elukey: restarted webrequest-load-text 12 Mar 2017 16:00:00 and 17:00:00
 * 22:24 elukey: stopped yarn nodemanager on an1028
 * 22:17 elukey: restarted webrequest-load Maps 12 Mar 2017 14:00:00
 * 07:11 elukey: re-set SET GLOBAL max_connections=300 on bohrium's mysql (got lost after the restart)

2017-03-10

 * 15:39 elukey: applied innodb_buffer_pool_size = 512M and restarted mysql on bohrium
 * 10:54 elukey: executed set global innodb_flush_log_at_trx_commit=2; on bohrium as test

2017-03-09

 * 14:17 joal: restart failed webrequest load [upload maps misc] 2017-03-09T09:00Z
 * 11:04 elukey: an1041 yarn nodemanager back running
 * 10:31 elukey: analytics1041 yarn nodemanager stopped, chowning to yarn:yarn all the perms in /var/lib/hadoop/data/X/yarn dirs
 * 10:09 elukey: restarted yarn-nodenamanger on analtycs1040
 * 09:52 elukey: restarted Mar 2017 02:00:00 webrequest-load-text (second time)
 * 08:57 elukey: re-running webrequest-load-text failed jobs too via Hue
 * 08:43 elukey: re-run via Hue the failed upload-load job
 * 08:39 elukey: re-run all the failed misc webrequest-load oozie jobs (total of four)
 * 08:28 elukey: re-run 186-09 Mar 2017 00:00:00 (webrequest-load-maps) on Hie

2017-03-07

 * 15:20 joal: deploying aqs in prod
 * 14:44 joal: Deploy AQS on beta
 * 12:52 elukey: analytics1040 back in service
 * 12:50 elukey: restarted webrequest-load-wf-text-2017-3-7-9 from Hue (oozie id: 0010151-170228165458841-oozie-oozi-W mapred that failed: job_1488294419903_24496)

2017-03-03

 * 11:29 joal: Restart 3 oozie spark jobs
 * 11:02 joal: Deploying refinery after having break stat1002 :(
 * 10:32 joal: deploying refinery
 * 09:43 joal: Deploying refinery-source v0.0.42 using jenkins

2017-03-02

 * 18:22 ottomata: deleteing and recreating oozie share lib
 * 18:15 joal: Restarting webrequest load for tect 2017-03-02T15:00Z
 * 14:27 joal: restart mediacounts job starting 2017-03-01T11:00Z

2017-03-01

 * 14:41 joal: Deploying refinery onto hdfs (before restarting jobs)
 * 14:38 joal: Restart all hdfs oozie jobs with 2048M launcher memory (using script)
 * 10:16 joal: Kill and restart webrequest-load-maps coordinator checking for new oozie_loader_memory parameter (starting from 2017-02-28T18:00 - 2g launcher memory)
 * 09:39 joal: Kill and restart webrequest-load-maps coordinator checking for new oozie_loader_memory parameter (starting from 2017-02-28T18:00)
 * 07:17 elukey: restarted manually the browser-general-coord failed jobs
 * 07:13 elukey: restarted manually the pageview-hourly-coord failed jobs
 * 07:09 elukey: restarted manually the pageview-druid-monthly-coord (february job failed)
 * 07:06 elukey: restarted manually via Hue UI the webrequest-load-coord-misc failed jobs
 * 06:59 elukey: restarted manually via Hue UI the webrequest-load-coord-maps failed jobs

2017-02-28

 * 18:03 joal: restart pageview oozie job for 2017-02-28T12:00
 * 17:53 elukey: restarted via Hue Feb 2017 14:00:00 webrequest-load-coord-misc/maps
 * 14:02 joal: Suspend mediawiki-load jobs as well (forgot about those)
 * 13:31 joal: Suspend webrequest-load bundle for CDH upgrade
 * 13:30 elukey: stopping camus as prep step for the CDH upgrade

2017-02-23

 * 12:18 joal: Restart cassandra-coord-pageview-per-project-hourly 2017-02-23T07, 08, 09 to recover from cassandra issue - Worked !
 * 11:19 joal: Restart cassandra-coord-pageview-per-project-hourly 2017-02-23T07 and 08 to recover from cassandra issue

2017-02-22

 * 08:06 elukey: restart Hue on an1027 for openssl upgrades

2017-02-16

 * 13:22 elukey: updated firewall rules for Analytics VLAN

2017-02-15

 * 13:55 elukey: disabled apache mod_deflate on bohrium (piwik test)
 * 09:01 elukey: restarted Piwik with bulk_requests_use_transaction=0 to try to fix the SQL deadlock issue (https://github.com/piwik/piwik/issues/6398#issuecomment-91093146)

2017-02-13

 * 21:38 elukey: Restarted webrequest-load-coord-upload 19:00 - failed and Hue returning 500s

2017-02-11

 * 00:13 joal: Restartedwebrequest-load-wf-text-2017-2-10-20

2017-02-10

 * 09:53 elukey: re-enabled oozie bundles after maintenance
 * 09:51 elukey: restarted Hive-* and oozie on analytics1003
 * 09:40 elukey: suspending oozie bundles to allow oozie/hive maintenance

2017-02-09

 * 13:02 mforns: Restarted webrequest-load-bundle and pageview-hourly-coord
 * 12:46 mforns: Deployed refinery using scap, then deployed onto hdfs
 * 12:00 elukey: added Marcel as superuser in Hue
 * 11:56 elukey: stopped webrequest-load-bundle from hue
 * 11:06 mforns: Deployed refinery-source using jenkins
 * 10:48 elukey: restarting druid daemons for Java upgrades
 * 10:05 elukey: re-enabled oozie bundles after maintenance
 * 10:04 elukey: performed master failover from an1001 to an1002 (and vice-versa) for java upgrades
 * 10:04 elukey: restarted oozie, hive-server and metastore for java upgrades
 * 09:49 elukey: suspended oozie bundles temporarily to allow graceful restarts

2017-02-08

 * 18:05 ottomata: restarting pivot
 * 17:52 ottomata: restarting pivot
 * 15:35 elukey: restarted all the failed oozie cassandra load jobs

2017-02-07

 * 20:24 joal: Resubmit cassandra-coord-pageview-per-project-hourly for 2017-02-07T18:00
 * 14:36 elukey: restarted webrequest-load-wf-text-2017-2-7-13

2017-02-04

 * 13:18 joal: Restarted mediacounts-archive job for day 2017-02-03 (had failed)

2017-02-02

 * 12:07 joal: Restarted daily and monthly pageview druid loading jobs
 * 12:03 joal: Deployed refinery to correct bug introduced in https://gerrit.wikimedia.org/r/#/c/335067/
 * 10:13 joal: Killed-Restarted last access uniques monthly jobs to pick up new config -0097552-161121120201437-oozie-oozi-C

2017-02-01

 * 19:01 joal: Killed-Restarted Mobile apps Uniques monthly jobs to pick up new config - 0096638-161121120201437-oozie-oozi-C
 * 18:47 joal: Deploy refinery for uniques monthly patches
 * 17:27 joal: Restarting 2 webrequest-load text jobs that failed during NM restart (2016-02-01T11:00 and T13:00)
 * 13:12 elukey: restarted pageview-druid-monthly-coord and pageview-druid-daily-coord oozie coordinators after deployment
 * 12:17 elukey: deployed Refinery via scap and then executed the hdfs copies on stat1002

2017-01-31

 * 16:11 elukey: started Cassandra nodetool cleanup for aqs1007-a
 * 16:04 elukey: started Cassandra nodetool cleanup for aqs1004-b
 * 08:31 elukey: started Cassandra nodetool cleanup for aqs1004-a

2017-01-26

 * 19:20 joal: Restart webrequest-lood-coord-text 2017-01-26T15:00 after cluster shake
 * 19:18 elukey: restored an1001 as RM and HDFS master

2017-01-24

 * 21:30 ottomata: restarted hadoop-mapreduce-historyserver on analytics1001. it died due to OOM

2017-01-22

 * 13:27 joal: Rerun pageview-druid-daily-wf-2017-1-20 trying to see if it fixes automagically

2017-01-19

 * 15:51 joal: Launched 0080172-161121120201437-oozie-oozi-B to recover from missing webrequest-load 2017-01-18 19:00 with a correct setup this time
 * 15:39 joal: Launched 0080149-161121120201437-oozie-oozi-B to recover from missing webrequest-load 2017-01-18 19:00

2017-01-17

 * 11:16 joal: Remove mediawiki-history-beta datasource from druid
 * 09:51 elukey: restarted mediacounts-archive-wf-2017-01-16

2017-01-11

 * 19:23 joal: Start mediawiki history reconstruction job on newly sqooped data
 * 18:25 joal: Replace /wmf/data/raw/mediawiki/tables/ with newly sqooped data

2017-01-10

 * 15:30 joal: Restart 0024519-160420145651441-oozie-oozi-C for day 2017-01-09 to see if it fails again

2017-01-06

 * 20:35 joal: Launched 0063574-161121120201437-oozie-oozi-C to cover for upload-2017-01-06-[16-17]
 * 19:04 elukey: started 0063446-161121120201437-oozie-oozi-C to re-run upload-2017-1-6-17

2016-12-22

 * 15:28 elukey: changed firewall rules to allow only $ANALYTICS_NETWORKS (rather than the broader $INTERNAL) for the Yarn UI http service (an1001) and the hive metastore (an1003)

2016-12-19

 * 21:27 nuria: deployed analytics refinery, restarted webrequest load and pageview_hourly jobs
 * 20:11 nuria: deployed analytics/refinery to cluster (2nd try)

2016-12-13

 * 11:12 elukey: deleted /srv/stat1001 on stat1004

2016-12-09

 * 14:32 joal: restarted eventlogging mysql consumer after DB restart
 * 13:57 joal: Stopped EventLogging Mysql consumer for database restart

2016-12-08

 * 18:37 ottomata: preferred-replica-election on analytics kafka cluster to bring 1012 back as leader for its partitions
 * 18:15 ottomata: restarting broker on kafka1012 to repro T152674

2016-12-07

 * 21:59 ottomata: restarting eventlogging again to pick up puppet changes to use kafka-confluent writer
 * 19:39 ottomata: restarting analytics eventlogging to test out confluent kafka producer for processors

2016-12-05

 * 11:02 joal: Killing wikidata-articleplaceholder_metrics job and restarting it starting Nov. 1st for code update
 * 10:43 joal: Deploy refinery onto hdfs
 * 10:35 joal: deploying refinery

2016-12-02

 * 09:43 joal: Restarted yesterday failed oozie webrequest-load jobs (upload, text, misc, hours 21, 22,23)

2016-12-01

 * 20:27 ottomata: bouncing kafka broker on kafka1018 to test config changes to eventlogging analytics kafka clients
 * 20:25 ottomata: restarting eventlogging analytics processes again to pick up api_version change for consumers too
 * 19:45 ottomata: restarting eventlogging analytics processes to pick up api_version kafka arg
 * 08:02 elukey: added fi.wikivoyage to the pageview whitelist manually

2016-11-30

 * 21:32 milimetric: restarted webrequest/load oozie bundle
 * 21:17 milimetric: Deployed refinery using scap, then deployed onto hdfs
 * 20:52 milimetric: Deployed refinery-source using jenkins

2016-11-25

 * 09:16 elukey: resumed oozie bundles and camus crontab after maintenance
 * 08:49 elukey: stopping oozie and camus as prep-step for Yarn/HDFS master failover (remaining hosts with old openjdk)

2016-11-12

 * 19:23 joal: Launch 0028421-161020124223818-oozie-oozi-B to cover for webrequest-load hours 19-20 missing on 2016-11-10

2016-11-10

 * 19:59 nuria: deployed v0.0.37 of refinery to hdfs
 * 18:22 nuria: deployed v0.0.37 of refinery-source https://gerrit.wikimedia.org/r/#/c/320797/

2016-11-08

 * 12:33 joal: Deploying refinery for patching pageview whitelist

2016-11-07

 * 09:45 elukey: started 0022558-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-7-07
 * 08:00 elukey: started 0022441-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-7-04 -> 06
 * 04:53 joal: started 0022249-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-7-00 -> 03

2016-11-06

 * 19:50 joal: started 0021806-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-6-16
 * 17:39 elukey: started 0021694-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-6-15
 * 09:27 joal: started 0021136-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-6-01 -> 07

2016-11-05

 * 18:05 joal: started 0020254-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-5-10
 * 08:47 joal: started 0019693-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-5-00 -> wf-upload-2016-11-5-07
 * 08:45 joal: started 0019686-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-4-19 -> wf-upload-2016-11-4-20

2016-11-04

 * 08:45 elukey: started 0018557-161020124223818-oozie-oozi-C to re-run wf-upload-2016-11-4-6
 * 08:45 elukey: started 0018549-161020124223818-oozie-oozi-C to re-run wf-upload-2016-11-4-2 -> wf-upload-2016-11-4-4

2016-11-02

 * 19:43 ottomata: manually stopped an old wikistats_git pageviews cron in spetrea's crontab on stat1003. no output from it since 2013, and spetrea doesn't really have an account

2016-11-01

 * 17:52 joal: Deploying refinery
 * 14:45 joal: Restart webrequest load job to apply
 * 14:33 joal: deploying refinery onto the cluster
 * 14:00 ottomata: restarting pivot

2016-10-31

 * 17:09 ottomata: bouncing eventlogging
 * 17:00 ottomata: kafka preferred replica election on main-eqiad kafka cluster to promote kafka1003 as leader for its preferred partitions
 * 14:49 ottomata: adding kafka1003 in as replicas for active main-eqiad topics
 * 14:12 ottomata: adding kafka1003 as kafka broker in main-eqiad cluster
 * 14:00 joal: deploy refinery

2016-10-28

 * 13:04 elukey: oozie firewall rules changed - nowonly the analytics network is allowed
 * 00:19 bd808: Testing logging to mw.o SAL via stashbot

2016-09-23

 * 09:06 elukey: reboot eventlog2001.codfw.wmnet for kernel upgrades
 * 08:45 elukey: upgrading varnishkafka to 1.0.12-1 in cache:misc
 * 08:32 elukey: upgrading varnishkafka to 1.0.12-1 in cache:maps

2016-09-22

 * 15:30 elukey: analytics1001 is back Yarn/HDFS master
 * 13:16 elukey: previous comment was meant to be read as "set a permanent read only = false"
 * 13:16 elukey: set read_only = false (on startup) for the analytics1003's mariadb instance
 * 13:12 elukey: restarted oozie jobs for 2016-9-22-6
 * 12:50 elukey: varnishkafka 1.0.12 installed in cache:upload ulsfo and eqiad
 * 11:04 elukey: re-enabling oozie and camus after cluster reboots
 * 10:57 elukey: rebooted analytics1001
 * 10:55 elukey: Failover from analytics1001 to analytics1002 as prep step for 1001's reboot
 * 10:28 elukey: setting global read_only = 0 to analytics1003 mariadb instance
 * 10:04 elukey: rebooted analytics1003 (oozie, hive-metastore and hive-server2 daemons affected)
 * 09:51 elukey: executed aptitude remove apache2 on analytic1027 (we use nginx in front of hue, apache steals port 8888 to hue and it does not start)
 * 09:49 elukey: suspended all oozie bundles as prep step to reboot analytics1003
 * 09:39 elukey: rebooted analytics1027
 * 09:14 elukey: varnishkafka 1.0.12 installed in cache:upload codfw
 * 08:52 elukey: varnishkafka 1.0.12 installed in cache:upload esams
 * 06:45 elukey: stopped camus on analytics1027 and suspended webrequest-load-bundle via Hue (prep step for reboots)

2016-09-21

 * 17:43 elukey: installed varnishkafka 1.0.12-1 on cp3034.esams
 * 06:25 elukey: removed aqs100[123] from live traffic

2016-09-20

 * 17:03 elukey: aqs100[56] added to LVS and serving live traffic
 * 16:22 elukey: restarting cassandra on aqs1005
 * 07:41 elukey: restart cassandra on aqs100[456] for T130861 - only aqs1004 is taking live traffic

2016-09-16

 * 09:24 elukey: added aqs100[456] to conftool-data (not pooled but the load balancer is doing health checks)

2016-09-14

 * 16:07 elukey: cassandra on aqs100[123] restarted for T130861

2016-09-12

 * 18:54 ottomata: reenabled camus with new version of camus checker jar
 * 18:41 ottomata: disabled camus crons on analytics1027
 * 09:48 elukey: restarted pivot on a tmux session on stat1002 since it died

2016-09-09

 * 08:32 elukey: executed apt-get clean on analytics1032 to free space

2016-09-08

 * 15:37 ottomata: deploying refinery with v0.0.35 of refinery source
 * 09:54 elukey: removed duplicates from the hdfs crontab on analytics1027

2016-09-05

 * 13:23 elukey: removed the unsued analytics-root group from puppet

2016-08-31

 * 09:18 elukey: deleted /var/www/limn-public-data/caching on stat1001 to free space
 * 09:10 elukey: Moved stat1003:/srv/reportupdater/output/caching to /home/elukey/caching as temporary measure to free space on stat1001
 * 07:54 elukey: removed /home/home dir from stat1001 to free space
 * 07:52 elukey: removed /home/home/home dir from stat1001 to free space

2016-08-30

 * 17:45 joal: Drop pageviews test datasource in druid

2016-08-26

 * 13:52 elukey: re-enabling camus and oozie
 * 13:48 elukey: restarted hadoop-hdfs-namenode on analytics1002 (1001 back to active)
 * 13:45 elukey: restarted yarn-resourcemanager on analytics1002 (1001 back to active)
 * 13:33 elukey: restarted hadoop-hdfs-namenode on analytics1001
 * 13:30 elukey: restarted yarn-resourcemanager on analytics1001
 * 13:09 elukey: oozie, hive-server and hive-metastore restarted for security upgrades
 * 11:32 elukey: stopped camus on analytics1027
 * 11:31 elukey: suspended all the oozie bundles via Hue

2016-08-12

 * 14:40 elukey: created the 'aqsloader' user on aqs100[456] cassandra instances following https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/AQS_Tasks
 * 14:09 joal: Deploy refinery on hadoop
 * 13:51 joal: Deploy refinery from tin

2016-08-10

 * 15:41 joal: Loading 2016-07 in new aqs

2016-08-09

 * 17:48 ottomata: restarting eventlogging with kafka-python 1.3.1 (and bugfix), will be testing kafka broker restarts again today
 * 13:12 elukey: deploying the aqs cassandra user to aqs100[123] (not using it in aqs-restbase yet)
 * 13:10 elukey: deploying the aqs cassandra user to aqs100[456] (not using it in aqs-restbase yet)

2016-08-08

 * 18:54 ottomata: restarting eventlogging with processors retries=6&retry_backoff_ms=200. if this works better, will puppetize.
 * 18:30 ottomata: restarting kafka broker on kafka1013 to test eventlogging leader rebalances
 * 15:13 ottomata: deploying eventlogging/analytics - kafka-python 1.3.0 for both consumers and producers
 * 14:13 joal: Loading 2016-06 in clean new aqs
 * 14:10 joal: Adding test data onto newly wiped aqs cluster
 * 14:06 joal: Updating cassandra compaction to deflate on newly wiped cluster

2016-08-05

 * 15:39 joal: Restart oozie jobs for druid loading from production refinery instead of joal
 * 14:31 joal: Retrying deploying refinery from scap
 * 13:51 joal: Stopping pagecounts-[raw|all-sites] oozie jobs (load and archive)
 * 13:07 joal: Deploying refinery using scap
 * 12:59 joal: Rolled back refinery interactive deploy
 * 12:54 joal: Deploy refinery using brand new scap deploy !
 * 07:42 elukey: ran apt-get clean on analytics1027 to free space

2016-08-04

 * 19:50 ottomata: now running kafka-python 1.2.5 for eventlogging-service-eventbus in codfw, removed downtime for kafka200[12]
 * 17:36 elukey: added the analytics-deploy key to the Keyholder for the Analytics Refinery scap3 migration (also updated https://wikitech.wikimedia.org/wiki/Keyholder)
 * 17:29 elukey: deploying the refinery with scap3 for the first time on all nodes

2016-07-29

 * 01:55 milimetric: limn1 disk full, no idea how to clean it because /public refuses to list its files or listen to me when I try to delete it

2016-07-28

 * 17:37 ottomata: powercycling analytics1032

2016-07-26

 * 10:13 joal: Re-deploying refinery after bug fix
 * 09:26 joal: Deploying refinery
 * 08:41 joal: Deploying refinery-source using Jenkins

2016-07-25

 * 18:31 ottomata: upgrading kafka to 0.9 in main-codfw, first kafka2001 then 2002

2016-07-20

 * 19:40 joal: Relaunch 2016-07-19 cassandra per-article-daily oozie job
 * 15:45 elukey: executed https://phabricator.wikimedia.org/P3520 on aqs100[456] for both a/b cassandra instances
 * 15:33 elukey: raising compaction throughput to 256 on aqs100[456]

2016-07-18

 * 17:16 joal: Change compression from lz4 to deflate on aqs100[456]
 * 17:16 joal: Change compression from lz4 to deflate
 * 08:59 joal: deploy restabase on aqs100[23]
 * 08:36 elukey: re-executed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2016-7-16 (failed oozie job)

2016-06-08

 * 08:45 elukey: removed temporary retention override for kafka webrequest_text topic (T136690)
 * 08:17 elukey: lowering down webrequest_text kafka topic retention time from 7 days to 4 days to free disk space

2016-06-07

 * 17:51 ottomata: restarting broker on kafka1020
 * 10:10 elukey: hue restarted on analytics1027 for security upgrades

2016-06-06

 * 19:16 ottomata: restarting kafka broker on kafka1020 to test python consumption client

2016-06-04

 * 09:47 elukey: removed temporary Analytics Kafka upload retention override (T136690)
 * 09:38 elukey: Lowering down temporarily the Analytics kafka upload retention time to 24h to free space (T136690)

2016-06-03

 * 08:38 elukey: event logging restarted on eventlog1001
 * 08:34 elukey: rebooting kafka1012 for kernel upgrades.

2016-06-02

 * 19:53 ottomata: stopping kafka broker and restarting kafka1014

2016-06-01

 * 18:16 ottomata: stopping kafka broker on kafka1018 and rebooting node
 * 11:55 elukey: restarted EL on eventlog1001
 * 11:51 elukey: rebooting kafka1022 for kernel upgrades
 * 08:26 elukey: deleted very old kafka.log files in /var/log/kafka to free root space
 * 07:54 elukey: EL restarted on eventlog1001
 * 07:47 elukey: stopping kafka on kafka1020.eqiad and rebooting the host for Linux 4.4 upgrades

2016-05-27

 * 11:28 elukey: restarted jmxtrans on kafka10* hosts
 * 11:26 elukey: restarted jmxtrans on kafka1013
 * 11:21 elukey: executed kafka preferred-replica-election on kafka1013

2016-05-25

 * 14:24 joal: deploying aqs from tin
 * 14:16 joal: Deploying aqs into aqs_deploy

2016-05-24

 * 19:25 nuria_: deploying latest master to dashiki 08cc9a2545bcc0a183a3c00c18e81f21326a41b
 * 12:56 elukey: EL restarted after kafka1013 node stop (kernel upgrades)
 * 12:50 elukey: stopping kafka on kafka1013 and rebooting the host for kernel upgrade

2016-05-23

 * 17:28 elukey: re-run from Hue webrequest-load-wf-(text|upload)-2016-5-23-13. The failures were likely caused by my global Yarn restart on the cluster.
 * 17:20 elukey: oozie bundles re-enabled
 * 14:58 elukey: suspended all the oozie bundles as prep step for https://gerrit.wikimedia.org/r/#/c/290252 (yes I know super paranoid mode on)
 * 06:42 elukey: Removed Kafka temp. override for webrequest_upload retention.ms after freeing some disk space.
 * 06:37 elukey: Set kafka retention.ms=172800000 for the topic webrequest_upload to free some disk space on kafka1022

2016-05-20

 * 12:50 elukey: aqs100[123] restarted for openjdk upgrades
 * 08:53 elukey: cassandra upgraded to 2.1.13 on aqs1003
 * 08:30 elukey: aqs1002 migrated to cassandra 2.1.13

2016-05-02

 * 18:30 joal: manually touch _SUCCESS file in hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2016/05/02/14/ to launch refine process despites load job failure
 * 17:38 elukey: removed out of service banner from dashiki dashboards
 * 17:33 elukey: reverted Varnish config to return 503s for datasets and stats
 * 12:14 elukey: deployed Varnish change to force HTTP 503 for datasets.wikimedia.org, stats.wikimedia.org, metrics.wikimedia.org as prep-step for OS reimage.
 * 12:05 elukey: enabled maintenance banner to dashiki based dashboards via https://meta.wikimedia.org/wiki/Dashiki:OutOfService
 * 11:21 elukey: deployed last version of Event Logging. Service also restarted.

2016-04-30

 * 13:42 elukey: disabled puppet on analytics1047 and scheduled downtime for the host, IO errors in the dmesg for /dev/sdd. Stopped also Hadoop daemons to remove it from the cluster temporarily (not sure how to do it properly, will write docs).

2016-04-28

 * 10:44 joal: deployed aqs on all three nodes (Thanks elukey !!!!)
 * 09:03 joal: Deploying aqs on aqs1001
 * 08:14 elukey: restarting kafka on kafka{1012,1014,1022,1020,2001,2002} for Java upgrades. EL will be restarted as well (sigh)

2016-04-27

 * 15:47 elukey: restarted event logging on eventlogging1001
 * 14:01 elukey: restarted Event Logging on eventlogging1001
 * 13:53 elukey: restarted kafka on kafka1018.eqiad.wmnet for Java upgrades

2016-04-25

 * 19:55 nuria_: deployed new vitalsigns code to https://vital-signs.wmflabs.org
 * 17:43 nuria_: deployed new vitalsigns code to https://vital-signs.wmflabs.org

2016-04-22

 * 09:23 moritzm: installing ircbalance bugfix updates (preventing massive logspam on some systems)

2016-04-20

 * 16:06 elukey: camus re-enabled on analytics1027
 * 13:54 elukey: puppet stopped on analytics1027 together with Camus (via crontab -e)
 * 10:41 elukey: started rsync of /srv from stat1001 to stat1004 (/srv/stat1001)

2016-04-19

 * 08:33 joal: deployed new refinery on hadoop
 * 08:21 joal: deploying refinery from tin

2016-04-18

 * 10:11 elukey: execute sudo eventloggingctl restart on eventlogging1001

2016-04-13

 * 16:35 ottomata: rebuilding raid1 array on aqs1001 after hot swapping sdh
 * 15:00 joal: restarting failed jobs
 * 14:38 ottomata: restarting hadoop-yarn-nodemanager on all hadoop worker nodes one by one to apply increase in heap size

2016-04-11

 * 11:52 joal: Restart refine job after deploy
 * 10:30 joal: Deploying refinery on HDFS
 * 10:21 joal: deploying refinery from tin
 * 09:13 joal: Releasing refinery-source v0.0.30 to archiva

2016-04-08

 * 10:09 joal: deploying aqs from tin on aqs1003
 * 10:08 joal: deploying aqs from tin on aqs1002
 * 10:03 joal: deploying aqs from tin on aqs1001

2016-04-07

 * 22:58 nuria_: deployed browser-reports master branch to labs
 * 19:34 ottomata: restarting eventlogging so it runs out of the scap deploy in eventlogging/analytics
 * 10:21 elukey: nodejs-legacy upgraded too on all aqs nodes
 * 09:43 elukey: aqs1002.eqiad.wmnet re-pooled, aqs1003.eqiad.wmnet de-pooled/re-pooled too (nodejs upgrade)
 * 09:30 elukey: aqs1002.eqiad.wmnet de-pooled via confctl. Nodejs upgrade will follow.
 * 09:18 elukey: re-added aqs1001.eqiad.wmnet to LVS pool via confctl
 * 08:59 elukey: removed aqs1001.eqiad.wmnet from LVS pool via confd for nodejs upgrade

2016-04-06

 * 14:04 elukey: ran nodetool repair system_auth on aqs1002.eqiad/aqs1003.eqiad
 * 13:59 elukey: ran nodetool repair system_auth on aqs1001.eqiad
 * 11:45 elukey: started nodetool repair on aqs1002 after running "ALTER KEYSPACE system_auth WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 };"

2016-04-04

 * 15:45 elukey: aqs1001 re-added to the aqs pool (nodejd NOT upgraded)
 * 14:46 elukey: de-pooled aqs1001.eqiad from the confd pool for nodejs upgrade
 * 10:42 elukey: re-pooled aqs1001.eqiad (no node upgrade, need more info about restbase)
 * 09:53 elukey: de-pooled aqs1001.eqiad.wmnet as pre-step for nodejs upgrade

2016-04-01

 * 13:38 joal: Deploying aqs in aqs1003 from tin
 * 13:35 joal: Deploying aqs in aqs1002 from tin
 * 13:23 joal: Deploying aqs in aqs1001 from tin

2016-03-31

 * 20:01 ottomata: stopping eventlogging, uninstalling globally installed eventlogging python code, running puppet, restarting eventlogging from /srv/deployment/eventlogging/eventlogging
 * 19:45 ottomata: merging puppet change to run eventlogging code out of deploy repo

2016-03-30

 * 18:06 ottomata: repooling aqs1001
 * 18:00 ottomata: depooling aqs1001

2016-03-29

 * 13:27 joal: Update CirrusSearchRequestSet schema in hive

2016-03-24

 * 18:29 elukey: camus and puppet re-enabled on analytics1027
 * 18:27 ottomata: resuming suspended webrequest load and refine jobs
 * 17:57 elukey: enabled Hadoop Master Node automatic failover on analytics1001/1002 (this time without fireworks).
 * 17:09 ottomata: temporarily suspending oozie webrequest refine jobs
 * 16:18 ottomata: suspending webrequest load job temporarily
 * 16:15 elukey: disabled camus and puppet on analytics1027
 * 13:16 elukey: camus and puppet re-enabled on analytics1027
 * 09:56 elukey: Camus stopped on analitics1027 (puppe disabled too)
 * 09:52 elukey: puppet disabled on analytics1001/1002 as pre-set to enable HDFS HA failover.

2016-01-21

 * 16:35 ottomata: stopped eventlogging mysql consumers for long downtime: https://phabricator.wikimedia.org/T120187
 * 16:20 ottomata: started eventlogging mysql consumers
 * 15:59 ottomata: stopping eventlogging mysql consumers for https://phabricator.wikimedia.org/T123546

2016-01-20

 * 18:30 mforns: deployed EL in production with removal of queue
 * 17:37 mforns: restarted EventLogging because of Kafka consumption lag

2016-01-19

 * 20:08 mforns: deployed eventlogging to deployment-eventlogging03 with removal of mysql consumer batch

2016-01-18

 * 14:49 ottomata: restarting eventlogging to un-blacklist MobileWebSectionUsage
 * 01:07 ottomata: restarted eventlogging again. A single raw client side processor consumer seemed stuck (according to burrow).  seeing offset commit errors in logs.

2016-01-17

 * 08:26 ottomata: restarting eventlogging to see if it'll help burrow reported kafka consumer lag

2016-01-14

 * 22:29 YuviPanda: wikimetrics
 * 19:55 ottomata: restarted eventlogging_sync script to insert batches of 1000

2016-01-13

 * 20:01 ottomata: dropped MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 from analytics-store eventlogging slave db
 * 19:24 ottomata: restarting eventlogging to apply blacklist of MobileWebSectionUsage scheas

2015-12-30

 * 15:23 ottomata: killing oozie legacy_tsv job 0102159-150605005438095-oozie-oozi-B to restart it without mobile, 5xx-mobile and zero outputs

2015-11-10

 * 03:14 ottomata: restarted eventlogging

2015-11-09

 * 14:40 ottomata: restarting eventlogging to see if it is ok after enabling firewall rules on kafka1014

2015-11-06

 * 15:51 joal: Change replication factor to 2 in cassandra per_article_flat keyspace
 * 15:47 ottomata: deploying aqs

2015-11-05

 * 18:24 ottomata: deploying aqs

2015-10-29

 * 10:35 joal: Gzipped already archived pageview files
 * 10:34 joal: restarted pageview job to archive gzipped files
 * 10:34 joal: refinery deployed

2015-10-28

 * 19:16 joal: Downsizing cassandra replication from 3 to to 2 on per_article_flat keyspace
 * 19:07 joal: Restart load job (based on IMPORTED flag)
 * 15:48 joal: Deploying refinery
 * 15:40 joal: deploying refinery-source v0.0.22

2015-10-27

 * 19:06 ottomata: deploying aqs
 * 18:24 joal: deploying refinery
 * 16:46 joal: Releasing refinery-source v0.0.21
 * 10:34 joal: manual aggregator launch after small bug correction

2015-10-26

 * 18:42 joal: refine bundle, pageview_hourly and projectview_hourly coord restarted
 * 18:41 joal: refinery deployed on HDFS
 * 14:33 joal: truncating "local_group_default_T_pageviews_per_article".data on aqs
 * 09:58 joal: Restart cassandra on aqs1001

2015-10-22

 * 20:24 ottomata: deploying aqs
 * 09:51 joal: restart cassandra on aqs1003

2015-10-21

 * 22:53 milimetric: deployed EventLogging and tried to backfill data lost on 2015.10.14 but failed
 * 18:24 joal: Stopped per article loading in cassandra
 * 13:39 ottomata: deploying aqs

2015-10-20

 * 10:12 joal: restart cassandra on aqs1002

2015-10-19

 * 18:35 ottomata: restarting eventlogging with change to parse schema names out of errored events

2015-10-16

 * 20:38 joal: restarted cassandra on aqs100[1,2,3]

2015-10-15

 * 12:17 joal: Refinery deploy needed before restart --> Deploying
 * 12:12 joal: Restarting daily and monthly mobile unique coordinators with new patch
 * 12:12 joal: Rerunning daily mobile unique jobs for days 2015-08-[03,04,11,12,12,14,17], 2015-09-16
 * 12:10 joal: Stopped daily and monthly mobile unique coordinators

2015-10-14

 * 15:22 ottomata: restarting lagging eventlogging mysql consumer

2015-10-09

 * 19:26 ottomata: releasing refinery 0.20
 * 15:19 ottomata: moved camus property files out of refinery repository and into puppet. Camus properties now live on an27 at /etc/camus.d, and camus log files are in /var/log/camus
 * 14:54 joal: Cassandra restarted on aqs1003
 * 09:15 joal: Restart cassandra on aqs1002

2015-10-08

 * 17:38 joal: Backfilling load from hadoop to cassandra from beginning of october

2015-10-07

 * 16:32 joal: Started cassandra load jobs from 2015-10-01

2015-10-01

 * 16:27 valhallasw`cloud: testing again
 * 16:13 valhallasw`cloud: test

2015-09-29

 * 10:51 joal: cluster back to normql state. Some errors are still not explained, need to be carefull.

2015-09-28

 * 14:56 joal: backfilling various load jobs having failed at earlier stages than check_sequence_statistics
 * 13:03 joal: Errors on cluster, dome refine jobs have failed, investigating.

2015-08-19

 * 18:20 ottomata: does this log work?

March 25

 * 22:09 qchris: starting HDFS balance for unhealty node analytics1016.eqiad.wmnet with healty nodes analytics1037.eqiad.wmnet,analytics1040.eqiad.wmnet

February 25

 * 16:07 ottomata: hello?

February 7

 * 02:10 qchris: Ran kafka leader re-election as analytics1021 dropped out of it's partition leader role.
 * 01:32 qchris: name nodes died with error "Java heap space" and did not come back up. Bumping heap allowed to resurrect them (See ).

February 4

 * 23:22 qchris: Manual failover of Hadoop namenode from analytics1001 to analytics1002, as analytics1001 had Heap space errors
 * 07:49 qchris: Manual failover of Hadoop namenode from analytics1002 to analytics1001, as analytics1002 had Heap space errors

January 30

 * 20:21 ottomata: deployed refinery 0.0.4
 * 19:37 ottomata: released refinery 0.0.4

January 25

 * 21:53 qchris: Marked raw text webrequest partition for 2015-01-24T00/1H ok (See )

January 23

 * 22:46 qchris: Marked raw upload webrequest partition for 2015-01-16T12/1H ok (The partition only needed deduping)
 * 22:23 qchris: Marked raw upload webrequest partition for 2015-01-16T01/1H ok (The partition only needed deduping)
 * 22:11 qchris: Marked raw upload webrequest partition for 2015-01-15T17/1H ok (The partition only needed deduping)
 * 22:04 qchris: Marked raw text webrequest partition for 2015-01-15T15/1H ok (The partition only needed deduping)
 * 22:01 qchris: Marked raw mobile webrequest partition for 2015-01-16T01/1H ok (The partition only needed deduping)

January 15

 * 08:25 qchris: Ran kafka leader re-election to bring analytics1021 back into the set of leaders

January 10

 * 16:55 qchris: Dropped wmf.webstats tables, as announced on https://lists.wikimedia.org/pipermail/analytics/2015-January/003019.html

January 6

 * 12:15 qchris: Marked raw mobile+text webrequest partitions for 2015-01-05T17/1H ok (See )

January 4

 * 12:06 qchris: Marked raw mobile and upload webrequest partition for 2015-01-03T10/1H ok (See )

January 2

 * 21:21 qchris: Ran kafka leader re-election to bring analytics1021 back into the set of leaders
 * 21:07 qchris: Marked raw bits, text, and upload webrequest partition for 2014-12-11T14/1H ok (See )
 * 19:05 qchris: Marked raw text+upload webrequest partitions for 2014-12-26T06/1H ok (See )
 * 15:51 qchris: Marked raw text webrequest partition for 2014-12-11T20/1H ok (See )
 * 12:39 qchris: Marked raw mobile webrequest partition for 2014-12-29T17/1H ok (See )
 * 11:21 qchris: Marked raw text webrequest partition for 2014-12-30T20/1H ok (See )

January 1

 * 20:26 qchris: Marked raw webrequest partitions for 2014-12-10T14/2H ok (See )