Analytics/Server Admin Log

From MediaWiki.org
Jump to: navigation, search

2018-04-23[edit]

  • 14:10 ottomata: switching main -> analytics MirrorMaker to --new.consumer (temporarily stopping puppet on kafka101[234]) https://phabricator.wikimedia.org/T192387
  • 13:54 elukey: reimage analytics1067 to debian stretch

2018-04-20[edit]

  • 18:23 joal: Drop/recreate wmf.mediawiki_user_history andwmf.mediawiki_page_history for T188669
  • 14:17 elukey: d-[1,2,3] hosts in the analytics labs project upgraded to druid 0.10
  • 11:37 fdans: manually uploaded refinery whitelist to hdfs
  • 11:33 elukey: reimage analytics1068 do Debian stretch

2018-04-19[edit]

  • 20:39 milimetric: launched virtual pageviews job, it has id 0026169-180330093100664-oozie-oozi-C
  • 20:36 milimetric: Synced latest refinery version to HDFS
  • 17:35 fdans: refinery deployment - sync to hdfs finished
  • 16:27 elukey: analytics1069 reimaged to Debian stretch
  • 15:40 fdans: deploying refinery
  • 14:30 elukey: disabled druid1001's middlemanager, restarted 1002's
  • 14:19 elukey: add 60G /srv partition to hadoop-coordinator-1 in analytics labs
  • 14:04 elukey: disabled druid1002's worker as prep step for restart - jvms with a old version running realtime indexation

2018-04-16[edit]

  • 10:04 joal: Restart metrics job after table update
  • 09:54 joal: Update wmf.mediawiki_metrics table for T190058
  • 08:41 joal: Restart Mediawiki-history job after new patches
  • 08:35 joal: Restarting wikidata-articleplaceholder oozie job after last week's failures
  • 08:29 joal: Deploying refnery onto HDFS
  • 08:22 joal: Deploying refinery from tin
  • 08:03 joal: Correction - Deploying refinery-source v0.0.62 using Jenkins !
  • 08:03 joal: Deploying refinery source v0.0.62 from tin

2018-04-12[edit]

  • 20:34 ottomata: replacing references to dataset1001.wikimedia.org:: with /srv/dumps in stat1005:~ezachte/wikistats/dammit.lt/bash: for f in $(sudo grep -l dataset1001.wikimedia.org *); do sudo sed -i 's@dataset1001.wikimedia.org::@/srv/dumps/@g' $f; done T189283

2018-04-11[edit]

  • 16:48 elukey: restart hadoop namenodes to pick up HDFS trash settings

2018-04-10[edit]

  • 22:43 joal: Deploying refinery with scap
  • 22:42 joal: Refinery-source 0.0.61 deployed on archiva
  • 20:43 ottomata: bouncing main -> jumbo mirrormakers to blacklist job topics until we have time to investigate more
  • 20:38 ottomata: restarted event* camus and refine cron jobs, puppet is reenabled on analytics1003
  • 20:14 ottomata: restart mirrormakers main -> jumbo (AGAIN)
  • 19:26 ottomata: restarted camus-webrequest and camus-mediawiki (avro) camus jobs
  • 18:18 ottomata: restarting all hadoop nodemanagers, 3 at a time to pick up spark2-yarn-shuffle.jar T159962
  • 18:06 joal: EDeploy refinery to HDFS
  • 17:46 joal: Refinery source 0.0.60 deployed to archiva
  • 15:42 ottomata: disable puppet on analytics1003 and stop camus crons in preperation for spark 2 upgrade
  • 14:25 ottomata: bouncing all main -> jumob mirror makers, they look stuck!
  • 09:00 elukey: restart eventlogging mysql consumers on eventlog1002 to pick up new DNS changes for m4-master - T188991

2018-04-09[edit]

  • 07:15 elukey: upgrade kafka burrow on kafkamon*

2018-04-06[edit]

  • 17:14 joal: Launch manual mediawiki-history-reduced job to test memory setting (and index new data) -- mediawiki-history-reduced-wf-2018-03
  • 13:39 joal: Rerun mediawiki-history-druid-wf-2018-03

2018-04-05[edit]

  • 19:24 ottomata: upgrading spark2 to spark 2.3
  • 13:43 mforns: created success files in /wmf/data/raw/mediawiki/tables/<table>/snapshot=2018-03 for <table> in revision, logging, pagelinks
  • 13:38 mforns: copied sqooped data for mediawiki history from /user/mforns over to /wmf/data/raw/mediawiki/tables/ for enwiki, table: revision

2018-04-04[edit]

  • 21:07 mforns: copied sqooped data for mediawiki history from /user/mforns over to /wmf/data/raw/mediawiki/tables/ for wikidatawiki and commonswiki, tables: revision, logging and pagelinks
  • 16:06 elukey: killed banner-impression related jvms on an1003 to finish openjdk-8 upgrades (they should be brought back via cron)

2018-04-03[edit]

  • 20:11 ottomata: bouncing main -> jumbo mirrormaker to apply batch.size = 65536
  • 19:32 ottomata: bouncing main -> jumbo MirrorMaker unsetting http://session.timeout.ms/, this has a restiction on the broker in 0.9 :(
  • 19:22 ottomata: bouncing main -> jumbo MirrorMaker setting session.timeout.ms = 125000
  • 18:46 ottomata: restart main -> jumbo MirrorMaker with request.timeout.ms = 2 minutes
  • 15:26 elukey: manually run hdfs balancer on an1003 (tmux session)
  • 15:25 elukey: killed a jvm belonging to hdfs-balancer stuck from march 9th
  • 13:48 ottomata: re-enable job queue topic mirroring from main -> eqiad

2018-04-02[edit]

  • 22:28 ottomata: bounce mirror maker to pick up client_id config changes
  • 20:55 ottomata: deployed multi-instance mirrormaker for main -> jumbo. 4 per host == 12 total processes
  • 11:25 joal: Repair cu_changes hive table afer succesfull sqoop import and add _PARTITIONED file for oozie jobs to launch
  • 08:33 joal: rerun wikidata-specialentitydata_metrics-wf-2018-4-1

2018-03-30[edit]

  • 13:48 elukey: restart overlord+middlemanager on druid100[23] to avoid consistency issues
  • 13:41 elukey: restart overlord+middlemanager on druid1001 after failures in real time indexing (overlord leader)
  • 09:44 elukey: re-enable camus
  • 08:26 elukey: stopped camus to drain the cluster - prep for easy restart of analytics1003's jvm daemons

2018-03-29[edit]

  • 20:55 milimetric: accidentally killed mediawiki-geowiki-monthly-coord, and then restarted it
  • 20:12 ottomata: blacklisted mediawiki.job topics from main -> jumbo MirrorMaker again, don't want to page over the weekend while this still is not stable. T189464
  • 07:30 joal: Manually reparing hive mediawiki_private_cu_changes table after manual sqooping of 2018-01 data, and add _PARTITIONNED file to the folder

2018-03-28[edit]

  • 19:39 ottomata: bouncing main -> jumbo mirrormaker to apply increase in consumer num.streams
  • 19:21 milimetric: synced refinery to hdfs (only python changes but just so we have latest)
  • 19:20 joal: Start Geowiki jobs (monthly and druid) starting 2018-01
  • 18:36 joal: Making hdfs://analytics-hadoop/wmf/data/wmf/mediawiki_private accessible only by analytics-privatedata-users group (and hdfs obviously)
  • 18:02 joal: Kill-Restart mobile_apps-session_metrics (bundle killed, coord started)
  • 18:00 joal: Kill-Restart mediawiki-history-reduced-coord after deploy
  • 17:44 joal: Deploying refinery onto hadoop
  • 17:29 joal: Deploy refinery using scap
  • 16:32 ottomata: bouncing main -> jumbo mirror makers to increase heap size to 2G
  • 14:16 ottomata: re-enabling replication of mediawiki job topics from main -> jumbo

2018-03-27[edit]

  • 14:03 elukey: consolidate all the zookeeper definition in one 'main-eqiad' one in Horizon -> Project-Analytics
  • 11:16 elukey: kill banner impression job to force a respawn (still using an old jvm)

2018-03-26[edit]

  • 15:12 elukey: restart eventlogging mysql consumers after maintenance
  • 14:26 ottomata: restarting jumbo -> eqiad mirror makers with prometheus instead of jmx
  • 13:28 ottomata: restarting kafka mirror maker main -> jumbo using new consumer
  • 13:09 fdans: stopped 2 mysql consumers as precaution for T174386

2018-03-24[edit]

  • 08:13 joal: kill failing query swamping the cluster (application_1520532368078_47226)

2018-03-23[edit]

  • 16:44 elukey: invalidated 2018-03-12/13 for iOS data in piwik to force a re-run of the archiver

2018-03-20[edit]

2018-03-19[edit]

  • 09:38 elukey: restart hadoop daemons on analytics1070 for openjdk upgrades (canary)

2018-03-16[edit]

  • 20:23 ottomata: bouncing main -> jumbo mirror makers to apply change-prop topic blacklist
  • 14:44 ottomata: restarting eventlogging mysql eventbus consumer to consume from analytics instead of jumbo
  • 14:38 elukey: temporary point pivot to druid1002 as prep step for druid1001's reboot
  • 14:37 elukey: disable druid1001's middlemanager as prep step for reboot
  • 14:24 elukey: changed superset druid private config from druid1002 to druid1003
  • 13:43 elukey: disable druid1002's middle manager via API as prep step for reboot
  • 09:57 elukey: restart eventlogging-consumer@mysql-m4/eventbus on eventlog1002 to force the DNS resolution of m4-master (changed from dbproxy1009 -> dbproxy1004)

2018-03-15[edit]

  • 22:13 ottomata: bounced jumbo mirror makers
  • 19:10 ottomata: bouncing main -> jumbo mirror maker
  • 14:50 joal: Restart clickstream-coord to pick new config including fawiki
  • 14:29 elukey: disabled druid1003's middlemanager as prep step for reboot
  • 14:07 ottomata: bouncing kafka jumob -> eqiad mirrormaker

2018-03-14[edit]

  • 15:27 ottomata: bouncing main -> jumbo mirror maker instances
  • 14:45 ottomata: beginning migration of eventlogging analtyics from Kafka analytics to Kafka jumbo: T183297

2018-03-13[edit]

  • 20:47 ottomata: restarting eventlogging processors to pick up VirtualPageView blacklist from eventlogging-valid-mixed topic
  • 15:13 ottomata: bounce main -> analytics mirror maker instances
  • 15:07 ottomata: bouncing MirrorMaker on kafka1020 (main -> jumbo) to re-apply acks=all
  • 14:55 ottomata: bouncing MirrorMaker on kafka1022 to re-apply acks=all (main -> jumbo)
  • 14:32 ottomata: bouncing MirrorMaker on kafka1023 (main -> jumbo) to re-apply acks=all
  • 14:22 ottomata: bouncing mirrormaker for main -> analytics on kafka101[234] to apply roundrobin

2018-03-12[edit]

  • 19:39 ottomata: deployed new Refine jobs (eventlogging, eventbus, etc.) with deduplication and geocoding and casting
  • 18:17 ottomata: bouncing kafka mm eqiad -> jumbo witih acks=1
  • 18:10 ottomata: bouncing kafka mirrormaker for main-eqiad -> jumbo with buffer.memory=128M
  • 17:34 joal: Restart mediawiki-history-reduced oozie job to add a dependency
  • 16:55 joal: Restart mobile_apps_session_metrics
  • 16:52 joal: Deploying refinery on HDFS for mobile_apps patch
  • 16:26 joal: Deploying refinery again to provide patch for mobile_apps_session_metric job
  • 15:09 joal: Deploy refinery onto hdfs
  • 15:07 joal: Deploy refinery from scap
  • 14:32 elukey: restart druid-broker on druid1004 - no /var/log/druid/broker.log after 2018-03-10T22:38:52 (java.io.IOException: Too many open files_
  • 08:50 elukey: fixed evenglog1002's ipv6 (https://gerrit.wikimedia.org/r/#/c/418714/)

2018-03-10[edit]

  • 09:07 joal: Rerun clickstream-wf-2018-2
  • 00:32 milimetric: finished sqooping pagelinks for missing dbs, hdfs -put a SUCCESS flag in the 2018-02 snapshot, jobs should run unless Hue is still lying to itself

2018-03-09[edit]

  • 17:29 joal: Rerun mediawiki-history-reduced job after having manually repaired wmf_raw.mediawiki_project_namespace_map

2018-03-08[edit]

  • 18:05 ottomata: bouncing ResourceManagers
  • 08:54 elukey: re-enable camus after reboots
  • 07:15 elukey: disable Camus on an1003 to allow the cluster to drain - prep step for an100[123] reboot

2018-03-07[edit]

  • 07:15 elukey: manually re-run wikidata-articleplaceholder_metrics-wf-2018-3-6

2018-03-06[edit]

  • 20:44 ottomata: reverted change to point mediawiki monolog kafka producers at kafka jumbo-eqiad until deployment train is done T188136
  • 20:35 ottomata: pointing mediawiki monolog kafka producers at kafka jumbo-eqiad cluster: T188136
  • 19:06 elukey: cleaned up id=0 rows on db1108 (log database) for T188991
  • 10:19 elukey: restart webrequest-load-wf-upload-2018-3-6-7 (failed due to reboots)
  • 10:08 elukey: re-starting mysql consumers on eventlog1001
  • 09:41 elukey: stop eventlogging's mysql consumers for db1107 (el master) kernel updates

2018-03-05[edit]

  • 18:22 elukey: restart webrequest-load-wf-upload-2018-3-5-16 via Hue (failed due to reboots)
  • 18:21 elukey: restart webrequest-load-wf-text-2018-3-5-16 via Hue (failed due to reboots)
  • 15:00 mforns: rerun mediacounts-load-wf-2018-3-5-9
  • 10:57 joal: Relaunch Mediawiki-history job manually from spark2 to see if new versions helps
  • 10:57 joal: Killing failing Mediawiki-History job for 2018-03

2018-03-02[edit]

  • 15:33 mforns: rerun webrequest-load-wf-text-2018-3-2-12

2018-03-01[edit]

  • 14:59 elukey: shutdown deployment-eventlog02 in favor of deployment-eventlog05 in deployment-prep (Ubuntu -> Debian EL migration)
  • 09:45 elukey: rerun webrequest-load-wf-text-2018-3-1-6 manually, failed due to analytics1030's reboot

2018-02-28[edit]

  • 22:09 milimetric: re-deployed refinery for a small docs fix in the sqoop script
  • 17:55 milimetric: Refinery synced to HDFS, deploy completed
  • 17:40 milimetric: deploying Refinery
  • 08:38 joal: rerun cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2018-2-27-15

2018-02-27[edit]

  • 19:12 ottomata: updating spark2-* CLIs to spark 2.2.1: T185581

2018-02-21[edit]

  • 20:48 ottomata: now running 2 camus webrequest jobs, one consuming from jumbo (no data yet), the other from analytics. these should be fine to run in parallel.
  • 07:21 elukey: reboot db1108 (analytics-slave.eqiad.wmnet) for mariadb+kernel updates

2018-02-19[edit]

2018-02-16[edit]

  • 15:41 elukey: add analytics1057 back in the Hadoop worker pool after disk swap
  • 10:55 elukey: increased topic partitions for netflow to 3

2018-02-15[edit]

  • 13:54 milimetric: deployment of refinery and refinery-source done
  • 12:52 joal: Killing webrequest-load bundle (next restart should be at hour 12:00)
  • 08:18 elukey: removed jmxtrans and java 7 from analytics1003 and re-launched refinery-drop-mediawiki-snapshots
  • 07:51 elukey: removed default-java packages from analytics1003 and re-launched refinery-drop-mediawiki-snapshots

2018-02-14[edit]

  • 13:44 elukey: rollback java 8 upgrade for archiva - issues with Analytics builds
  • 13:35 elukey: installed openjdk-8 on meitnerium, manually upgraded java-update-alternatives to java8, restarted archiva
  • 13:14 elukey: removed java 7 packages from analytics100[12]
  • 12:43 elukey: jmxtrans removed from all the Hadoop workers
  • 12:43 elukey: openjdk-7-* packages removed from all the Hadoop workers

2018-02-13[edit]

  • 11:42 elukey: force kill of yarn nodemanager + other containers on analytics1057 (node failed, unit masked, processes still around)

2018-02-12[edit]

  • 23:16 elukey: re-run webrequest-load-wf-upload-2018-2-12-21 via Hue (node managers failure)
  • 23:13 elukey: manual restart of Yarn Node Managers on analytics1058/31
  • 23:09 elukey: cleaned up tmp files on all analytics hadoop worker nodes, job filling up tmp
  • 17:18 elukey: home dirs on stat1004 moved to /srv/home (/home symlinks to it)
  • 17:15 ottomata: restarting eventlogging-processors to blacklist Print schema in eventlogging-valid-mixed (MySQL)
  • 14:46 ottomata: deploying eventlogging for T186833 with EventCapsule in code and IP NO_DB_PROPERTIES

2018-02-09[edit]

  • 12:19 joal: Rerun wikidata-articleplaceholder_metrics-wf-2018-2-8

2018-02-08[edit]

  • 16:23 elukey: stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one

2018-02-07[edit]

  • 13:55 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
  • 13:49 elukey: restart overlord/middlemanager on druid1005

2018-02-06[edit]

  • 19:40 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
  • 15:36 elukey: drain + shutdown of analytics1038 to replace faulty BBU
  • 09:58 elukey: applied https://gerrit.wikimedia.org/r/c/405687/ manually on deployment-eventlog02 for testing

2018-02-05[edit]

  • 15:51 elukey: live hacked deployment-eventlog02's /srv/deployment/eventlogging/analytics/eventlogging/handlers.py to add poll(0) to the confluent kafka producer - T185291
  • 11:03 elukey: restart eventlogging/forwarder legacy-zmq on eventlog1001 due to slow memory leak over time (cached memory down to zero)

2018-02-02[edit]

  • 17:09 joal: Webrequest upload 2018-02-02 hours 9 and 11 dataloss warning have been checked - They are false positive
  • 09:56 joal: unique_devices-per_project_family-monthly-wf-2018-1 after failure

2018-02-01[edit]

  • 17:00 ottomata: killing stuck JsonRefine eventlogging analytics job application_1515441536446_52892, not sure why this is stuck.
  • 14:06 joal: Dataloss alerts for upload 2018-02-01 hours 1, 2, 3 and 5 were false positives
  • 12:17 joal: Restart cassandra monthly bundle after January deploy

2018-01-23[edit]

  • 20:10 ottomata: hdfs dfs -chmod 775 /wmf/data/archive/mediacounts/daily/2018 for T185419
  • 09:26 joal: Dataloss warning for upload and text 2018-01-23:06 is confirmed to be false positive

2018-01-22[edit]

  • 17:36 joal: Kill-Restart clickstream oozie job after deploy
  • 17:12 joal: deploying refinery onto HDFS
  • 17:12 joal: Refinery deployed from scap

2018-01-18[edit]

  • 19:11 joal: Kill-Restart coord_pageviews_top_bycountry_monthly ooie job from 2015-05
  • 19:10 joal: Add fake data to cassandra to silent alarms (Thanks again ema)
  • 18:56 joal: Truncating table "local_group_default_T_top_bycountry"."data" in cassandra before reload
  • 15:21 mforns: refinery deployment using scap and then deploying onto hdfs finished
  • 15:07 mforns: starting refinery deployment
  • 12:43 elukey: piwik on bohrium re-enabled
  • 12:40 elukey: set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot)
  • 09:38 elukey: reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites
  • 09:37 elukey: resumed druid hourly index jobs via hue and restored pivot's configuration
  • 09:21 elukey: reboot druid1001 for kernel upgrades
  • 09:00 elukey: suspended hourly druid batch index jobs via Hue
  • 08:58 elukey: temporarily set druid1002 in superset's druid cluster config (via UI)
  • 08:53 elukey: temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted)
  • 08:52 elukey: disable druid1001's middlemanager as prep step for reboot
  • 07:11 elukey: re-run webrequest-load-wf-misc-2018-1-18-3 via Hue

2018-01-17[edit]

  • 17:33 elukey: killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present)
  • 17:29 elukey: restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task)
  • 16:24 elukey: re-run all the pageview-druid-hourly failed jobs via Hue
  • 14:42 elukey: restart druid middlemanager on druid1003 as attempt to unblock realtime streaming
  • 14:21 elukey: forced kill of banner impression data streaming job to get it restarted
  • 11:44 elukey: re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot)
  • 11:44 elukey: restart druid middlemanager on druid1002
  • 10:38 elukey: stopped all crons on hadoop-coordinator-1
  • 10:37 elukey: re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot)
  • 10:22 elukey: reboot druid1002 for kernel upgrades
  • 09:53 elukey: disable druid middlemanager on druid1002 as prep step for reboot
  • 09:46 elukey: rebooted analytics1003
  • 09:46 elukey: removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?)
  • 08:53 elukey: disabled camus as prep step for analytics1003 reboot

2018-01-15[edit]

  • 13:39 elukey: stop eventlogging and reboot eventlog1001 for kernel updates
  • 09:58 elukey: rolling reboots of aqs hosts (1005->1009) for kernel updates
  • 09:11 elukey: reboot aqs1004 for kernel updates

2018-01-12[edit]

  • 13:03 joal: Rerun webrequest-load-wf-text-2018-1-12-9
  • 13:02 joal: Rerun webrequest-load-wf-upload-2018-1-12-9
  • 10:33 elukey: reboot analytics1066->69 for kernel updates
  • 09:07 elukey: reboot analytics1063->65 for kernel updates

2018-01-11[edit]

  • 22:35 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/403774
  • 22:04 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403762/
  • 20:57 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403753/
  • 17:37 joal: Kill manual banner-streaming job to see it restarted by cron
  • 17:11 ottomata: restart kafka on kafka-jumbo1003
  • 17:08 ottomata: restart kafka on kafka-jumbo1001...something is not right with my certpath change yesterday
  • 14:46 joal: Deploy refinery onto HDFS
  • 14:33 joal: Deploy refinery with Scap
  • 14:07 joal: Manually restarting banner streaming job to prevent alerting
  • 13:23 joal: Killing banner-streaming job to have it auto-restarted from cron
  • 11:45 elukey: re-run webrequest-load-wf-text-2018-1-11-8 (failed due to reboots)
  • 11:39 joal: rerun mediacounts-load-wf-2018-1-11-8
  • 10:48 joal: Restarting banner-streaming job after hadoop nodes reboot
  • 10:01 elukey: reboot analytics1059-61 for kernel updates
  • 09:34 elukey: reboot analytics1055->1058 for kernel updates
  • 09:04 elukey: reboot analytics1051->1054 for kernel updates

2018-01-10[edit]

  • 16:56 elukey: reboot analytics1048->50 for kernel updates
  • 16:23 ottomata: restarting kafka jumbo brokers to apply java.security certpath restrictions
  • 11:51 elukey: re-run webrequest-load-wf-upload-2018-1-10-10 (failed due to reboots)
  • 11:27 elukey: re-run webrequest-load-wf-text-2018-1-10-10 (failed due to reboots)
  • 11:26 elukey: reboot analytics1044->47 for kernel updates
  • 11:03 elukey: reboot analytics1040->43 for kernel updates

2018-01-09[edit]

  • 16:53 joal: Rerun pageview-druid-hourly-wf-2018-1-9-13
  • 15:33 elukey: stop mysql on dbstore1002 as prep step for shutdown (stop all slaves, mysql stop)
  • 15:10 elukey: reboot analytics1028 (hadoop worker and hdfs journal node) for kernel updates
  • 15:00 elukey: reboot kafka-jumbo1006 for kernel updates
  • 14:41 elukey: reboot kafka-jumbo1005 for kernel updates
  • 14:33 elukey: reboot kafka1023 for kernel updates
  • 14:04 elukey: reboot kafka1022 for kernel updates
  • 13:51 elukey: reboot kafka-jumbo1003 for kernel updates
  • 10:08 elukey: reboot kafka-jumbo1002 for kernel updates
  • 09:35 elukey: reboot kafka1014 for kernel updates

2018-01-08[edit]

  • 19:07 milimetric: Deployed refinery and synced to hdfs
  • 15:23 elukey: reboot kafka1013 for kernel updates
  • 13:40 elukey: reboot analytics10[36-39] for kernel updates
  • 12:59 elukey: reboot kafka1012 for kernel updates
  • 12:43 joal: Deploy AQS from tin
  • 12:36 fdans: Deploying AQS
  • 12:33 joal: Update fake-data in cassandra adiing top-by-country needed row
  • 11:07 elukey: re-run webrequest-load-wf-text-2018-1-8-8 (failed after some reboots due to kernel updates)
  • 10:07 elukey: drain + reboot analytics1029,1031->1034 for kernel updates

2018-01-07[edit]

  • 09:01 elukey: re-enabled puppet on db110[78] - eventlogging_sync restarted on db1108 (analytics-slave)

2018-01-06[edit]

  • 08:09 elukey: re-enable eventlogging mysql consumers after database maintenance

2018-01-05[edit]

  • 13:18 fdans: deploying AQS

2018-01-04[edit]

  • 19:54 joal: Deploying refinery onto hadoop
  • 19:45 joal: Deploy refinery using scap
  • 19:38 joal: Deploy refinery-source using jenkins
  • 16:01 ottomata: killing json_refine_eventlogging_analytics job that started yesterday and has not completed (has no executors running?) application_1512469367986_81514. I think the cluster is just too busy? mw-history job running...
  • 10:34 elukey: re-run mediacounts-archive-wf-2018-01-03

2018-01-03[edit]

  • 15:00 ottomata: restarting kafka-jumbo brokers to enable tls version and cipher suite restrictions

2018-01-02[edit]

  • 11:13 joal: Kill and restart cassandra loading oozie bundle to pick new pageview_top_bycountry job
  • 08:22 elukey: restart druid coordinators to pick up new jvm settings (freeing up 6GB of used memory)