Analytics/Meetings/June webrequest data loss post mortem

From mediawiki.org

Notes[edit]

  1. Why didn't we notice that two Kafka producers stopped working?
    1. Ganglia continued to report the same ProduceRequestsPerSecond value even when udp2log kafka producer stopped working.
    2. WebRequestLoss checks reported 0% lost (despite the fact we are sure we lost data)
      1. Why didn't WebRequestLoss report loss?
        1. Because WebRequestLoss is not monitoring the data stream that had loss
          1. Which of the datastreams are monitored by WebRequestLoss?j
            1. ust the geocoded mobile stream
    3. Kafka monitoring also didn't work as expected -- theory is threshold is too low
    4. because the three alrert-triggering monitoring tools did not catch this particular scenario?
      1. Is there a single monitoring system that would catch all scenarios in a basic way?
        1. yes, WebRequestLoss above
    5. Why did the Kafka producers stop working?
    6. We do not know


Action Items[edit]

  1. Investigate turning off un-anonymized stream - ?
  2. OR turn on monitoring for this streamInvestigate failure in Kafka monitoring - AO
  3. Longer term: how can we abstract the data format from the tools - ?
  4. Longer term: think about use of IP in data - DvL