Analytics/Kraken/status

From mediawiki.org

Last update on: 2014-03-monthly


2013-09-monthly[edit]

  • Migrated to OpenJDK 7
  • We upgraded the Hadoop cluster to Cloudera's CDH 4.3.1
  • We have been busy experimenting with Camus to import data from a Kafka broker and write it to HDFS. We have also been experimenting with creating Hive schema's that allow for easy querying of webrequest data.

2013-10-monthly[edit]

The team designed and implemented multiple data center configuration support for Kafka (message bus). A bug involving buffer space allocation was exposed and fixed in the Varnish module, and infrastructure work done on automated data ingestion and partitioning. Product Analyst Oliver Keyes did some exploratory work with Hive and provided feedback to the Development team on ease of use and use cases.

2013-11-monthly[edit]

We continued to make progress on event delivery via Kafka. We identified and tested solutions for issues encountered with event delivery from the Amsterdam data center. We also tested solutions to fix Ganglia logging issues.

2013-12-monthly[edit]

In late December, the Analytics team partnered with Operations to enable log delivery over Kafka (distributed message bus). All logs from the edge caches serving mobile traffic are now delivered via Kafka into a data warehouse on our Hadoop infrastructure. We're seeing 3−4K messages per second, with a maximum of 8K/sec over Christmas. This is a significant step towards our goals of building an infrastructure that can be used for analysis of all of our page views.

2014-01-monthly[edit]

The team has been monitoring the mobile stream and adding additional load to Kafka which has exposed some scaling issues. These have been resolved. In addition, work has been done with the Operations team on designing and implementing a Java deployment system for use with Hadoop and other systems. Finally, work has been initiated to use the data in the warehouse on mobile browser distribution and session length.

2014-02-monthly[edit]

We continue to make progress on the Hadoop/Kafka roll-out. We've encountered some issues with cross-data center latencies with Varnish-Kafka that we are currently debugging. We are also testing the Kafka-tee component that provides backwards compatibility for udp2log subscribers. Finally, we are finishing a report for the Mobile team on browser breakdowns using Kafka-provided data on Hadoop.

2014-03-monthly[edit]

We reached a milestone in our ability to deploy Java applications at the Foundation this month when we stood up an Archiva build artifact repository. This enables us to consistently deploy Java libraries and applications and will be used in Hadoop and Search initially.

The first Analytics use case for this system will be Camus, Linked-In's open source application for loading Kafka data into Hadoop. Once this is productized, we'll have the ability to regularly load log data from our servers into Hadoop for processing and analysis.