Analytics/Archive/Infrastructure/status

Last update on: 2013-01-monthly

2012-05-22
First 10 Cisco boxes are available. Just puppetized a generic Java installation.

2012-06-06
Hadoop is up and running (under CDH3). http://analytics1001.wikimedia.org:50070/dfshealth.jsp 

2012-06-08
Testing and benchmarking different hadoop parameters. Using TestDSFIO and Terasort benchmarks. Learning!

2012-05-monthly
Cluster planning continues smoothly: David Schoonover has begun writing architecture and dataflow documentation. Cluster setup began in earnest mid-month when the Operations team delivered 10 machines from a 2011 Cisco hardware grant. Andrew Otto and David set up the systems, user environments, and software dependencies. They began begin testing Cassandra, Hadoop, and Hbase, to evaluate which best meets the storage needs of the cluster's batch and stream processing systems.

2012-06-03
At the Berlin Hackathon, Diederik van Liere and Dave Schoonover gathered community input on analytics plans, and gave a few ad hoc presentations about the upcoming changes to the data-processing workflow.

2012-06-29
Andrew Otto has performed several preliminary benchmarks on a 10 node CDH3 cluster. We plan to do more benchmarking with CDH 4 and Datastax Enterprise.

Focused has recently switched to building, testing and deploying Facebook's scribe as an eventual replacement for udp2log. We are investigating the use of scribe initially for Lucene search query logging.

2012-06-monthly
Andrew Otto has performed several preliminary benchmarks on a 10 node CDH3 cluster. We plan to do more benchmarking with CDH 4 and Datastax Enterprise.

Focused has recently switched to building, testing and deploying Facebook's scribe as an eventual replacement for udp2log. We are investigating the use of scribe initially for Lucene search query logging.

2012-07-monthly
Researching and evaluating udp2log replacements for getting data into Kraken cluster. Document in progress here: Analytics/Distributed_Logging_Solutions.

2012-08-monthly
<section begin="2012-08-monthly"/>Otto and Dave concluded their research into distributed logging systems to replace  for Kraken; we'd like to believe the proposal is technical-yet-readable, with a rough charm and boundless love &mdash; much like all engineering.<section end="2012-08-monthly"/>

2012-09-monthly
<section begin="2012-09-monthly"/>After experimenting with DataStax Enterprise, the team came to consensus that they need to use fully open-source software and ruled it out in favor of CDH4. They also experimented with PigLatin and are developing a script to geocode and count requests per country for the Fundraising team. Analytics Ciscos were reinstalled with Ubuntu Precise.<section end="2012-09-monthly"/>

2012-10-monthly
<section begin="2012-10-monthly"/>The Analytics team has been working on: configuring and puppetizing CDH4 (Hue, Sqoop, Oozie, Zookeeper, Hive, Pig), configuring and puppetizing Kafka, benchmarking performance, drafting metadata schemas, setting up ganglia monitoring, setting up prototype pixel service endpoint, and running ad-hoc data queries for fundraising.<section end="2012-10-monthly"/>

2012-11-monthly
<section begin="2012-11-monthly"/>The Analytics team has received all of the hardware purchased back in the Spring. The Hadoop nodes have been moved onto their final homes. Evan Rosen from the Global Development team is helping us test this setup with real use cases for his team. Kafka has been puppetized and installed. It is currently consuming all of Banner Impression- and Wikpedia Zero-related logs. As a proof of concept, the Zero logs are being fed daily into Hadoop for analysis by the Global Development team. Debs for Storm have been built. Storm has been puppetized and is running on several of the Cisco nodes.<section end="2012-11-monthly"/>

2012-12-monthly
<section begin="2012-12-monthly"/>LDAP Hue/Hadoop authentication works, but group file access still needs to be worked out. We've puppetized an Apache proxy for internal Kraken and Hadoop web services, as well as udp2log kafka production and kafka hadoop consumption. The event.gif log stream is being consumed into Hadoop. We're attempting to use udp2log to import logs into Kafka and Hadoop without packet loss, and backing up Hadoop service data files to HDFS (e.g. Hue, Oozie, Hive, etc.).<section end="2012-12-monthly"/>

2013-01-monthly
<section begin="2013-01-monthly"/>In January, we built automated data collection and analysis workflows for mobile page views by country by device, and we wrote Pig User-Defined Functions for device detection using the dClass library. We also migrated all our analytics repositories to the official Wikimedia github account. We set up Riemann for real-time cluster monitoring, and solved the packet loss issue that happened when storing data.<section end="2013-01-monthly"/>