Analytics/Archive/Infrastructure/status

Last update on: 2012-07-monthly

2012-05-22
First 10 Cisco boxes are available. Just puppetized a generic Java installation.

2012-06-06
Hadoop is up and running (under CDH3). http://analytics1001.wikimedia.org:50070/dfshealth.jsp 

2012-06-08
Testing and benchmarking different hadoop parameters. Using TestDSFIO and Terasort benchmarks. Learning!

2012-05-monthly
Cluster planning continues smoothly: David Schoonover has begun writing architecture and dataflow documentation. Cluster setup began in earnest mid-month when the Operations team delivered 10 machines from a 2011 Cisco hardware grant. Andrew Otto and David set up the systems, user environments, and software dependencies. They began begin testing Cassandra, Hadoop, and Hbase, to evaluate which best meets the storage needs of the cluster's batch and stream processing systems.

2012-06-03
At the Berlin Hackathon, Diederik van Liere and Dave Schoonover gathered community input on analytics plans, and gave a few ad hoc presentations about the upcoming changes to the data-processing workflow.

2012-06-29
Andrew Otto has performed several preliminary benchmarks on a 10 node CDH3 cluster. We plan to do more benchmarking with CDH 4 and Datastax Enterprise.

Focused has recently switched to building, testing and deploying Facebook's scribe as an eventual replacement for udp2log. We are investigating the use of scribe initially for Lucene search query logging.

2012-06-monthly
Andrew Otto has performed several preliminary benchmarks on a 10 node CDH3 cluster. We plan to do more benchmarking with CDH 4 and Datastax Enterprise.

Focused has recently switched to building, testing and deploying Facebook's scribe as an eventual replacement for udp2log. We are investigating the use of scribe initially for Lucene search query logging.

2012-07-monthly
Researching and evaluating udp2log replacements for getting data into Kraken cluster. Document in progress here: Analytics/Distributed_Logging_Solutions.