Analytics/Archive/Infrastructure/status

This page is archived! Find up-to-date documentation at https://wikitech.wikimedia.org/wiki/Analytics

Last update on: 2013-08-monthly

2012-05-22[edit]

First 10 Cisco boxes are available. Just puppetized a generic Java installation.

2012-06-06[edit]

Hadoop is up and running (under CDH3). http://analytics1001.wikimedia.org:50070/dfshealth.jsp

2012-06-08[edit]

Testing and benchmarking different hadoop parameters. Using TestDSFIO and Terasort benchmarks. Learning!

2012-05-monthly[edit]

Cluster planning continues smoothly: David Schoonover has begun writing architecture and dataflow documentation. Cluster setup began in earnest mid-month when the Operations team delivered 10 machines from a 2011 Cisco hardware grant. Andrew Otto and David set up the systems, user environments, and software dependencies. They began begin testing Cassandra, Hadoop, and Hbase, to evaluate which best meets the storage needs of the cluster's batch and stream processing systems.

2012-06-03[edit]

At the Berlin Hackathon, Diederik van Liere and Dave Schoonover gathered community input on analytics plans, and gave a few ad hoc presentations about the upcoming changes to the data-processing workflow.

2012-06-29[edit]

Andrew Otto has performed several preliminary benchmarks on a 10 node CDH3 cluster. We plan to do more benchmarking with CDH 4 and Datastax Enterprise.

Focused has recently switched to building, testing and deploying Facebook's scribe as an eventual replacement for udp2log. We are investigating the use of scribe initially for Lucene search query logging.

2012-06-monthly[edit]

Andrew Otto has performed several preliminary benchmarks on a 10 node CDH3 cluster. We plan to do more benchmarking with CDH 4 and Datastax Enterprise.

Focused has recently switched to building, testing and deploying Facebook's scribe as an eventual replacement for udp2log. We are investigating the use of scribe initially for Lucene search query logging.

2012-07-monthly[edit]

Researching and evaluating udp2log replacements for getting data into Kraken cluster. Document in progress here: Analytics/Distributed_Logging_Solutions.

2012-08-monthly[edit]

Otto and Dave concluded their research into distributed logging systems to replace udp2log for Kraken; we'd like to believe the proposal is technical-yet-readable, with a rough charm and boundless love — much like all engineering.

2012-09-monthly[edit]

After experimenting with DataStax Enterprise, the team came to consensus that they need to use fully open-source software and ruled it out in favor of CDH4. They also experimented with PigLatin and are developing a script to geocode and count requests per country for the Fundraising team. Analytics Ciscos were reinstalled with Ubuntu Precise.

2012-10-monthly[edit]

The Analytics team has been working on: configuring and puppetizing CDH4 (Hue, Sqoop, Oozie, Zookeeper, Hive, Pig), configuring and puppetizing Kafka, benchmarking performance, drafting metadata schemas, setting up ganglia monitoring, setting up prototype pixel service endpoint, and running ad-hoc data queries for fundraising.

2012-11-monthly[edit]

The Analytics team has received all of the hardware purchased back in the Spring. The Hadoop nodes have been moved onto their final homes. Evan Rosen from the Global Development team is helping us test this setup with real use cases for his team. Kafka has been puppetized and installed. It is currently consuming all of Banner Impression- and Wikipedia Zero-related logs. As a proof of concept, the Zero logs are being fed daily into Hadoop for analysis by the Global Development team. Debs for Storm have been built. Storm has been puppetized and is running on several of the Cisco nodes.

2012-12-monthly[edit]

LDAP Hue/Hadoop authentication works, but group file access still needs to be worked out. We've puppetized an Apache proxy for internal Kraken and Hadoop web services, as well as udp2log kafka production and kafka hadoop consumption. The event.gif log stream is being consumed into Hadoop. We're attempting to use udp2log to import logs into Kafka and Hadoop without packet loss, and backing up Hadoop service data files to HDFS (e.g. Hue, Oozie, Hive, etc.).

2013-01-monthly[edit]

In January, we built automated data collection and analysis workflows for mobile page views by country by device. In the process, we wrote Pig User-Defined Functions for device detection using the dClass library and root-caused and resolved an apparent packet loss issue. We also migrated all our analytics repositories to the official Wikimedia github account. We experimented with a set up of Riemann for real-time cluster monitoring.

2013-02-monthly[edit]

We did two reviews of Kraken: one for security and one for overall architecture. We're incorporating the feedback, which includes merging our puppet modules into the operations puppet repository and the test puppet in Labs. Work has started to create dashboards for mobile pageviews, Wikipedia Zero and the mobile alpha and beta sites.

2013-04-monthly[edit]

We've improved the functionality of Limn, our visualization tool, to allow users to create and edit charts via the UI. We can also automatically deploy new instances of Limn, so it's faster and easier to setup dashboards. In addition to current users, we expect this to be very helpful for the Program Evaluation team as they start to develop their own analytics.

We're also now importing 1:1000 traffic streams, enabling us to migrate reports from our legacy analytics platform, WikiStats, onto our big data cluster, Kraken. In the future, this will make it easier for us to publish data and visualize reports using our newer infrastructure.

We have implemented secure login to the User Metrics API via SSL. We've also introduce a new metric called pages_created, allowing us to count the number of pages created by a specific editor.

We improved the accuracy of the udp2log monitoring and upgraded the machines to Ubuntu Precise in order to make the system more robust.

2013-05-monthly[edit]

We continued our efforts of increasing our monitor coverage of the different webrequest dataflows. On the udp2log side, we added monitoring per DC/server role. Every month, we work on improving the robustness and security of the analytics-related servers that we run: we moved the multicast relay from Oxygen to Gadolinium, we upgraded Oxygen to Ubuntu Precise, and we moved all the Limn-based dashboards from the Kripke labs instance to the Limn0 labs instance. Continous integration for webstatscollector, wikistats and udp-filters now works. The puppet module for Hadoop has been merged in the Operations reposotiry; this is a big step forward in moving Kraken from beta to production status. Magnus Edenhill demonstrated varnishkafka based on Kafka 0.8; on a local machine varniskafka was able to process 140k msgs/s and we are planning to do production testing mid June. Last, we separated the Kraken machines from the other production servers by installing network ACLs.

2013-06-monthly[edit]

We made significant progress with our preparations for replacing udp2log with Kafka in our logging infrastructure. The C library librdkafka has now support for the 0.8 protocol, there is a first version of varnishkafka ready that will replace varnishncsa, the Apache Kafka project released their first beta of Kafka 0.8, and we have a Debianized and Pupppetized version. We keep on adding new metrics and alerts to monitor all the different parts of the webrequest dataflows into Kraken. We expect to keep making improvements in the coming months, until we have a fully reliable data pipeline into Kraken. We also continued our efforts of moving Kraken out of beta: we puppetized Zookeeper, JMXtrans, and the Hadoop client nodes for Hive, Pig and Sqoop. We started reinstalling the Hadoop Datanode workers with a fully puppetized Hadoop installation; so far, we have replaced 3 nodes, and we'll replace the other seven in the coming weeks. Last, we enabled Jenkins continuous integration for the Grantmaking & Evaluation dashboards.

2013-07-monthly[edit]

Kraken:

We kicked off a reliability project with Ops with the end goal of stabilizing Hadoop and the logging infrastructure. Teams have been in discussions on architecture and planning, and should have a path forward in the next 2 weeks. We identified a consultant who will perform a system audit to aid the project.

We continue adding new metrics and alerts to monitor all the different parts of the webrequest dataflows into Kraken. We expect to keep making improvements in the coming months until we have a fully reliable data pipeline into Kraken.

We puppetized Hue, Hive, and Oozie. We also have a working setup of the Hadoop cluster in Labs for testing purposes. All Puppet work is open sourced.

Logging Infrastructure:

We started this month with designing a canary event monitoring system. A canary event is an artificial event that is injected at the start of the data workflow and which we will monitor to see it reaches its final destination; that way we can ensure that the dataflows are functioning.

We are investigating what data format to use for sending the webrequest messages from Varnish to the Hadoop cluster. Formats that we are scrutinizing are JSON, Protobuf and AVRO, but we are also looking at compressions algorithms such as Snappy.

2013-08-monthly[edit]

We continue to pursue the initiatives listed in our planning document. We've had one analyst accept a job offer (welcome Aaron!) and are in discussions with a software engineer. We continue to have a solid pipeline and are spending a lot of time interviewing. Wikimetrics is on target for an early September release and we've made good progress against our hadoop infrastructure goals. In co-operation with Ops, we've completed our reinstall of the Hadoop cluster and run several days of reliability testing over the labor day weekend. We are currently investigating replacing the Oracle JDK with the Open JDK to be in line with our goals of using open source whenever possible. Our project to replace udp2log with Kafka is making steadily progress. Varnishkafka, which will replace varnishncsa, has been debianized and the first performance tests of compressing the message sets are very encouraging. We created a test environment in Labs to test Kafka failover modes and we have been prototyping with Camus to consume the data from a broker and write it to HDFS. We are right now thinking about how to set up Kafka in a multi data-center environment. The Zookeepers have been reinstalled through Puppet as well.