Analytics/Archive/Data Processing/status

Last update on: 2014-07-monthly

2014-05-monthly
Capacity, deployment and CDH 5 (new Hadoop) version was worked on this month. These initiatives should be resolved in June. A permissions issue caused the page view dumps to stall for a weekend. The system was fixed promptly and no data was lost.

2014-06-monthly
The team has now integrated Data Processing as part of its Development Process. New Stories/Features have been identified and tasked. Also, experimentation with Cloudera Hadoop 5 is complete and we are ready to upgrade the cluster in July.

2014-07-monthly
New nodes were added to the cluster this month and all machines were upgraded to run CDH5. The team decided not to preserve any data on the cluster during the upgrade and started fresh. The team hosted a Tech Talk on our Hadoop installation (see video and slides). Duplicate monitoring has also been implemented in Hadoop to monitor the incoming Varnish logs.



2014-08-monthly
 The team continued monitoring analytics systems and responding to issues when [non-critical] alarms in went off. Packet losses and kafka issues were diagnosed and handled.

Hadoop worker nodes now automatically set memory limits according to what is available. Previously all workers had the same fixed limit. This allows for better resource utilization.

Logstash is now available at https://logstash.wikimedia.org (Wikitech account required). Logs from Hadoop are piped there for easier search and diagnosis of Hadoop jobs.

Some uses of udp2log were migrated to kafkatee. The latter is not prone to packet losses. In particular Webstatscollector was switched over and error rates were seen to drop drastically. Eventually, the “collecting” part of Webstatscollector will be implemented in Hadoop - a much more scalable environment to handle such work. 