User:BDavis (WMF)/Projects/Logging, metrics and monitoring

Logging, metrics and monitoring (LMM) is an evolving idea for a team/sub-team/individual in the Wikimedia Foundation with a mission of maintaining and enhancing the operational logging and monitoring infrastructure used by the Wikimedia Foundation production servers.

Problem
The health of the Wikimedia cluster at any given time is a complex question. Various tools have been developed or deployed to provide insight into the various systems in use on the production cluster:


 * ELK stack
 * Graphite
 * Ganglia
 * Graphana
 * Tendril
 * Icinga
 * Ishmael
 * Dbtree

These tools and more all help in various ways but they are largely without owners on a day to day basis. The sheer number of tools also compounds confusion of where to look, how to sign up to get notifications and what to do if something looks wrong. Many (most? all?) of these tools were originally introduced as experiments and many do not have solid plans for scaling and adapting to infrastructure changes. Conspicuously missing are tools/services that provide automatic anomaly detection or unified dashboards of widely recognized health signals.

Proposal
Create a small team responsible for managing our existing tools, evaluating and selecting new tools, planning for scaling of the tools, promoting the proper use of the tools to Wikimedia Foundation engineers and management, and integrating these tools with MediaWiki and other software deployed on the cluster.