User:BDavis (WMF)/Projects/Structured logging

This is a collection of ideas about Structured logging in MediaWiki and related software projects.

Problem
Logging in MediaWiki (and WMF in general?) is optimized for human consumption. This works well for local development and testing and can scale to management of a small to mid size wiki depending on the number of eyes applied to the logs and the acumen of the operational support personnel with grep, cut, sed, awk and other text processing tools.

Unfortunately what works well for a single developer doesn't work as well for analyzing the log output of a large production site.
 * need better tools for wider audience
 * aggregation
 * de-dup
 * cross system correlation
 * alerting
 * reporting

This is not a wholly new idea. Let's look at what's out there and see if we can find a solution or at least borrow the best bits.

Graylog Extended Log Format (GELF)
From http://www.graylog2.org/about/gelf: "The Graylog Extended Log Format (GELF) avoids the shortcomings of classic plain syslog: - Limited to length of 1024 byte - Not much space for payloads like backtraces - Unstructured. You can only build a long message string and define priority, severity etc."

Structured logs messages sent via UDP as gzip'd json with chunking support.


 * client libraries in PHP, Ruby, Java, Python, etc
 * native input for graylog, plugin for logstash

Structure

 * version
 * GELF spec version ("1.0")


 * host
 * hostname sending message


 * short_message
 * short descriptive message


 * full_message
 * long message (eg backtrace, env vars)


 * timestamp
 * UNIX microsecond timestamp


 * level
 * numeric syslog level


 * facility
 * event generator label


 * line
 * source file line


 * file
 * source file


 * _
 * sender specified field

Metlog
From https://wiki.mozilla.org/Services/Sagrada/Metlog: "'The Metlog project is part of Project Sagrada, providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing.'"

3-tier architecture: generator (eg an app), router (logstash), endpoints (back end event destinations: statsd, Sentry, HDFS, Esper, OpenTSDB, ...)


 * python and node.js client libs for formatting and emitting messages.
 * logstash plugins for input processing and output to backends
 * json serialized messages

Structure

 * timestamp
 * time message generated


 * logger
 * message generator (eg app name)


 * type
 * type of message payload
 * (application defined; correlated with payload and fields)


 * severity
 * rfc5424 numeric code (syslog level)


 * payload
 * message contents


 * fields
 * arbitrary key/value pairs


 * env_version
 * message envelope version (0.8)

Common Event Expression (CEE)
From http://cee.mitre.org/: "'Common Event Expression (CEE™) improves the audit process and the ability of users to effectively interpret and analyze event log and audit data. This is accomplished by defining an extensible unified event structure, which users and developers can leverage to describe, encode, and exchange their CEE Event Records.'"


 * Project dead due to funding loss, but some interesting artifacts left behind.
 * http://cee.mitre.org/language/1.0-beta1/core-profile.html
 * Lumberjack (https://fedorahosted.org/lumberjack/) seems to be an equally dead implementation.
 * CEE/Lumberjack provide taxonomies for the types of arbitrary key/value pairs described in Metlog and other basic structured log formats.

Other ideas

 * Mapped diagnostic context (MDC)
 * I used to have an academic paper describing this; I can't find it now
 * Available in Java, ruby, lots of others.
 * I wrote one for PHP: https://github.com/bd808/moar-log/blob/master/src/Moar/Log/MDC.php
 * Timers and counters
 * https://github.com/bd808/moar-metrics
 * statsd everywhere
 * Separate events from emitters
 * "optimistic logging" - aggregate log events in ram, only emit if a "sever" message is seen