User:BDavis (WMF)/Projects/Structured logging

This is a collection of ideas about Structured logging in MediaWiki and related software projects. For the purposes of this discussion "logging" refers to  and it's related family of functions and not audit logs stored in the database by Special:Log or similar systems.

Problem
Logging in MediaWiki (and WMF in general?) is optimized for human consumption. This works well for local development and testing and can scale to management of a small to mid size wiki depending on the number of eyes applied to the logs and the acumen of the operational support personnel with grep, cut, sed, awk and other text processing tools.

Unfortunately what works well for a single developer doesn't work as well for analyzing the log output of a large production site.
 * need better tools for wider audience
 * aggregation
 * de-dup
 * cross system correlation
 * alerting
 * reporting

This is not a wholly new idea. Let's look at what's out there and see if we can find a solution or at least borrow the best bits.

Current logging

 * wfDebug( $text, $logonly = false )
 * Logs developer provided free-form text + optional global prefix string
 * Possibly has time-elapsed since start of request and real memory usage inserted between prefix and message
 * Delegates to


 * wfDebugMem( $exact = false )
 * Uses  to log "Memory usage: N (kilo)?bytes"


 * wfDebugLog( $logGroup, $text, $public = true)
 * Logs either to a custom log sink defined in  or via
 * Default
 * Prepends "[$logGroup] " to message
 * Custom sink
 * May log only a fraction of occurrences via  sampling
 * Prepends  to message
 * Delegates to  to actually write to sink


 * wfLogDBError( $text )
 * Enabled/disabled with  sink location
 * Logs "$date\t$host\t$wiki\$text" via  to    sink
 * Date format is  with possible custom timezone specified by


 * wfErrorLog( $text, $file )
 * Writes  to either a local file or a UDP packet depending on the value of
 * UDP
 * If  ends with a string following the host name/IP it will be used as a prefix to
 * The final message with optional prefix added will be trimmed to 65507 bytes and a trailing newline may be added
 * FILE
 * will be appended to file unless the resulting file size would be >= 0x7fffffff bytes (~2G)


 * wfLogProfilingData
 * Delegates to  using   sink
 * Creates a tab delimited log message including timestamp, elapsed request time, requesting IPs, and request URL followed by a newline and the profiler output.
 * Date is from

Graylog Extended Log Format (GELF)
From http://www.graylog2.org/about/gelf: "The Graylog Extended Log Format (GELF) avoids the shortcomings of classic plain syslog: - Limited to length of 1024 byte - Not much space for payloads like backtraces - Unstructured. You can only build a long message string and define priority, severity etc."

Structured logs messages sent via UDP as gzip'd json with chunking support.


 * client libraries in PHP, Ruby, Java, Python, etc
 * native input for graylog, plugin for logstash


 * version
 * GELF spec version ("1.0")


 * host
 * hostname sending message


 * short_message
 * short descriptive message


 * full_message
 * long message (eg backtrace, env vars)


 * timestamp
 * UNIX microsecond timestamp


 * level
 * numeric syslog level


 * facility
 * event generator label


 * line
 * source file line


 * file
 * source file


 * _
 * sender specified field

Metlog
From https://wiki.mozilla.org/Services/Sagrada/Metlog: "'The Metlog project is part of Project Sagrada, providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing.'"

3-tier architecture: generator (eg an app), router (logstash), endpoints (back end event destinations: statsd, Sentry, HDFS, Esper, OpenTSDB, ...)


 * python and node.js client libs for formatting and emitting messages.
 * logstash plugins for input processing and output to backends
 * json serialized messages


 * timestamp
 * time message generated


 * logger
 * message generator (eg app name)


 * type
 * type of message payload
 * (application defined; correlated with payload and fields)


 * severity
 * rfc5424 numeric code (syslog level)


 * payload
 * message contents


 * fields
 * arbitrary key/value pairs


 * env_version
 * message envelope version (0.8)

Common Event Expression (CEE)
From http://cee.mitre.org/: "'Common Event Expression (CEE™) improves the audit process and the ability of users to effectively interpret and analyze event log and audit data. This is accomplished by defining an extensible unified event structure, which users and developers can leverage to describe, encode, and exchange their CEE Event Records.'"


 * Project dead due to funding loss, but some interesting artifacts left behind.
 * http://cee.mitre.org/language/1.0-beta1/core-profile.html
 * Lumberjack (https://fedorahosted.org/lumberjack/) seems to be an equally dead implementation.
 * CEE/Lumberjack provide taxonomies for the types of arbitrary key/value pairs described in Metlog and other basic structured log formats.

Other ideas

 * Mapped diagnostic context (MDC)
 * I used to have an academic paper describing this; I can't find it now
 * Available in Java, ruby, lots of others.
 * I wrote one for PHP: https://github.com/bd808/moar-log/blob/master/src/Moar/Log/MDC.php
 * Timers and counters
 * https://github.com/bd808/moar-metrics
 * statsd everywhere
 * Separate events from emitters
 * "optimistic logging" - aggregate log events in ram, only emit if a "severe" message is seen

Serialization Format
Rather than dive down a rabbit hole of trying to find a universal spec for log file formats let's just keep things simple. PHP loves dictionaries (well they call them arrays but whatever; key=value collections) and has a pretty fast json formatter. So the simplest thing that will work reasonably well would be to keep log events internally as PHP arrays and serialize them as json objects. This will be relatively easy to recreate on other internally developed applications as well with the possible exception of apps written in low level languages such as C that don't have ready made key=value data structures.

Data collected
Here's a list of the data points that I think we should definitely have:
 * timestamp
 * Local system time that event occurred either as UNIX epoch timestamp or ISO 8601 formatted string


 * host
 * FQDN of system where event occurred


 * source
 * Name of application generating events; correlates to APP-NAME of RFC 5424


 * pid
 * Unix process id, thread id, thread name or other process identifier


 * severity
 * Message severity (RFC 5424 levels)


 * channel
 * Log channel. Often the function/module/class creating message (similar to  groups)


 * message
 * Message body

Additionally I would suggest that we add a semi-structured "context" component to logs. This would be a collection of key=value pairs that the developers determine to be useful for debugging. In the past I have found it useful to have two different methods available to add such data. The first is as an optional argument to the logging method itself and the second is a global collection patterned after the Log4J Mapped Diagnostic Context (MDC).

The local collection is useful for obvious reasons such as attaching class/method state data to the log output and deferring stringification of resources in the event that runtime configuration is ignoring messages of the provided level.


 * file
 * Source file triggering message


 * line
 * Source line triggering message


 * errcode
 * Numeric or string identifier for the error


 * exception
 * Live exception object to be stringified by the log event emitter


 * args
 * key=value map of method arguments

The global collection is very useful for attaching global application state data to all log messages that may be emitted. Examples of data that could be included:
 * vhost
 * Apache vhost processing request


 * ip
 * Requesting ip address


 * user
 * Authenticated user identity


 * req
 * Request ID; UUID or similar token that can be used to correlate all log messages connected to a given request



There are obvious privacy concerns with some of these data fields. Careful thought will need to be given to designing a system such that wikis can choose whether or not to log any of these fields.

Not collected (rejected fields)

 * ua
 * User agent

sess
 * Application session ID (or a proxy token that can be used to correlate events connected to the session)

API
FIXME: Propose a PHP API for logging and a migration plan for existing wfErrorLog calls