User:BDavis (WMF)/Projects/Structured logging

This is a collection of ideas about Structured logging in MediaWiki and related software projects. For the purposes of this discussion "logging" refers to wfErrorLog() and it's related family of functions and not audit logs stored in the database by Special:Log or similar systems.

Problem[edit]

Logging in MediaWiki (and WMF in general?) is optimized for human consumption. This works well for local development and testing and can scale to management of a small to mid size wiki depending on the number of eyes applied to the logs and the acumen of the operational support personnel with grep, cut, sed, awk and other text processing tools.

Unfortunately what works well for a single developer doesn't work as well for analyzing the log output of a large production site.

need better tools for wider audience
aggregation
de-dup
cross system correlation
alerting
reporting

This is not a wholly new idea. Let's look at what's out there and see if we can find a solution or at least borrow the best bits.

Current logging[edit]

wfDebug( $text, $logonly = false ): Logs developer provided free-form text + optional global prefix string; Possibly has time-elapsed since start of request and real memory usage inserted between prefix and message; Delegates to wfErrorLog()

wfDebugMem( $exact = false ): Uses wfDebug() to log "Memory usage: N (kilo)?bytes"

wfDebugLog( $logGroup, $text, $public = true)

Logs either to a custom log sink defined in $wgDebugLogGroups or via wfDebug()

Default

Prepends "[$logGroup] " to message

Custom sink

May log only a fraction of occurrences via mt_rand() sampling

Prepends wfTimestamp( TS_DB ) wfWikiID() wfHostname(): to message

Delegates to wfErrorLog() to actually write to sink

wfLogDBError( $text ): Enabled/disabled with $wgDBerrorLog sink location; Logs "$date\t$host\t$wiki\$text" via wfErrorLog() to $wgDBerrorLog sink; Date format is 'D M j G:i:s T Y' with possible custom timezone specified by $wgDBerrorLogTZ

wfErrorLog( $text, $file )

Writes $text to either a local file or a UDP packet depending on the value of $file

UDP

If $file ends with a string following the host name/IP it will be used as a prefix to $text

The final message with optional prefix added will be trimmed to 65507 bytes and a trailing newline may be added

FILE

$text will be appended to file unless the resulting file size would be >= 0x7fffffff bytes (~2G)

wfLogProfilingData(): Delegates to wfErrorLog() using $wgDebugLogFile sink; Creates a tab delimited log message including timestamp, elapsed request time, requesting IPs, and request URL followed by a newline and the profiler output.; Date is from gmdate( 'YmdHis' )

Structured logging standards[edit]

Graylog Extended Log Format (GELF)[edit]

From http://www.graylog2.org/about/gelf:

"The Graylog Extended Log Format (GELF) avoids the shortcomings of classic plain syslog:
- Limited to length of 1024 byte - Not much space for payloads like backtraces
- Unstructured. You can only build a long message string and define priority, severity etc."

Structured logs messages sent via UDP as gzip'd json with chunking support.

client libraries in PHP, Ruby, Java, Python, etc
native input for graylog, plugin for logstash

version: GELF spec version ("1.0")
host: hostname sending message
short_message: short descriptive message
full_message: long message (eg backtrace, env vars)
timestamp: UNIX microsecond timestamp
level: numeric syslog level
facility: event generator label
line: source file line
file: source file
_<whatever>: sender specified field

Metlog[edit]

From https://wiki.mozilla.org/Services/Sagrada/Metlog:

"The Metlog project is part of Project Sagrada, providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing."

3-tier architecture: generator (eg an app), router (logstash), endpoints (back end event destinations: statsd, Sentry, HDFS, Esper, OpenTSDB, ...)

python and node.js client libs for formatting and emitting messages.
logstash plugins for input processing and output to backends
json serialized messages

timestamp: time message generated
logger: message generator (eg app name)
type: type of message payload; (application defined; correlated with payload and fields)
severity: rfc5424 numeric code (syslog level)
payload: message contents
fields: arbitrary key/value pairs
env_version: message envelope version (0.8)

Common Event Expression (CEE)[edit]

From http://cee.mitre.org/:

"Common Event Expression (CEE™) improves the audit process and the ability of users to effectively interpret and analyze event log and audit data. This is accomplished by defining an extensible unified event structure, which users and developers can leverage to describe, encode, and exchange their CEE Event Records."

Project dead due to funding loss, but some interesting artifacts left behind.
- http://cee.mitre.org/language/1.0-beta1/core-profile.html
Lumberjack (https://fedorahosted.org/lumberjack/) seems to be an equally dead implementation.
CEE/Lumberjack provide taxonomies for the types of arbitrary key/value pairs described in Metlog and other basic structured log formats.

Other ideas[edit]

Mapped diagnostic context (MDC)
- I used to have an academic paper describing this; I can't find it now
- Available in Java, ruby, lots of others.
- I wrote one for PHP: https://github.com/bd808/moar-log/blob/master/src/Moar/Log/MDC.php
Timers and counters
- https://github.com/bd808/moar-metrics
statsd everywhere
Separate events from emitters
"optimistic logging" - aggregate log events in ram, only emit if a "severe" message is seen

Proposal[edit]

Serialization Format[edit]

Rather than dive down a rabbit hole of trying to find a universal spec for log file formats let's just keep things simple. PHP loves dictionaries (well they call them arrays but whatever; key=value collections) and has a pretty fast json formatter. So the simplest thing that will work reasonably well would be to keep log events internally as PHP arrays and serialize them as json objects. This will be relatively easy to recreate on other internally developed applications as well with the possible exception of apps written in low level languages such as C that don't have ready made key=value data structures.

Data collected[edit]

Here's a list of the data points that I think we should definitely have:

timestamp: Local system time that event occurred either as UNIX epoch timestamp or ISO 8601 formatted string; date( 'c' )
host: FQDN of system where event occurred; php_uname( 'n' )
source: Name of application generating events; correlates to APP-NAME of RFC 5424; 'Mediawiki'
pid: Unix process id, thread id, thread name or other process identifier; getmypid()
severity: Message severity (RFC 5424 levels); 'WARN'
channel: Log channel. Often the function/module/class creating message (similar to wgDebugLogGroups groups); get_class( $this )
message: Message body; "Help! I'm trapped in a logger factory!"

Additionally I would suggest that we add a semi-structured "context" component to logs. This would be a collection of key=value pairs that the developers determine to be useful for debugging. In the past I have found it useful to have two different methods available to add such data. The first is as an optional argument to the logging method itself and the second is a global collection patterned after the Log4J Mapped Diagnostic Context (MDC).

The local collection is useful for obvious reasons such as attaching class/method state data to the log output and deferring stringification of resources in the event that runtime configuration is ignoring messages of the provided level.

file: Source file triggering message
line: Source line triggering message
errcode: Numeric or string identifier for the error
exception: Live exception object to be stringified by the log event emitter
args: key=value map of method arguments

The global collection is very useful for attaching global application state data to all log messages that may be emitted. Examples of data that could be included:

vhost: Apache vhost processing request; $_SERVER['HTTP_HOST']
ip: Requesting ip address; $_SERVER['REMOTE_ADDR']
user: Authenticated user identity
req: Request ID; UUID or similar token that can be used to correlate all log messages connected to a given request

There are obvious privacy concerns with some of these data fields. Careful thought will need to be given to designing a system such that wikis can choose whether or not to log any of these fields.

Not collected (rejected fields)[edit]

ua: User agent

sess

Application session ID (or a proxy token that can be used to correlate events connected to the session)

API[edit]

FIXME: Propose a PHP API for logging and a migration plan for existing wfErrorLog() calls