Jump to content

Talk:Requests for comment/Structured logging

Add topic
From mediawiki.org

Tracking components

[edit]

Here's a related bug I filed back in 2008 about using wfDebugLog instead of wfDebug so that a system admin can determine more easily where log messages are coming from. https://bugzilla.wikimedia.org/show_bug.cgi?id=14601. Something to think about. Cneubauer (talk) 15:16, 4 December 2013 (UTC)Reply

Thanks for pointing this out. I've added it to the "See also" section of the RFC. BDavis (WMF) (talk) 16:23, 4 December 2013 (UTC)Reply
And since Dec 2012 ( https://gerrit.wikimedia.org/r/37442 ) the log bucket is prepended to the message when it falls in the main $wgDebugLogFile. So we could potentially replace all the wfDebug() calls right now, though having wfDebugLog( 'default', 'some message' ); is not idea. Antoine "hashar" Musso (talk) 20:53, 4 December 2013 (UTC)Reply

IP address and other personal identifying information

[edit]

Is there any technical reason to keep actual IPs as data points of structured logging entries? The Internet Archive replaced them with hashes generated with a daily key.[1] Especially for operations logs, hashes should be more than enough (if a day is too frequent, make it a week or whatever).

Privacy in this case is not just a philosophical matter but also a very practical one: the less private information a log contains, the easier it is to distribute it, have more eyeballs checking it and identify problems quickly. Nemo 17:57, 4 December 2013 (UTC)Reply

This is definitely a point worth discussing. I happen to personally fall into the camp of persons who don't believe that IP addresses constitute private information, but I'm just one voice in the discussion. Having an IP address is useful for diagnosing some bugs but for most use cases I can think of an opaque token that corresponds to the IP (or even /24 block) would be just as useful. I think it would be nice to end up with a core standard that takes sharing log data with the community into consideration but my primary goal is to enable better log analysis for the operational wikis run by the Foundation. BDavis (WMF) (talk) 18:16, 4 December 2013 (UTC)Reply
Wikimedia Foundation and the community it represents does consider the IP address to be private. https://wikimediafoundation.org/wiki/Privacy_policy#IP_and_other_technical_information
We do need IP addresses (client and X-Forwarded-For header) in logs, they are very valuable. Obfuscating them would make debugging harder. Similar thing are the Cookies which we obviously need in logs.
We could redact them whenever they are published, might even have some utility to handle it for us/ But honestly I think it is out of the scope of this document. Antoine "hashar" Musso (talk) 20:50, 4 December 2013 (UTC)Reply
Having structure to the log events themselves would be a great help for scrubbing/filtering operations needed to redact logs for wider distribution. This is similar to the discussions that have taken place[citation needed] about filtering exception stack traces in logs so that they can be shown. BDavis (WMF) (talk) 21:00, 4 December 2013 (UTC)Reply
It may be nice to somehow flag private data at the top level of the log structure. I'd like to avoid having a consumer filter out top level fields like ip, then assume the context/message is safe to make public. Or have to grep through those unstructured bits trying to redact particular messages. CSteipp (talk) 23:13, 4 December 2013 (UTC)Reply
I think the spec could accommodate a "private" collection of information, but it would ultimately be up to code review to ensure that private info isn't placed in non-private containers. BDavis (WMF) (talk) 23:24, 4 December 2013 (UTC)Reply
Thanks Hashar. Is this true of all logs or are there some where unobfuscated IP addresses are not required for debugging? Also, do IPs get useless for debugging after a short period of time? If not, it would be useful to encrypt them so that you have at least some (permanent) information, as opposed to being forced to purge everything after 3 months or so per the privacy policy (as with CheckUser data). Nemo 09:41, 9 December 2013 (UTC)Reply
EFF best practices may contain some useful tips, I think: https://www.eff.org/wp/osp (for instance around "Other key solutions to wipe PII from logs"). Most of that is probably not relevant or already in progress (e.g. EventLogging/UserAgentSanitization), but dropping the link here can't harm I think. :) Nemo 21:14, 5 February 2014 (UTC)Reply
Quoting Adam Wight and Bawolff here to be able to find their notes again later:

++the EFF for more ideas, they are actively doing great work on so-called perfect forward secrecy.

There are simple things we could do to achieve a better balance between privacy and sockpantsing, such as cryptolog [2], in which IP addresses are hashed using a salt that changes every day. In theory, nobody can reverse the function to reveal the IP, but you can still correlate all of an address's edits for the day, week, or whatever, making CheckUser possible. IP range blocking obviously needs to happen up-front, before the IP is mangled. I have no suggestions, but maybe browser and preferences fingerprinting would be more effective anyway, since: tor.

The cryptolog approach - This has the property that there's a specific time where all anon identifiers suddenly change (e.g. Midnight every day in the setup cryptolog uses). Having an arbitrary point in time where suddenly identifiers shift is probably an unwanted property. (Although maybe it doesn't matter that much in practice? Someone who actually deals with abuse on wiki would be better able to answer that).


Nemo 14:58, 30 July 2014 (UTC)Reply

We are considering a similar approach for Parsoid

[edit]

The logging library we are considering is https://github.com/trentm/node-bunyan, which also logs JSON messages. Example messages: <pre> {"name":"myapp","hostname":"banana.local","pid":40161,"level":30,"msg":"hi","time":"2013-01-04T18:46:23.851Z","v":0} {"name":"myapp","hostname":"banana.local","pid":40161,"level":40,"lang":"fr","msg":"au revoir","time":"2013-01-04T18:46:23.853Z","v":0} </pre> Gabriel Wicke (GWicke) (talk) 22:06, 4 December 2013 (UTC)Reply

We can also easily log to GELF. Together with the available transport infrastructure that looks like a very good option to me. Gabriel Wicke (GWicke) (talk) 23:23, 4 December 2013 (UTC)Reply
It looks like the current proposed data points mostly map into GELF.
Proposal GELF 1.1 Notes
timestamp timestamp
host host
source _source (facility is deprecated in 1.1)
pid _pid
severity level (via string->number transform)
channel _channel
message full_message
file file
line line
errorcode short_message
exception ? GELF doesn't have collection support
args ? GELF doesn't have collection support
vhost _vhost
ip _ip
user _user
req _req
One thing I don't like about GELF is that it doesn't support fields that are collections. This leads to clumsy representations for things like argument lists or structured stack traces. I do really like the support for chunking and compression in the GELF transport protocol.
One of my use cases for structured output is feeding a Logstash server with a Kibana frontend. Logstash will happily index and search anything that can be represented in JSON via it's ElasticSearch backend. With this combination you can do some pretty neat things like find all errors in the last N days with stack traces that mention file X or have argument foo = 'bar'. This can be done with stack traces as big blobs of text but if you get the JSON document structure right it becomes much easier and less prone to false positives/negatives. It may well turn out that defining the "one log event output format to rule them all" is too hard to spend much time thinking about. BDavis (WMF) (talk) 18:31, 6 December 2013 (UTC)Reply

Location of Logger registry

[edit]

One of the big questions here is where will the loggers be stored. One option is in ContextSource, or maybe just a global singleton. Not sure which would be most appropriate here. Parent5446 (talk) 23:03, 4 December 2013 (UTC)Reply

log4j

[edit]

log4j and associated logging facilities might be useful to look at while developing this. Ssastry (talk) 23:13, 4 December 2013 (UTC)Reply

The monolog product that has been mentioned was obviously heavily influenced by log4j and it's descendants. Separating the concerns of gathering log event information, formatting the information and outputting the formatted record to storage should definitely be considered.
One feature of log4j that monolog is missing is the idea of a hierarchy of loggers. I'd like to see something like that in MediaWiki eventually but pushing for it in the initial implementation may be overreaching. BDavis (WMF) (talk) 23:28, 4 December 2013 (UTC)Reply

Third party libraries

[edit]

The RFC contains a mention that Monolog might be used, together with a question if it should be used directly or behind a facade. One should not bind to a specific logging library. If MW defines its own interface, which might be very similar to the one of the logging library, such binding can be avoided. One simple adepter would need to be created for each logging library that needs to be supported. This way other libraries can be used in the future, and changes to the library interfaces will not require changes to all of MW. Jeroen De Dauw (talk) 10:03, 5 December 2013 (UTC)Reply

If we go the route of using a third party library to implement the logging changes I would +1 using the PSR-3 Psr\Log\LoggerInterface as the expected API between MW and the logger implementation. BDavis (WMF) (talk) 16:58, 6 December 2013 (UTC)Reply

Proof of concept implementation

[edit]

I've opened Gerrit change 112699 with a proof of concept implementation that leverages PSR-3 and Monolog. I'm sure it's not ready to merge yet but hopefully it will get some attention from interested parties to move this discussion forward. BDavis (WMF) (talk) 17:10, 11 February 2014 (UTC)Reply

Google Protocol Buffers

[edit]

Curious to see how our approach here compares to https://developers.google.com/protocol-buffers/ .

PHP: https://github.com/allegro/php-protobuf/

thanks Tfinc (talk) 21:45, 12 February 2014 (UTC)Reply

I think protocol buffers are solving a slightly different problem. They allow for specification of a rigid message structure with efficient binary representation such as would typically be used for some sort of interprocess (and possibly inter-language) message passing. For event logging I think that having a rigid message structure is unwanted. Not all events will look alike and thus not all events will contain the exact same fields. We are looking to specify a "core" set of fields but would like the ability to invent additional ones as appropriate. I think there are also advantages to using plain text rather than binary formats. Plain text is more easily debugged and manipulated with a variety of tools which supports the primary use case of recording events so that developers and operational support teams can debug runtime issues. BDavis (WMF) (talk) 22:11, 12 February 2014 (UTC)Reply
Protocol Buffers requires defining the whole format with all the keys beforehand and outputs binary. The RFC seems to suggest simple JSON which is a good format for data with variable keys, especially if it's supposed to be consumed by a variety of things. Even 3rd party logging services like Loggly accept JSON. Daniel Friesen (Dantman) (talk) 22:11, 12 February 2014 (UTC)Reply

feedback needed

[edit]

See http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075921.html . Sumana Harihareswara, Engineering Community Manager (talk) 15:57, 17 April 2014 (UTC)Reply