Talk:Requests for comment/Structured logging

Jump to: navigation, search

About this board

SSastry (WMF) (talkcontribs)

log4j

log4j and associated logging facilities might be useful to look at while developing this.

This post was posted by SSastry (WMF), but signed as Ssastry.

BDavis (WMF) (talkcontribs)

The monolog product that has been mentioned was obviously heavily influenced by log4j and it's descendants. Separating the concerns of gathering log event information, formatting the information and outputting the formatted record to storage should definitely be considered.

One feature of log4j that monolog is missing is the idea of a hierarchy of loggers. I'd like to see something like that in MediaWiki eventually but pushing for it in the initial implementation may be overreaching.

Reply to "log4j"

IP address and other personal identifying information

9
Nemo bis (talkcontribs)

Is there any technical reason to keep actual IPs as data points of structured logging entries? The Internet Archive replaced them with hashes generated with a daily key. Especially for operations logs, hashes should be more than enough (if a day is too frequent, make it a week or whatever).

Privacy in this case is not just a philosophical matter but also a very practical one: the less private information a log contains, the easier it is to distribute it, have more eyeballs checking it and identify problems quickly.

BDavis (WMF) (talkcontribs)

This is definitely a point worth discussing. I happen to personally fall into the camp of persons who don't believe that IP addresses constitute private information, but I'm just one voice in the discussion. Having an IP address is useful for diagnosing some bugs but for most use cases I can think of an opaque token that corresponds to the IP (or even /24 block) would be just as useful. I think it would be nice to end up with a core standard that takes sharing log data with the community into consideration but my primary goal is to enable better log analysis for the operational wikis run by the Foundation.

Hashar (talkcontribs)

Wikimedia Foundation and the community it represents does consider the IP address to be private. https://wikimediafoundation.org/wiki/Privacy_policy#IP_and_other_technical_information

We do need IP addresses (client and X-Forwarded-For header) in logs, they are very valuable. Obfuscating them would make debugging harder. Similar thing are the Cookies which we obviously need in logs.

We could redact them whenever they are published, might even have some utility to handle it for us/ But honestly I think it is out of the scope of this document.

BDavis (WMF) (talkcontribs)
CSteipp (WMF) (talkcontribs)

It may be nice to somehow flag private data at the top level of the log structure. I'd like to avoid having a consumer filter out top level fields like ip, then assume the context/message is safe to make public. Or have to grep through those unstructured bits trying to redact particular messages.

This post was posted by CSteipp (WMF), but signed as CSteipp.

BDavis (WMF) (talkcontribs)

I think the spec could accommodate a "private" collection of information, but it would ultimately be up to code review to ensure that private info isn't placed in non-private containers.

Nemo bis (talkcontribs)

Thanks Hashar. Is this true of all logs or are there some where unobfuscated IP addresses are not required for debugging? Also, do IPs get useless for debugging after a short period of time? If not, it would be useful to encrypt them so that you have at least some (permanent) information, as opposed to being forced to purge everything after 3 months or so per the privacy policy (as with CheckUser data).

Nemo bis (talkcontribs)
Nemo bis (talkcontribs)

Quoting Adam Wight and Bawolff here to be able to find their notes again later:

++the EFF for more ideas, they are actively doing great work on so-called

perfect forward secrecy.

There are simple things we could do to achieve a better balance between privacy and sockpantsing, such as cryptolog , in which IP addresses are hashed using a salt that changes every day. In theory, nobody can reverse the function to reveal the IP, but you can still correlate all of an address's edits for the day, week, or whatever, making CheckUser possible.

IP range blocking obviously needs to happen up-front, before the IP is mangled. I have no suggestions, but maybe browser and preferences fingerprinting would be more effective anyway, since: tor.

The cryptolog approach - This has the property that there's a specific time where all anon identifiers suddenly change (e.g. Midnight every day in the setup cryptolog uses). Having an arbitrary point in time where suddenly identifiers shift is probably an unwanted property. (Although maybe it doesn't matter that much in practice? Someone who actually deals with abuse on wiki would be better able to answer that).

Reply to "IP address and other personal identifying information"
Sumanah (talkcontribs)
Reply to "feedback needed"
Tfinc (talkcontribs)
BDavis (WMF) (talkcontribs)

I think protocol buffers are solving a slightly different problem. They allow for specification of a rigid message structure with efficient binary representation such as would typically be used for some sort of interprocess (and possibly inter-language) message passing. For event logging I think that having a rigid message structure is unwanted. Not all events will look alike and thus not all events will contain the exact same fields. We are looking to specify a "core" set of fields but would like the ability to invent additional ones as appropriate. I think there are also advantages to using plain text rather than binary formats. Plain text is more easily debugged and manipulated with a variety of tools which supports the primary use case of recording events so that developers and operational support teams can debug runtime issues.

Dantman (talkcontribs)

Protocol Buffers requires defining the whole format with all the keys beforehand and outputs binary. The RFC seems to suggest simple JSON which is a good format for data with variable keys, especially if it's supposed to be consumed by a variety of things. Even 3rd party logging services like Loggly accept JSON.

Reply to "Google Protocol Buffers"
BDavis (WMF) (talkcontribs)

I've opened Gerrit change 112699 with a proof of concept implementation that leverages PSR-3 and Monolog. I'm sure it's not ready to merge yet but hopefully it will get some attention from interested parties to move this discussion forward.

Reply to "Proof of concept implementation"

We are considering a similar approach for Parsoid

3
GWicke (talkcontribs)

The logging library we are considering is https://github.com/trentm/node-bunyan, which also logs JSON messages. Example messages:

{"name":"myapp","hostname":"banana.local","pid":40161,"level":30,"msg":"hi","time":"2013-01-04T18:46:23.851Z","v":0}
{"name":"myapp","hostname":"banana.local","pid":40161,"level":40,"lang":"fr","msg":"au revoir","time":"2013-01-04T18:46:23.853Z","v":0}
GWicke (talkcontribs)

We can also easily log to GELF. Together with the available transport infrastructure that looks like a very good option to me.

BDavis (WMF) (talkcontribs)

It looks like the current proposed data points mostly map into GELF.

Proposal GELF 1.1 Notes
timestamp timestamp
host host
source _source (facility is deprecated in 1.1)
pid _pid
severity level (via string->number transform)
channel _channel
message full_message
file file
line line
errorcode short_message
exception ? GELF doesn't have collection support
args ? GELF doesn't have collection support
vhost _vhost
ip _ip
user _user
req _req

One thing I don't like about GELF is that it doesn't support fields that are collections. This leads to clumsy representations for things like argument lists or structured stack traces. I do really like the support for chunking and compression in the GELF transport protocol.

One of my use cases for structured output is feeding a Logstash server with a Kibana frontend. Logstash will happily index and search anything that can be represented in JSON via it's ElasticSearch backend. With this combination you can do some pretty neat things like find all errors in the last N days with stack traces that mention file X or have argument foo = 'bar'. This can be done with stack traces as big blobs of text but if you get the JSON document structure right it becomes much easier and less prone to false positives/negatives. It may well turn out that defining the "one log event output format to rule them all" is too hard to spend much time thinking about.

Reply to "We are considering a similar approach for Parsoid"
Jeroen De Dauw (talkcontribs)

The RFC contains a mention that Monolog might be used, together with a question if it should be used directly or behind a facade. One should not bind to a specific logging library. If MW defines its own interface, which might be very similar to the one of the logging library, such binding can be avoided. One simple adepter would need to be created for each logging library that needs to be supported. This way other libraries can be used in the future, and changes to the library interfaces will not require changes to all of MW.

BDavis (WMF) (talkcontribs)

If we go the route of using a third party library to implement the logging changes I would +1 using the PSR-3 Psr\Log\LoggerInterface as the expected API between MW and the logger implementation.

Reply to "Third party libraries"
Parent5446 (talkcontribs)

One of the big questions here is where will the loggers be stored. One option is in ContextSource, or maybe just a global singleton. Not sure which would be most appropriate here.

Reply to "Location of Logger registry"
Cneubauer (talkcontribs)
BDavis (WMF) (talkcontribs)

Thanks for pointing this out. I've added it to the "See also" section of the RFC.

Hashar (talkcontribs)

And since Dec 2012 ( https://gerrit.wikimedia.org/r/37442 ) the log bucket is prepended to the message when it falls in the main $wgDebugLogFile. So we could potentially replace all the wfDebug() calls right now, though having wfDebugLog( 'default', 'some message' ); is not idea.

Reply to "Tracking components"
There are no older topics