Topic on Talk:Requests for comment/Structured logging

IP address and other personal identifying information

9
Nemo bis (talkcontribs)

Is there any technical reason to keep actual IPs as data points of structured logging entries? The Internet Archive replaced them with hashes generated with a daily key. Especially for operations logs, hashes should be more than enough (if a day is too frequent, make it a week or whatever).

Privacy in this case is not just a philosophical matter but also a very practical one: the less private information a log contains, the easier it is to distribute it, have more eyeballs checking it and identify problems quickly.

BDavis (WMF) (talkcontribs)

This is definitely a point worth discussing. I happen to personally fall into the camp of persons who don't believe that IP addresses constitute private information, but I'm just one voice in the discussion. Having an IP address is useful for diagnosing some bugs but for most use cases I can think of an opaque token that corresponds to the IP (or even /24 block) would be just as useful. I think it would be nice to end up with a core standard that takes sharing log data with the community into consideration but my primary goal is to enable better log analysis for the operational wikis run by the Foundation.

Hashar (talkcontribs)

Wikimedia Foundation and the community it represents does consider the IP address to be private. https://wikimediafoundation.org/wiki/Privacy_policy#IP_and_other_technical_information

We do need IP addresses (client and X-Forwarded-For header) in logs, they are very valuable. Obfuscating them would make debugging harder. Similar thing are the Cookies which we obviously need in logs.

We could redact them whenever they are published, might even have some utility to handle it for us/ But honestly I think it is out of the scope of this document.

BDavis (WMF) (talkcontribs)
CSteipp (WMF) (talkcontribs)

It may be nice to somehow flag private data at the top level of the log structure. I'd like to avoid having a consumer filter out top level fields like ip, then assume the context/message is safe to make public. Or have to grep through those unstructured bits trying to redact particular messages.

This post was posted by CSteipp (WMF), but signed as CSteipp.

BDavis (WMF) (talkcontribs)

I think the spec could accommodate a "private" collection of information, but it would ultimately be up to code review to ensure that private info isn't placed in non-private containers.

Nemo bis (talkcontribs)

Thanks Hashar. Is this true of all logs or are there some where unobfuscated IP addresses are not required for debugging? Also, do IPs get useless for debugging after a short period of time? If not, it would be useful to encrypt them so that you have at least some (permanent) information, as opposed to being forced to purge everything after 3 months or so per the privacy policy (as with CheckUser data).

Nemo bis (talkcontribs)
Nemo bis (talkcontribs)

Quoting Adam Wight and Bawolff here to be able to find their notes again later:

++the EFF for more ideas, they are actively doing great work on so-called

perfect forward secrecy.

There are simple things we could do to achieve a better balance between privacy and sockpantsing, such as cryptolog , in which IP addresses are hashed using a salt that changes every day. In theory, nobody can reverse the function to reveal the IP, but you can still correlate all of an address's edits for the day, week, or whatever, making CheckUser possible.

IP range blocking obviously needs to happen up-front, before the IP is mangled. I have no suggestions, but maybe browser and preferences fingerprinting would be more effective anyway, since: tor.

The cryptolog approach - This has the property that there's a specific time where all anon identifiers suddenly change (e.g. Midnight every day in the setup cryptolog uses). Having an arbitrary point in time where suddenly identifiers shift is probably an unwanted property. (Although maybe it doesn't matter that much in practice? Someone who actually deals with abuse on wiki would be better able to answer that).

Reply to "IP address and other personal identifying information"