EventLogging/UserAgentSanitization

Problem
User-agent strings often encode sufficiently many bits of information about the user's setup that they can be used to pinpoint an individual user, especially when used in conjunction with other seemingly-anonymous datapoints. At the same time, user-agent strings are indispensable for performance measurements and front-end error reporting. Historically, the approach of EventLogging was to leave user-agent logging and processing to developers and analysts.

The problem with this approach is that the anonymization of user-agent fields is not uniform. When it is done at all, it is often done inconsistently. This often ends up being a barrier for releasing performance-related datasets that would greatly benefit from additional scrutiny. The solution is to centralize the processing of user-agents at the time of data collection. This way we can be confident that all UAs logged via EventLogging are adequately anonymized.

Fingerprinting: Background
"Entropy" is the mathematical quantity that measures how close a fact comes to revealing somebody's identity uniquely. Uniquely identifying a visiting user is referred to as "fingerprinting". When we learn a new fact about a person it reduces the entropy of their identity by a certain amount. Thus, the entropy of a user agent is the set of observable characteristics that can be used in concert with others to uniquely identify a user.

Entropy is normally measured in bits. For example, Peter Eckersley's Panopticlick study for the EFF (https://www.eff.org/deeplinks/2010/01/tracking-by-user-agent) finds that the User-Agent header provide about 10.0 bits of entropy. Since 2^10 == 1024 that means only 1 in 1024 random browsers visiting a site are expected to share the same user-agent header. Our goal when anonymizing user agents is reducing the information the user agent provides about the user while still keeping enough information available to do performance diagnostics.

Anonymization
The steps taken to further reduce the entropy of a user agent are quite simple: we remove information pertaining to language first. Second we remove information regarding toolbars, extensions, plugins, flash and Java when we find it. Disclaimer: these steps do not include fully processing the user agent to identify the major and minor browser version as that would make our logging solution too heavy handed (as it would need to depend of a storage solution such as browser scope to identify UA).

We leave mobile browsers as they are, since EFF's research finds that they are comparatively resistant to fingerprinting by user agent alone (perhaps because they are not as easily customizable as desktop browsers.)

Reading
https://www.eff.org/deeplinks/2010/01/primer-information-theory-and-privacy

http://w3c.github.io/fingerprinting-guidance/

http://panopticlick.eff.org/browser-uniqueness.pdf