EventLogging/UserAgentSanitization

Problem
User-agent strings often encode sufficiently many bits of information about the user's setup that they can be used to pinpoint an individual user, especially when used in conjunction with other seemingly-anonymous datapoints. At the same time, user-agent strings are indispensable for performance measurements, front-end error reporting and browser support analysis.

Historically, EventLogging left user-agent logging and processing up to the developers and analysts. The problem with this approach is that the sanitization of user-agent strings is not uniform. When it is done at all, it is often done inconsistently. This often ends up being a barrier for releasing performance-related datasets that would greatly benefit from additional scrutiny. The proposed solution is to centralize the processing of user-agents at the time of data collection. This way we can be confident that all UAs logged via EventLogging are adequately sanitized.

Fingerprinting: Background
Entropy is the mathematical quantity that measures how close a fact comes to revealing somebody's identity uniquely. Uniquely identifying a visiting user is referred to as "fingerprinting". When we learn a new fact about a person it reduces the entropy of their identity by a certain amount. Thus, the entropy of a user agent is the set of observable characteristics that can be used in concert with others to uniquely identify a user.

Entropy is normally measured in bits. For example, Peter Eckersley's Panopticlick study for the EFF finds that the User-Agent header provides about 10.0 bits of entropy. Since 2^10 == 1024 that means only 1 in 1024 random browsers visiting a site are expected to share the same user-agent header. Our goal when sanitizing user agents is to reduce the information the user agent provides about the user while still keeping enough information available to do performance diagnostics.

Sanitization
The steps taken to further reduce the entropy of a user agent are quite simple:
 * 1) We remove information pertaining to language first. These are headers like en-ES that are present in some user agents.
 * 2) We remove minor versions. For example: AppleWebkit/525.3.1  gets transformed to AppleWebkit/525
 * 3) We remove information regarding toolbars, extensions, plugins, builds, flash and Java when we find it. Disclaimer: these steps do not include fully processing the user agent to identify the major and minor browser version as that would make our logging solution too heavy handed (as it would need to depend of a storage solution such as browser scope to identify UA).
 * 4) To further sanitize the data, we should perhaps transform all sanitized UAs that haven't been encountered in the last N requests (for some appropriate value of N) into a generic "Other/unknown" bucket.

(Guidelines are much WIP, our goal is to make our logging solution as light as possible)

UA in EventLogging vs HTTP headers
It should be noted that the user-agent header is sent by the client by default regardless of what EventLogging explicitly logs in the EventCapsule: if we send a sanitized version of the UA string into the event query string on the client-side, it still gets sent in full with every request. Since we cannot prevent the client from sending raw UA headers, the current proposal is to make this data unavailable to any downstream subscriber of EventLogging data: we apply the canonicalization /sanitization as part of the parsing that precedes validation and broadcasting to subscribers. All the relevant EventLogging data consumers would be downstream relative to that. The current implementation doesn't perform any further transformation. To further sanitize the data, we should transform all sanitized UAs that haven't been encountered in the last N requests (for some appropriate value of N) into a generic "Other/unknown" bucket.