EventLogging/UserAgentSanitization

Problem
User-agent strings often encode sufficiently many bits of information about the user's setup that they can be used to pinpoint an individual user, especially when used in conjunction with other seemingly-anonymous datapoints. At the same time, user-agent strings are indispensable for performance measurements, front-end error reporting and browser support analysis.

Historically, EventLogging left user-agent logging and processing up to the developers and analysts. The problem with this approach is that the sanitization of user-agent strings is not uniform. When it is done at all, it is often done inconsistently. This often ends up being a barrier for releasing performance-related datasets that would greatly benefit from additional scrutiny. The proposed solution is to centralize the processing of user-agents at the time of data collection. This way we can be confident that all UAs logged via EventLogging are adequately sanitized.

Use cases for User Agent collection
Why do we need user agent data?

To assess support needs. For example, we want to be sure we support every device with usage over x %, so we need to have device data reported with a precision of x.

Prioritizing Bug Fixes. In order to know whether a bug that affects a certain browser is a "must-fix" or a "nice to have" we need to have browser makeup stats. In both desktop and Mobile.

To plan feature work. We need to know, for example, the percentage of tablets among the users using wikipedia's mobile apps to plan application development. Or the number of users with "newer" browsers to plan work for Visual Editor.

To contextualize performance numbers. For any kind of performance work we do we need to have browser data available to target our efforts smartly. An example of a performance schema in EL: https://meta.wikimedia.org/wiki/Schema:NavigationTiming

Fingerprinting: background
Entropy is the mathematical quantity that measures how close a fact comes to revealing somebody's identity uniquely. Uniquely identifying a visiting user is referred to as "fingerprinting". When we learn a new fact about a person it reduces the entropy of their identity by a certain amount. Thus, the entropy of a user agent is the set of observable characteristics that can be used in concert with others to uniquely identify a user.

Entropy is normally measured in bits. For example, Peter Eckersley's Panopticlick study for the EFF finds that the User-Agent header provides about 10.0 bits of entropy. Since 2^10 == 1024 that means only 1 in 1024 random browsers visiting a site are expected to share the same user-agent header. Our goal when sanitizing user agents is to reduce the information the user agent provides about the user while still keeping enough information available to do performance diagnostics.

Sanitization
The steps taken to further reduce the entropy of a user agent are quite simple:
 * 1) We remove information pertaining to language first. These are headers like en-ES that are present in some user agents.
 * 2) We remove minor versions. For example: AppleWebkit/525.3.1  gets transformed to AppleWebkit/525
 * 3) We remove information regarding toolbars, extensions, plugins, builds, flash and Java when we find it. Disclaimer: these steps might include fully processing the user agent to identify the device, OS (major and minor) and browser (major). Pre processing UA has the advantage of filtering data such that we only store the data we care about. It, however, has the disadvantage to couple logging with a UA parsing solution. Also since data is pre processed we have lost the raw data  that mistakes in the parsing library will not be easy to correct.
 * 4) To further sanitize the data any solution needs to incorporate bucketing and mark all UAs that haven't been encountered in the last N requests (for some appropriate value of N) into a generic "Other/unknown" bucket.

Fingerprinting versus identifying a browser session
It is worth noting that a sanitized UA reduces the chance to identify a user (fingerprinting) but it still leaves the ability to identify a browser session. Browser sessions do not pinpoint a user but rather a set of actions that were initiated from the same client. How does this work: every record in EL has a hashed version of the client IP (it is really an sha-1 HMAC with a throw away salt) that looks something like this: "d56cd65b103df4d76b32ec839f8406f8d712b989". Thus every record with this same identifier for IP was initiated by the same client, the hashed IP does not reveal anything about the IP and you cannot re-construct the IP from the hash but you can disambiguate 3 records and decide that 1 and 2 go together as part of the same browser session and that record number 3 belongs to a different session. The salt will eventually rotate thus you can only identify a browser session for a given number of days.

In further versions of this system the IP is longer stored, not even in a hashed format, rather the geo location inferred from the IP is. We store city and country and while other info can be inferred from IP such us coordinates that is neither harvested nor stored.

Aggregation
While sanitization reduces the amount of private information a User Agent contains it still leaves open the concern that you can pinpoint a browser session. The solution to this problem is reporting data in an aggregated fashion, thus discarding the original dataset and just leaving agreggates, like: "3% of users have an iPhone 5".

Caveats
Aggregation requires that you know before hand the reports you want to produce from the data. If original records are discharged it reduces your ability to explore the dataset.

UA in EventLogging vs HTTP headers
It should be noted that the user-agent header is sent by the client by default regardless of what EventLogging explicitly logs in the EventCapsule: if we send a sanitized version of the UA string into the event query string on the client-side, it still gets sent in full with every request. Since we cannot prevent the client from sending raw UA headers, the current proposal is to make this data unavailable to any downstream subscriber of EventLogging data: we apply the canonicalization /sanitization as part of the parsing that precedes validation and broadcasting to subscribers. All the relevant EventLogging data consumers would be downstream relative to that. The current implementation doesn't perform any further transformation.