Talk:EventLogging/UserAgentSanitization

Reasoning for processing at time of data collection
The lead paragraph sets up the reasoning for cleaning up user agents for the release of public datasets and then, seemingly without justification, states that processing will occur before user agents are stored. Why would we process *before* storing the user agent? It seems to me that a more logical time to perform this processing is prior to release of a dataset. I like this alternative better because processing user agents is messy and new patterns in user agent strings are likely to break our processing strategy from time to time. If we only store post-processed data, we don't have the opportunity to process it again with updated processing strategies. --Halfak (WMF) (talk) 15:58, 8 January 2014 (UTC)
 * Why wouldn't it? Avoiding to store private data altogether allows not to worry about privacy issues for the data in question. --Nemo 16:02, 8 January 2014 (UTC)
 * I second Nemo's approach QChris (talk) 14:10, 10 January 2014 (UTC)
 * We wouldn't so as not to couple our data gathering with a browser parsing library. Any browser processing is subjected to change going forward (new devices come along, better libraries, mistakes are corrected in identifications) so we want to decouple logging from device, OS and browser identification. We want to hold a sanitized dataset that we can process in the future in any way we want to NRuiz (WMF) (talk) 13:28, 24 January 2014 (UTC)
 * Thanks for expressing a reason. I think one is needed when it's about storing private data. However, "so as not to couple our data gathering with a browser parsing library" is an implementation matter, not a goal. The fact that "new devices come along, better libraries, mistakes are corrected in identifications" is a reminder on the need for this system to be well engineered, but doesn't help identifying actionable requirements and rationales. Something concrete would be, for instance, "we want to be sure we support every device with usage over x %, so we need to have sufficient precision for that". --Nemo 21:19, 26 January 2014 (UTC)
 * Let me please comment further to clarify as I think we are mixing two concerns: 1) performance 2) anonymity. Processing at the time of logging is purely a performance/engineering concern. A processed user agent (regardless of when it is processed) can still be used to link to a browser session, just with lower probability than a "raw" user agent might. Thus sanitizing the incoming user agent lowers the probability of identifying a browser session. Plain processing does the same, it reduces the probability further. It does not eliminate it. But I much agree, as you mentioned our goal (specially for mobile) is to be able to study Device/Browser percentages combos to provide support. NRuiz (WMF) (talk) 15:54, 27 January 2014 (UTC)
 * Discussion on this topic is going on in too many places, forgive me if I'm confused. After the comment above, I've been unable to understand what we're disagreeing on if anything. :) Part of the reason is that this page doesn't document the reasons and goals for having user agents at all, so it's hard to discuss what makes sense. I am commenting in a rather general way.
 * Premise. This is a subpage of EventLogging but all the comments above would apply to any part of the infrastructure dealing with user agents... For a random user from the outside, like me, EventLogging is «the collection of anonymized, aggregate metrics on how users interact with MediaWiki» ( one sentence is how far the public understanding of a MediaWiki extension can go, so we need to ensure everyone is on the same page at least for the basics ). Very specific counts of events/things we need to make decisions about now or in the very near future, and nothing else: every head counts one. (When data has problems and analysis doesn't suffice, you just tweak the experiment/schema and run it again.) If EventLogging stored something else, its definition should be changed. Users trust WMF to only collect data it absolutely needs for specific, stated purposes.
 * Goals. The only actual and legitimate use everyone knows about for UA are things like Template:Browser matrix and the detailed table for it, coming from udp2log, not EventLogging. Even there we only really care about (and use) stats for a dozen or so user agents in any given time and we don't really care about time series, so there's a lot of room for long tail cutting and aggregation. EventLogging currently doesn't have a documented use for UA data, so I can only assume that it needs even less precision/information; and that EventLogging "consumers" will not log UA data in the first place if they don't have an immediate use for it.
 * Back to the point. Let's assume, as the page does, that EventLogging needs to deal with UA data. The premise/promise is that it will only collect it in aggregate form, for a reasonable definition of aggregate. So, by definition, EventLogging can only store processed/sanitised data, as Nuria agrees above, but we also can't accept a perhaps in point 4, because the long tail must be cut. Keeping counts for the top50 most frequent UA, or for the UA seen in the last 100 requests (as suggested IIRC by Luis on bugzilla), is probably fine; if it's hard to implement or we can't decide a number, perhaps it's easier to strip secondary data in earlier stages (we'll never do any special development for AppleWebkit/525 as opposed to AppleWebkit/524 or AppleWebkit/425, so just keep AppleWebkit).
 * Finally, as you say, there are legal requirements and there is performance; I interpret the latter as efficient use of resources and I think the two are coupled. If processing UA is too hard, or too expensive in terms of computational or development resources, or the topic is so complex that you can't agree on a solution, it's also fine to just conclude that it's a distraction from the main, actual goals of EventLogging and that it shoud be put aside, removing UA data collection altogether to focus on something else. --Nemo 22:36, 6 February 2014 (UTC)