Talk:EventLogging/UserAgentSanitization

Reasoning for processing at time of data collection
The lead paragraph sets up the reasoning for cleaning up user agents for the release of public datasets and then, seemingly without justification, states that processing will occur before user agents are stored. Why would we process *before* storing the user agent? It seems to me that a more logical time to perform this processing is prior to release of a dataset. I like this alternative better because processing user agents is messy and new patterns in user agent strings are likely to break our processing strategy from time to time. If we only store post-processed data, we don't have the opportunity to process it again with updated processing strategies. --Halfak (WMF) (talk) 15:58, 8 January 2014 (UTC)
 * Why wouldn't it? Avoiding to store private data altogether allows not to worry about privacy issues for the data in question. --Nemo 16:02, 8 January 2014 (UTC)
 * I second Nemo's approach QChris (talk) 14:10, 10 January 2014 (UTC)
 * We wouldn't so as not to couple our data gathering with a browser parsing library. Any browser processing is subjected to change going forward (new devices come along, better libraries, mistakes are corrected in identifications) so we want to decouple logging from device, OS and browser identification. We want to hold a sanitized dataset that we can process in the future in any way we want to NRuiz (WMF) (talk) 13:28, 24 January 2014 (UTC)
 * Thanks for expressing a reason. I think one is needed when it's about storing private data. However, "so as not to couple our data gathering with a browser parsing library" is an implementation matter, not a goal. The fact that "new devices come along, better libraries, mistakes are corrected in identifications" is a reminder on the need for this system to be well engineered, but doesn't help identifying actionable requirements and rationales. Something concrete would be, for instance, "we want to be sure we support every device with usage over x %, so we need to have sufficient precision for that". --Nemo 21:19, 26 January 2014 (UTC)
 * Let me please comment further to clarify as I think we are mixing two concerns: 1) performance 2) anonymity. Processing at the time of logging is purely a performance/engineering concern. A processed user agent (regardless of when it is processed) can still be used to link to a browser session, just with lower probability than a "raw" user agent might. Thus sanitizing the incoming user agent lowers the probability of identifying a browser session. Plain processing does the same, it reduces the probability further. It does not eliminate it. But yes, as you said our goal (specially for mobile) is to be able to study Device/Browser percentages combos to provide support. NRuiz (WMF) (talk) 15:54, 27 January 2014 (UTC)