Wikimedia Research/Showcase/Archive/2023/10

From mediawiki.org

October 2023[edit]

Theme
Data Privacy

October 18, 2023 Video: YouTube

Wikipedia Reader Navigation: When Synthetic Data Is Enough
By Akhil Arora, EPFL
Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers’ needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%. This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users’ privacy.


How to tell the world about data you cannot show them: Differential privacy at the Wikimedia Foundation
slides
By Hal Triedman, Wikimedia Foundation
The Wikimedia Foundation (WMF), by virtue of its centrality on the internet, collects lots of data about platform activities. Some of that data is made public (e.g. global daily pageviews); other data types are not shared (or are pseudonymized prior to sharing), largely due to privacy concerns. Differential privacy is a statistical definition of privacy that has gained prominence in academia, but is still an emerging technology in industry. In this talk, I share the story of how we put differential privacy into production at the WMF, through looking at the case study of geolocated daily pageview counts.