Wikimedia Research/Showcase


The Monthly Wikimedia Research Showcase is a public showcase of recent research by the Wikimedia Foundation's Research Team and guest presenters from the academic community. The showcase is hosted virtually every 3rd Wednesday of the month at 9:30 a.m. Pacific Time/18:30 p.m. CET and is live-streamed on YouTube. The schedule may change, see the calendar below for a list of confirmed showcases.

How to attend[edit]

We live stream our research showcase every month on YouTube. The link will be in each showcase's details below and is also announced in advance via wiki-research-l, analytics-l, and @WikiResearch on Twitter. You can join the conversation and participate in Q&A after each presentation using the YouTube chat. We expect all presenters and attendees to abide by our Friendly Space Policy.

Upcoming Events[edit]

February 2024[edit]

Wednesday, February 21, 16:30 UTC: Find your local time here
Platform Governance and Policies

Wednesday, February 21, 2023 Video: YouTube

Sociotechnical Designs for Democratic and Pluralistic Governance of Social Media and AI
By Amy X. Zhang, University of Washington
Decisions about policies when using widely-deployed technologies, including social media and more recently, generative AI, are often made in a centralized and top-down fashion. Yet these systems are used by millions of people, with a diverse set of preferences and norms. Who gets to decide what are the rules, and what should the procedures be for deciding them---and must we all abide by the same ones? In this talk, I draw on theories and lessons from offline governance to reimagine how sociotechnical systems could be designed to provide greater agency and voice to everyday users and communities. This includes the design and development of: 1) personal moderation and curation controls that are usable and understandable to laypeople, 2) tools for authoring and carrying out governance to suit a community's needs and values, and 3) decision-making workflows for large-scale democratic alignment that are legitimate and consistent.

March 2024[edit]

Wednesday, March 20, 17:30 UTC: Find your local time here
Addressing Gender Gaps

Wednesday, March 20, 2023 Video: coming soon

By Mo Houtti

By Nicole Schwitter



January 2024[edit]

Wednesday, January 17, 17:30 UTC: Find your local time here
Connecting Actions with Policy

January 17, 2023 Video: YouTube

Presenting the report "Unreliable Guidelines"
By Amber Berson and Monika Sengul-Jones
The goal behind the report Unreliable Guidelines: Reliable Sources and Marginalized Communities in French, English and Spanish Wikipedias was to understand the effects of the set of reliable source guidelines and rules on the participation of and the content about marginalized communities on three Wikipedias. Two years following the release of their report, researchers Berson and Sengul-Jones reflect on the impact of their research as well as the actionable next steps.

Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions
By Lucie-Aimée Kaffee and Arnav Arora
The moderation of content on online platforms is usually non-transparent. On Wikipedia, however, this discussion is carried out publicly and the editors are encouraged to use the content moderation policies as explanations for making moderation decisions. However, currently only a few comments explicitly mention those policies. To aid in this process of understanding how content is moderated, we construct a novel multilingual dataset of Wikipedia editor discussions along with their reasoning in three languages. We demonstrate that stance and corresponding reason (policy) can be predicted jointly with a high degree of accuracy, adding transparency to the decision-making process.


December 2023[edit]

Tuesday, December 12, 17:30 UTC: Find your local time here
A year of Generative AI: future directions for Wikimedia

December 12, 2023 Video: YouTube

Panel discussion
A year of Generative AI- future directions for Wikimedia
By User:Barkeep49, Maryana Pinchuk, and Robert West
This December marks the one-year anniversary of ChatGPT with the resulting public interest in generative AI and growing research focus on practical uses of large language models. There has been much discussion about how these generative models might disrupt the Wikimedia projects but also prototyping to see where they might be useful. To discuss what we've learned in the past year and what opportunities ahead are being enabled by research, we bring together a panel of four folks: Isaac Johnson (senior research scientist at the Wikimedia Foundation) will be the moderator and the three panelists are User:Barkeep49, Maryana Pinchuk, and Robert West to bring perspectives from the volunteer, product, and research communities.

November 2023[edit]


November 15, 2023 Video: YouTube

Contextualizing the bibliographic references of Wikipedia
By Wenceslao Arroyo-Machado, Universidad de Granada
This study aims to enhance the value of bibliographic references in Wikipedia articles by moving beyond just citation counts and exploiting Wikipedia article features and engagement metrics, like page views and talks, to enrich the context of references and deepen the understanding of the relationship between science and society.
  • Papers:
Arroyo-Machado, W., Torres-Salinas, D., & Costas, R. (2022). Wikinformetrics: Construction and description of an open Wikipedia knowledge graph data set for informetric purposes. Quantitative Science Studies, 1-22.
Arroyo-Machado, W., Díaz-Faes, A. A., Herrera-Viedma, E., & Costas, R. (2023). From academic to media capital: To what extent does the scientific reputation of universities translate into Wikipedia attention?. arXiv preprint arXiv:2307.05366.
Arroyo-Machado, W., & Costas, R. (2023, April). Do popular research topics attract the most social attention? A first proposal based on OpenAlex and Wikipedia. In 27th International Conference on Science, Technology and Innovation Indicators (STI 2023).

Gender and country biases in Wikipedia citations to scholarly publications
By Chaoqun Ni, University of Wisconsin-Madison
Ensuring Wikipedia cites scholarly publications based on quality and relevancy without biases is critical to credible and fair knowledge dissemination. We investigate gender- and country-based biases in Wikipedia citation practices using linked data from the Web of Science and a Wikipedia citation dataset. Using coarsened exact matching, we show that publications by women are cited less by Wikipedia than expected, and publications by women are less likely to be cited than those by men. Scholarly publications by authors affiliated with non-Anglosphere countries are also disadvantaged in getting cited by Wikipedia, compared with those by authors affiliated with Anglosphere countries. The level of gender- or country-based inequalities varies by research field, and the gender-country intersectional bias is prominent in math-intensive STEM fields. To ensure the credibility and equality of knowledge presentation, Wikipedia should consider strategies and guidelines to cite scholarly publications independent of the gender and country of authors.

October 2023[edit]

Data Privacy

October 18, 2023 Video: YouTube

Wikipedia Reader Navigation: When Synthetic Data Is Enough
By Akhil Arora, EPFL
Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers’ needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%. This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users’ privacy.

How to tell the world about data you cannot show them: Differential privacy at the Wikimedia Foundation
By Hal Triedman, Wikimedia Foundation
The Wikimedia Foundation (WMF), by virtue of its centrality on the internet, collects lots of data about platform activities. Some of that data is made public (e.g. global daily pageviews); other data types are not shared (or are pseudonymized prior to sharing), largely due to privacy concerns. Differential privacy is a statistical definition of privacy that has gained prominence in academia, but is still an emerging technology in industry. In this talk, I share the story of how we put differential privacy into production at the WMF, through looking at the case study of geolocated daily pageview counts.

September 2023[edit]

Rules on Wikipedia

September 20, 2023 Video: YouTube

Wikipedia Community Policies and Experiential Epistemology
Critical Information Literacy, Social Justice, and Inclusive Practices
By Zachary J. McDowell, University of Illinois at Chicago and Matthew Vetter, Indiana University of Pennsylvania
Drawing from a meta-analysis of research on learning outcomes in Wikipedia-based education, this presentation addresses Wikipedia community policies and practices through the Framework for Information Literacy in Higher Education from the Association of College and Research Libraries’ (ACRL). Wikipedia-based educational practices, which promote newcomers’ active engagement in the encyclopedia, have been shown to support experiential learnings in critical information literacy, communication and research outcomes, and social justice. Exploring the connections between participation in Wikipedia and transferable skills for information literacy in the context of the current new media landscape, this presentation grapples with new questions for the future of information literacies alongside the implications of large language models (LLMs), systemic biases, and the representation and inclusion of non-western and indigenous knowledge sources.
  • Papers:
McDowell, Z. J., & Vetter, M. A. (2022). Wikipedia as Open Educational Practice: Experiential Learning, Critical Information Literacy, and Social Justice. Social Media + Society, 8(1).
McDowell, Z. J., & Vetter, M. A. (2020). It Takes a Village to Combat a Fake News Army: Wikipedia’s Community and Policies for Information Literacy. Social Media + Society, 6(3).
McDowell, Z., & Vetter, M. (2022). Fast “Truths” and Slow Knowledge; Oracular Answers and Wikipedia’s Epistemology. Fast Capitalism, 19(1).

Variation and overlap in the peer production of community rules
the case of five Wikipedias
By Sohyeon Hwang, Northwestern University
In this talk, I present work analyzing the rules and rule-making on Wikipedia. The governance of many online communities relies on rules created by participants. However, work predominantly focuses on efforts within a single community or on a platform as a whole. Here we investigate the comparative and relational dimensions of online self-governance in a set of similar communities by looking at the five largest language editions of Wikipedia. Using exhaustive trace data spanning almost 20 years since their founding, we examine patterns in rule-making and overlaps in rule sets. Our findings show that language editions have similar trajectories of rule-making activity, replicating and extending a rich body of work that have focused on English-language Wikipedia alone. We also find that the language editions have increasingly unique rule sets, even as editing activity concentrates on rules shared between them. The results suggest that self-governing communities aligned in key ways may share a common core of rules and rule-making practices even as they develop and sustain institutional variations.

August 2023[edit]

No Showcase due to Wikimania

July 2023[edit]

16.30 UTC: Find your local time here
Improving knowledge integrity in Wikimedia projects

July 19, 2023 Video: YouTube

Assessment of Reference Quality on Wikipedia
By Aitolkyn Baigutanova, KAIST
In this talk, I will present our research on the reliability of Wikipedia through the lens of its references. I will primarily discuss our paper on the longitudinal assessment of reference quality on English Wikipedia, where we operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. I will share our research findings on two key aspects: (1) the evolution of reference quality over a 10-year period and (2) factors that affect reference quality. We discover that the RN score has dropped by 20 percent point, with more than half of verifiable statements now accompanying references. The RR score has remained below 1% over the years as a result of the efforts of the community to eliminate unreliable references. As an extension of this work, we explore how community initiatives, such as the perennial source list, help with maintaining reference quality across multiple language editions of Wikipedia. We hope our work encourages more active discussions within Wikipedia communities to improve reference quality of the content.
  • Paper: Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of the ACM Web Conference 2023 (WWW '23). Association for Computing Machinery, New York, NY, USA, 2831–2839.

Multilingual approaches to support knowledge integrity in Wikipedia
By Diego Saez-Trumper & Pablo Aragón, Wikimedia Foundation
Knowledge integrity in Wikipedia is key to ensure the quality and reliability of information. For that reason, editors devote a substantial amount of their time in patrolling tasks in order to detect low-quality or misleading content. In this talk we will cover recent multilingual approaches to support knowledge integrity. First, we will present a novel design of a system aimed at assisting the Wikipedia communities in addressing vandalism. This system was built by collecting a massive dataset of multiple languages and then applying advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. Second, we will showcase the Wikipedia Knowledge Integrity Risk Observatory, a dashboard that relies on a language-agnostic version of the former system to monitor high risk content in hundreds of Wikipedia language editions. We will conclude with a discussion of different challenges to be addressed in future work.
  • Papers:
Trokhymovych, M., Aslam, M., Chou, A. J., Baeza-Yates, R., & Saez-Trumper, D. (2023). Fair multilingual vandalism detection system for Wikipedia. arXiv e-prints, arXiv-2306.
Aragón, P., & Sáez-Trumper, D. (2021). A preliminary approach to knowledge integrity risk assessment in Wikipedia projects. arXiv preprint arXiv:2106.15940.

June 2023[edit]

16.30 UTC: Find your local time here
Wikimedia and LGBTQIA+

June 21, 2023 Video: YouTube

Multilingual Contextual Affective Analysis of LGBT People Portrayals in Wikipedia
By Chan Park, Carnegie Mellon University
Abstract: In this talk, I present our research on analyzing the portrayal of LGBT individuals in their biographies on Wikipedia, with a particular focus on subtle word connotations and cross-cultural comparisons. We aim to address two primary research questions: 1) How can we effectively measure the nuanced connotations of words in multilingual texts, which reflect sentiments, power dynamics, and agency? 2) How can we analyze the portrayal of a specific group, such as the LGBT community, and compare these portrayals across different languages? To answer these questions, we collect the Multilingual Contextualized Connotation Frames dataset, comprising 2,700 examples in English, Spanish, and Russian. We also develop a new multilingual model based on pre-trained multilingual language models. Additionally, we devise a matching algorithm to construct a comparison corpus for the target corpus, isolating the attribute of interest. Finally, we showcase how our developed models and constructed corpora enable us to conduct cross-cultural analysis of LGBT People Portrayals on Wikipedia. Our results reveal systematic differences in how the LGBT community is portrayed across languages, surfacing cultural differences in narratives and signs of social biases.

How do you represent my gender? Challenges and opportunities from the Wikidata Gender Diversity project
By Daniele Metilli, University College London
Abstract: Wikidata Gender Diversity (WiGeDi) is a one-year project funded through the Wikimedia Research Fund. The project is studying gender diversity in Wikidata, focusing on marginalized gender identities such as those of trans and non-binary people, and adopting a queer and intersectional feminist perspective. The project is organised in three strands — model, data, and community. First, we are looking at how the current Wikidata ontology model represents gender, and the extent to which this representation is inclusive of marginalized gender identities. We are analysing the data stored in the knowledge base to gather insights and identify possible gaps and biases. Finally, we are looking at how the community has handled the move towards the inclusion of a wider spectrum of gender identities by studying a corpus of user discussions through computational linguistics methods. This presentation will report on the current status of the Wikidata Gender Diversity project and the envisioned outcomes. We will discuss the main challenges that we are facing and the opportunities that our project will potentially enable, on Wikidata and beyond.

May 2023[edit]

No Showcase this month. Join us in the 10th edition of Wiki Workshop on May 11th starting 12:00 UTC instead.

April 2023[edit]

16.30 UTC: Find your local time here
Images on Wikipedia

April 19, 2023 Video: YouTube

A large scale study of reader interactions with images on Wikipedia
By Daniele Rama, University of Turin
Wikipedia is the largest source of free encyclopedic knowledge and one of the most visited sites on the Web. To increase reader understanding of the article, Wikipedia editors add images within the text of the article’s body. However, despite their widespread usage on web platforms and the huge volume of visual content on Wikipedia, little is known about the importance of images in the context of free knowledge environments. To bridge this gap, we collect data about English Wikipedia reader interactions with images during one month and perform the first large-scale analysis of how interactions with images happen on Wikipedia. First, we quantify the overall engagement with images, finding that one in 29 pageviews results in a click on at least one image, one order of magnitude higher than interactions with other types of article content. Second, we study what factors associate with image engagement and observe that clicks on images occur more often in shorter articles and articles about visual arts or transports and biographies of less well-known people. Third, we look at interactions with Wikipedia article previews and find that images help support reader information need when navigating through the site, especially for more popular pages. The findings in this study deepen our understanding of the role of images for free knowledge and provide a guide for Wikipedia editors and web user communities to enrich the world’s largest source of encyclopedic knowledge.

Visual gender biases in Wikipediaː A systematic evaluation across the ten most spoken languages
By Pablo Beytia, Catholic University of Chile
The existing research suggests a significant gender gap in Wikipedia biographical articles, with a minimal representation of women and gender asymmetries in the textual content. However, the visual aspects of this gap (e.g., image volume and quality) have received little attention. This study examined asymmetries between women's and men's biographies, exploring written and visual content across the ten most widely spoken languages. The cross-lingual analysis reveals that (1) the most salient male biases appear when editors select which personalities should have a Wikipedia page, (2) the trends in written and visual content are dissimilar, (3) male biographies tend to have more images across languages, and (4) female biographies have better visual quality on average. The open database of this study provides eight indicators of gender asymmetries in ten occupational domains and ten languages. That information allows for a granular view of gender biases, as well as exploring more macroscopic phenomena, such as the similarity between Wikipedia versions according to their gender bias structures.
  • Papersː
Beytía, P., Agarwal, P., Redi, M., & Singh, V. K. (2022). Visual Gender Biases in Wikipedia: A Systematic Evaluation across the Ten Most Spoken Languages. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 43-54.
Beytía, P. & Wagner, C. (2022). Visibility layers: a framework for systematizing the gender gap in Wikipedia content. Internet Policy Review, 11(1).

March 2023[edit]

9:30am PDT / 12:30pm EDT / 16.30 UTC: Find your local time here
Gender and Equity

March 15, 2023 Video: YouTube

Men Are elected, women are marriedː events gender bias on Wikipedia
By Jiao Sun, University of Southern California
Abstract: Human activities can be seen as sequences of events, which are crucial to understanding societies. Disproportional event distribution for different demographic groups can manifest and amplify social stereotypes, and potentially jeopardize the ability of members in some groups to pursue certain goals. Our study discovers that Wikipedia pages tend to intermingle personal life events with professional events for females but not for males, which calls for the awareness of the Wikipedia community to formalize guidelines and train the editors to mind the implicit biases that contributors carry.

Twitter reacts to absence of women on Wikipediaː a mixed-methods analysis of #VisibleWikiWomen campaign
By Sneh Gupta, Guru Gobind Singh Indraprastha University
Digital gender divide (DGD) is visible in access, participation, representation, and biases against women embedded in Wikipedia, the largest digital reservoir of co-created content. This article examined the content of #VisibleWikiWomen, a global digital advocacy campaign aimed at encouraging inclusion of women voices in the global technology conversation and improving digital sustainability of feminist data on Wikipedia. In a mixed-methods study, Sentiment Analysis followed by a Feminist Critical Discourse Analysis of the campaign tweets reveals how digital gender divide manifested in the public response. An overwhelming majority of tweets expressed positive sentiment towards the objective of the campaign. An inductive reading of the coded tweets (n = 1067) generated five themes: Feminist Activism, Invisibility & Marginalization of Women, Technology for Women Empowerment, Gendered Knowledge Inequity, and Power Dynamics in the Digital Sphere. Twitter discourse presented many agitated digital users calling out the epistemic injustice on Wikipedia that goes beyond the invisibility of women. Their tweets reveal that they want an equal social platform inclusive of women of color and varied identities currently absent in the Wikipedia universe. Extracting ideas, values, and themes from new media campaigns holds unparalleled potential in the diffusion of interventions and messages on a larger scale.

February 2023[edit]

9:30am PDT / 12:30pm EDT / 17ː30 UTC Find your local time here
The Free Knowledge Ecosystem

February 15, 2023 Video: YouTube

The evolution of humanitarian mapping in OpenStreetMap (OSM) and how it affects map completeness and inequalities in OSM
By Benjamin Herfort, Heidelberg Institute for Geoinformation Technology
Mapping efforts of communities in OpenStreetMap (OSM) over the previous decade have created a unique global geographic database, which is accessible to all with no licensing costs. The collaborative maps of OSM have been used to support humanitarian efforts around the world as well as to fill important data gaps for implementing major development frameworks such as the Sustainable Development Goals (SDGs). Besides the well-examined Global North - Global South bias in OSM, the OSM data as of 2023 shows a much more spatially diverse spread pattern than previously considered, which was shaped by regional, socio-economic and demographic factors across several scales. Humanitarian mapping efforts of the previous decade have already made OSM more inclusive, contributing to diversify and expand the spatial footprint of the areas mapped. However, methods to quantify and account for the remaining biases in OSM’s coverage are needed so that researchers and practitioners will be able to draw the right conclusions, e .g. about progress towards the SDGs in cities.

Dataset reuseː Toward translating principles to practice
By Laura Koesten, University of Vienna
The web provides access to millions of datasets. These data can have additional impact when used beyond the context for which they were originally created. But using a dataset beyond the context in which it originated remains challenging. Simply making data available does not mean it will be or can be easily used by others. At the same time, we have little empirical insight into what makes a dataset reusable and which of the existing guidelines and frameworks have an impact.In this talk, I will discuss our research on what makes data reusable in practice. This is informed by a synthesis of literature on the topic, our studies on how people evaluate and make sense of data, and a case study on datasets on GitHub. In the case study, we describe a corpus of more than 1.4 million data files from over 65,000 repositories. Building on reuse features from the literature, we use GitHub’s engagement metrics as proxies for dataset reuse and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This demonstrates the practical gap between principles and actionable insights that might allow data publishers and tool designers to implement functionalities that facilitate reuse.

January 2023[edit]

9:30am PDT / 12:30pm EDT Find your local time here
Editor Retention

January 18, 2023 Video: YouTube

Learning to Predict the Departure Dynamics of Wikidata Editors
By Guangyuan Piao, Maynooth University
Wikidata as one of the largest open collaborative knowledge bases has drawn much attention from researchers and practitioners since its launch in 2012. As it is collaboratively developed and maintained by a community of a great number of volunteer editors, understanding and predicting the departure dynamics of those editors are crucial but have not been studied extensively in previous works. In this paper, we investigate the synergistic effect of two different types of features: statistical and pattern-based ones with DeepFM as our classification model which has not been explored in a similar context and problem for predicting whether a Wikidata editor will stay or leave the platform. Our experimental results show that using the two sets of features with DeepFM provides the best performance regarding AUROC (0.9561) and F1 score (0.8843), and achieves substantial improvement compared to using either of the sets of features and over a wide range of baselines.


December 2022[edit]

9:30am PDT / 12:30pm EDT Find your local time here
A year in review from the WMF Research teamː Tying our work to the research community

December 14, 2022 Video: YouTube

Research as a service
By The WMF Research team
The Wikimedia Research community is key to tackling the many strategic challenges of the Wikimedia movement. As we are ending the year, the Research team will reflect on why working with the community is important to us. We will share the initiatives, tools, and resources developed throughout 2022 to bring the community together, facilitate researchers’ contributions to the Wikimedia projects, and encourage a diversity of research questions.

November 2022[edit]

9:30am PDT / 12:30pm EDT View your local time here
Libraries and Wikimedia knowledge

November 16, 2022 Video: YouTube

Wikipedia and Academic Libraries
By Laurie Bridges (Oregon State University)
In 2021 an open-access edited book, Wikipedia and Academic Libraries: A Global Project, was published, featuring 20 chapters from over 50 authors ( In this presentation, Laurie Bridges, one of the co-editors, will discuss the process for creating and publishing an OA-edited book. Michael David Miller, one of the chapter authors, will discuss his chapter about contributions to local Québécois LGBTQ+ content in Francophone Wikipedia.

Liaison Librarian Contribution to Local Quebecois LGBTQ+ Content in Francophone Wikipedia
By Michael David Miller (McGill University)

Ethical Considerations of Including Gender Information in Open Knowledge Platforms
By Nerissa Lindsey (San Diego State University)
In recent years, galleries, libraries, archives, and museums (GLAMs) have sought to leverage open knowledge platforms such as Wikidata to highlight or provide more visibility for traditionally marginalized groups and their work, collections, or contributions. Efforts like Art + Feminism, local edit-a-thons, and, more recently, GLAM institution-led projects have promoted open knowledge initiatives to a broader audience of participants. One such open knowledge project, the Program for Cooperative Cataloging (PCC) Wikidata Pilot, has brought together over seventy GLAM organizations to contribute linked open data for individuals associated with their institutions, collections, or archives. However, these projects have brought up ethical concerns around including potentially sensitive personal demographic information, such as gender identity, sexual orientation, race, and ethnicity, in entries in an open knowledge base about living persons. GLAM institutions are thus in a position of balancing open access with ethical cataloging, which should include adhering to the personal preferences of the individuals whose data is being shared. People working in libraries and archives have been increasingly focusing their energies on issues of diversity, equity, and inclusion in their descriptive practices, including remediating legacy data and addressing biased language. Moving this work into a more public sphere and scaling up in volume creates potential risks to the individuals being described. While adding demographic information on living people to open knowledge bases has the potential to enhance, highlight, and celebrate diversity, it could also potentially be used to the detriment of the subjects through surveillance and targeting activities. In our research we investigated the changing role of metadata and open knowledge in addressing, or not addressing, issues of under- and misrepresentation, especially as they pertain to gender identity as described in the sex or gender property in Wikidata. We reported our findings from a survey investigating how organizations participating in open knowledge projects are addressing ethical concerns around including personal demographic information as part of their projects, including what, if any, policies they have implemented and what implications these activities may have for the living people being described.

October 2022[edit]

9:30am PDT / 12:30pm EDT / 16ː30 UTC Find your local time here
Panel discussion celebrating Wikidata's 10th birthdayǃ

October 19, 2022 Video: YouTube

By Denny Vrandečić (WMF) with panelists Lydia Pintscher (WMDE), Elena Simperl (King's College London), Katherine Thornton (Yale), and Markus Krötzsch (Technical University of Dresden).
October 2022 marks the tenth anniversary of the launch of Wikidata ( In ten years, this project has become the largest community-driven free knowledge graph in the world, enabling a common knowledge base for Wikimedia projects. The language-independent nature of Wikidata has greatly improved the maintenance and consistency of knowledge across Wikipedia language editions, fostering knowledge equity in Wikimedia. In addition, since Wikidata is a collaborative project that can be read and edited by humans and machines alike, it is also widely used in third-party applications delivering knowledge as a service for all. The Wikimedia Research community has devoted significant effort and resources in studying the foundations, capabilities and applications of Wikidata, from the complex requirements of representing real-world knowledge in a multilingual environment to the needs to assess the quality of data and sources in Wikidata. To learn more about the state of the art of Wikidata and research challenges in the era of AI/ML, we will celebrate this tenth anniversary with a panel that will bring together established researchers/practitioners in this field.

September 2022[edit]

No Showcase this month. The Research team will meet for an in-person offsite in Prague September 19-22. We are very excited that we finally can see/meet each other in person after almost 3 years of not being able to meet. If you are in Prague in that period, feel free to ping. We would be happy to catch up in-person with you if we can align schedules. Otherwise, see you all in October.

August 2022[edit]

No Showcase due to Wikimaniaǃ

July 2022[edit]

9:30am PDT / 12:30pm EDT/ 18:30pm CEST View your local time here
2022 Wikimedia Foundation Research of the Year Award Winnersǃ

July 20, 2022 Video: YouTube

Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
By Krishna Srinivasan (Google)
The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information across image and text modalities. In this talk, I introduce the Wikipedia-based Image Text (WIT) Dataset to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.5 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages.

WIT’s unique advantages include: WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). WIT is massively multilingual (first of its kind) with coverage over 100+ languages. WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover.

WIT Dataset is available for download and use via a Creative Commons license here:

I conclude the talk with future directions to expand and extend the WIT dataset. Link to paperː

Assessing the Quality of Sources in Wikidata Across Languages
By Gabriel Amaral (King's College London)
Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata’s ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. Link to paperː Link to slidesː

June 2022[edit]

(4:00am PDT / 7:00am EDT/ 13:00pm CEST)
Wikipedia's languages.

June 15, 2022 Video: YouTube

Quantifying knowledge synchronisation in the 21st century
By Jisung Yoon (Pohang University of Science and Technology)
Humans acquire and accumulate knowledge through language usage and eagerly exchange their knowledge for advancement. Although geographical barriers had previously limited communication, the emergence of information technology has opened new avenues for knowledge exchange. However, it is unclear which communication pathway is dominant in the 21st century. Here, we explore the dominant path of knowledge diffusion in the 21st century using Wikipedia, the largest communal dataset. We evaluate the similarity of shared knowledge between population groups, distinguished based on their language usage. When population groups are more engaged with each other, their knowledge structure is more similar, where engagement is indicated by socio-economic connections, such as cultural, linguistic, and historical features. Moreover, geographical proximity is no longer a critical requirement for knowledge dissemination. Furthermore, we integrate our data into a mechanistic model to better understand the underlying mechanism and suggest that the knowledge "Silk Road" of the 21st century is based online.

Relevant links: paper (preprint), slides

The Language Geography of Wikipedia
By Martin Dittus
Every language is a system of being, doing, knowing, and imagining. With over 7,000 active languages in the world, how many languages are fully represented online? To answer this question, digital non-profit Whose Knowledge? initiated the first ever report on the State of the Internet's Languages. As part of this report, Martin Dittus and Mark Graham have investigated the languages of Wikipedia. Wikipedia began with a single English-language edition more than two decades ago, and now offers more than 300 language editions, which places it at the forefront of digital language support. However, this does not mean that speakers of these languages get access to the same content: Wikipedia’s language editions vary widely in scale. We further find that this inequality is also reflected in Wikipedia’s geographic coverage: not all places are captured in every language. Wikipedia's coverage often follows the global distribution of speakers of the respective language. Yet even when we account for the distribution of language populations, certain language communities are much more strongly represented on Wikipedia than others. As a consequence, we find that for many countries in Africa, Central and South America, and South Asia, most of the content about those countries is in a foreign language, often a European-colonial language. In other words, in many of these places, people may need to be able to speak a second (possibly foreign) language in order to access Wikipedia information about their own places. Why do we see these differences? And what can be done to improve things?

Relevant links: The Language Geography of Wikipedia, State of the Internet's Languages Report, Slides

May 2022[edit]

(9:30am PDT/ 12:30pm EDT/ 18:30pm CEST)
Gaps and Biases in Wikipedia.

May 18, 2022 Video: YouTube

Ms. Categorized
Gender, notability, and inequality on Wikipedia
By Francesca Tripodi (University of North Carolina at Chapel Hill)
For the last five decades, sociologists have argued that gender is one of the most pervasive and insidious forms of inequality. Research demonstrates how these inequalities persist on Wikipedia - arguably the largest encyclopedic reference in existence. Roughly eighty percent of Wikipedia's editors are men and pages about women and women's interests are underrepresented. English language Wikipedia contains more than 1.5 million biographies about notable writers, inventors, and academics, but less than nineteen percent of these biographies are about women. To try and improve these statistics, activists host “edit-a-thons” to increase the visibility of notable women. While this strategy helps create several biographies previously inexistent, it fails to address a more inconspicuous form of gender exclusion. Drawing on ethnographic observations, interviews, and quantitative analysis of web-scraped metadata this talk demonstrates that women’s biographies are more frequently considered non-notable and nominated for deletion compared to men’s biographies. This disproportionate rate is another dimension of gender inequality on Wikipedia previously unexplored by social scientists and provides broader insights into how women’s achievements are (under)valued in society.

Relevant paperː Ms. Categorized: Gender, notability, and inequality on Wikipedia - Francesca Tripodi, 2021 (

Controlled Analyses of Social Biases in Wikipedia Bios
By Yulia Tsvetkov (University of Washington)
Social biases on Wikipedia could greatly influence public opinion. Wikipedia is also a popular source of training data for NLP models, and subtle biases in Wikipedia narratives are liable to be amplified in downstream NLP models. In this talk I'll present two approaches to unveiling social biases in how people are described on Wikipedia, across demographic attributes and across languages. First, I'll present a methodology that isolates dimensions of interest (e.g., gender), from other attributes (e.g., occupation). This methodology allows us to quantify systemic differences in coverage of different genders and races, while controlling for confounding factors. Next, I'll show an NLP case study that uses this methodology in combination with people-centric sentiment analysis to identify disparities in Wikipedia bios of members of the LGBTQIA+ community across three languages: English, Russian, and Spanish. Our results surface cultural differences in narratives and signs of social biases. Practically, these methods can be used to automatically identify Wikipedia articles for further manual analysis—articles that might contain content gaps or an imbalanced representation of particular social groups.

Relevant papers: TheWebConf'22, ICWSM'21

April 2022[edit]

No showcase this month. See you in Wiki Workshop 2022 and Wiki-M3L.

March 2022[edit]

Patterns and dynamics of article quality

March 16, 2022 Video: YouTube

Quality monitoring in Wikipedia - A computational perspective
By Animesh Mukherjee (Indian Institute of Technology, Kharagpur)
In this talk, I shall summarize our five-year long research highlights concerning Wikipedia. In particular, I shall deep dive into two of our recent works; while the first one attempts to understand the early indications of which editors would soon go "missing" (aka missing editors) [1], the second one investigates how the quality of a Wikipedia article transitions over time and whether computational models could be built to understand the characteristics of future transitions [2]. In each case, I will present a suite of key results and the main insights that we obtained thereof.

Automatically Labeling Low Quality Content on Wikipedia by Leveraging Editing Behaviors
By Sumit Asthana (University of Michigan, Ann Arbor)
Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current labeling approaches are tedious and produce noisy labels. In this talk, I will discuss an automated labeling approach that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia edits and uses the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated article sentences are examples that no longer need semantic improvements. I will discuss the performance of models training with this labeling approach over models trained with existing labeling approaches, and also the implications of such a large scale semi supervised labeling approach in capturing the editing practices of Wikipedia editors and helping them improve articles faster.

February 2022[edit]

Collective Attention in Wikipedia

February 16, 2022 Video: YouTube

Modeling Collective Anticipation and Response on Wikipedia
By Renaud Lambiotte (University of Oxford)
The dynamics of popularity in online media are driven by a combination of endogenous spreading mechanisms and response to exogenous shocks including news and events. However, little is known about the dependence of temporal patterns of popularity on event-related information, e.g. which types of events trigger long-lasting activity. Here we propose a simple model that describes the dynamics around peaks of popularity by incorporating key features, i.e., the anticipatory growth and the decay of collective attention together with circadian rhythms. The proposed model allows us to develop a new method for predicting the future page view activity and for clustering time series. To validate our methodology, we collect a corpus of page view data from Wikipedia associated to a range of planned events, that are events which we know in advance will have a fixed date in the future, such as elections and sport events. Our methodology is superior to existing models in both prediction and clustering tasks. Furthermore, restricting to Wikipedia pages associated to association football, we observe that the specific realization of the event, in our case which team wins a match or the type of the match, has a significant effect on the response dynamics after the event. Our work demonstrates the importance of appropriately modeling all phases of collective attention, as well as the connection between temporal patterns of attention and characteristic underlying information of the events they represent.

Sudden Attention Shifts on Wikipedia During the COVID-19 Crisis
By Kristina Gligorić (EPFL)
We study how the COVID-19 pandemic, alongside the severe mobility restrictions that ensued, has impacted information access on Wikipedia, the world’s largest online encyclopedia. A longitudinal analysis that combines pageview statistics for 12 Wikipedia language editions with mobility reports published by Apple and Google reveals massive shifts in the volume and nature of information seeking patterns during the pandemic. Interestingly, while we observe a transient increase in Wikipedia’s pageview volume following mobility restrictions, the nature of information sought was impacted more permanently. These changes are most pronounced for language editions associated with countries where the most severe mobility restrictions were implemented. We also find that articles belonging to different topics behaved differently; e.g., attention towards entertainment-related topics is lingering and even increasing, while the interest in health- and biology-related topics was either small or transient. Our results highlight the utility of Wikipedia for studying how the pandemic is affecting people’s needs, interests, and concerns.

January 2022[edit]

Beyond English Wikipedia

January 19, 2022 Video: YouTube

Comparing Language Communities - Characterizing Collaboration in the English, French and Spanish Language Editions of Wikipedia
By Taryn Bipat (Microsoft, formerly University of Washington)
Is Wikipedia a standardized platform with a common model of collaboration or is it a set of 312 active language editions with distinct collaborative models? In the last 20 years, researchers have extensively analyzed the complexities of group work that enable the creation of quality articles in the English Wikipedia, but most of our intellectual assumptions about collaborative practices on Wikipedia remain solely based on an Anglocentric perspective. This research extends the current Anglocentric body of literature in human-computer interaction (HCI) and computer-supported cooperative work (CSCW) through three studies that mutually help build an understanding of collaboration models in the English (EN), French (FR), and Spanish (ES) editions of Wikipedia. In the first study, I replicated a model by Viégas et al. (2007) based on editors' behaviors in the English Wikipedia. This model was used as a lens to examine collaborative activity in EN, FR, and ES. In the second study, I leveraged a collaboration model by Kriplean et al. (2007) that suggested editors used “power plays” – how groups of editors claim control over article content through the discourse of Wikipedia policy – in their talk page debates to justify their edits made on articles. In the third study, I interviewed editors from each language edition to build a typology of collaborative behavior and further understand the editor's perceptions of power and authority on Wikipedia.
  • Related papers:

Understanding Wikipedia Practices Through Hindi, Urdu, and English Takes on an Evolving Regional Conflict
By Jacob Thebault-Spieker (Information School, University of Wisconsin – Madison)
Wikipedia is the product of thousands of editors working collaboratively to provide free and up-to-date encyclopedic information to the project’s users. This article asks to what degree Wikipedia articles in three languages — Hindi, Urdu, and English — achieve Wikipedia’s mission of making neutrally-presented, reliable information on a polarizing, controversial topic available to people around the globe. We chose the topic of the recent revocation of Article 370 of the Constitution of India, which, along with other recent events in and concerning the region of Jammu and Kashmir, has drawn attention to related articles on Wikipedia. This work focuses on the English Wikipedia, being the preeminent language edition of the project, as well as the Hindi and Urdu editions. Hindi and Urdu are the two standardized varieties of Hindustani, a lingua franca of Jammu and Kashmir. We analyzed page view and revision data for three Wikipedia articles to gauge popularity of the pages in our corpus, and responsiveness of editors to breaking news events and problematic edits. Additionally, we interviewed editors from all three language editions to learn about differences in editing processes and motivations, and we compared the text of the articles across languages as they appeared shortly after the revocation of Article 370. Across languages, we saw discrepancies in article tone, organization, and the information presented, as well as differences in how editors collaborate and communicate with one another. Nevertheless, in Hindi and Urdu, as well as English, editors predominantly try to adhere to the principle of neutral point of view (NPOV), and for the most part, the editors quash attempts by other editors to push political agendas.


December 2021[edit]

Online Education Landscapes

December 15, 2021 Video: YouTube

Latin American Youth and their Information Ecosystem - Finding, Evaluation, Creating, and Sharing Content Online
By Lionel Brossi and Ana María Castillo. Artificial Intelligence and Society Hub at University of Chile
The increased importance the Internet plays as a core source of information in youth's lives, now underscored by the pandemic, gives new urgency to the need to better understand young people’s information habits and attitudes. Answers to questions like where young people go to look for information, what information they decide to trust and how they share the information they find, hold important implications for the knowledge they obtain, the beliefs they form and the actions they take in areas ranging from personal health, professional employment or their educational training. In this research showcase, we will be summarizing insights from focus group interviews in Latin America that offer a window into the experiences of young people themselves. Taken together, these perspectives might help us to develop a more comprehensive understanding of how young people in Latin America use the Internet in general and interact with information from online sources in particular.

Characterizing the Online Learning Landscape - What and How People Learn Online
By Sean Kross, University of California San Diego
Hundreds of millions of people learn something new online every day. Simultaneously, the study of online education has blossomed with new systems, experiments, and observations creating and exploring previously undiscovered online learning environments. In this talk I will discuss our study, in which we endeavor to characterize this entire landscape of online learning experiences using a national survey of 2260 US adults who are balanced to match the demographics of the U.S. We examine the online learning resources that they consult, and we analyze the subjects that they pursue using those resources. Furthermore, we compare both formal and informal online learning experiences on a larger scale than has ever been done before, to our knowledge, to better understand which subjects people are seeking for intensive study. We find that there is a core set of online learning experiences that are central to other experiences and these are shared among the majority of people who learn online.

November 2021[edit]

Content moderation

November 17, 2021 Video: YouTube

Is Deplatforming Censorship? What happened when controversial figures were deplatformed, with philosophical musings on the nature of free speech.
By Amy S. Bruckman (Georgia Institute of Technology)
When a controversial figure is deplatformed, what happens to their online influence? In this talk, first, I’ll present results from a study of the deplatforming from Twitter of three figures who repeatedly broke platform rules (Alex Jones, Milo Yiannopoulos, and Owen Benjamin). Second, I’ll discuss what happened when this study was on the front page of Reddit, and the range of angry reactions from people who say that they’re in favor of “free speech.” I’ll explore the nature of free speech, and why our current speech regulation framework is fundamentally broken. Finally, I’ll conclude with thoughts on the strength of Wikipedia’s model in contrast to other platforms, and highlight opportunities for improvement.

Effects of Algorithmic Flagging on Fairness. Quasi-experimental Evidence from Wikipedia
By Nathan TeBlunthuis (University of Washington / Northwestern University)
Online community moderators often rely on social signals such as whether or not a user has an account or a profile page as clues that users may cause problems. Reliance on these clues can lead to "overprofiling" bias when moderators focus on these signals but overlook the misbehavior of others. We propose that algorithmic flagging systems deployed to improve the efficiency of moderation work can also make moderation actions more fair to these users by reducing reliance on social signals and making norm violations by everyone else more visible. We analyze moderator behavior in Wikipedia as mediated by RCFilters, a system which displays social signals and algorithmic flags, and estimate the causal effect of being flagged on moderator actions. We show that algorithmically flagged edits are reverted more often, especially those by established editors with positive social signals, and that flagging decreases the likelihood that moderation actions will be undone. Our results suggest that algorithmic flagging systems can lead to increased fairness in some contexts but that the relationship is complex and contingent.

October 2021[edit]

Bridging knowledge gaps

October 27, 2021 Video: YouTube

Automatic approaches to bridge knowledge gaps in Wikimedia projects
By WMF Research Team
In order to advance knowledge equity as part of the Wikimedia Movement’s 2030 strategic direction, the Research team at the Wikimedia Foundation has been conducting research to “Address Knowledge Gaps” as one of its main programs. One core component of this program is to develop technologies to bridge knowledge gaps. In this talk, we give an overview on how we approach this task using tools from Machine Learning in four different contexts: section alignment in content translation, link recommendation in structured editing, image recommendation in multimedia knowledge gaps, and the equity of the recommendations themselves. We will present how these models can assist contributors in addressing knowledge gaps. Finally, we will discuss the impact of these models in applications deployed across Wikimedia projects supporting different Product initiatives at the Wikimedia Foundation.
More information on the individual projects:
* Section alignment: meta:Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Alignment
* Link recommendation: meta:Research:Link_recommendation_model_for_add-a-link_structured_task
* Image recommendation: meta:Research:Recommending_Images_to_Wikipedia_Articles
* Equity in recommendations: meta:Research:Prioritization_of_Wikipedia_Articles/Recommendation
Slide deck:
* Slides on figshare

September 2021[edit]

Socialization on Wikipedia

September 15, 2021 Video: YouTube

Unlocking the Wikipedia clubhouse to newcomers. Results from two studies.
By Rosta Farzan (School of Computing and Information, University of Pittsburgh)
It is no news to any of us that success of online production communities such as Wikipedia highly relies on a continuous stream of newcomers to replace the inevitable high turnover and to bring on board new sources of ideas and workforce. However, these communities have been struggling with attracting newcomers, especially from a diverse population of users, and further retention of newcomers. In this talk, I will present about two different approaches in engaging new editors in Wikipedia: (1) newcomers joining through the Wiki Ed program, an online program in which college students edit Wikipedia articles as class assignments; (2)newcomers joining through a Wikipedia Art+Feminism edit-a-thon. I present how each approach incorporated techniques in engaging newcomers and how they succeed in attracting and retention of newcomers.
* Bring on Board New Enthusiasts! A Case Study of Impact of Wikipedia Art + Feminism Edit-A-Thon Events on Newcomers, SocInfo 2016 (pdf author's copy)
* Successful Online Socialization: Lessons from the Wikipedia Education Program, CSCW 2020 (pdf author's copy)

The Effect of Receiving Appreciation on Wikipedias. A Community Co-Designed Field Experiment.
By J. Nathan Matias (Citizens and Technology Lab, Cornell University Departments of Communication and Information Science)
Can saying “thank you” make online communities stronger & more inclusive? Or does thanking others for their voluntary efforts have little effect? To ask this question, the Citizens and Technology Lab (CAT Lab) organized 344 volunteers to send thanks to Wikipedia contributors across the Arabic, German, Polish, and Persian languages. We then observed the behavior of 15,558 newcomers and experienced contributors to Wikipedia. On average, we found that organizing volunteers to thank others increases two-week retention of newcomers and experienced accounts. It also caused people to send more thanks to others. This study was a field experiment, a randomized trial that sent thanks to some people and not to others. These experiments can help answer questions about the impact of community practices and platform design. But they can sometimes face community mistrust, especially when researchers conduct them without community consent. In this talk, learn more about CAT Lab's approach to community-led research and discuss open questions about best practices.
* The Diffusion and Influence of Gratitude Expressions in Large-Scale Cooperation: A Field Experiment in Four Knowledge Networks, paper preprint
* Volunteers Thanked Thousands of Wikipedia Editors to Learn the Effects of Receiving Thanks, blogpost (in EN, DE, AR, PL, FA)

August 2021[edit]

No showcase due to Wikimania 2021

July 2021[edit]

Effects of campaigns to close content gaps

July 21, 2021 Video: YouTube

Content Growth and Attention Contagion in Information Networks. Addressing Information Poverty on Wikipedia
By Kai Zhu (McGill University, Canada)
Open collaboration platforms have fundamentally changed the way that knowledge is produced, disseminated, and consumed. In these systems, contributions arise organically with little to no central governance. Although such decentralization provides many benefits, a lack of broad oversight and coordination can leave questions of information poverty and skewness to the mercy of the system’s natural dynamics. Unfortunately, we still lack a basic understanding of the dynamics at play in these systems and specifically, how contribution and attention interact and propagate through information networks. We leverage a large-scale natural experiment to study how exogenous content contributions to Wikipedia articles affect the attention that they attract and how that attention spills over to other articles in the network. Results reveal that exogenously added content leads to significant, substantial, and long-term increases in both content consumption and subsequent contributions. Furthermore, we find significant attention spillover to downstream hyperlinked articles. Through both analytical estimation and empirically informed simulation, we evaluate policies to harness this attention contagion to address the problem of information poverty and skewness. We find that harnessing attention contagion can lead to as much as a twofold increase in the total attention flow to clusters of disadvantaged articles. Our findings have important policy implications for open collaboration platforms and information networks.
Related papers:
* Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia. Informations Systems Research (2020) (Link to pdf)
* Slides on figshare

Bridging Wikipedia’s Gender Gap. Quantifying and Assessing the Impact of Two Feminist Interventions
By Isabelle Langrock (University of Pennsylvania, USA)
Wikipedia has a well-known gender divide affecting its biographical content. This bias not only shapes social perceptions of knowledge, but it can also propagate beyond the platform as its contents are leveraged to correct misinformation, train machine-learning tools, and enhance search engine results. What happens when feminist movements intervene to try to close existing gaps? In this talk, we present a recent study of two popular feminist interventions designed to counteract digital knowledge inequality. Our findings show that the interventions are successful at adding content about women that would otherwise be missing, but they are less successful at addressing several structural biases that limit the visibility of women within Wikipedia. We argue for more granular and cumulative analysis of gender divides in collaborative environments and identify key areas of support that can further aid the feminist movements in closing Wikipedia’s gender gaps.
Related papers:
* The Gender Divide in Wikipedia: Quantifying and Assessing the Impact of Two Feminist Interventions (2021) (Link to pdf)
* Slides on figshare

June 2021[edit]

AI model governance

June 23, 2021 Video: YouTube

Bridging AI and HCI. Incorporating Human Values into the Development of AI Technologies
By Haiyi Zhu (Carnegie Mellon University)
The increasing accuracy and falling costs of AI have stimulated the increased use of AI technologies in mainstream user-facing applications and services. However, there is a disconnect between mathematically rigorous AI approaches and the human stakeholders’ needs, motivations, and values, as well as organizational and institutional realities, contexts, and constraints; this disconnect is likely to undermine practical initiatives and may sometimes lead to negative societal impacts. In this presentation, I will discuss my research on incorporating human stakeholders’ values and feedback into the creation process of AI technologies. I will describe a series of projects in the context of the Wikipedia community to illustrate my approach. I hope this presentation will contribute to the rich ongoing conversation concerning bridging HCI and AI and using HCI methods to address AI challenges.
* Slides on figshare

ML Governance. First Steps
By Andy Craze (Wikimedia Foundation, Machine Learning Team)
The WMF Machine Learning team is upgrading the Foundation's infrastructure to support the modern machine learning ecosystem. As part of this work, the team seeks to understand its ethical and legal responsibilities for developing and hosting predictive models within a global context. Drawing from previous WMF research related to ethical & human-centered machine learning, the team wishes to begin a series of conversations to discuss how we can deploy responsible systems that are inclusive to newcomers and non-experts, while upholding our commitment to free and open knowledge.
* Slides on figshare

May 2021[edit]

The value and importance of Wikipedia

May 19, 2021 Video: YouTube

The Importance of Wikipedia to Search Engines and Other Systems
By Nick Vincent (Northwestern University)
A growing body of work has highlighted the important role that Wikipedia’s volunteer-created content plays in helping search engines achieve their core goal of addressing the information needs of hundreds of millions of people. In this talk, I will discuss a recent study looking at how often, and where, Wikipedia links appear in search engine results. In this study, we found that Wikipedia links appeared prominently and frequently in Google, Bing, and DuckDuckGo results, though less often for searches from a mobile device. I will connect this study to past work looking at the value of Wikipedia links to other online platforms, and to ongoing discussions around Wikipedia's value as a training source for modern AI.
* Related paper: A Deeper Investigation of the Importance of Wikipedia Links to Search Engine Results. To Appear in CSCW 2021. (Link to pdf)
* Slides on figshare

On the Value of Wikipedia as a Gateway to the Web
By Tiziano Piccardi (EPFL)
By linking to external websites, Wikipedia can act as a gateway to the Web. However, little is known about the amount of traffic generated by Wikipedia's external links. We fill this gap in a detailed analysis of usage logs gathered from Wikipedia users' client devices. We discovered that in one month, English Wikipedia generated 43M clicks to external websites, with the highest click-through rate on the official links listed in the infoboxes. Our analysis highlights that the articles about businesses, educational institutions, and websites show the highest engagement, and for some content, Wikipedia act as a stepping stone to the intended destination. We conclude our analysis by quantifying the hypothetical economic value of the clicks received by external websites. We estimate that the respective website owners would need to pay a total of $7--13 million per month to obtain the same volume of traffic via sponsored search. These findings shed light on Wikipedia's role not only as an important source of information but also as a high-traffic gateway to the broader Web ecosystem.
Related papers:
* On the Value of Wikipedia as a Gateway to the Web. WWW 2021. (Link to pdf)
* Slides on figshare

April 2021[edit]

No showcase due to Wiki Workshop 2021

March 2021[edit]


March 17, 2021 Video: YouTube

The curious human
By Danielle S. Bassett (University of Pennsylvania)
The human mind is curious. It is strange, remarkable, and mystifying; it is eager, probing, questioning. Despite its pervasiveness and its relevance for our well-being, scientific studies of human curiosity that bridge both the organ of curiosity and the object of curiosity remain in their infancy. In this talk, I will integrate historical, philosophical, and psychological perspectives with techniques from applied mathematics and statistical physics to study individual and collective curiosity. In the former, I will evaluate how humans walk on the knowledge network of Wikipedia during unconstrained browsing. In doing so, we will capture idiosyncratic forms of curiosity that span multiple millennia, cultures, languages, and timescales. In the latter, I will consider the fruition of collective curiosity in the building of scientific knowledge as encoded in Wikipedia. Throughout, I will make a case for the position that individual and collective curiosity are both network building processes, providing a connective counterpoint to the common acquisitional account of curiosity in humans.
Related papers:
* Lydon-Staley, D. M., Zhou, D., Blevins, A. S., Zurn, P., & Bassett, D. S. (2019). Hunters, busybodies, and the knowledge network building associated with curiosity.
* Ju, H., Zhou, D., Blevins, A. S., Lydon-Staley, D. M., Kaplan, J., Tuma, J. R., & Bassett, D. S. (2020). The network structure of scientific revolutions.

February 2021[edit]


February 17, 2021 Video: YouTube

Shocking the Crowd - The Effect of Censorship Shocks on Chinese Wikipedia
By Daniel Romero (University of Michigan)
Collaborative crowdsourcing has become a popular approach to organizing work across the globe. Being global also means being vulnerable to shocks – unforeseen events that disrupt crowds – that originate from any country. In this study, we examine changes in collaborative behavior of editors of Chinese Wikipedia that arise due to the 2005 government censorship in mainland China. Using the exogenous variation in the fraction of editors blocked across different articles due to the censorship, we examine the impact of reduction in group size, which we denote as the shock level, on three collaborative behavior measures: volume of activity, centralization, and conflict. We find that activity and conflict drop on articles that face a shock, whereas centralization increases. The impact of a shock on activity increases with shock level, whereas the impact on centralization and conflict is higher for moderate shock levels than for very small or very high shock levels. These findings provide support for threat rigidity theory – originally introduced in the organizational theory literature – in the context of large-scale collaborative crowds.
* paper published at ICWSM 2017
* slides on figshare

Censorship's Effect on Incidental Exposure to Information - Evidence from Wikipedia
By Margaret Roberts (University of California San Diego)
The fast-growing body of research on internet censorship has examined the effects of censoring selective pieces of political information and the unintended consequences of censorship of entertainment. However, we know very little about the broader consequences of coarse censorship or censorship that affects a large array of information such as an entire website or search engine. In this study, we use China’s complete block of Chinese language Wikipedia ( on May 19, 2015, to disaggregate the effects of coarse censorship on proactive consumption of information—information users seek out—and on incidental consumption of information—information users are not actively seeking but consume when they happen to come across it. We quantify the effects of censorship of Wikipedia not only on proactive information consumption but also on opportunities for exploration and incidental consumption of information. We find that users from mainland China were much more likely to consume information on Wikipedia about politics and history incidentally rather than proactively, suggesting that the effects of censorship on incidental information access may be politically significant.

January 2021[edit]

Macro-level organizational analysis of peer production communities

January 20, 2021 Video: YouTube

The importance of thinking big. Convergence, divergence, and interdependence among wikis and peer production communities
By Aaron Shaw (Northwestern University)
Designing and governing collaborative, peer production communities can benefit from large-scale, macro-level thinking that focuses on communities as the units of analysis. For example, understanding how and why seemingly comparable communities may follow convergent, divergent, and/or interdependent patterns of behavior can inform more parsimonious theoretical and empirical insights as well as more effective strategic action. This talk gives a sneak peak at research-in-progress by members of the Community Data Science Collective to illustrate these points. In particular, I focus on studies of (1) convergent trends of formalization in several large Wikipedias; (2) divergent editor engagement among three small Wikipedias; and (3) commensal patterns of ecological interdependence across communities. Together, the studies underscore the value and challenges of macro-level organizational analysis of peer production and social computing systems.


December 2020[edit]

Disinformation and reliability of sources in Wikipedia

December 16, 2020 Video: YouTube

Quality assessment of Wikipedia and its sources
By Włodzimierz Lewoniewski (Poznań University of Economics and Business, Poland)
Information in Wikipedia can be edited in over 300 languages independently. Therefore often the same subject in Wikipedia can be described differently depending on language edition. In order to compare information between them one usually needs to understand each of considered languages. We work on solutions that can help to automate this process. They leverage machine learning and artificial intelligence algorithms. The crucial component, however, is assessment of article quality therefore we need to know how to define and extract different quality measures. This presentation briefly introduces some of the recent activities of Department of Information Systems at Poznań University of Economics and Business related to quality assessment of multilingual content in Wikipedia. In particular, we demonstrate some of the approaches for the reliability assessment of sources in Wikipedia articles. Such solutions can help to enrich various language editions of Wikipedia and other knowledge bases with information of better quality.

Challenges on fighting Disinformation in Wikipedia
Who has the (ground-)truth?
By Diego Saez-Trumper (Research, Wikimedia Foundation)
Different from the major social media websites where the fight against disinformation mainly refers to preventing users to massively replicate fake content, fighting disinformation in Wikipedia requires tools that allows editors to apply the content policies of: verifiability, non-original research, and neutral point of view. Moreover, while other platforms try to apply automatic fact checking techniques to verify content, the ground-truth for such verification is done based on Wikipedia, for obvious reasons we can't follow the same pipeline for fact checking content on Wikipedia. In this talk we will explain the ML approach we are developing to build tools to efficiently support wikipedians to discover suspicious content and how we collaborate with external researchers on this task. We will also describe a group of datasets we are preparing to share with the research community in order to produce state-of-the-art algorithms to improve the verifiability of content on Wikipedia.

November 2020[edit]

Interpersonal communication between editors

November 18, 2020 Video: YouTube

Talk before you type - Interpersonal communication on Wikipedia
By Dr Anna Rader, Research Consultant
Formally, the work of Wikipedia’s community of volunteers is asynchronous and anarchic: around the world, editors labor individually and in disorganized ways on the collective project. Yet this work is also underscored by informal and vibrant interpersonal communication: in the lively exchanges of talk pages and the labor-sharing of editorial networks, anonymous strangers communicate their intentions and coordinate their efforts to maintain the world’s largest online encyclopaedia. This working paper offers an overview of academic research into editors’ communication networks and patterns, with a particular focus on the role of talk pages. It considers four communication dynamics of editor interaction: cooperation, deliberation, conflict and coordination; and reviews key recommendations for enhancing peer-to-peer communication within the Wikipedia community.
slides on figshare

All Talk - How Increasing Interpersonal Communication on Wikis May Not Enhance Productivity
By Sneha Narayan, Assistant Professor, Carlton College
What role does interpersonal communication play in sustaining production in online collaborative communities? This paper sheds light on that question by examining the impact of a communication feature called "message walls" that allows for faster and more intuitive interpersonal communication in a population of wikis on Wikia. Using panel data from a sample of 275 wiki communities that migrated to message walls and a method inspired by regression discontinuity designs, we analyze these transitions and estimate the impact of the system's introduction. Although the adoption of message walls was associated with increased communication among all editors and newcomers, it had little effect on productivity, and was further associated with a decrease in article contributions from new editors. Our results imply that design changes that make communication easier in a social computing system may not always translate to increased participation along other dimensions.

October 2020[edit]

No Showcase in October.

September 2020[edit]

Knowledge gaps

September 23, 2020 Video: YouTube

A first draft of the knowledge gaps taxonomy for Wikimedia projects
By WMF Research Team
In response to Wikimedia Movement’s 2030 strategic direction, the Research team at the Wikimedia Foundation is developing a framework to understand and measure knowledge gaps. The goal is to capture the multi-dimensional aspect of knowledge gaps and inform long-term decision making. The first milestone was to develop a taxonomy of knowledge gaps which offers a grouping and descriptions of the different Wikimedia knowledge gaps. The first draft of the taxonomy is now published and we seek your feedback to improve it. In this talk, we will give an overview over the first draft of the taxonomy of knowledge gaps in Wikimedia projects. Following that, we will host an extended Q&A in which we would like to get your feedback and discuss with you the taxonomy and knowledge gaps more generally.

August 2020[edit]

Readership and navigation

August 19, 2020 Video: YouTube

What matters to us most and why? Studying popularity and attention dynamics via Wikipedia navigation data.
By Taha Yasseri (University College Dublin), Patrick Gildersleve (Oxford Internet Institute)
While Wikipedia research was initially focused on editorial behaviour or the content to a great extent, soon researchers realized the value of the navigation data both as a reflection of readers interest and, more generally, as a proxy for behaviour of online information seekers. In this talk we will report on various projects in which we utilized pageview statistics or readers navigation data to study: movies financial success [1], electoral popularity [2], disaster triggered collective attention [3] and collective memory [4], general navigation patterns and article typology [5], and attention patterns in relation to news breakouts.

Query for Architecture, Click through Military. Comparing the Roles of Search and Navigation on Wikipedia
By Dimitar Dimitrov (GESIS - Leibniz Institute for the Social Sciences)
As one of the richest sources of encyclopedic information on the Web, Wikipedia generates an enormous amount of traffic. In this paper, we study large-scale article access data of the English Wikipedia in order to compare articles with respect to the two main paradigms of information seeking, i.e., search by formulating a query, and navigation by following hyperlinks. To this end, we propose and employ two main metrics, namely (i) searchshare -- the relative amount of views an article received by search --, and (ii) resistance -- the ability of an article to relay traffic to other Wikipedia articles -- to characterize articles. We demonstrate how articles in distinct topical categories differ substantially in terms of these properties. For example, architecture-related articles are often accessed through search and are simultaneously a "dead end" for traffic, whereas historical articles about military events are mainly navigated. We further link traffic differences to varying network, content, and editing activity features. Lastly, we measure the impact of the article properties by modeling access behavior on articles with a gradient boosting approach. The results of this paper constitute a step towards understanding human information seeking behavior on the Web.

July 2020[edit]

Medical knowledge on Wikipedia

July 15, 2020 Video: YouTube

Wikipedia for health information - Situating Wikipedia as a health information resource
By Denise Smith (McMaster University, Health Sciences Library & Western University, Faculty of Information & Media Studies)
Wikipedia is the most frequently accessed web site for health information, but the various ways users engage with Wikipedia’s health content has not been thoroughly investigated or reported. This talk will summarize the findings of a comprehensive literature review published in February. It explores all the contexts in which Wikipedia’s health content is used that have been reported in academic literature. The talk will focus on the findings reported in this paper, the potential impact of this study in health and medical librarianship, the practice of medicine, and medical or health education.
  • D.A. Smith (2020). "Situating Wikipedia as a health information resource in various contexts: A scoping review". PLoS ONE. doi: 10.1371/journal.pone.0228786

COVID-19 research in Wikipedia
By Giovanni Colavizza (University of Amsterdam, Netherlands)
Wikipedia is one of the main sources of free knowledge on the Web. During the first few months of the pandemic, over 4,500 new Wikipedia pages on COVID-19 have been created and have accumulated close to 250M pageviews by early April 2020.1 At the same time, an unprecedented amount of scientific articles on COVID-19 and the ongoing pandemic have been published online. Wikipedia’s contents are based on reliable sources, primarily scientific literature. Given its public function, it is crucial for Wikipedia to rely on representative and reliable scientific results, especially so in a time of crisis. We assess the coverage of COVID-19-related research in Wikipedia via citations. We find that Wikipedia editors are integrating new research at an unprecedented fast pace. While doing so, they are able to provide a largely representative coverage of COVID-19-related research. We show that all the main topics discussed in this literature are proportionally represented from Wikipedia, after accounting for article-level effects. We further use regression analyses to model citations from Wikipedia and show that, despite the pressure to keep up with novel results, Wikipedia editors rely on literature which is highly cited, widely shared on social media, and has been peer-reviewed.

June 2020[edit]

Credibility and Verifiability

June 17, 2020 Video: YouTube

Today’s News, Tomorrow’s Reference, and The Problem of Information Reliability - An Introduction to NewsQ
By Connie Moon Sehat, NewsQ, Hacks/Hackers
The effort to make Wikipedia more reliable is related to the larger challenges facing the information ecosystem overall. These challenges include the discovery of and accessibility to reliable news amid the transformation of news distribution through platform and social media products. Connie will present some of the challenges related to the ranking and recommendation of news that are addressed by the NewsQ Initiative, a collaboration between the Tow-Knight Center for Entrepreneurial Journalism at the Craig Newmark Graduate School of Journalism and Hacks/Hackers. In addition, she’ll share some of the ways that the project intersects with Wikipedia, such as supporting research around the US Perennial Sources list.

Related resources

Quantifying Engagement with Citations on Wikipedia
By Tiziano Piccardi, EPFL
Wikipedia, the free online encyclopedia that anyone can edit, is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia is not a source of original information, but was conceived as a gateway to secondary sources: according to Wikipedia's guidelines, facts must be backed up by reliable sources that reflect the full spectrum of views on the topic. Although citations lie at the very heart of Wikipedia, little is known about how users interact with them. To close this gap, we built client-side instrumentation for logging all interactions with links leading from English Wikipedia articles to cited references for one month and conducted the first analysis of readers' interaction with citations on Wikipedia. We find that overall engagement with citations is low: about one in 300 page views results in a reference click (0.29% overall; 0.56% on desktop; 0.13% on mobile). Matched observational studies of the factors associated with reference clicking reveal that clicks occur more frequently on shorter pages and on pages of lower quality, suggesting that references are consulted more commonly when Wikipedia itself does not contain the information sought by the user. Moreover, we observe that recent content, open access sources, and references about life events (births, deaths, marriages, etc) are particularly popular. Taken together, our findings open the door to a deeper understanding of Wikipedia's role in a global information economy where reliability is ever less certain, and source attribution ever more vital.

May 2020[edit]

Human in the Loop Machine Learning

May 20, 2020 Video: YouTube

OpenCrowd -- A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation
By Jie Yang, Amazon (current), Delft University of Technology (starting soon)
Finding social influencers is a fundamental task in many online applications ranging from brand marketing to opinion mining. Existing methods heavily rely on the availability of expert labels, whose collection is usually a laborious process even for domain experts. Using open-ended questions, crowdsourcing provides a cost-effective way to find a large number of social influencers in a short time. Individual crowd workers, however, only possess fragmented knowledge that is often of low quality. To tackle those issues, we present OpenCrowd, a unified Bayesian framework that seamlessly incorporates machine learning and crowdsourcing for effectively finding social influencers. To infer a set of influencers, OpenCrowd bootstraps the learning process using a small number of expert labels and then jointly learns a feature-based answer quality model and the reliability of the workers. Model parameters and worker reliability are updated iteratively, allowing their learning processes to benefit from each other until an agreement on the quality of the answers is reached. We derive a principled optimization algorithm based on variational inference with efficient updating rules for learning OpenCrowd parameters. Experimental results on finding social influencers in different domains show that our approach substantially improves the state of the art by 11.5% AUC. Moreover, we empirically show that our approach is particularly useful in finding micro-influencers, who are very directly engaged with smaller audiences. Paper

Keeping Community in the Machine-Learning Loop
By C. Estelle Smith, MS, PhD Candidate, GroupLens Research Lab at the University of Minnesota
On Wikipedia, sophisticated algorithmic tools are used to assess the quality of edits and take corrective actions. However, algorithms can fail to solve the problems they were designed for if they conflict with the values of communities who use them. In this study, we take a Value-Sensitive Algorithm Design approach to understanding a community-created and -maintained machine learning-based algorithm called the Objective Revision Evaluation System (ORES)—a quality prediction system used in numerous Wikipedia applications and contexts. Five major values converged across stakeholder groups that ORES (and its dependent applications) should: (1) reduce the effort of community maintenance, (2) maintain human judgement as the final authority, (3) support differing peoples’ differing workflows, (4) encourage positive engagement with diverse editor groups, and (5) establish trustworthiness of people and algorithms within the community. We reveal tensions between these values and discuss implications for future research to improve algorithms like ORES. Paper

March 2020[edit]

Topic modeling

March 18, 2020 Video: YouTube

Big Data Analysis with Topic Models
Evaluation, Interaction, and Multilingual Extensions
By Jordan Boyd-Graber, University of Maryland
A common information need is to understand large, unstructured datasets: millions of e-mails during e-discovery, a decade worth of science correspondence, or a day's tweets. In the last decade, topic models have become a common tool for navigating such datasets even across languages. This talk investigates the foundational research that allows successful tools for these data exploration tasks: how to know when you have an effective model of the dataset; how to correct bad models; how to measure topic model effectiveness; and how to detect framing and spin using these techniques. After introducing topic models, I argue why traditional measures of topic model quality---borrowed from machine learning---are inconsistent with how topic models are actually used. In response, I describe interactive topic modeling, a technique that enables users to impart their insights and preferences to models in a principled, interactive way. I will then address measuring topic model effectiveness in real-world tasks.

Topic Classification for Wikipedia
By Isaac Johnson, Wikimedia Foundation
This talk will provide a high-level overview of how the Wikimedia Foundation is approaching the challenges of topic classification and topic modeling for Wikipedia. An overview will be given of the importance of being able to model topics to Wikipedia readers and editors as well as a description of some of the existing technologies (ORES articletopic API; Wikidata-based topic API) and future work in this space. (Presentation slides)

February 2020[edit]

February 19, 2020 Video: YouTube

Autonomous tools and the design of work
By Jeffrey V. Nickerson, Stevens Institute of Technology
Bots and other software tools that exhibit autonomy can appear in an organization to be more like employees than commodities. As a result, humans delegate to machines. Sometimes the machines turn and delegate part of the work back to humans. This talk will discuss how the design of human work is changing, drawing on a recent study of editors and bots in Wikipedia, as well as a study of game and chip designers. The Wikipedia bot ecosystem, and how bots evolve, will be discussed. Humans are working together with machines in complex configurations; this puts constraints on not only the machines but also the humans. Both software and human skills change as a result. Paper

When Humans and Machines Collaborate
Cross-lingual Label Editing in Wikidata
By Lucie-Aimée Kaffee, University of Southampton
The quality and maintainability of any knowledge graph are strongly influenced in the way it is created. In the case of Wikidata, the knowledge graph is created and maintained by a hybrid approach of human editing supported by automated tools. We analyse the editing of natural language data, i.e. labels. Labels are the entry point for humans to understand the information, and therefore need to be carefully maintained. Wikidata is a good example for a hybrid multilingual knowledge graph as it has a large and active community of humans and bots working together covering over 300 languages. In this work, we analyse the different editor groups and how they interact with the different language data to understand the provenance of the current label data. This presentation is based on the paper “When Humans and Machines Collaborate: Cross-lingual Label Editing in Wikidata”, published in OpenSym 2019 in collaboration with Kemele M. Endris and Elena Simperl. Paper

January 2020[edit]

No Showcase in January.


December 2019[edit]

December 18, 2019 Video: YouTube

Making Knowledge Bases More Complete
By Fabian Suchanek, Télécom Paris, Institut Polytechnique de Paris
A Knowledge Base (KB) is a computer-readable collection of facts about the world (examples are Wikidata, DBpedia, and YAGO). The problem is that these KBs are often missing entities or facts. In this talk, I present some new methods to combat this incompleteness. I will also quickly talk about some other research projects we are currently pursuing, including a new version of YAGO. (presentation slides, related publications)

The Dynamics of Peer-Produced Political Information During the 2016 U.S. Presidential Campaign
By Brian Keegan, Ph.D., Assistant Professor, Department of Information Science, University of Colorado Boulder
Wikipedia plays a crucial role for online information seeking and its editors have a remarkable capacity to rapidly revise its content in response to current events. How did the production and consumption of political information on Wikipedia mirror the dynamics of the 2016 U.S. Presidential campaign? Drawing on systems justification theory and methods for measuring the enthusiasm gap among voters, this paper quantitatively analyzes the candidates' biographical and related articles and their editors. Information production and consumption patterns match major events over the course of the campaign, but Trump-related articles show consistently higher levels of engagement than Clinton-related articles. Analysis of the editors' participation and backgrounds show analogous shifts in the composition and durability of the collaborations around each candidate. The implications for using Wikipedia to monitor political engagement are discussed. (Presentation slides, Paper)

November 2019[edit]

November 20, 2019 Video: YouTube

Wikipedia Text Reuse: Within and Without
By Martin Potthast, Leipzig University
We study text reuse related to Wikipedia at scale by compiling the first corpus of text reuse cases within Wikipedia as well as without (i.e., reuse of Wikipedia text in a sample of the Common Crawl). To discover reuse beyond verbatim copy and paste, we employ state-of-the-art text reuse detection technology, scaling it for the first time to process the entire Wikipedia as part of a distributed retrieval pipeline. We further report on a pilot analysis of the 100 million reuse cases inside, and the 1.6 million reuse cases outside Wikipedia that we discovered. Text reuse inside Wikipedia gives rise to new tasks such as article template induction, fixing quality flaws, or complementing Wikipedia’s ontology. Text reuse outside Wikipedia yields a tangible metric for the emerging field of quantifying Wikipedia’s influence on the web. To foster future research into these tasks, and for reproducibility’s sake, the Wikipedia text reuse corpus and the retrieval pipeline are made freely available (paper, slides, and related resources,Demo)

Characterizing Wikipedia Reader Demographics and Interests
By Isaac Johnson, Wikimedia Foundation
Building on two past surveys on the motivation and needs of Wikipedia readers (Why We Read Wikipedia; Why the World Reads Wikipedia), we examine the relationship between Wikipedia reader demographics and their interests and needs. Specifically, we run surveys in thirteen different languages that ask readers three questions about their motivation for reading Wikipedia (motivation, needs, and familiarity) and five questions about their demographics (age, gender, education, locale, and native language). We link these survey results with the respondents' reading sessions -- i.e. sequence of Wikipedia page views -- to gain a more fine-grained understanding of how a reader's context relates to their activity on Wikipedia. We find that readers have a diversity of backgrounds but that the high-level needs of readers do not correlate strongly with individual demographics. We also find, however, that there are relationships between demographics and specific topic interests that are consistent across many cultures and languages. This work provides insights into the reach of various Wikipedia language editions and the relationship between content or contributor gaps and reader gaps. See the meta page for more details. Slides (figshare).

October 2019[edit]

October 16, 2019 Video: YouTube

Elections Without Fake
Deploying Real Systems to Counter Misinformation Campaigns
By Fabrício Benevenuto, Computer Science Department, Universidade Federal de Minas Gerais (UFMG), Brazil
The political debate and electoral dispute in the online space during the 2018 Brazilian elections were marked by an information war. In order to mitigate the misinformation problem, we created the project Elections Without Fake and developed a few technological solutions able to reduce the abuse of misinformation campaigns in the online space. Particularly, we created a system to monitor public groups in WhatsApp and a system to monitor ads in Facebook. Our systems showed to be fundamental for fact-checking and investigative journalism, and are currently being used by over 150 journalists with editorial lines and various fact-checking agencies.

Protecting Wikipedia from Disinformation
Detecting Malicious Editors and Pages to Protect
By Francesca Spezzano, Computer Science Department, Boise State University
Wikipedia is based on the idea that anyone can make edits in order to create reliable and crowd-sourced content. Yet with the cover of internet anonymity, some users make changes to the online encyclopedia that do not align with Wikipedia’s intended uses. In this talk, we present different forms of disinformation on Wikipedia including vandalism and spam and introduce to the mechanism that Wikipedia implements to protects its integrity such as blocking malicious editors and page protection. Next, we provide an overview of effective algorithms based on the user editing behavior we have developed to detect malicious editors and pages to protect across multiple languages. (Slides on Figshare, related research papers[1][2][3])

September 2019[edit]

September 18, 2019 Video: YouTube

Citation Needed
A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability
By Miriam Redi, Research, Wikimedia Foundation
Among Wikipedia's core guiding principles, verifiability policies have a particularly important role. Verifiability requires that information included in a Wikipedia article be corroborated against reliable secondary sources. Because of the manual labor needed to curate and fact-check Wikipedia at scale, however, its contents do not always evenly comply with these policies. Citations (i.e. reference to external sources) may not conform to verifiability requirements or may be missing altogether, potentially weakening the reliability of specific topic areas of the free encyclopedia. In this project, we aimed to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we constructed a taxonomy of reasons why inline citations are required by collecting labeled data from editors of multiple Wikipedia language editions. We then collected a large-scale crowdsourced dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we designed and evaluated algorithmic models to determine if a statement requires a citation, and to predict the citation reason based on our taxonomy. We evaluated the robustness of such models across different classes of Wikipedia articles of varying quality, as well as on an additional dataset of claims annotated for fact-checking purposes. Slides on FigShare
Redi, M., Fetahu, B., Morgan, J., & Taraborelli, D. (2019, May). Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability. In The World Wide Web Conference (pp. 1567-1578). ACM.

Patrolling on Wikipedia
By Jonathan T. Morgan, Research, Wikimedia Foundation
I will present initial findings from an ongoing research study of patrolling workflows on Wikimedia projects. Editors patrol recent pages and edits to ensure that Wikimedia projects maintains high quality as new content comes in. Patrollers revert vandalism and review newly-created articles and article drafts. Patrolling of new pages and edits is vital work. In addition to making sure that new content conforms to Wikipedia project policies, patrollers are the first line of defense against disinformation, copyright infringement, libel and slander, personal threats, and other forms of vandalism on Wikimedia projects. This research project is focused on understanding the needs, priorities, and workflows of editors who patrol new content on Wikimedia projects. The findings of this research can inform the development of better patrolling tools as well as non-technological interventions intended to support patrollers and the activity of patrolling.

July 2019[edit]

July 17, 2019 Video: YouTube

Characterizing Incivility on Wikipedia
By Elizabeth Whittaker, University of Michigan School of Information
In a society whose citizens have a variety of viewpoints, there is a question of how citizens can govern themselves in ways that allow these viewpoints to co-exist. Online deliberation has been posited as a problem solving mechanism in this context, and civility can be thought of as a mechanism that facilitates this deliberation. Civility can thus be thought of as a method of interaction that encourages collaboration, while incivility disrupts collaboration. However, it is important to note that the nature of online civility is shaped by its history and the technical architecture scaffolding it. Civility as a concept has been used both to promote equal deliberation and to exclude the marginalized from deliberation, so we should be careful to ensure that our conceptualizations of incivility reflect what we intend them to in order to avoid unintentionally reinforcing inequality.
To this end, we examined Wikipedia editors’ perceptions of interactions that disrupt collaboration through 15 semi-structured interviews. Wikipedia is a highly deliberative platform, as editors need to reach consensus about what will appear on the article page, a process that often involves deliberation to coordinate, and any disruption to this process should be apparent. We found that incivility on Wikipedia typically occurs in one of three ways: through weaponization of Wikipedia’s policies, weaponization of Wikipedia’s technical features, and through more typical vitriolic content. These methods of incivility were gendered, and had the practical effect of discouraging women from editing. We implicate this pattern as one of the underlying causes of Wikipedia’s gender gap.

Hidden Gems in the Wikipedia Discussions - The Wikipedians’ Rationales
By Lu Xiao, Syracuse University School of Information Studies
I will present a series of completed and ongoing studies that are aimed at understanding the role of the Wikipedians’ rationales in Wikipedia discussions. We define a rationale as one’s justification of her viewpoint and suggestions. Our studies demonstrate the potential of leveraging the Wikipedians’ rationales in discussions as resources for future decision-making and as resources for eliciting knowledge about the community’s norms, practices and policies. Viewed as rich digital traces in these environments, we consider them to be beneficial for the community members, such as helping newcomers familiarize themselves on the commonly accepted justificatory reasoning styles. We call for more research attention to the discussion content from this rationale study perspective.

June 2019[edit]

June 26, 2019 Video: YouTube

Trajectories of Blocked Community Members
Redemption, Recidivism and Departure
By Jonathan Chang, Cornell University
Community norm violations can impair constructive communication and collaboration online. As a defense mechanism, community moderators often address such transgressions by temporarily blocking the perpetrator. Such actions, however, come with the cost of potentially alienating community members. Given this tradeoff, it is essential to understand to what extent, and in which situations, this common moderation practice is effective in reinforcing community rules. In this work, we introduce a computational framework for studying the future behavior of blocked users on Wikipedia. After their block expires, they can take several distinct paths: they can reform and adhere to the rules, but they can also recidivate, or straight-out abandon the community. We reveal that these trajectories are tied to factors rooted both in the characteristics of the blocked individual and in whether they perceived the block to be fair and justified. Based on these insights, we formulate a series of prediction tasks aiming to determine which of these paths a user is likely to take after being blocked for their first offense, and demonstrate the feasibility of these new tasks. Overall, this work builds towards a more nuanced approach to moderation by highlighting the tradeoffs that are in play. For more information, see the full paper.

Automatic Detection of Online Abuse in Wikipedia <-- see project page
By Lane Rasberry, University of Virginia
Please see the researchers' own video and their own slides! This presentation comes from the research coordinator and will consider the research administration more than the research process. Researchers analyzed all English Wikipedia blocks prior to 2018 using machine learning. With insights gained, the researchers examined all English Wikipedia users who are not blocked against the identified characteristics of blocked users. The results were a ranked set of predictions of users who are not blocked, but who have a history of conduct similar to that of blocked users. This research and process models a system for the use of computing to aid human moderators in identifying conduct on English Wikipedia which merits a block.

First Insights from Partial Blocks in Wikimedia Wikis
By Morten Warncke-Wang, Wikimedia Foundation
The Anti-Harassment Tools team at the Wikimedia Foundation released the partial block feature in early 2019. Where previously blocks on Wikimedia wikis were sitewide (users were blocked from editing an entire wiki), partial blocks makes it possible to block users from editing specific pages and/or namespaces. The Italian Wikipedia was the first wiki to start using this feature, and it has since been rolled out to other wikis as well. In this presentation, we will look at how this feature has been used in the first few months since release.

May 2019[edit]

No showcase

April 2019[edit]

April 17, 2019 Video: YouTube

Group Membership and Contributions to Public Information Goods
The Case of WikiProject
By Ark Fangzhou Zhang
We investigate the effects of group identity on contribution behavior on the English Wikipedia, the largest online encyclopedia that gives free access to the public. Using an instrumental variable approach that exploits the variations in one’s exposure to WikiProject, we find that joining a WikiProject has a significant impact on one’s level of contribution, with an average increase of 79 revisions or 8,672 character per month. To uncover the potential mechanism underlying the treatment effect, we use the size of home page for WikiProject as a proxy for the number of recommendations from a project. The results show that the users who join a WikiProject with more recommendations significantly increase their contribution to articles under the joined project, but not to articles under other projects.

Thanks for Stopping By
A Study of “Thanks” Usage on Wikimedia
By Swati Goel
The Thanks feature on Wikipedia, also known as "Thanks," is a tool with which editors can quickly and easily send one other positive feedback. The aim of this project is to better understand this feature: its scope, the characteristics of a typical "Thanks" interaction, and the effects of receiving a thank on individual editors. We study the motivational impacts of "Thanks" because maintaining editor engagement is a central problem for crowdsourced repositories of knowledge such as Wikimedia. Our main findings are that most editors have not been exposed to the Thanks feature (meaning they have never given nor received a thank), thanks are typically sent upwards (from less experienced to more experienced editors), and receiving a thank is correlated with having high levels of editor engagement. Though the prevalence of "Thanks" usage varies by editor experience, the impact of receiving a thank seems mostly consistent for all users. We empirically demonstrate that receiving a thank has a strong positive effect on short-term editor activity across the board and provide preliminary evidence that thanks could compound to have long-term effects as well. More information is available on the research project page.

March 2019[edit]

March 20, 2019 Video: YouTube

Learning How to Correct a Knowledge Base from the Edit History
By Thomas Pellissier Tanon (Télécom ParisTech), Camille Bourgaux (DI ENS, CNRS, ENS, PSL Univ. & Inria), Fabian Suchanek (Télécom ParisTech), WWW'19.
The curation of Wikidata (and other knowledge bases) is crucial to keep the data consistent, to fight vandalism and to correct good faith mistakes. However, manual curation of the data is costly. In this work, we propose to take advantage of the edit history of the knowledge base in order to learn how to correct constraint violations automatically. Our method is based on rule mining, and uses the edits that solved violations in the past to infer how to solve similar violations in the present. For example, our system is able to learn that the value of the sex or gender property woman should be replaced by female. We provide a Wikidata game that suggests our corrections to the users in order to improve Wikidata. Both the evaluation of our method on past corrections, and the Wikidata game statistics show significant improvements over baselines.

An Approach for Determining Fine-grained Relations for Wikipedia Tables
By Besnik Fetahu
Wikipedia tables represent an important resource, where information is organized w.r.t table schemas consisting of columns. In turn each column, may contain instance values that point to other Wikipedia articles or primitive values (e.g. numbers, strings etc.). In this work, we focus on the problem of interlinking Wikipedia tables for two types of table relations: equivalent and subPartOf. Through such relations, we can further harness semantically related information by accessing related tables or facts therein. Determining the relation type of a table pair is not trivial, as it is dependent on the schemas, the values therein, and the semantic overlap of the cell values in the corresponding tables. We propose TableNet, an approach that constructs a knowledge graph of interlinked tables with subPartOf and equivalent relations. TableNet consists of two main steps: (i) for any source table we provide an efficient algorithm to find all candidate related tables with high coverage, and (ii) a neural based approach, which takes into account the table schemas, and the corresponding table data, we determine with high accuracy the table relation for a table pair. We perform an extensive experimental evaluation on the entire Wikipedia with more than 3.2 million tables. We show that with more than 88% we retain relevant candidate tables pairs for alignment. Consequentially, with an accuracy of 90% we are able to align tables with subPartOf or equivalent relations. Comparisons with existing competitors show that TableNet has superior performance in terms of coverage and alignment accuracy.

February 2019[edit]

February 20, 2019 Video: YouTube

Diversity of Visual Encyclopedic Knowledge Across Wikipedia Language Editions
By Shiqing He (presenting, University of Michigan), Brent Hecht (presenting, Northwestern University), Allen Yilun Lin (Northwestern University), Eytan Adar (University of Michigan), ICWSM'18.
Across all Wikipedia language editions, millions of images augment text in critical ways. This visual encyclopedic knowledge is an important form of wikiwork for editors, a critical part of reader experience, an emerging resource for machine learning, and a lens into cultural differences. However, Wikipedia research--and cross-language edition Wikipedia research in particular--has thus far been limited to text. In this paper, we assess the diversity of visual encyclopedic knowledge across 25 language editions and compare our findings to those reported for textual content. Unlike text, translation in images is largely unnecessary. Additionally, the Wikimedia Foundation, through the Wikipedia Commons, has taken steps to simplify cross-language image sharing. While we may expect that these factors would reduce image diversity, we find that cross-language image diversity rivals, and often exceeds, that found in text. We find that diversity varies between language pairs and content types, but that many images are unique to different language editions. Our findings have implications for readers (in what imagery they see), for editors (in deciding what images to use), for researchers (who study cultural variations), and for machine learning developers (who use Wikipedia for training models).

A Warm Welcome, Not a Cold Start
Eliciting New Editors' Interests via Questionnaires
By Ramtin Yazdanian (presenting, Ecole Polytechnique Federale de Lausanne)
Every day, thousands of users sign up as new Wikipedia contributors. Once joined, these users have to decide which articles to contribute to, which users to reach out to and learn from or collaborate with, etc. Any such task is a hard and potentially frustrating one given the sheer size of Wikipedia. Supporting newcomers in their first steps by recommending articles they would enjoy editing or editors they would enjoy collaborating with is thus a promising route toward converting them into long-term contributors. Standard recommender systems, however, rely on users' histories of previous interactions with the platform. As such, these systems cannot make high-quality recommendations to newcomers without any previous interactions -- the so-called cold-start problem. Our aim is to address the cold-start problem on Wikipedia by developing a method for automatically building short questionnaires that, when completed by a newly registered Wikipedia user, can be used for a variety of purposes, including article recommendations that can help new editors get started. Our questionnaires are constructed based on the text of Wikipedia articles as well as the history of contributions by the already onboarded Wikipedia editors. We have assessed the quality of our questionnaire-based recommendations in an offline evaluation using historical data, as well as an online evaluation with hundreds of real Wikipedia newcomers, concluding that our method provides cohesive, human-readable questions that perform well against several baselines. By addressing the cold-start problem, this work can help with the sustainable growth and maintenance of Wikipedia's diverse editor community. Slides

January 2019[edit]

January 16, 2019 Video: YouTube

Understanding participation in Wikipedia
Studies on the relationship between new editors’ motivations and activity
By Martina Balestra, New York University
Peer production communities like Wikipedia often struggle to retain contributors beyond their initial engagement. Theory suggests this may be related to their levels of motivation, though prior studies either center on contributors’ activity or use cross-sectional survey methods, and overlook accompanied changes in motivation. In this talk, I will present a series of studies aimed at filling this gap. We begin by looking at how Wikipedia editors’ early motivations influence the activities that they come to engage in, and how these motivations change over the first three months of participation in Wikipedia. We then look at the relationship between editing activity and intrinsic motivation specifically over time. We find that new editors’ early motivations are predictive of their future activity, but that these motivations tend to change with time. Moreover, newcomers’ intrinsic motivation is reinforced by the amount of activity they engage in over time: editors who had a high level of intrinsic motivation entered a virtuous cycle where the more they edited the more motivated they became, whereas those who initially had low intrinsic motivation entered a vicious cycle. Our findings shed new light on the importance of early experiences and reveal that the relationship between motivation and activity is more complex than previously understood.

Geography and knowledge. Reviving an old relationship with Wiki Atlas
By Anastasios Noulas, New York University
Wiki Atlas is an interactive cartography tool. The tool renders Wikipedia content in a 3-dimensional, web-based cartographic environment. The map acts as a medium that enables the discovery and exploration of articles in a manner that explicitly associates geography and information. At its current prototype form, a Wikipedia article is represented on the map as a 3D element whose height property is proportional to the number of views the article has on the website. This property enables the discovery of relevant content, in a manner that reflects the significance of the target element by means of collective attention by the site’s audience.


December 2018[edit]

12 December 2018 Video: YouTube

Why the World Reads Wikipedia
By Florian Lemmerich, RWTH Aachen University; Diego Sáez-Trumper, Wikimedia Foundation; Robert West, EPFL; and Leila Zia, Wikimedia Foundation
So far, little is known about why users across the world read Wikipedia's various language editions. To bridge this gap, we conducted a comparative study by combining a large-scale survey of Wikipedia readers across 14 language editions with a log-based analysis of user activity. For analysis, we proceeded in three steps: First, we analyzed the survey results to compare the prevalence of Wikipedia use cases across languages, discovering commonalities, but also substantial differences, among Wikipedia languages with respect to their usage. Second, we matched survey responses to the respondents' traces in Wikipedia's server logs to characterize behavioral patterns associated with specific use cases, finding that distinctive patterns consistently mark certain use cases across language editions. Third, we could show that certain Wikipedia use cases are more common in countries with certain socio-economic characteristics; e.g., in-depth reading of Wikipedia articles is substantially more common in countries with a low Human Development Index. The outcomes of this study provide a deeper understanding of Wikipedia readership in a wide range of languages, which is important for Wikipedia editors, developers, and the reusers of Wikipedia content.

November 2018[edit]

There was no showcase in November due to US holidays.

October 2018[edit]

17 October 2018 Video: YouTube

"Welcome" Changes? Descriptive and Injunctive Norms in a Wikipedia Sub-Community
By Jonathan T. Morgan, Wikimedia Foundation and Anna Filippova, GitHub
Open online communities rely on social norms for behavior regulation, group cohesion, and sustainability. Research on the role of social norms online has mainly focused on one source of influence at a time, making it difficult to separate different normative influences and understand their interactions. In this study, we use the Focus Theory to examine interactions between several sources of normative influence in a Wikipedia sub-community: local descriptive norms, local injunctive norms, and norms imported from similar sub- communities. We find that exposure to injunctive norms has a stronger effect than descriptive norms, that the likelihood of performing a behavior is higher when both injunctive and descriptive norms are congruent, and that conflicting social norms may negatively impact pro-normative behavior. We contextualize these findings through member interviews, and discuss their implications for both future research on normative influence in online groups and the design of systems that support open collaboration.(research paper, slides with notes)

The pipeline of online participation inequalities - The case of Wikipedia Editing
By Aaron Shaw, Northwestern University and Eszter Hargittai, University of Zurich
Participatory platforms like the Wikimedia projects have unique potential to facilitate more equitable knowledge production. However, digital inequalities such as the Wikipedia gender gap undermine this democratizing potential. In this talk, I present new research in which Eszter Hargittai and I conceptualize a "pipeline" of online participation and model distinct levels of awareness and behaviors necessary to become a contributor to the participatory web. We test the theory in the case of Wikipedia editing, using new survey data from a diverse, national sample of adult internet users in the U.S.
The results show that Wikipedia participation consistently reflects inequalities of education and internet experiences and skills. We find that the gender gap only emerges later in the pipeline whereas gaps along racial and socioeconomic lines explain variations earlier in the pipeline. Our findings underscore the multidimensionality of digital inequalities and suggest new pathways toward closing knowledge gaps by highlighting the importance of education and Internet skills.
We conclude that future research and interventions to overcome digital participation gaps should not focus exclusively on gender or class differences in content creation, but expand to address multiple aspects of digital inequality across pipelines of participation. In particular, when it comes to overcoming gender gaps in the case of Wikipedia, our results suggest that continued emphasis on recruiting female editors should include efforts to disseminate the knowledge that Wikipedia can be edited. Our findings support broader efforts to overcome knowledge- and skill-based barriers to entry among potential contributors to the open web.

September 2018[edit]

19 September 2018 Video: YouTube

The impact of news exposure on collective attention in the United States during the 2016 Zika epidemic
By Michele Tizzoni, André Panisson, Daniela Paolotti, Ciro Cattuto
In recent years, many studies have drawn attention to the important role of collective awareness and human behaviour during epidemic outbreaks. A number of modelling efforts have investigated the interaction between the disease transmission dynamics and human behaviour change mediated by news coverage and by information spreading in the population. Yet, given the scarcity of data on public awareness during an epidemic, few studies have relied on empirical data. Here, we use fine-grained, geo-referenced data from three online sources - Wikipedia, the GDELT Project and the Internet Archive - to quantify population-scale information seeking about the 2016 Zika virus epidemic in the U.S., explicitly linking such behavioural signal to epidemiological data. Geo-localized Wikipedia pageview data reveal that visiting patterns of Zika-related pages in Wikipedia were highly synchronized across the United States and largely explained by exposure to national television broadcast. Contrary to the assumption of some theoretical models, news volume and Wikipedia visiting patterns were not significantly correlated with the magnitude or the extent of the epidemic. Attention to Zika, in terms of Zika-related Wikipedia pageviews, was high at the beginning of the outbreak, when public health agencies raised an international alert and triggered media coverage, but subsequently exhibited an activity profile that suggests nonlinear dependencies and memory effects in the relationship between information seeking, media pressure, and disease dynamics. This calls for a new and more general modelling framework to describe the interaction between media exposure, public awareness, and disease dynamics during epidemic outbreaks.

Deliberation and resolution on Wikipedia: A case study of requests for comments
By Jane Im (University of Michigan) and Amy X. Zhang (MIT)
Resolving disputes in a timely manner is crucial for any online production group. We present an analysis of Requests for Comments (RfCs), one of the main vehicles on Wikipedia for formally resolving a policy or content dispute. We collected an exhaustive dataset of 7,316 RfCs on English Wikipedia over the course of 7 years and conducted a qualitative and quantitative analysis into what issues affect the RfC process. Our analysis was informed by 10 interviews with frequent RfC closers. We found that a major issue affecting the RfC process is the prevalence of RfCs that could have benefited from formal closure but that linger indefinitely without one, with factors including participants' interest and expertise impacting the likelihood of resolution. From these findings, we developed a model that predicts whether an RfC will go stale with 75.3% accuracy, a level that is approached as early as one week after dispute initiation. (RfC Dataset, CSCW paper)

August 2018[edit]

13 August 2018 Video: YouTube

Training an ML system to generate draft Wikipedia articles and Wikidata entries simultaneously
By John Bohannon and Vedant Dharnidharka, Primer
The automatic generation and updating of Wikipedia articles is usually approached as a multi-document summarization task: Given a set of source documents containing information about an entity, summarize the entity. Purely sequence-to-sequence neural models can pull that off, but getting enough data to train them is a challenge. Wikipedia articles and their reference documents can be used for training, as was recently done by a team at Google AI. But how do you find new source documents for new entities? And besides having humans read all of the source documents, how do you fact-check the output? What is needed is a self-updating knowledge base that learns jointly with a summarization model, keeping track of data provenance. Lucky for us, the world’s most comprehensive public encyclopedia is tightly coupled with Wikidata, the world’s most comprehensive public knowledge base. We have built a system called Quicksilver uses them both.

July 2018[edit]

11 July 2018 Video: YouTube

Mind the (Language) Gap. Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders
By Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis
While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is therefore of the utmost social and cultural interests to address languages for which native speakers have only access to an impoverished Wikipedia. In this work, we investigate the generation of summaries for Wikipedia articles in underserved languages, given structured data as an input.
In order to address the information bias towards widely spoken languages, we focus on an important support for such summaries: ArticlePlaceholders, which are dynamically generated content pages in underserved Wikipedia versions. They enable native speakers to access existing information in Wikidata, a structured Knowledge Base (KB). Our system provides a generative neural network architecture, which processes the triples of the KB as they are dynamically provided by the ArticlePlaceholder, and generate a comprehensible textual summary. This data-driven approach is tested with the goal of understanding how well it matches the communities' needs on two underserved languages on the Web: Arabic, a language with a big community with disproportionate access to knowledge online, and Esperanto.
With the help of the Arabic and Esperanto Wikipedians, we conduct an extended evaluation which exhibits not only the quality of the generated text but also the applicability of our end-system to any underserved Wikipedia version.

Token-level change tracking. Data, tools and insights
By Fabian Flöck
This talk first gives an overview of the WikiWho infrastructure, which provides tracking of changes to single tokens (~words) in articles of different Wikipedia language versions. It exposes APIs for accessing this data in near-real time, and is complemented by a published static dataset. Several insights are presented regarding provenance, partial reverts, token-level conflict and other metrics that only become available with such data. Lastly, the talk will cover several tools and scripts that are already using the API and will discuss their application scenarios, such as investigation of authorship, conflicted content and editor productivity.

June 2018[edit]

18 June 2018 Video: YouTube

Conversations Gone Awry. Detecting Early Signs of Conversational Failure
By Justine Zhang and Jonathan Chang, Cornell University
One of the main challenges online social systems face is the prevalence of antisocial behavior, such as harassment and personal attacks. In this work, we introduce the task of predicting from the very start of a conversation whether it will get out of hand. As opposed to detecting undesirable behavior after the fact, this task aims to enable early, actionable prediction at a time when the conversation might still be salvaged. To this end, we develop a framework for capturing pragmatic devices—such as politeness strategies and rhetorical prompts—used to start a conversation, and analyze their relation to its future trajectory. Applying this framework in a controlled setting, we demonstrate the feasibility of detecting early warning signs of antisocial behavior in online discussions.

Building a rich conversation corpus from Wikipedia Talk pages
We present a corpus of conversations that encompasses the complete history of interactions between contributors to English Wikipedia's Talk Pages. This captures a new view of these interactions by containing not only the final form of each conversation but also detailed information on all the actions that led to it: new comments, as well as modifications, deletions and restorations. This level of detail supports new research questions pertaining to the process (and challenges) of large-scale online collaboration. As an example, we present a small study of removed comments highlighting that contributors successfully take action on more toxic behavior than was previously estimated.

May 2018[edit]

08 May 2018 Video: YouTube

Case studies in the appropriation of ORES
By Aaron Halfaker, Wikimedia Foundation
ORES is an open, transparent, and auditable machine prediction platform for Wikipedians to help them do their work. It's currently used in 33 different Wikimedia projects to measure the quality of content, detect vandalism, recommend changes to articles, and to identify good-faith newcomers. The primary way that Wikipedians use ORES' predictions is through the tools developed by volunteers. These javascript gadgets, MediaWiki extensions, and web-based tools make up a complex ecosystem of Wikipedian processes -- encoded into software. In this presentation, Aaron will walk through a three key tools that Wikipedians have developed that make use of ORES, and he'll discuss how these novel process support technologies and the discussions around them have prompted Wikipedians to reflect on their work processes.

Exploring Wikimedia Donation Patterns
By Gary Hsieh, University of Washington
Every year, the Wikimedia Foundation relies on fundraising campaigns to help maintain the services it provides to millions of people worldwide. However, despite a large number of individuals who donate through these campaigns, these donors represent only a small percentage of Wikimedia users. In this work, we seek to advance our understanding of donors and their donation behaviors. Our findings offer insights to improve fundraising campaigns and to limit the burden of these campaigns on Wikipedia visitors.

April 2018[edit]

18 April 2018 Video: YouTube

The Critical Relationship of Volunteer-Created Wikipedia Content to Large-Scale Online Communities
By Nicholas Vincent, Northwestern University
The extensive Wikipedia literature has largely considered Wikipedia in isolation, outside of the context of its broader Internet ecosystem. Very recent research has demonstrated the significance of this limitation, identifying critical relationships between Google and Wikipedia that are highly relevant to many areas of Wikipedia-based research and practice. In this talk, I will present a study which extends this recent research beyond search engines to examine Wikipedia’s relationships with large-scale online communities, Stack Overflow and Reddit in particular. I will discuss evidence of consequential, albeit unidirectional relationships. Wikipedia provides substantial value to both communities, with Wikipedia content increasing visitation, engagement, and revenue, but we find little evidence that these websites contribute to Wikipedia in return. Overall, these findings highlight important connections between Wikipedia and its broader ecosystem that should be considered by researchers studying Wikipedia. Overall, this talk will emphasize the key role that volunteer-created Wikipedia content plays in improving other websites, even contributing to revenue generation.

The Rise and Decline of an Open Collaboration System, a Closer Look
By Nate TeBlunthuis, University of Washington
Do patterns of growth and stabilization found in large peer production systems such as Wikipedia occur in other communities? This study assesses the generalizability of Halfaker etal.’s influential 2013 paper on “The Rise and Decline of an Open Collaboration System.” We replicate its tests of several theories related to newcomer retention and norm entrenchment using a dataset of hundreds of active peer production wikis from Wikia. We reproduce the subset of the findings from Halfaker and colleagues that we are able to test, comparing both the estimated signs and magnitudes of our models. Our results support the external validity of Halfaker et al.’s claims that quality control systems may limit the growth of peer production communities by deterring new contributors and that norms tend to become entrenched over time.

March 2018[edit]

21 March 2018 Video: YouTube

Using Wikipedia categories for research: opportunities, challenges, and solutions
By Tiziano Piccardi, EPFL
The category network in Wikipedia is used by editors as a way to label articles and organize them in a hierarchical structure. This manually created and curated network of 1.6 million nodes in English Wikipedia generated by arranging the categories in a child-parent relation (i.e., Scientists-People, Cities-Human Settlement) allows researchers to infer valuable relations between concepts. A clean structure in this format would be a valuable resource for a variety of tools and application including automatic reasoning tools. Unfortunately, Wikipedia category network contains some "noise" since in many cases the association as subcategory does not define an is-a relation (Scientists is-a People vs. Billionaires‎ is-a Wealth). Inspired to develop a model for recommending sections to be added to the already existing Wikipedia articles, we developed a method to clean this network and to keep only the categories that have a high chance to be associated with their children by an is-a relation. The strategy is based on the concept of "pure" categories, and the algorithm uses the types of the attached articles to determine how homogenous the category is. The approach does not rely on any linguistic feature and therefore is suitable for all Wikipedia languages. In this talk, we will discuss the high-level overview of the algorithm and some of the possible applications for the generated network beyond article section recommendations.

Beyond Automatic Translation: Aligning Wikipedia sections across multiple languages
By Diego Saez-Trumper
Sections are the building blocks of Wikipedia articles. For editors, they can be used as an entry point for creating and expanding articles. For readers, they enhance readability of Wikipedia content. In this talk, we present an ongoing research to align article sections across Wikipedia languages. We show how the available technology for automatic translations are not good enough for translating section titles. We then show a complementary approach for section alignment, using Wikidata and cross-lingual word embeddings. We will present some of the use-cases of a methodology for aligning sections across languages, including improved section recommendation, especially in medium to smaller size languages where the language itself may not contain enough signal about the structure of the articles and signals can be inferred from other larger Wikipedia languages.

February 2018[edit]

21 February 2018 Video: YouTube

Visual Enrichment of Collaborative Knowledge Bases
By Miriam Redi, Wikimedia Foundation
Images allow us to explain, enrich and complement knowledge without language barriers.[4] They can help illustrate the content of an item in a language-agnostic way to external data consumers. Images can be extremely helpful in multilingual collaborative knowledge bases such as Wikidata.
However, a large proportion of Wikidata items lack images. More than 3.6M Wikidata items are about humans (Q5), but only 17% of them have an image associated with them. Only 2.2M of 40 Million Wikidata items have an image. A wider presence of images in such a rich, cross-lingual repository could enable a more complete representation of human knowledge.
In this talk, we will discuss challenges and opportunities faced when using machine learning and computer vision tools for the visual enrichment of collaborative knowledge bases. We will share research to help Wikidata contributors make Wikidata more “visual” by recommending high-quality Commons images to Wikidata items. We will show the first results on free-licence image quality scoring and recommendation and discuss future work in this direction.

Backlogs—backlogs everywhere
Using machine classification to clean up the new page backlog
By Aaron Halfaker, Wikimedia Foundation
If there's one insight that I've had about the functioning of Wikipedia and other wiki-based online communities, it's that eventually self-directed work breaks down and some form of organization becomes important for task routing.  In Wikipedia specifically, the notion of "backlogs" has become dominant.  There's backlogs of articles to create, articles to clean up, articles to assess, new editor contributions to review, manual of style rules to apply, etc.  To a community of people working on a backlog, the state of that backlog has deep effects on their emotional well being.  A backlog that only grows is frustrating and exhausting.  
Backlogs aren't inevitable though and there are many shapes that backlogs can take.  In my presentation, I'll tell a story about where English Wikipedia editors defined a process and set of roles that formed a backlog around new page creations.  I'll make the argument that this formalization of quality control practices has created a choke point and that alternatives exist. Finally I'll present a vision for such an alternative using models that we have developed for ORES, the open machine prediction service my team maintains.

January 2018[edit]

17 January 2018 Video: YouTube

What motivates experts to contribute to public information goods? A field experiment at Wikipedia
By Yan Chen, University of Michigan
Wikipedia is among the most important information sources for the general public. Motivating domain experts to contribute to Wikipedia can improve the accuracy and completeness of its content. In a field experiment, we examine the incentives which might motivate scholars to contribute their expertise to Wikipedia. We vary the mentioning of likely citation, public acknowledgement and the number of views an article receives. We find that experts are significantly more interested in contributing when citation benefit is mentioned. Furthermore, cosine similarity between a Wikipedia article and the expert's paper abstract is the most significant factor leading to more and higher-quality contributions, indicating that better matching is a crucial factor in motivating contributions to public information goods. Other factors correlated with contribution include social distance and researcher reputation.

Wikihounding on Wikipedia
By Caroline Sinders, WMF
Wikihounding (a form of digital stalking on Wikipedia) is incredibly qualitative and quantitive. What makes wikihounding different then mentoring? It's the context of the action or the intention. However, all interactions inside of a digital space has a quantitive aspect to it, every comment, revert, etc is a data point. By analyzing data points comparatively inside of wikihounding cases and reading some of the cases, we can create a baseline for what are the actual overlapping similarities inside of wikihounding to study what makes up wikihounding. Wikihounding currently has a fairly loose definition. Wikihounding, as defined by the Harassment policy on en:wp, is: “the singling out of one or more editors, joining discussions on multiple pages or topics they may edit or multiple debates where they contribute, to repeatedly confront or inhibit their work. This is with an apparent aim of creating irritation, annoyance or distress to the other editor. Wikihounding usually involves following the target from place to place on Wikipedia.” This definition doesn't outline parameters around cases such as frequency of interaction, duration, or minimum reverts, nor is there a lot known about what a standard or canonical case of wikihounding looks like. What is the average wikihounding case? This talk will cover the approaches myself and members of the research team: Diego Saez-Trumper, Aaron Halfaker and Jonathan Morgan are taking on starting this research project.

Note: If you'd like to learn more about this research, we have started to document it (the page is a work in progress).


December 2017[edit]

13 December 2017 Video: YouTube

The State of the Article Expansion Recommendation System
By Leila Zia
Only 1% of English Wikipedia articles are labeled with quality class Good or better, and 37% of the articles are stubs. We are building an article expansion recommendation system to change this in Wikipedia, across many languages. In this presentation, I will talk with you about our current thinking of the vision and direction of the research that can help us build such a recommendation system, and share more about one specific area of research we have heavily focused on in the past months: building a recommendation system that can help editors identify what sections to add to an already existing article. I present some of the challenges we faced, the methods we devised or used to overcome them, and the result of the first line of experiments on the quality of such recommendations (teaser: the results are really promising. The precision and recall at 10 is 80%.)

November 2017[edit]

15 November 2017 Video: YouTube

Conversation Corpora, Emotional Robots, and Battles with Bias
By Lucas Dixon (Google/Jigsaw)
I'll talk about interesting experimental setups for doing large-scale analysis of conversations in Wikipedia, and what it even means to grapple with the concept of conversation when one is talking about revisions on talk pages. I'll also describe challenges with having good conversations at scale, some of the dreams one might have for AI in the space, and I'll dig into measuring unintended bias in machine learning and what one can do to make ML more inclusive. This talk will cover work from the WikiDetox project as well as ongoing research on the nature and impact of harassment in Wikipedia discussion spaces – part of a collaboration between Jigsaw, Cornell University, and the Wikimedia Foundation. The ML model training code, datasets, and the supporting tooling developed as part of this project are openly available. (slides)

October 2017[edit]

There was no showcase in October 2017. We attended WikidataCon in Berlin. We'll be back in November.

September 2017[edit]

September 20, 2017, 11:30am PDT Video: YouTube

A Glimpse into Babel
An Analysis of Multilinguality in Wikidata
By Lucie-Aimée Kaffee
Multilinguality is an important topic for knowledge bases, especially Wikidata, that was build to serve the multilingual requirements of an international community. Its labels are the way for humans to interact with the data. In this talk, we explore the state of languages in Wikidata as of now, especially in regard to its ontology, and the relationship to Wikipedia. Furthermore, we set the multilinguality of Wikidata in the context of the real world by comparing it to the distribution of native speakers. We find an existing language maldistribution, which is less urgent in the ontology, and promising results for future improvements. An outlook on how users interact with languages on Wikidata will be given.
See the paper[5]

Science is Shaped by Wikipedia
Evidence from a Randomized Control Trial
By Neil C. Thompson and Douglas Hanley
As the largest encyclopedia in the world, it is not surprising that Wikipedia reflects the state of scientific knowledge. However, Wikipedia is also one of the most accessed websites in the world, including by scientists, which suggests that it also has the potential to shape science. This paper shows that it does. Incorporating ideas into a Wikipedia article leads to those ideas being used more in the scientific literature. This paper documents this in two ways: correlationally across thousands of articles in Wikipedia and causally through a randomized experiment where we added new scientific content to Wikipedia. We find that fully a third of the correlational relationship is causal, implying that Wikipedia has a strong shaping effect on science. Our findings speak not only to the influence of Wikipedia, but more broadly to the influence of repositories of scientific knowledge. The results suggest that increased provision of information in accessible repositories is a very cost-effective way to advance science. We also find that such gains are equity-improving, disproportionately benefitting those without traditional access to scientific information.
See the paper[6]

August 2017[edit]

August 23, 2017, 11:30am PDT Video: YouTube

The Wikipedia Adventure
Field Evaluation of an Interactive Tutorial for New Users
By Sneha Narayan
Integrating new users into a community with complex norms presents a challenge for peer production projects like Wikipedia. We present The Wikipedia Adventure (TWA): an interactive tutorial that offers a structured and gamified introduction to Wikipedia. In addition to describing the design of the system, we present two empirical evaluations. First, we report on a survey of users, who responded very positively to the tutorial. Second, we report results from a large-scale invitation-based field experiment that tests whether using TWA increased newcomers' subsequent contributions to Wikipedia. We find no effect of either using the tutorial or of being invited to do so over a period of 180 days. We conclude that TWA produces a positive socialization experience for those who choose to use it, but that it does not alter patterns of newcomer activity. We reflect on the implications of these mixed results for the evaluation of similar social computing systems.
See the paper[7] and slides.[8]

The Gene Wiki
Using Wikipedia and Wikidata to organize biomedical knowledge
By Andrew Su
The Gene Wiki project began in 2007 with the goal of creating a collaboratively-written, community-reviewed, and continuously-updated review article for every human gene within Wikipedia. In 2013, shortly after the creation of the Wikidata project, the project expanded to include the organization and integration of structured biomedical data. This talk will focus on our current and future work, including efforts to encourage contributions from biomedical domain experts, to build custom applications that use Wikidata as the back-end knowledge base, and to promote CC0-licensing among biomedical knowledge resources.
Comments, feedback and contributions are welcome at and See the slides[9]

July 2017[edit]

July 26, 2017, 11:30am PDT Video: YouTubecommons

Freedom versus Standardization: Structured Data Generation in a Peer Production Community
By Andrew Hall
In addition to encyclopedia articles and software, peer production communities produce structured data, e.g., Wikidata and OpenStreetMap’s metadata. Structured data from peer production communities has become increasingly important due to its use by computational applications, such as CartoCSS, MapBox, and Wikipedia infoboxes. However, this structured data is usable by applications only if it follows standards. We did an interview study focused on OpenStreetMap’s knowledge production processes to investigate how – and how successfully – this community creates and applies its data standards. Our study revealed a fundamental tension between the need to produce structured data in a standardized way and OpenStreetMap’s tradition of contributor freedom. We extracted six themes that manifested this tension and three overarching concepts, correctness, community, and code, which help make sense of and synthesize the themes. We also offer suggestions for improving OpenStreetMap’s knowledge production processes, including new data models, sociotechnical tools, and community practices.
See the paper[10] and slides[11].

June 2017[edit]

June 21, 2017, 11:30am PDT Video: YouTubecommons

Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
By Allen Yilun Lin
Wikipedia-based studies and systems frequently assume that each article describes a separate concept. However, in this paper, we show that this article-as-concept assumption is problematic due to editors’ tendency to split articles into parent articles and sub-articles when articles get too long for readers (e.g. “United States” and “American literature” in the English Wikipedia). In this paper, we present evidence that this issue can have significant impacts on Wikipedia-based studies and systems and introduce the subarticle matching problem. The goal of the sub-article matching problem is to automatically connect sub-articles to parent articles to help Wikipedia-based studies and systems retrieve complete information about a concept. We then describe the first system to address the sub-article matching problem. We show that, using a diverse feature set and standard machine learning techniques, our system can achieve good performance on most of our ground truth datasets, significantly outperforming baseline approaches.

Understanding Wikidata Queries
By Markus Kroetzsch
Wikimedia provides a public service that lets anyone answer complex questions over the sum of all knowledge stored in Wikidata. These questions are expressed in the query language SPARQL and range from the most simple fact retrievals ("What is the birthday of Douglas Adams?") to complex analytical queries ("Average lifespan of people by occupation"). The talk presents ongoing efforts to analyse the server logs of the millions of queries that are answered each month. It is an important but difficult challenge to draw meaningful conclusions from this dataset. One might hope to learn relevant information about the usage of the service and Wikidata in general, but at the same time one has to be careful not to be misled by the data. Indeed, the dataset turned out to be highly heterogeneous and unpredictable, with strongly varying usage patterns that make it difficult to draw conclusions about "normal" usage. The talk will give a status report, present preliminary results, and discuss possible next steps. (Project page on meta)

May 2017[edit]

There was no showcase in May 2017. The team attended the Wikimedia Hackathon in Vienna and WikiCite. :)

April 2017[edit]

April 19, 2017 Video: YouTube

Using WikiBrain to visualize Wikipedia's neighborhoods
By Dr. Shilad Sen
While Wikipedia serves as the world's most widely reference for humans, it also represents the most widely use body of knowledge for algorithms that must reason about the world. I will provide an overview of WikiBrain, a software project that serves as a platform for Wikipedia-based algorithms. I will also demo a brand new system built on WikiBrain that visualizes any dataset as a topographic map whose neighborhoods correspond to related Wikipedia articles. I hope to get feedback about which directions for these tools are most useful to the Wikipedia research community. 
See also

March 2017[edit]

There was no showcase in March 2017.

February 2017[edit]

February 15, 2017 Video: YouTube

Wikipedia and the Urban-Rural Divide
By Isaac Johnson (GroupLens/University of Minnesota)
Wikipedia articles about places, OpenStreetMap features, and other forms of peer-produced content have become critical sources of geographic knowledge for humans and intelligent technologies. We explore the effectiveness of the peer production model across the rural/urban divide, a divide that has been shown to be an important factor in many online social systems. We find that in Wikipedia (as well as OpenStreetMap), peer-produced content about rural areas is of systematically lower quality, less likely to have been produced by contributors who focus on the local area, and more likely to have been generated by automated software agents (i.e. “bots”). We continue to explore and codify the systemic challenges inherent to characterizing rural phenomena through peer production as well as discuss potential solutions. (read more in this paper)

Wikipedia Navigation Vectors
By Ellery Wulczyn
In this project, we learned embeddings for Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of reading sessions. Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles. (read more...)

January 2017[edit]

There was no showcase in January 2017.


December 2016[edit]

December 21, 2016 Video: YouTube

English Wikipedia Quality Dynamics and the Case of WikiProject Women Scientists
By Aaron Halfaker
With every productive edit, Wikipedia is steadily progressing towards higher and higher quality. In order to track quality improvements, Wikipedians have developed an article quality assessment rating scale that ranges from "Stub" at the bottom to "Featured Articles" at the top. While this quality scale has the promise of giving us insights into the dynamics of quality improvements in Wikipedia, it is hard to use due to the sporadic nature of manual re-assessments. By developing a highly accurate prediction model (based on work by Warncke-Wang et al.), we've developed a method to assess an articles quality at any point in history. Using this model, we explore general trends in quality in Wikipedia and compare these trends to those of an interesting cross-section: Articles tagged by WikiProject Women Scientists. Results suggest that articles about women scientists were lower quality than the rest of the wiki until mid-2013, after which a dramatic shift occurred towards higher quality. This shift may correlate with (and even be caused by) this WikiProjects initiatives.

Privacy, Anonymity, and Perceived Risk in Open Collaboration. A Study of Tor Users and Wikipedians
By Andrea Forte
In a recent qualitative study to be published at CSCW 2017, collaborators Rachel Greenstadt, Naz Andalibi, and I examined privacy practices and concerns among contributors to open collaboration projects. We collected interview data from people who use the anonymity network Tor who also contribute to online projects and from Wikipedia editors who are concerned about their privacy to better understand how privacy concerns impact participation in open collaboration projects. We found that risks perceived by contributors to open collaboration projects include threats of surveillance, violence, harassment, opportunity loss, reputation loss, and fear for loved ones. We explain participants’ operational and technical strategies for mitigating these risks and how these strategies affect their contributions. Finally, we discuss chilling effects associated with privacy loss, the need for open collaboration projects to go beyond attracting and educating participants to consider their privacy, and some of the social and technical approaches that could be explored to mitigate risk at a project or community level.

November 2016[edit]

November 16, 2016 Video: YouTube

Why We Read Wikipedia
By Leila Zia
Every day, millions of readers come to Wikipedia to satisfy a broad range of information needs, however, little is known about what these needs are. In this presentation, I share the result of a research that sets to help us understand Wikipedia readers better. Based on an initial user study on English, Persian, and Spanish Wikipedia, we build a taxonomy of Wikipedia use-cases along several dimensions, capturing users’ motivations to visit Wikipedia, the depth of knowledge they are seeking, and their knowledge of the topic of interest prior to visiting Wikipedia. Then, we quantify the prevalence of these use-cases via a large-scale user survey conducted on English Wikipedia. Our analyses highlight the variety of factors driving users to Wikipedia, such as current events, media coverage of a topic, personal curiosity, work or school assignments, or boredom. Finally, we match survey responses to the respondents’ digital traces in Wikipedia’s server logs, enabling the discovery of behavioral patterns associated with specific use-cases. Our findings advance our understanding of reader motivations and behavior on Wikipedia and have potential implications for developers aiming to improve Wikipedia’s user experience, editors striving to cater to (a subset of) their readers’ needs, third-party services (such as search engines) providing access to Wikipedia content, and researchers aiming to build tools such as article recommendation engines.

October 2016[edit]

October 19, 2016 Video: YouTube

Human centered design for using and editing structured data in Wikipedia infoboxes
By Charlie Kritschmar UX Intern, Wikimedia Deutschland
Wikidata is a Wikimedia project which stores structured data to be used by other Wikimedia projects like Wikipedia. Currently, integrating its data in Wikipedia is difficult for users, since there’s no predefined way to do so and requires some technical knowledge. To tackle these issues, human-centered design methods were applied to find needs from which solutions were generated and evaluated with the help of the community. The concept may serve as a basis which may be implemented into various Wiki projects in the future to make editing Wikidata from within another Wikimedia project more user-friendly and improve the project’s acceptance in the community.

Emergent Work in Wikipedia
By Ofer Arazy (University of Haifa)
Online production communities present an exciting opportunity for investigating novel organizational forms. Extant theoretical accounts of knowledge co-production point to organizational policies, norms, and communication as key mechanisms enabling the coordination of work. Yet, in practice participants in initiatives such as Wikipedia are often occasional contributors who are unaware of community policies and do not communicate with other members. How then is work coordinated and how does the organization maintain stability in the face of dynamics in individuals’ task enactment? In this study we develop a conceptualization of emergent roles - the prototypical activity patterns that organically emerge from individuals’ spontaneous actions – and investigate the temporal dynamics of emergent role behaviors. Conducing a multi-level large-scale empirical study stretching over a decade, we tracked co-production of a thousand Wikipedia articles, logging two hundred thousand distinct participants and seven hundred thousand co-production activities. Using a combination of manual tagging and machine learning, we annotated each activity type, and then clustered participants’ activity profiles to arrive at seven prototypical emergent roles. Our analysis shows that participants’ behavior is turbulent, with substantial flow in and out of co-production work and across roles. Our findings at the organizational level, however, show that work is organized around a highly stable set of emergent roles, despite the absence of traditional stabilizing mechanisms such as pre-defined work procedures or role expectations. We conceptualize this dualism in emergent work as “Turbulent Stability”. Further analyses suggest that co-production is artifact-centric, where contributors mutually adjust according to the artifact’s changing needs. Our study advances the theoretical understandings of self-organizing knowledge co-production and particularly the nature of emergent roles.

September 2016[edit]

September 21, 2016 Video: YouTube

Finding News Citations for Wikipedia
By Besnik Fetahu (Leibniz University of Hannover)
Slides: [1]
An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two- stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale.

Designing and Building Online Discussion Systems
By Amy X. Zhang (MIT)
Today, conversations are everywhere on the Internet and come in many different forms. However, there are still many problems with discussion interfaces today. In my talk, I will first give an overview of some of the problems with discussion systems, including difficulty dealing with large scales, which exacerbates additional problems with navigating deep threads containing lots of back-and-forth and getting an overall summary of a discussion. Other problems include dealing with moderation and harassment in discussion systems and gaining control over filtering, customization, and means of access. Then I will focus on a few projects I am working on in this space now. The first is Wikum, a system I developed to allow users to collaboratively generate a wiki-like summary from threaded discussion. The second, which I have just begun, is exploring the design space of presentation and navigation of threaded discussion. I will next discuss Murmur, a mailing list hybrid system we have built to implement and test ideas around filtering, customization, and flexibility of access, as well as combating harassment. Finally, I'll wrap up with what I am working on at Google Research this summer: developing a taxonomy to describe online forum discussion and using this information to extract meaningful content useful for search, summarization of discussions, and characterization of communities.

August 2016[edit]

August 17, 2016 Video: YouTube

Computational Fact Checking from Knowledge Networks
By Giovanni Luca Ciampaglia
Traditional fact checking by expert journalists cannot keep up with the enormous volume of information that is now generated online. Fact checking is often a tedious and repetitive task and even simple automation opportunities may result in significant improvements to human fact checkers. In this talk I will describe how we are trying to approximate the complexities of human fact checking by exploring a knowledge graph under a properly defined proximity measure. Framed as a network traversal problem, this approach is feasible with efficient computational techniques. We evaluate this approach by examining tens of thousands of claims related to history, entertainment, geography, and biographical information using the public knowledge graph extracted from Wikipedia by the DBPedia project, showing that the method does indeed assign higher confidence to true statements than to false ones. One advantage of this approach is that, together with a numerical evaluation, it also provides a sequence of statements that can be easily inspected by a human fact checker.

Deploying and maintaining AI in a socio-technical system. Lessons learned
By Aaron Halfaker
We should exercise great caution when deploying AI into our social spaces. The algorithms that make counter-vandalism in Wikipedia orders of magnitude more efficient also have the potential to perpetuate biases and silence whole classes of contributors. This presentation will describe the system efficiency characteristics that make AI so attractive for supporting quality control activities in Wikipedia. Then, Aaron will tell two stories of how the algorithms brought new, problematic biases to quality control processes in Wikipedia and how the Revision Scoring team learned about and addressed these issues in ORES, a production-level AI service for Wikimedia Wikis. He'll also make an overdue call to action toward leveraging human-review of AIs biases in the practice of AI development.

July 2016[edit]

July 20, 2016 Video: YouTube

Detecting Personal Attacks on Wikipedia
By Ellery Wulczyn, Nithum Thain
Ellery Wulczyn (WMF) and Nithum Thain (Jigsaw) will be speaking about their recent work on Project Detox, a research project to develop tools to detect and understand online personal attacks and harassment on Wikipedia. Their talk will cover the whole research pipeline to date, including data acquisition, machine learning model building, and some analytical insights as to the nature of personal attacks on Wikipedia talk pages. Portal Research
Search behaviors and New Language by article count Dropdown
By Daisy Chen
What part do the portal and on-wiki search mechanisms play in users' experiences finding information online? These findings reflect research participants' responses to a combination of generative and evaluative questions about their general online search behaviors, on-wiki search behaviors, interactions with the portal, and their thoughts about a partial re-design of the portal page, the new language by article count dropdown.

June 2016[edit]

There was no showcase in June 2016.

May 2016[edit]

There was no showcase in May 2016.

April 2016[edit]

There was no showcase in April 2016.

March 2016[edit]

March 16, 2016 Video: YouTube

Evolution of Privacy Loss in Wikipedia
By Marian-Andrei Rizoiu (Australian National University)
The cumulative effect of collective online participation has an important and adverse impact on individual privacy. As an online system evolves over time, new digital traces of individual behavior may uncover previously hidden statistical links between an individual’s past actions and her private traits. To quantify this effect, we analyze the evolution of individual privacy loss by studying the edit history of Wikipedia over 13 years, including more than 117,523 different users performing 188,805,088 edits. We trace each Wikipedia’s contributor using apparently harmless features, such as the number of edits performed on predefined broad categories in a given time period (e.g. Mathematics, Culture or Nature). We show that even at this unspecific level of behavior description, it is possible to use off-the-shelf machine learning algorithms to uncover usually undisclosed personal traits, such as gender, religion or education. We provide empirical evidence that the prediction accuracy for almost all private traits consistently improves over time. Surprisingly, the prediction performance for users who stopped editing after a given time still improves. The activities performed by new users seem to have contributed more to this effect than additional activities from existing (but still active) users. Insights from this work should help users, system designers, and policy makers understand and make long-term design choices in online content creation systems.

February 2016[edit]

There was no showcase in February 2016.

January 2016[edit]

January 20, 2016 Video: YouTube

Anon productivity and productive efficiency in English Wikipedia
By Aaron Halfaker (Halfak/EpochFail)
Building from a call to action around measuring value-adding behavior in Wikipedia from Wikimania 2014, I'll show preliminary results of measuring editor productivity in English Wikipedia. From this analysis some surprising results have emerged: (1) IP editors contribute about 20% of good new content to Wikipedia articles, (2) the overall productivity of registered editors has been holding constant since 2007 -- despite declines in the community and labor hours invested in editing. (1) suggests that we should consider better supporting editing without an account and (2) suggests that Wikipedians are somehow contributing more efficiently than they used to.

Cooperation in a Peer Production Economy
Experimental Evidence from Wikipedia
By Jérôme Hergueux
Relying on the behavior of Wikipedia contributors in a (game-theoretic) social experiment, I will seek to engage the community in a reflection about ways to create a more inclusive Wikipedia. First, I will identify the underlying demographic and social determinants of anti-social behavior within Wikipedia -- an often cited driver of its declining retention rates. Second, I will study the relationship between Wikipedia administrators' trust in anonymous strangers and their policing activity patterns, asking the question of the optimal level of trust that admins should exhibit in order to efficiently protect Wikipedia from malicious users while avoiding to drive well-intentioned ones away from the project.


December 2015[edit]

There was no showcase in December 2015.

November 2015[edit]

November 18, 2015 Video: YouTube

Impact, Characteristics, and Detection of Wikipedia Hoaxes
By Srijan Kumar
False information on Wikipedia raises concerns about its credibility. One way in which false information may be presented on Wikipedia is in the form of hoax articles, i.e. articles containing fabricated facts about nonexistent entities or events. In this talk, we study false information on Wikipedia by focusing on the hoax articles that have been created throughout its history. First, we assess the real-world impact of hoax articles by measuring how long they survive before being debunked, how many pageviews they receive, and how heavily they are referred to by documents on the Web. We find that, while most hoaxes are detected quickly and have little impact on Wikipedia, a small number of hoaxes survive long and are well cited across the Web. Second, we characterize the nature of successful hoaxes by comparing them to legitimate articles and to failed hoaxes that were discovered shortly after being created. We find characteristic differences in terms of article structure and content, embeddedness into the rest of Wikipedia, and features of the editor who created the hoax. Third, we successfully apply our findings to address a series of classification tasks, most notably to determine whether a given article is a hoax. And finally, we describe and evaluate a task involving humans distinguishing hoaxes from non-hoaxes. We find that humans are not particularly good at the task and that our automated classifier outperforms them by a big margin.

Please see the latest version of the slides at

October 2015[edit]

October 21, 2015 Video: YouTube

The impact of the Wikipedia Teahouse on new editor retention
By Jonathan Morgan, Aaron Halfaker
New Wikipedia editors face a variety of social and technical barriers to participation. These barriers have been shown to cause even promising, highly-motivated newcomers to give up and leave Wikipedia shortly after joining.[12] The Wikipedia Teahouse was launched in 2012 to provide new editors with a space on Wikipedia where they could ask questions, introduce themselves, and learn the ropes of editing in a friendly and supportive environment, with the goal of increasing the percentage of good-faith newcomers who go on to become productive Wikipedians. Research has shown[13][14] that the Teahouse provided a positive experience for participants, and suggested[15] that participating in the Teahouse led to more editing activity and longer survival for new editors who participated. The current study[16] examines the impact of Teahouse invitations on new editors survival over a longer period of time (2-6 months), and presents findings related to contextual factors within editors' first few sessions that are associated with overall survival rate and editing patterns associated with increased likelihood of visiting the Teahouse.

September 2015[edit]

September 16, 2015 Video: YouTube

Fun or Functional? The Misalignment Between Content Quality and Popularity in Wikipedia
By Morten Warncke-Wang
In peer production communities like Wikipedia, individual community members typically decide for themselves where to make contributions, often driven by factors such as “fun” or a belief that “information should be free”. However, the extent to which this bottom-up, interest-driven content production paradigm meets the need of consumers of this content is unclear. In this talk, I analyse four large Wikipedia language editions, finding extensive misalignment between production and consumption of quality content in all of them, and I show how this greatly impacts Wikipedia’s readers. I also examine misalignment in more detail by studying how it relates to specific topics, and to what extent high popularity is related to sudden changes in demand (i.e. “breaking news”). Finally, I discuss technologies and community practices that can help reduce misalignment in Wikipedia. See the paper[17].

Automated News Suggestions for Populating Wikipedia Entity Pages
By Besnik Fetahu
Wikipedia entity pages are a valuable source of information for direct consumption and for knowledge-base construction, update and maintenance. Facts in these entity pages are typically supported by references. Recent studies show that as much as 20% of the references are from online news sources. However, many entity pages are incomplete even if relevant information is already available in existing news articles. Even for the already present references, there is often a delay between the news article publication time and the reference time. In this work, we therefore look at Wikipedia through the lens of news and propose a novel news-article suggestion task to improve news coverage in Wikipedia, and reduce the lag of newsworthy references. Our work finds direct application, as a precursor, to Wikipedia page generation and knowledge-base acceleration tasks that rely on relevant and high quality input sources. We propose a two-stage supervised approach for suggesting news articles to entity pages for a given state of Wikipedia. First, we suggest news articles to Wikipedia entities (article-entity placement) relying on a rich set of features which take into account the salience and relative authority of entities, and the novelty of news articles to entity pages. Second, we determine the exact section in the entity page for the input article (article-section placement) guided by class-based section templates. We perform an extensive evaluation of our approach based on ground-truth data that is extracted from external references in Wikipedia. We achieve a high precision value of up to 93% in the article-entity suggestion stage and upto 84% for the article-section placement. Finally, we compare our approach against competitive baselines and show significant improvements.

August 2015[edit]

The August showcase was canceled due to scheduling conflicts.

July 2015[edit]

July 29, 2015 Video: YouTube

VisualEditor's effect on newly registered users
By Aaron Halfaker
It's been nearly two years since we ran an initial study of VisualEditor's effect on newly registered editors. While most of the results of this study were positive (e.g. workload on Wikipedians did not increase), we still saw a significant decrease in the newcomer productivity. In the meantime, the Editing team has made substantial improvements to performance and functionality. In this presentation, I'll report on the results of a new experiment designed to test the effects of enabling this improved VisualEditor software for newly registered users by default. I'll show what we learned from the experiment and discuss some results have opened larger questions about what, exactly, is difficult about being a newcomer to English Wikipedia.

Wikipedia knowledge graph with DeepDive
By Juhana Kangaspunta and Thomas Palomares
Despite the tremendous amount of information present on Wikipedia, only a very little amount is structured. Most of the information is embedded in text and extracting it is a non-trivial challenge. In this project, we try to populate Wikidata, a structured component of Wikipedia, using Deepdive tool to extract relations embedded in the text. We finally extracted more than 140,000 relations with more than 90% average precision.This report is structured as follows: first we present DeepDive and the data that we use for this project. Second, we clarify the relations we focused on so far and explain the implementation and pipeline, including our model, features and extractors. Finally, we detail our results with a thorough precision and recall analysis.

June 2015[edit]

The June showcase was canceled due to scheduling conflicts.

May 2015[edit]

May 13, 2015 Video: YouTube

The people's classifier: Towards an open model for algorithmic infrastructure
By Aaron Halfaker
Recent research has implicated that Wikipedia's algorithmic infrastructure is perpetuating social issues. However, these same algorithmic tools are critical to maintaining efficiency of open projects like Wikipedia at scale. But rather than simply critiquing algorithmic wiki-tools and calling for less algorithmic infrastructure, I'll propose a different strategy -- an open approach to building this algorithmic infrastructure. In this presentation, I'll demo a set of services that are designed to open a critical part Wikipedia's quality control infrastructure -- machine classifiers. I'll also discuss how this strategy unites critical/feminist HCI with more dominant narratives about efficiency and productivity.
Social transparency online
By Jennifer Marlow and Laura Dabbish
An emerging Internet trend is greater social transparency, such as the use of real names in social networking sites, feeds of friends' activities, traces of others' re-use of content, and visualizations of team interactions. There is a potential for this transparency to radically improve coordination, particularly in open collaboration settings like Wikipedia. In this talk, we will describe some of our research identifying how transparency influences collaborative performance in online work environments. First, we have been studying professional social networking communities. Social media allows individuals in these communities to create an interest network of people and digital artifacts, and get moment-by-moment updates about actions by those people or changes to those artifacts. It affords and unprecedented level of transparency about the actions of others over time. We will describe qualitative work examining how members of these communities use transparency to accomplish their goals. Second, we have been looking at the impact of making workflows transparent. In a series of field experiments we are investigating how socially transparent interfaces, and activity trace information in particular, influence perceptions and behavior towards others and evaluations of their work.

April 2015[edit]

April 30, 2015 Video: YouTube

Creating, remixing, and planning in open online communities
By Jeff Nickerson
Paradoxically, users in remixing communities don’t remix very much. But an analysis of one remix community, Thingiverse, shows that those who actively remix end up producing work that is in turn more likely to remixed. What does this suggest about Wikipedia editing? Wikipedia allows more types of contribution, because creating and editing pages are done in a planning context: plans are discussed on particular loci, including project talk pages. Plans on project talk pages lead to both creation and editing; some editors specialize in making article changes and others, who tend to have more experience, focus on planning rather than acting. Contributions can happen at the level of the article and also at a series of meta levels. Some patterns of behavior – with respect to creating versus editing and acting versus planning – are likely to lead to more sustained engagement and to higher quality work. Experiments are proposed to test these conjectures.
Authority, power and culture on Wikipedia: The oral citations debate
By Heather Ford
In 2011, Wikimedia Foundation Advisory Board member, Achal Prabhala was funded by the WMF to run a project called 'People are knowledge' or the Oral citations project. The goal of the project was to respond to the dearth of published material about topics of relevance to communities in the developing world and, although the majority of articles in languages other than English remain intact, the English editions of these articles have had their oral citations removed. I ask why this happened, what the policy implications are for oral citations generally, and what steps can be taken in the future to respond to the problem that this project (and more recent versions of it) set out to solve. This talk comes out of an ethnographic project in which I have interviewed some of the actors involved in the original oral citations project, including the majority of editors of the surr article that I trace in a chapter of my PhD [2].

March 2015[edit]

March 25, 2015 Video: YouTube

User Session Identification Based on Strong Regularities in Inter-activity Time
By Aaron Halfaker
Session identification is a common strategy used to develop metrics for web analytics and behavioral analyses of user-facing systems. Past work has argued that session identification strategies based on an inactivity threshold is inherently arbitrary or advocated that thresholds be set at about 30 minutes. In this work, we demonstrate a strong regularity in the temporal rhythms of user initiated events across several different domains of online activity (incl. video gaming, search, page views and volunteer contributions). We describe a methodology for identifying clusters of user activity and argue that regularity with which these activity clusters appear implies a good rule-of-thumb inactivity threshold of about 1 hour. We conclude with implications that these temporal rhythms may have for system design based on our observations and theories of goal-directed human activity.
Mining Missing Hyperlinks from Human Navigation Traces
By Bob West
Wikipedia relies crucially on the links between articles, but important links are often missing. In most prior work, the problem of detecting missing links is addressed by constructing a model of the existing link structure and then predicting the missing links based on this model. In this work we propose a novel method that does not rely on such a model of the static structure of existing links, but rather starts from data capturing how these links are used by people. The approach is guided by the intuition that the ultimate purpose of hyperlinks is to aid navigation, so we argue that the objective should be to suggest links that are likely to be clicked by users. In a nutshell, our algorithm suggests an as yet non-existent link from S to T for addition if users who open S are much more likely than random to later also open T. We show that this simple algorithm yields good link suggestions when run on data from the human-computation game Finally, we show preliminary results that show the method also works "in the wild", i.e., on navigation data mined directly from Wikipedia's server logs.

February 2015[edit]

February 18, 2015 Video: YouTube

Presentation slides.
Global South User Survey 2014
By Haitham Shammaa
Users' trends in the Global South have significantly changed over the past two years, and given the increase in interest in Global South communities and their activities, we wanted this survey to focus on understanding the statistics and needs of our users (both readers, and editors) in the regions listed in the WMF's New Global South Strategy. This survey aims to provide a better understanding of the specific needs of local user communities in the Global South, as well as provide data that supports product and program development decision making process.
Presentation slides.
Ingesting Open Geodata: Observations from OpenStreetMap
By Alan McConchie
As Wikidata grapples with the challenges of ingesting external data sources such as Freebase, what lessons can we learn from other open knowledge projects that have had similar experiences? OpenStreetMap, often called "The Wikipedia of Maps", is a crowdsourced geospatial data project covering the entire world. Since the earliest years of the project, OSM has combined user contributions with existing data imported from external sources. Within the OSM community, these imports have been controversial; some core OSM contributors complain that imported data is lower quality than user-contributed data, or that it discourages the growth of local mapping communities. In this talk, I'll review the history of data imports in OSM, and describe how OSM's best-practices have evolved over time in response to these critiques.

January 2015[edit]

January 14, 2015 Video: YouTube

Functional roles and career paths in Wikipedia
Presentation slides
By Felipe Ortega
An understanding of participation dynamics within online production communities requires an examination of the roles assumed by participants. Recent studies have established that the organizational structure of such communities is not flat; rather, participants can take on a variety of well-defined functional roles. What is the nature of functional roles? How have they evolved? And how do participants assume these functions? Prior studies focused primarily on participants' activities, rather than functional roles. Further, extant conceptualizations of role transitions in production communities, such as the Reader to Leader framework, emphasize a single dimension: organizational power, overlooking distinctions between functions. In contrast, in this paper we empirically study the nature and structure of functional roles within Wikipedia, seeking to validate existing theoretical frameworks. The analysis sheds new light on the nature of functional roles, revealing the intricate "career paths" resulting from participants' role transitions.
Free Knowledge Beyond Wikipedia
A conversation facilitated by Benjamin Mako Hill
In some of my research with Leah Buechley, I've explored the way that increasing engagement and diversity in technology communities often means not just attacking systematic barriers to participation but also designing for new genres and types of engagement. I hope to facilitate a conversation about how WMF might engage new readers by supporting more non-encyclopedic production. I'd like to call out some examples from the new Wikimedia project proposals list, encourage folks to share entirely new ideas, and ask for ideas about how we could dramatically better support Wikipedia's sister projects.


December 2014[edit]

December 18, 2014 Video: YouTube

Mobile Madness: The Changing Face of Wikimedia Readers
Presentation slides
By Oliver Keyes
A dive into the data we have around readership that investigates the rising popularity of the mobile web, countries and projects that are racing ahead of the pack, and what changes in user behaviour we can expect to see as mobile grows.
Global Disease Monitoring and Forecasting with Wikipedia
By Reid Priedhorsky (Los Alamos National Laboratory)
Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r² up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.

November 2014[edit]

November 14, 2014 Video: YouTube

Does Team Competition Increase Pro-Social Lending? Evidence from Online Microfinance.
Presentation slides
By Yan Chen
In the first half of the talk, I will present our empirical analysis of the effects of team competition on pro-social lending activity on, the first microlending website to match lenders with entrepreneurs in developing countries. Using naturally occurring field data, we find that lenders who join teams contribute 1.2 more loans per month than those who do not. Furthermore, teams differ in activity levels. To investigate this heterogeneity, we run a field experiment by posting forum messages. Compared to the control, we find that lenders from inactive teams make significantly more loans when exposed to a goal-setting message and that team coordination increases the magnitude of this effect.
In the second part of the talk, I will discuss a randomized field experiment we did in May 2014, when we recommend teams to lenders on Kiva. We find that lenders are more likely to join teams in their local area. However, after joining teams, those who join popular teams (on the leaderboard) are more active in lending.

October 2014[edit]


October 15, 2014 Video: Commons? YouTube

Emotions under Discussion: Gender, Status and Communication in Wikipedia
By David Laniado: I will present a large-scale analysis of emotional expression and communication style of editors in Wikipedia discussions. The talk will focus especially on how emotion and dialogue differ depending on the status, gender, and the communication network of the about 12000 editors who have written at least 100 comments on the English Wikipedia's article talk pages. The analysis is based on three different predefined lexicon-based methods for quantifying emotions: ANEW, LIWC and SentiStrength. The results unveil significant differences in the emotional expression and communication style of editors according to their status and gender, and can help to address issues such as gender gap and editor stagnation.
Wikipedia as a socio-technical system
By Aaron Halfaker: Wikipedia is a socio-technical system. In this presentation, I'll explain how the integration of human collective behavior ("social") and information technology ("technical") has lead to phenomena that, while being massively productive, is poorly understood due to lack of precedence. Based on my work in this area, I'll describe five critical functions that healthy, Wikipedia-like socio-technical systems must serve in order to continue to function: allocation, regulation, quality control, community management and reflection. Finally, I'll conclude with an overview of three classes of new projects that should provide critical opportunities to both practically and academically understand the maintenance of Wikipedia's socio-technical fitness.

September 2014[edit]

September 17, 2014 The September showcase was canceled because of a conflict with other events scheduled by WMF. We will resume showcases in October.

August 2014[edit]

August 20, 2014 Video: Commons? YouTube

Everything You Know About Mobile Is WrW^Right: Editing and Reading Pattern Variation Between User Types
By Oliver Keyes: Using new geolocation tools, we look at reader and editor behaviour to understand how and when people access and contribute to our content. This is largely exploratory research, but has potential implications for our A/B testing and how we understand both cultural divides between reader and editor groups from different countries, and how we understand the differences between types of edit and the editors who make them.
Wikipedia Article Curation: Understanding Quality, Recommending Tasks
By Morten Warncke-Wang: In this talk we look at article curation in Wikipedia through the lens of task suggestions and article quality. The first part of the talk presents SuggestBot, the Wikipedia article recommender. SuggestBot connects contributors with articles similar to those they previously edited. In the second part of the talk, we discuss Wikipedia article quality using “actionable” features, features that contributors can easily act upon to improve article quality. We will first discuss these features’ ability to predict article quality, before coming back to SuggestBot and show how these predictions and actionable features can be used to improve the suggestions.

July 2014[edit]

July 16, 2014 Video: Commons YouTube

Halfak's wiki research libraries (v0.0.1)
By Aaron Halfaker: Along with quantitative research comes data and analysis code. In this presentation, Aaron will introduce you to 4 python libraries that capture code he uses on a regular basis to get his wiki research done. MediaWiki Utilities is a general data processing library that includes connectors for the API and MySQL databases as well as an XML dump parser and revert detection. Wiki-Class is a machine learning library that is designed to train, test and deploy automatic quality assessment class detection for Wikipedia articles. MediaWiki-OAuth provides a simple interface for performing an OAuth handshake with a MediaWiki installation (e.g. Wikipedia). Deltas is an experimental text difference detection library that implements cutting-edge research to track changes to Wikipedia articles and attribute authorship of content.

Using Open Data and Stories to Broaden Crowd Content
By Nathan Matias: Nathan will share a series of research on gender diversity online and designs for collaborative content creation that foster learning and community. He will also demo a prototype for a system that could leverage open data to attract and support new Wikipedia contributors.

June 2014[edit]

June 18, 2014 Video: Commons YouTube

MoodBar -- lightweight socialization improves long-term editor retention
by Giovanni Luca Ciampaglia -- I will talk about MoodBar, an experimental feature deployed on the English Wikipedia from 2011 to 2013 to streamline the socialization of newcomers. I will present results from a natural experiment that measured the effect of Moodbar on the short-term engagement and long-term retention of newly registered users attempting to edit for the first time Wikipedia. Our results indicate that a mechanism to elicit lightweight feedback and to provide early mentoring to newcomers significantly improves their chances of becoming long-term contributors.
Active Editors' Survival Models
by Leila Zia -- I will talk about first results in building prediction models for active editors' survival. A sample of such prediction models, their performance, and the important variables in predicting survival will be presented.

May 2014[edit]

May 21, 2014 Video: Commons YouTube

A bird's eye view of editor activation
by Dario Taraborelli -- In this talk I will give a high-level overview of data on new editor activation, presenting longitudinal data from the largest Wikipedias, a comparison between desktop and mobile registrations and the relative activation rates of different cohorts of newbies.
Collaboration patterns in Articles for Creation
by Aaron Halfaker -- Wikipedia needs to attract and retain newcomers while also increasing the quality of its content. Yet new Wikipedia users are disproportionately affected by the quality assurance mechanisms designed to thwart spammers and promoters. English Wikipedia’s en:WP:Articles for Creation provides a protected space for newcomers to draft articles, which are reviewed against minimum quality guidelines before they are published. In this presentation, describe and a study of how this drafting process has affected the productivity of newcomers in Wikipedia. Using a mixed qualitative and quantitative approach, I'll show the process's pre-publication review, which is intended to improve the success of newcomers, in fact decreases newcomer productivity in English Wikipedia and offer recommendations for system designers.

April 2014[edit]

April 16, 2014 Video: Commons YouTube

WikiProjects yesterday, today and tomorrow
slides (presenter notes)
by Jonathan Morgan -- in this talk I'll give an overview of some research[3][4] on English Wikipedia Wikiprojects: what kind of work they do, how they do it, and how they have changed over time.
Visualizing Wikipedia Communities using Gephi
by Haitham Shammaa -- I will introduce Gephi as a tool for generating a visualized representation of Wikimedia projects communities. Gephi is an open-source network analysis and visualization software, and is utilized to generate graphs that represent users and the interaction among them based on the frequency they send messages to each other on their talk pages.

March 2014[edit]

March 19, 2014 Video: Commons YouTube

Metrics standardization
by Dario Taraborelli -- In this talk I'll present the most recent updates on our work on metrics standardization and give a teaser of the Editor Engagement Vital Signs project.
Wikipedia: maintaining production efficiency
by Aaron Halfaker -- In Halfaker et al. (2013) we present data that show that several changes the Wikipedia community made to manage quality and consistency in the face of a massive growth in participation have ironically crippled the very growth they were designed to manage. Specifically, the restrictiveness of the encyclopedia's primary quality control mechanism and the algorithmic tools used to reject contributions are implicated as key causes of decreased newcomer retention.

February 2014[edit]

February 26, 2014 Video: Commons YouTube

Mobile session times
by Oliver Keyes -- A prerequisite to many pieces of interesting reader research is being able to accurately identify the length of users' 'sessions'. I will explain one potential way of doing it, how I've applied it to mobile readers, and what research this opens up. (slides, read more)
Wikipedia article creation research
by Aaron Halfaker -- A brief overview of research examining trends in newcomer article creation across 10 languages with a focus on English and German Wikipedias. In wikis where anonymous users can create articles, their articles are less likely to be deleted than articles created by newly registered editors. An in-depth analysis of Articles for Creation (AfC) suggests that while AfC's process seems to result in the publication of high quality articles, it also dramatically reduces the rate at which good new articles are published. (slides, read more)

January 2014[edit]

January 15, 2014

IP reliability tracking
by Oliver Keyes
The Wikipedia Adventure, quantitative and qualitative results from the pilot
by Jake Orlowitz (User:Ocaasi) We made a 7 mission gamified interactive onboarding tutorial to teach people how to edit Wikipedia in 1 hour. The journey involves badges, barnstars, challenges, and simulated interaction throughout a realistic quest to edit the article Earth. Game dynamics were used to create a sense of understanding, belonging, deep value identification, and technical proficiency. The use of games in open source and free culture online communities has great potential to drive participation. This talk will share the inspiration for taking a gamified approach, a review of the design highlights, and a discussion of quantitative and qualitative data and survey analysis. (slides, read more)


December 2013[edit]

December 18, 2013

Metrics standardization
by Dario Taraborelli
On the nature of Anonymous Editors
by Aaron Halfaker -- A brief discussion & critique of the use of the term "anonymous" to refer to IP editors and a presentation of research results that suggest that newly registered users who edit anonymous right before registering their account are highly productive. (slides, read more)
Overview of Program Evaluation (beta) Reports
by Jaime Anstee -- A brief overview of the first round reporting for programs including summary of the target measures along with strategies and challenges in metric standardization. Overview outline


  4. Van Hook, Steven R.. "Modes and models for transcending cultural differences in international classrooms". Journal of Research in International Education 10.1 (2011): 5-27. 
  5. Kaffee, Lucie-Aimée, et al. "A Glimpse into Babel: An Analysis of Multilinguality in Wikidata." Proceedings of the 13th International Symposium on Open Collaboration. ACM, 2017.
  6. Thompson, Neil and Hanley, Douglas, Science Is Shaped by Wikipedia: Evidence from a Randomized Control Trial (September 19, 2017). Available at SSRN:
  7. Sneha Narayan, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW '17). ACM, New York, NY, USA, 1785-1799. DOI: PDF
  10. Andrew Hall, Sarah McRoberts, Jacob Thebault-Spieker, Yilun Lin, Shilad Sen, Brent Hecht, and Loren Terveen. "Freedom versus Standardization: Structured Data Generation in a Peer Production Community", CHI 2017. PDF
  12. meta:Research:The_Rise_and_Decline
  13. meta:Research:Teahouse/Phase_2_report
  14. meta:Research:Teahouse/Phase 2 report/Metrics
  16. meta:Research:Teahouse_long_term_new_editor_retention
  17. Warncke-Wang, M, Ranjan, V., Terveen, L., and Hecht, B. "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities", ICWSM 2015. pdf See also: Signpost/Research Newsletter coverage