Wikimedia Research/Data releases

This page describes the process involved in a formal open data release by the Wikimedia Research team. While this process and guidelines is not prescriptive for other teams at the Wikimedia Foundation, we encourage anyone involved in publishing static datasets for research purposes to follow these guidelines (and help us improve them).

Definition

 * We define a formal data release or data publication the process of publishing a static dataset along with metadata and a persistent identifier through an open data repository.
 * Optional steps in a formal data release may include:
 * on-wiki documentation;
 * a companion "dataset paper"
 * a notebook exploring the dataset
 * a blog post.
 * API releases and datasets not meant for research (and released primarily for operational purposes) typically do not fall within the scope of a formal data release.
 * These guidelines also apply to researchers entering a formal collaboration with Wikimedia Foundation staff, as part of the open data requirements of our Open Access policy.

Conduct a privacy review
It is mandatory that, prior to releasing any data other than already public datasets, you conduct a thorough privacy review, by asking appropriate teams (Legal and Security) to review the proposed dataset, as well as the aggregation and anonymization strategy (if applicable). All datasets published by the Wikimedia Foundation are subject to our privacy policy and data retention guidelines.

Determine the appropriate license
Open datasets published by the Wikimedia Foundation will typically use CC0 as a default license/dedication. Exceptions include cases where contributions from Wikimedia editors that require attribution are included. Please consult with the Foundation's Legal team to determine the appropriate licensing scheme.

Prepare the data for publication
Prepare the dataset for publication in a suitable open format. Typical formats for open datasets are Tab-separated values, Comma-separated values, JSON,, RDF).

Upload the dataset to a server
For large datasets (and for redundancy) it is advisable to store a copy of a dataset on a Wikimedia-maintained public server. For example: https://analytics.wikimedia.org/datasets/archive/public-datasets/ Compress the data as appropriate before uploading it.

Create a metadata entry for the dataset
To make the dataset persistently discoverable and citable, you should create a metadata entry in an open data repository. A metadata entry also allows to preserve the provenance of the data and to identify the organization or individual responsible for the creation and maintenance of the dataset. Popular open data repositories include Zenodo, Figshare, OpenDryad, Mendeley Data etc. The Wikimedia Research team has been using over the years Figshare for its data releases. For an example of a well-documented dataset, check out:


 * Halfaker, Aaron; Mansurov, Bahodir; Redi, Miriam; Taraborelli, Dario (2018): Citations with identifiers in Wikipedia. figshare. https://doi.org/10.6084/m9.figshare.1299540.v10

A well-documented dataset typically includes:
 * The name of the authors (if applicable)
 * A descriptive title
 * A documentation of the format and schema of the dataset
 * A persistent Digital Object Identifier (assigned by the repository upon publication)
 * A license statement
 * Additional references about the dataset
 * Keywords and categories describing the dataset (for discoverability)
 * A link to the server where the resources included in the dataset are hosted (if applicable)

Open data repositories allow creating "metadata only" entries (where the data is fully hosted on a different server) or "regular entries" (where the entry includes a copy of the data).

Additional documentation
Further documentation of the dataset may include any of the following:
 * on-wiki documentation: see for example Research:Wikipedia clickstream (as a companion to https://doi.org/10.6084/m9.figshare.1305770)
 * a dataset paper: see for example TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia (as a companion to https://doi.org/10.5281/zenodo.789289)
 * a notebook exploring the dataset: see for example ClickStream - Getting Started - Explorations (as a companion to https://doi.org/10.6084/m9.figshare.1305770)
 * a blog post: see for example What are the ten most cited sources on Wikipedia? Let’s ask the data (as a companion to https://doi.org/10.6084/m9.figshare.1299540).