Research Data Proposals

A lot of data is available for research. This is a list of proposals for data that researchers want to see provided, as well as notes about a computing platform they would like to have access to in order to work with some of the very large datasets that the Wikimedia Foundation provides.

The list was generated at WikiSym 2010; see [1].

See also notes from what data to collect?, another session at WikiSym (incorporate these notes here?).

Computing platform needs[edit]

Processor and Disk Space for processing dumps[edit]

Researchers often don't have enough processor and disk space to process these huge (5T uncompressed for English Wikipedia; 30GB in 7z) dumps; they need a shared computing platform where these dumps could be processed without the need for downloading elsewhere.

The tools created by researchers would be made available publicly; this would be a requirement of obtaining an user account.

meta:Toolserver serves this purpose to a lesser degree now.

Perhaps some cloud provider could donate in kind services for such a platform.

Researcher tool package[edit]

A package of tools could be created that researchers might use in common; many do similar sorts of analysis.

Some basic tools for processing will be added to svn.wikimedia.org/svnroot/mediawiki/trunk/tools/analysis/

Data[edit]

API[edit]

Special:Export[edit]

Special:Export when it hits its limit simply stops adding revisions to the output; this should be indicated in the XML file and indicated on import.

Dump[edit]

Additional formats[edit]

SQL dumps in CSV format for processing.

Samples[edit]

We could provide a few standard samples, for example all pages in a given reasonable size WikiProject's domain, a few thousand pages, that a researcher could run tests on in order to verify that their tools are working.

Standard large-ish samples could be provided for actual statistical analysis, so that researchers could for example run quality assessment metrics of their choice against the same dataset and compare with results of others.

A packet of controversial articles (for example, those of a political nature) or those with few incoming links but many outgoing links, could be provided, based on specifications of researchers. In this case, perhaps the researchers themselves would create scripts to generate such subsets.

XML dumps could be segmented according to namespace (produce one dump per namespace), according to timeframe (all pages created in a given year), by title (all pages with titles starting with unicode range X to Y).

Be able to filter the dumps for specific types of pages: disambiguation pages, redirections, other.

Recent revision sample[edit]

The revisions generated in the past 24 hours could be made available on a daily basis, various automated jobs could run against them and produce various filtered output on the shared collaboration research platform. These scripts could be maintained by the research community.

A day-to-day revisions + log events (page moves, deletions, restorations, username changes, etc.) would be useful for maintaining an up-to-date mirror or collection of meta-data. This should be much easier to generate than an entire dump. --EpochFail 16:26, 23 July 2010 (UTC)[reply]

I can see this functionality as useful in ordinary Wikipedia operations, as well as research. DGG 00:36, 24 July 2010 (UTC)[reply]

This was done several years ago :-) (Talk about slow replies, but I don't watch this page.) https://dumps.wikimedia.org/other/incr/ -- ArielGlenn (talk) 09:36, 25 May 2019 (UTC)[reply]

Additional dumps[edit]

Tracked in Phabricator
Task T27602

All Commons images
- Done on archive.org by WikiTeam (with about 6 months embargo; needs mirrors)
Centralauth tables (private dump) for accurate user counts across projects
Commons images with selection/reduction: e.g. featured pictures, good pictures, all images used by en.wiki scaled down; or, implement a web interface to generate custom "dumps" and see what users need).

Additional data[edit]

Software version, what versions of which extensions are enabled, with http link to revision in source code repo
Timestamp of when the dump was taken
~~Aliases for namespaces~~ -- done (siteinfo-namespaces.json.gz)
Interwiki prefixes
~~Magic words~~ --done (siteinfo-namespaces.json.gz)
~~Content language(s) of wiki~~ -- done (siteinfo-namespaces.json.gz)
Specific request that was used to generate the dump or the export
Image creators, other metadata about images
~~Include the target of the redirect for pages that are redirects—added by Diederik~~ -- done
~~Revision length (since it is now populated in the db)~~ -- done
Unique ID for files dumped based on the request, record this someplace (where?)... for all requests? For scheduled dumps only?
Data about the subjects of the pages, rather than metadata about the pages themselves
Dumps with pages that have certain markup removed (i.e. infoboxes and images) to be used by language researchers (for corpora)
Type of protection added to page (nimish) -- doesn't the page_restrictions table dump cover this?
~~Add page_namespace element to complement the title element~~ -- done, see <ns> element
Add page_title element (without namespace prefix) to complement the title element
Whether the dump contains multiple revisions or not
Total number of pages in the dump, total number of revisions
Full uncompressed size of the dump file in bytes and, if possible, lines
All the metadata that would go in the filename (en, wiktionary, 20110412, pages-articles v. pages-meta-history) since this info is lost if the file is renamed, or not put in the filenames by some 3rd party sites such as hitchwiki and wikivoyage or even our own "latest" dumps
HTML Dumps Of Wikipedia And other Wiki Projects

For more items to be added to XML dump, see: [2]

Wikidata[edit]

A XML-dump of the current version of NS 120 (Property), should be very tiny & very useful.

NLP[edit]

XML dump marked up with parts of speech for linguistic / natural language processing.

Long term storage of dumps[edit]

Keep all dumps around forever, get a tape library.

Usage statistics[edit]

Views and navigation[edit]

Which pages are being viewed, relationship between viewer and editor activity
Navigation paths
History of most recent session for a given user (without identification of course)
Tracking the behavior of different classes of users, e.g. casual readers, researching readers, editors reading, vandal tracking
Pages visited frequently or rarely in a given timeframe

This can be done with Domas visits logs Emijrp 17:13, 29 November 2010 (UTC)[reply]

Logging all visits to a specific set of pages (currently all visits are based on a sample of 1 out of every 1000 requests)
Get old page view stats from Matthias Schindler—done and available for download

Were they uploaded to Internet Archive? w:en:User:Emijrp/Wikipedia Archive#Domas visits logs

More complete (or visible) documentation of Domas' page view stats (formerly at dammit.lt/wikistats and now at https://dumps.wikimedia.org/other/pagecounts-raw/) - done, see that page
Log access to media files, including thumbnails; log if referrer is internal or external
Document all the existing filters in production for statistics collection, and who has which data
An API that, given a page_id and a timeframe (with, say, day or week granularity) gives an estimate of the page views of that page.
URLs of external referrer pages. This will be very useful to research on linking pages to Wikipedia, as it will provide real-world example data of pages linking to Wikipedia.

Misc[edit]

Logging[edit]

Log search terms from Lucene (internal) search or from Google search (referrer)
- Which searches were "successful" and which failed
- Have a session id for searches
- "Did you find what you were looking for?" at the bottom of the search page

Surveys[edit]

Standard user survey once a year with various questions the research community wants to get information about, with a proper sample etc.

Registry[edit]

Registry for researchers to use samples or subsections of dumps they used

Chunked uploading[edit]

Get chunked uploading working