Research Data Proposals

Jump to: navigation, search

A lot of data is available for research. This is a list of proposals for data that researchers want to see provided, as well as notes about a computing platform they would like to have access to in order to work with some of the very large datasets that the Wikimedia Foundation provides.

The list was generated at WikiSym 2010; see [1].

See also notes from what data to collect?, another session at WikiSym (incorporate these notes here?).

Computing platform needs[edit | edit source]

Processor and Disk Space for processing dumps[edit | edit source]

Researchers often don't have enough processor and disk space to process these huge (5T uncompressed for English Wikipedia; 30GB in 7z) dumps; they need a shared computing platform where these dumps could be processed without the need for downloading elsewhere.

The tools created by researchers would be made available publicly; this would be a requirement of obtaining an user account.
meta:Toolserver serves this purpose to a lesser degree now.
Perhaps some cloud provider could donate in kind services for such a platform.

Researcher tool package[edit | edit source]

A package of tools could be created that researchers might use in common; many do similar sorts of analysis.

Some basic tools for processing will be added to

Data[edit | edit source]

API[edit | edit source]

Special:Export[edit | edit source]

Special:Export when it hits its limit simply stops adding revisions to the output; this should be indicated in the XML file and indicated on import.

Dump[edit | edit source]

Additional formats[edit | edit source]

SQL dumps in CSV format for processing.

Samples[edit | edit source]

We could provide a few standard samples, for example all pages in a given reasonable size WikiProject's domain, a few thousand pages, that a researcher could run tests on in order to verify that their tools are working.

Standard large-ish samples could be provided for actual statistical analysis, so that researchers could for example run quality assessment metrics of their choice against the same dataset and compare with results of others.
A packet of controversial articles (for example, those of a political nature) or those with few incoming links but many outgoing links, could be provided, based on specifications of researchers. In this case, perhaps the researchers themselves would create scripts to generate such subsets.
XML dumps could be segmented according to namespace (produce one dump per namespace), according to timeframe (all pages created in a given year), by title (all pages with titles starting with unicode range X to Y).
Be able to filter the dumps for specific types of pages: disambiguation pages, redirections, other.

Recent revision sample[edit | edit source]

The revisions generated in the past 24 hours could be made available on a daily basis, various automated jobs could run against them and produce various filtered output on the shared collaboration research platform. These scripts could be maintained by the research community.

A day-to-day revisions + log events (page moves, deletions, restorations, username changes, etc.) would be useful for maintaining an up-to-date mirror or collection of meta-data. This should be much easier to generate than an entire dump. --EpochFail 16:26, 23 July 2010 (UTC)
I can see this functionality as useful in ordinary Wikipedia operations, as well as research. DGG 00:36, 24 July 2010 (UTC)

Additional dumps[edit | edit source]

  • All Commons images
    • YesY Done on by WikiTeam (with about 6 months embargo; needs mirrors)
  • Centralauth tables (private dump) for accurate user counts across projects
  • Commons images with selection/reduction: e.g. featured pictures, good pictures, all images used by scaled down; or, implement a web interface to generate custom "dumps" and see what users need).

Additional data[edit | edit source]

  • Software version, what versions of which extensions are enabled, with http link to revision in source code repo
  • Timestamp of when the dump was taken
  • Aliases for namespaces
  • Interwiki prefixes
  • Magic words
  • Content language(s) of wiki
  • Specific request that was used to generate the dump or the export
  • Image creators, other metadata about images
  • Include the target of the redirect for pages that are redirects—added by Diederik -- done
  • Revision length (since it is now populated in the db) -- done
  • Unique ID for files dumped based on the request, record this someplace (where?)... for all requests? For scheduled dumps only?
  • Data about the subjects of the pages, rather than metadata about the pages themselves
  • Dumps with pages that have certain markup removed (i.e. infoboxes and images) to be used by language researchers (for corpora)
  • Type of protection added to page (nimish)
  • Add page_namespace element to complement the title element -- done, see <ns> element
  • Add page_title element (without namespace prefix) to complement the title element
  • Whether the dump contains multiple revisions or not
  • Total number of pages in the dump, total number of revisions
  • Full uncompressed size of the dump file in bytes and, if possible, lines
  • All the metadata that would go in the filename (en, wiktionary, 20110412, pages-articles v. pages-meta-history) since this info is lost if the file is renamed, or not put in the filenames by some 3rd party sites such as hitchwiki and wikivoyage or even our own "latest" dumps

For more items to be added to XML dump, see: [2]

Wikidata[edit | edit source]

  • A XML-dump of the current version of NS 120 (Property), should be very tiny & very usefull.

NLP[edit | edit source]

XML dump marked up with parts of speech for linguistic / natural language processing.

Long term storage of dumps[edit | edit source]

Keep all dumps around forever, get a tape library.

Usage statistics[edit | edit source]

Views and navigation[edit | edit source]

  • Which pages are being viewed, relationship between viewer and editor activity
  • Navigation paths
  • History of most recent session for a given user (without identification of course)
  • Tracking the behavior of different classes of users, e.g. casual readers, researching readers, editors reading, vandal tracking
  • Pages visited frequently or rarely in a given timeframe
This can be done with Domas visits logs Emijrp 17:13, 29 November 2010 (UTC)
  • Logging all visits to a specific set of pages (currently all visits are based on a sample of 1 out of every 1000 requests)
  • Get old page view stats from Matthias Schindler—done and available for download
Were they uploaded to Internet Archive? w:en:User:Emijrp/Wikipedia Archive#Domas visits logs
  • More complete (or visible) documentation of Domas' page view stats (formerly at and now at - done, see that page
  • Log access to media files, including thumbnails; log if referrer is internal or external
  • Document all the existing filters in production for statistics collection, and who has which data
  • An API that, given a page_id and a timeframe (with, say, day or week granularity) gives an estimate of the page views of that page.
  • URLs of external referrer pages. This will be very useful to research on linking pages to Wikipedia, as it will provide real-world example data of pages linking to Wikipedia.

Misc[edit | edit source]

Logging[edit | edit source]

  • Log search terms from Lucene (internal) search or from Google search (referrer)
    • Which searches were "successful" and which failed
    • Have a session id for searches
    • "Did you find what you were looking for?" at the bottom of the search page

Surveys[edit | edit source]

Standard user survey once a year with various questions the research community wants to get information about, with a proper sample etc.

Registry[edit | edit source]

Registry for researchers to use samples or subsections of dumps they used

Chunked uploading[edit | edit source]

Get chunked uploading working