Research Data Proposals

This is a list of proposals for data that researchers want to see provided, as well as notes about a computing platform they would like to have access to in order to work with some of the very large datasets that the Wikimedia Foundation provides.

The list was generated at WikiSym 2010; see.

See also notes from what data to collect?, another session at WikiSym. (incorporate these notes here?)

Processor and Disk Space for processing dumps
Researchers often don't have enough processor and disk space to process these huge (5T uncompressed for English Wikipedia; 30GB in 7z) dumps; they need a shared computing platform where these dumps could be processed without the need for downloading elsewhere.


 * The tools created by researchers would be made available publicly; this would be a requirement of obtaining an user account.


 * Toolserver serves this purpose to a lesser degree now.


 * Perhaps some cloud provider could donate in kind services for such a platform.

Researcher tool package
A package of tools could be created that researchers might use in common; many do similar sorts of analysis.
 * Some basic tools for processing will be added to svn.wikimedia.org/svnroot/mediawiki/trunk/tools/analysis/

Special:Export
Special:Export when it hits its limit simply stops adding revisions to the output; this should be indicated in the XML file and indicated on import.

Additional formats
SQL dumps in CSV format for processing.

Samples
We could provide a few standard samples, for example all pages in a given reasonable size WikiProject's domain, a few thousand pages, that a researcher could run tests on in order to verify that their tools are working.


 * Standard large-ish samples could be provided for actual statistical analysis, so that researchers could for example run quality assessment metrics of their choice against the same dataset and compare with results of others.


 * A packet of controversial articles (for example, those of a political nature) or those with few incoming links but many outgoing links, could be provided, based on specifications of researchers. In this case, perhaps the researchers themselves would create scripts to generate such subsets.


 * XML dumps could be segmented according to namespace (produce one dump per namespace), according to timeframe (all pages created in a given year), by title (all pages with titles starting with unicode range X to Y).


 * Be able to filter the dumps for specific types of pages: disambiguation pages, redirections, other.

Recent revision sample
The revisions generated in the past 24 hours could be made available on a daily basis, various automated jobs could run against them and produce various filtered output on the shared collaboration research platform. These scripts could be maintained by the research community.
 * A day-to-day revisions + log events (page moves, deletions, restorations, username changes, etc.) would be useful for maintaining an up-to-date mirror or collection of meta-data. This should be much easier to generate than an entire dump. --EpochFail 16:26, 23 July 2010 (UTC)
 * I can see this functionality as useful in ordinary Wikipedia operations, as well as research. DGG 00:36, 24 July 2010 (UTC)

Additional dumps

 * Centralauth tables (private dump) for accurate user counts across projects

Additional data

 * Software version, what versions of which extensions are enabled
 * Timestamp of when the dump was taken
 * Aliases for namespaces
 * Content language of wiki
 * Specific request that was used to generate the dump or the export
 * Image creators, other metadata about images
 * Include the target of the redirect for pages that are redirects
 * Revision length (since it is now populated in the db)
 * Unique ID for files dumped based on the request, record this someplace (where?)... for all requests? For scheduled dumps only?
 * Data about the subjects of the pages, rather than metadata about the pages themselves
 * Dumps with pages that have certain markup removed (i.e. infoboxes and images) to be used by language researchers (for corpora)
 * type of protection added to page (nimish)

For more items to be added to XML dump, see:

NLP
XML dump marked up with parts of speech for linguistic / natural language processing.

Long term storage of dumps
Keep all dumps around forever, get a tape library.

Views and navigation

 * Which pages are being viewed, relationship between viewer and editor activity
 * Navigation paths
 * History of most recent session for a given user (without identification of course)
 * Tracking the behavior of different classes of users, e.g. casual readers, researching readers, editors reading, vandal tracking
 * Pages visited frequently or rarely in a given timeframe
 * Logging all visits to a specific set of pages (currently all visits are based on a sample of 1 out of every 1000 requests)
 * Get old page view stats from Matthias Schindler
 * More complete (or visible) documentation of Domas' page view stats (http://dammit.lt/wikistats)
 * Log access to media files, including thumbnails; log if referrer is internal or external
 * Document all the existing filters in production for statistics collection, and who has which data
 * An API that, given a page_id and a timeframe (with, say, day or week granularity) gives an estimate of the page views of that page.
 * URLs of external referrer pages. This will be very useful to research on linking pages to Wikipedia, as it will provide real-world example data of pages linking to Wikipedia.

Logging

 * Log search terms from Lucene (internal) search or from Google search (referrer)
 * Which searches were "successful" and which failed
 * Have a session id for searches
 * "Did you find what you were looking for?" at the bottom of the search page

Surveys
Standard user survey once a year with various questions the research community wants to get information about, with a proper sample etc.

Registry
Registry for researchers to use samples or subsections of dumps they used

Chunked uploading
Get chunked uploading working