Research Data Proposals

This is a list of proposals for data that researchers want to see provided, as well as notes about a computing platform they would like to have access to in order to work with some of the very large datasets that the WMF provides.

The list was generated at WikiSym 2010; see.

See also notes from what data to collect?, another session at WikiSym. (incorporate these notes here?)

---

Computing platform needs:


 * Researchers often don't have enough processor and disk space to process these huge (5T uncompressed for en pedia) dumps; they need a shared computing platform where these dumps could be processed without the need for downloading elsewhere.
 * The tools created by researchers would be made available publicly; this would be a requirement of obtaining a user account.
 * Toolserver.org serves this purpose to a lesser degree now.
 * A package of tools could be created that researchers might use in common; many do similar sorts of analysis.
 * Perhaps some cloud provider could donate in kind services for such a platform.

XML data dump samples:


 * We could provide a few standard samples, for example all pages in a given reasonable size WikiProject's domain, a few thousand pages, that a researcher could run tests on in order to verify that their tools are working
 * Standard large-ish samples could be provided for actual statistical analysis, so that researchers could for example run quality assessment metrics of their choice against the same dataset and compare with results of others
 * A packet of controversial articles (for example, those of a political nature) or those with few incoming links but many outgoing links, could be provided, based on specifications of researchers. In this case, perhaps the researchers themselves would create scripts to generate such subsets.
 * The revisions generated in the past 24 hours could be made available on a daily basis, various automated jobs could run against them and produce various filtered output on the shared collaboration research platform. These scripts would be maintained by the research community.
 * XML dumps could be segmented according to namespace (produce one dump per namespace), according to timeframe (all pages created in a given year), by title (all pages with titles starting with unicode range X to Y)
 * Be able to filter the dumps for specific types of pages: disambiguation pages, redirections, other

Usage statistics:


 * Which pages are being viewed, relationship between viewer and editor activity
 * Navigation paths
 * History of most recent session for a given user (without identification of course)
 * Tracking the behavior of different classes of users, eg casual readers, researching readers, editors reading, vandal tracking
 * Pages visited frequently or rarely in a given timeframe
 * Logging all visits to a specific set of pages (currently all visits are based on a sample of 1 out of every 1000 requests)
 * Get old page view stats from Matthias Schindler
 * More complete (or visible) documentation of Domas' page view stats
 * Log access to media files, including thumbnails; log if referrer is internal or external
 * Document all the existing filters in production for statistics collection, and who has which data

XML data additional information:


 * Software version, what versions of which extensions are enabled
 * Timestamp of when the dump was taken
 * Aliases for namespaces
 * Content language of wiki
 * Specific request that was used to generate the dump or the export
 * Image creators, other metadata about images
 * Include the target of the redirect for pages that are redirects
 * Revision length (since it is now populated in the db)

Misc other:


 * XML dump marked up with parts of speech for linguistic / natural language processing
 * Keep all dumps around forever, get a tape library
 * Log search terms from Lucene (internal) search or from Google search (referrer)
 * Which searches were "successful" and which failed
 * Have a session id for searches
 * "Did you find what you were looking for?" at the bottom of the search page
 * Standard user survey once a year with various questions the research community wants to get information about, with a proper sample etc.
 * SQL dumps in csv format for processing
 * Special:Export when it hits its limit simply stops adding revisions to the output; this should be indicated in the XML file and indicated on import
 * Unique ID for files dumped based on the request, record this someplace (where?)... for all requests? For scheduled dumps only?
 * Registry for researchers to use samples or subsections of dumps they used
 * Get chunked uploading working
 * XML dumps with data about the subjects of the pages, rather than metadata about the pages themselves
 * Dumps with pages that have certain markup removed (i.e. infoboxes and images) to be used by language researchers (for corpora)

For more items to be added to XML dump, see:

Want more things in the XML dumps? Add them here.

 * type of protection added to page (nimish)

Some basic tools
Some basic tools for processing will be added to svn.wikimedia.org/svnroot/mediawiki/trunk/tools/analysis/