Extension:JsonConfig/Tabular

This page describes implementation of the DataNamespace proposal using JsonConfig's Tabular content support.

Tabular content is a machine-readable data similar to CSV and TSV formats. It allows any user to create a page, e.g. "Data:List of interesting facts.tab" (demo), and keep it as a table, rather than wiki text. Tabular storage allows strings, numbers, booleans (true/false), and "localized strings" – strings that have different value depending on the language. Eventually, it would be good to also implement Q number support, allowing direct links to Wikidata.

Additionally, tabular data can store metadata, such as localized description and data source. More metadata can be added as needed.

Tabular storage greatly simplifies storing data for lists, tables, and graphs. Graphs may directly access tabular data, and on-wiki tables and lists can be created by using simple Lua scripts. This storage is fundamentally different from Wikidata, because it works with "blobs" (batches) of data, whereas Wikidata works with tiny "facts". Wikidata technology is simply not suited for large storage such as the list of the most expensive paintings, the shoe size comparisons table, or data to plot Moscow subway growth graph.

After a long discussion, it seems Commons is the best fit for such data, and Commons community has overwhelming support for it. Commons community already has good experience with international multi-licensed content. Feel free to experiment with it at http://data.wmflabs.org/wiki/Data:Sample.tab. Note that you can view it with different languages, e.g. http://data.wmflabs.org/wiki/Data:Sample.tab?uselang=fr

Usage
All tabular data will be stored in the Data namespace on Commons, with a ".tab" page title suffix, e.g. Data:My list.tab

The data will be accessible from all other wikis by: Unless a good reason is given, there will be no way to access data directly from wiki markup. Introducing a complex wikimarkup function to get a single cell's value seems overly complex and error prone, and should be done via Lua functions.
 * will be able to use the data directly by using "wikitabular:///My list.tab" graph protocol (no need for Data namespace)
 * Is wikitabular a good name? Alternatives: wikitabdata, wikitab, ...?
 * Should the page title have the page title's suffix?
 * Scribunto (Lua) modules via mw.data.getData( 'My list.tab' ) function. The data will be returned as parsed JSON of the raw page content, so Lua module will be able to access all other metadata fields. This function is not tabular-data specific. We might also want to introduce mw.data.getTabularData to get data with localized strings resolved for a specific language.

Example
This sample creates a table with 4 columns of 4 different types.

Future Plans

 * CSV and TSV copy/paste form to simplify transferring data to and from external spreadsheets.
 * An in-place spreadsheet table editor
 * Wikidata support, allowing direct links to localized Wikidata content

TBD / Questions / Ideas

 * Licenses - if requested per licensing discussion, how should the license be stored to make it machine readable and avoid untranslatable and unparsable free text
 * We may choose to deploy without license support (public domain only), and later add licensing capability. Yurik (talk) 18:20, 30 April 2016 (UTC)
 * What metadata is needed? Current proposal has "source" (string) and "info" (localized string), but we might need more.
 * Support for specifying data source(s). This is to avoid WW3 about what data is "right"/ the truth. There will be some frequently used sources that might need a shortcut and some sources will be less frequently used.
 * Is it enough to have one source for the whole table, or should we introduce a new data type called "source", to allow per row sourcing? The per-row sourcing could be added later of course. Also, ideally we should support multiple references just like wikidata. And it would be good to have multiple pairs of "source type" and "source value" - similar to Wikidata's property->value structure. Yurik (talk) 18:20, 30 April 2016 (UTC)


 * Cross-datacenter cache invalidation - JsonConfig supports remote cache invalidations, but it uses MW API call for that. What server should commons access to notify of data change?

Internal cross-wiki usage
Cross-wiki data usage is based on the existing JsonConfig mechanisms that have been in production use (Wikipedia Zero) for the past few years. JsonConfig supports multiple content handlers, and can be easily used for cross-wiki shared data namespace.

JCSingleton::getContent implementation gets content for a given page title, even if that page title is located in another wiki by first checking if the content is stored in memcached (JCCache::get). The memcached key is non-wiki specific, allowing different wikis to share the same content object. If case of a cache miss, the page is loaded locally (in case when current wiki is the storage wiki for that title), or remotely via a query api call, and cached.

When the page changes (JCSingleton::onArticleChangeComplete), the memcached is updated with the new value, and optionally an API call is made to a remote server to notify it that the cache should be updated. This could help with cross-datacenter cache invalidation.

Configuration
JsonConfig uses a very flexible (and a bit complicated) settings system. Both Commons wiki and all other wikis will need this code block to set up a cross-wiki shareable storage:

Commons wiki will need to specify that data should be stored locally:

Other wikis will need to set how to access remote data: