Extension:External Data

Description
External Data is a MediaWiki extension that allows for using values on a wiki page that were retrieved from either an external URL that contains data in XML, CSV or JSON formats, or a local wiki page that contains data in CSV format.

It defines three parser functions: #get_external_data, #external_value and #for_external_table. #get_external_data retrieves the data from a URL that holds XML, CSV or JSON data, and assigns it to variables on the page; #external_value displays the value of any such variable; and #for_external_table cycles through all the values retrieved for a set of variables, displaying the same "container" text for each one.

The values of XML and JSON files are determined simply from the content of tags, regardless of the tree structure. So, for example, if the tag " red " appeared anywhere in an XML file, the value "red" would become associated with the variable "color".

Currently, a CSV file must be literally a CSV file, i.e., delimited by commas instead of tabs or any other character, for parsing to work.

You can also set caching to be done on the data retrieved, and string replacement to hide API keys; see the "Usage" section, below, for how to do both of those.

Download
You can download the External Data code in either one of these two compressed files:


 * external_data_0.5.tar.gz
 * external_data_0.5.zip

You can also download the code directly via SVN from the MediaWiki source code repository, at http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExternalData/. From a command line, you can call the following:

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExternalData/

You can also view the code online here.

Installation
To install this extension, create an 'ExternalData' directory (either by extracting a compressed file or downloading via SVN), and place this directory within the main MediaWiki 'extensions' directory. Then, in the file 'LocalSettings.php', add the following line:

Authors
External Data was written by Yaron Koren, reachable at yaron57 -at- gmail.com; and Michael Dale.

Getting a single set of values
To get data from an external URL, call the following:

An explanation of the fields: The other parameters are divided into two types:
 * URL is the full URL of the XML, CSV or JSON file.
 * format specifies the format of the file: it should be one of either 'XML', 'CSV', 'CSV with header' or 'JSON'. The difference between 'CSV' and 'CSV with header' is that 'CSV' is simply a set of lines with values; while in 'CSV with header', the first line is a "header", holding a comma-separated list of the name of each column.
 * parameters containing a '==' are filters - they do additional filtering on the set of rows being returned. It is not necessary to use any filters; most APIs, it is expected, will provide their own filtering ability through the URL's query string.
 * parameters containing a '=' are mappings - they let you connect local variable names to external variable names. External variable names are the names of the values in the file (in the case of a header-less CSV file, the names are simply the indexes of the values ("1", "2", "3" etc.)), and local variable names are the names that are later passed in to #external_value.

More than one #get_external_data call can be used in a page. If this happens, though, make sure that every local variable name is unique.

If this call retrieved a single value for each variable specified, you can call the following:

As an example, this page contains the following text:

.
 * Germany borders the following countries and bodies of water:
 * Germany has population.

The page gets data from a URL at semanticweb.org, generated by the Semantic MediaWiki extension. That URL contains the following text:

Germany,"North Sea,Denmark,Baltic Sea,Poland,Czech Republic,Austria,Switzerland,France,Luxembourg,Belgium,Netherlands","82,411,000",3.5705e+11 m²

The page then uses #external_value to display the 'bordered countries' and 'population' values; although it uses the #arraymap function, defined by the Semantic Forms extension, to apply some transformations to the 'bordered countries' value (you can ignore this detail if you want).

Getting a table of data
A URL can also contain a "table" of data (many values per field), instead of just a single "row" (one value per field). If the data is in CSV or XML format (this is not yet supported for JSON), you can use #get_external_data to retrieve such a table, and then call the function #for_external_table to display it. For example, this URL at semanticweb.org contains information similar to that above, but for all the countries in Africa instead of just one country. Calling #get_external_data with this URL, with the same format as above, will set the local variables to contain arrays of data, rather than single values. You can then call #for_external_table, which has the following format:

...where "expression" is a string that contains one or more variable names, surrounded by triple brackets. This string is then displayed for each retrieved "row" of data.

For an example, this page contains a call to #get_external_data for the semanticweb.org URL mentioned above, followed by this call:

The call to #for_external_table produces a single row of a table, in wiki-text; it's surrounded by wiki-text to create the top and bottom of the table. The presence of " | " is a standard MediaWiki trick to display pipes from within parser functions; to get it to work, you just have to create a page called "Template:!" that contains a single pipe. There are much easier calls to #for_external_table that can be made, if you just want to display a line of text per data "row", but an HTML table is the standard approach.

Getting data from a non-API source
You may have a set of data that you want to be accessed throughout the wiki, but that unfortunately does not have a web-based API to access it. A simple example is a table contained in a spreadsheet; another is a database table. For this type of data, External Data provides a way to easily create an API of your own. To get this working, first place the data you want accessed in its own wiki page, in CSV format, with the headers as the top row of data (see here for an example). Then, the special page 'GetData' will provide an "instant API" for accessing either certain rows of that data, or the entire table. By adding "field-name=value" to the URL, you can limit the set of rows returned.

A URL for the 'GetData' page can then be used in a call to #get_external_data, just as any other data URL would be; the data will be returned as a CSV file with a header row, so the format argument of #get_external_data should be set to 'CSV with header'. See here for an example of such data being retrieved and displayed using #get_external_data and #for_external_table. In this way, you can use any table-based data within your wiki without the need for custom programming.

Data caching
You can configure External Data to cache the data contained in the URLs that it accesses, both for performance reasons and for the case that any of those external URLs become no longer accessible. To do this, you can run the SQL contained in the extension file 'ExternalData.sql' in your database, which will create the table 'ed_url_cache', then add the following to your LocalSettings.php file, after the inclusion of ExternalData:

String replacement in URLs
One or more of the URLs you use may contain a string that you would prefer to keep secret, like an API key. If that's the case, you can use the array $edgStringReplacements to specify a dummy string you can use in its place. For instance, let's say you want to access the URL "http://worlddata.com/api?country=Guatemala&key=123abcd", but you don't want anyone to know your API key. You can add the following to your LocalSettings.php file, after the inclusion of ExternalData:

Then, in your call to #get_external_data, you can replace the real URL with: "http://worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY".

Version
External Data is currently at version 0.4. Below is the version history:


 * 0.1 - January 12, 2009 - initial version
 * 0.2 - January 14, 2009 - support for JSON data added
 * 0.3 - February 3, 2009 - optional database caching added; string replacement in URLs added
 * 0.4 - February 10, 2009 - 'GetData' special page added; #for_external_table function added; 'EDUtils' class added; internationalization added
 * 0.4.1 - February 13, 2009 - support for retrieval of XML table data; handling of XML tag attributes
 * 0.5 - March 2, 2009 - 'CSV with header' format added; 'GetData' changed to use this format; filtering added to #get_external_data; fix for field names containing blanks

Bugs and feature requests
Send any bug reports, requests or code patches to Yaron Koren, at yaron57 -at- gmail.com.