Extension:External Data

Warning: if you are using a version of this extension between 0.4 and 0.5.1, and your wiki blocks reading access for certain pages, please upgrade immediately - those previous versions ignored access-control settings.

Description
The External Data extension allows MediaWiki pages to retrieve, filter, and format data from one or more sources, provided each dataset is accessible via a URL and is in one of the formats CSV, XML or JSON. In addition to external URLs, data can also come from a regular wiki page or an uploaded file.

The extension defines three parser functions:
 * #get_external_data retrieves the data from a URL and assigns it to variables that can be accessed on the page.
 * #external_value displays the value of any such variable.
 * #for_external_table cycles through all the values retrieved for a set of variables, displaying the same "container" text for each one.

For data from XML sources, the variable names are determined by both tag and attribute names. For example, given the following XML text:
 *  red

the variable type would have the value Apple, and the variable color would have the value red.

Similarly, the following XML text would be interpreted as a table of values defining two variables named type and color:
 *  red
 *  brown
 *  brown

Currently, a CSV file must be literally a CSV file, i.e., delimited by commas. It may have a header row, but that is not required.

You can also set caching to be done on the data retrieved, and string replacement to hide API keys; see the "Usage" section, below, for how to do both of those.

Download
You can download the External Data code in either one of these two compressed files:


 * external_data_0.6.tar.gz
 * external_data_0.6.zip

You can also download the code directly via SVN from the MediaWiki source code repository, at http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExternalData/. From a command line, you can call the following:

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExternalData/

You can also view the code online here.

Installation
To install this extension, create an 'ExternalData' directory (either by extracting a compressed file or downloading via SVN), and place this directory within the main MediaWiki 'extensions' directory. Then, in the file 'LocalSettings.php', add the following line:

Authors
External Data was written by Yaron Koren, reachable at yaron57 -at- gmail.com; and Michael Dale.

Getting a single set of values
To get data from an external URL, call the following:

An explanation of the fields: The other parameters are divided into two types:
 * URL is the full URL of the XML, CSV or JSON file.
 * format specifies the format of the file: it should be one of either 'XML', 'CSV', 'CSV with header' or 'JSON'. The difference between 'CSV' and 'CSV with header' is that 'CSV' is simply a set of lines with values; while in 'CSV with header', the first line is a "header", holding a comma-separated list of the name of each column.
 * parameters containing a '==' are filters - they do additional filtering on the set of rows being returned. It is not necessary to use any filters; most APIs, it is expected, will provide their own filtering ability through the URL's query string.
 * parameters containing a '=' are mappings - they let you connect local variable names to external variable names. External variable names are the names of the values in the file (in the case of a header-less CSV file, the names are simply the indexes of the values ("1", "2", "3" etc.)), and local variable names are the names that are later passed in to #external_value.

More than one #get_external_data call can be used in a page. If this happens, though, make sure that every local variable name is unique.

If this call retrieved a single value for each variable specified, you can call the following:

As an example, this page contains the following text:

.
 * Germany borders the following countries and bodies of water:
 * Germany has population.

The page gets data from a URL at semanticweb.org, generated by the Semantic MediaWiki extension. That URL contains the following text:

Germany,"North Sea,Denmark,Baltic Sea,Poland,Czech Republic,Austria,Switzerland,France,Luxembourg,Belgium,Netherlands","82,411,000",3.5705e+11 m²

The page then uses #external_value to display the 'bordered countries' and 'population' values; although it uses the #arraymap function, defined by the Semantic Forms extension, to apply some transformations to the 'bordered countries' value (you can ignore this detail if you want).

Getting a table of data
A URL can also contain a "table" of data (many values per field), instead of just a single "row" (one value per field). In this case, you can use #get_external_data to retrieve such a table, and then call the function #for_external_table to display it. For example, this URL at semanticweb.org contains information similar to that above, but for all the countries in Africa instead of just one country. Calling #get_external_data with this URL, with the same format as above, will set the local variables to contain arrays of data, rather than single values. You can then call #for_external_table, which has the following format:

...where "expression" is a string that contains one or more variable names, surrounded by triple brackets. This string is then displayed for each retrieved "row" of data.

For an example, this page contains a call to #get_external_data for the semanticweb.org URL mentioned above, followed by this call:

The call to #for_external_table produces a single row of a table, in wiki-text; it's surrounded by wiki-text to create the top and bottom of the table. The presence of " | " is a standard MediaWiki trick to display pipes from within parser functions; to get it to work, you just have to create a page called "Template:!" that contains a single pipe. There are much easier calls to #for_external_table that can be made, if you just want to display a line of text per data "row", but an HTML table is the standard approach.

Getting data from a non-API source
If the data you wish to access is on a MediaWiki page or in an uploaded file, you can use the above methods to retrieve the data assuming the page or file only contains data in one of the supported formats:
 * for data on a wiki page, use "&action=raw" as part of the URL;
 * for data in an uploaded file, use the full path.

If the MediaWiki page with the data is on the same wiki, it is best to use the fullurl: parser function, e.g.

Similarly, for uploaded files, you can use the filepath: function, e.g.

For wiki pages that have additional information, the External Data extension provides a way to create an API of your own, at least for CSV data. To get this working, first place the data you want accessed in its own wiki page, in CSV format, with the headers as the top row of data (see here for an example). Then, the special page 'GetData' will provide an "instant API" for accessing either certain rows of that data, or the entire table. By adding "field-name=value" to the URL, you can limit the set of rows returned.

A URL for the 'GetData' page can then be used in a call to #get_external_data, just as any other data URL would be; the data will be returned as a CSV file with a header row, so the format argument of #get_external_data should be set to 'CSV with header'. See here for an example of such data being retrieved and displayed using #get_external_data and #for_external_table. In this way, you can use any table-based data within your wiki without the need for custom programming.

Data caching
You can configure External Data to cache the data contained in the URLs that it accesses, both for performance reasons and for the case that any of those external URLs become no longer accessible. To do this, you can run the SQL contained in the extension file 'ExternalData.sql' in your database, which will create the table 'ed_url_cache', then add the following to your LocalSettings.php file, after the inclusion of ExternalData:

String replacement in URLs
One or more of the URLs you use may contain a string that you would prefer to keep secret, like an API key. If that's the case, you can use the array $edgStringReplacements to specify a dummy string you can use in its place. For instance, let's say you want to access the URL "http://worlddata.com/api?country=Guatemala&key=123abcd", but you don't want anyone to know your API key. You can add the following to your LocalSettings.php file, after the inclusion of ExternalData:

Then, in your call to #get_external_data, you can replace the real URL with: "http://worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY".

Common problems

 * If the call to #get_external_data or #for_external_tables isn't returning any data, and the page being accessed is large, it could be because the call to retrieve is getting timed out. You should set the $wgHTTPTimeout flag in your LocalSettings.php file (which represents a number of seconds) to some number greater than 3, its default value. You could call, for instance:




 * If the data being accessed has changed, but the wiki page accessing it still shows the old data, it is because that page is being cached by MediaWiki. There are several solutions to this: if you are an administrator, you can hit the "refresh" tab above the page, which will purge the cache. You can also easily disabling caching for the entire wiki; see here for how. Finally, if you wait long enough (typically no more than 24 hours), the page will get refreshed on its own and display the new data.

Version
External Data is currently at version 0.6. Below is the version history:


 * 0.1 - January 12, 2009 - initial version
 * 0.2 - January 14, 2009 - support for JSON data added
 * 0.3 - February 3, 2009 - optional database caching added; string replacement in URLs added
 * 0.4 - February 10, 2009 - 'GetData' special page added; #for_external_table function added; 'EDUtils' class added; internationalization added
 * 0.4.1 - February 13, 2009 - support for retrieval of XML table data; handling of XML tag attribute
 * 0.5 - March 2, 2009 - 'CSV with header' format added; 'GetData' changed to use this format; filtering added to #get_external_data; fix for field names containing blanks
 * 0.5.1 - March 16, 2009 - CSV values that contain newlines handled correctly
 * 0.5.2 - March 19, 2009 - security fix for wikis that contain read-protected pages
 * 0.6 - April 2, 2009 - support for JSON table data; data gets reset between one page and another when handling multiple pages

Bugs and feature requests
Send any bug reports, requests or code patches to Yaron Koren, at yaron57 -at- gmail.com.

Translating
Translation of External Data is done through translatewiki.net. The translation for this extension can be found here. To add language values or change existing ones, you should create an account on translatewiki.net, then request permission from the administrators to translate a certain language or languages on this page (this is a very simple process). Once you have permission for a given language, you can log in and add or edit whatever messages you want to in that language.