Extension:External Data/Web pages

The External Data extension can be used to extract data from either pages on the web, or local files, in a variety of formats, including CSV, JSON, XML, HTML, GFF and INI. For web pages, retrieval can either be done directly, or, if necessary, using the SOAP protocol.

#get_web_data - CSV, JSON, XML, etc.
To get data from a web page that holds structured data, call the parser function #get_web_data. It can take the following syntax:

An explanation of the parameters:


 * url - sets the full URL of the file being retrieved.
 * format - specifies the format of the data being retrieved: it should be one of either 'CSV', 'CSV with header', 'GFF', 'JSON', 'XML', 'HTML' or 'text'. CSV, JSON and XML are standard data formats; GFF, or the Generic Feature Format, is a format for genomic data. The difference between 'CSV' and 'CSV with header' is that 'CSV' is simply a set of lines with values; while in ' ', the first line is a " ", holding a comma-separated list of the name of each column. 'text' indicates that the contents of the file should be retrieved as-is.
 * delimiter -
 * in CSV format, specifies the delimiter between values in the data set. The default value is "  ".  To specify a tab delimiter, use "  ".
 * in INI format, specifies the delimiter between key and value; by default,.
 * regex - specifies a PHP regular expression that should be used to get specific strings; used with the "text" format. Example: For sample text, the regex   returns "Heading" to the external variable.
 * data - holds the "mappings" that connect local variable names to external variable names. Each mapping (of the form  ) is separated by a comma. External variable names are the names of the values in the file (in the case of a header-less CSV file, the names are simply the indexes of the values: 1, 2, 3, etc.), and local variable names are the names that are later passed in to.
 * means that all existing external variables are to be mapped to internal ones. Not applicable to formats and parser functions in which external data is defined based on data values.
 * Several special external variables can be set:
 * containing complete XML structure, for  format; can be used in Lua,
 * containing complete JSON structure, for  format; can be used in Lua,
 * ,,   and   containing information about text cutout with start line, etc.,
 * containg the complete text, for  format,
 * containg the time (Unix timestamp) that data was fetched on,
 * -, if data could not be fetched, ans stale cache was used;   otherwise,
 * - number of attempts needed to fetch the data.
 * filters - sets filtering on the set of rows being returned. You can set any number of filters, separated by commas; each filter sets a specific value for a specific external variable. It is not necessary to use any filters; most APIs, it is expected, will provide their own filtering ability through the URL's query string.
 * start line, end line, header lines, footer lines - use these to cut out a fragment of data. Line number are one-based, negative values (-1 meaning last) are possible as well as percentages (0% to 100%). Use header lines and footer lines to carve out a valid CSV, JSON or XML. Note that if any of these is set, additional newlines will be injected into XML or JSON to guarantee that required tag/variable blocks begin and end at new lines, which will influence the required start line and end line settings. The external variables  and   store the beginning and end of the main fragment (without header or footer),   contain the number of lines returned an   — total number of lines in the file.
 * use xpath - an optional parameter that can be used with the "XML" or "HTML" formats, to indicate that "data" mappings should be done using XPath notation; see Using XPath, below.
 * default xmlns prefix - an optional parameter that can be used with "use xpath", which sets the default namespace prefix to be used.
 * use jsonpath - an optional parameter that can be used with the "JSON" format, to indicate that "data" mappings should be done using JSONPath notation; see Using JSONPath, below.
 * json offset - an optional parameter that represents the number of characters to ignore at the beginning of the data set being parsed. It is used with JSON values, in case the JSON being accessed has some kind of security string at the beginning.
 * allow trailing commas - if this is set, JSON files with commas before  or   will be parsed although JSON specification does not allow trailing commas. This setting is useful when start line, end line, header lines and footer lines are set.
 * post data - an optional parameter that lets you send some set of data to the URL via POST, instead of via the query string.
 * cache seconds - an optional parameter that sets the number of seconds that the values from this call should be cached; if it is less than, if there is any, the latter will apply; and if the effective cache expiration time is zero, caching is forbidden.
 * use stale cache - an optional parameter that allows this function to use an expired cache entry if it cannot retrieve the real data.
 * suppress error - an optional parameter that prevents any error message from getting displayed if there is a problem retrieving the data.

More than one call can be used in a page. If this happens, though, make sure that every local variable name is unique.

For data from XML sources, the variable names are determined by both tag and attribute names. For example, given the following XML text:

the variable type would have the value Apple, and the variable color would have the value red.

Similarly, the following XML text would be interpreted as a table of values defining two variables named type and color :

A CSV file must be literally a CSV file, i.e., delimited by commas. A call for a headerless CSV file might look:



while a call to CSV with a header row might look like:



where the header contains, which is retrieved as   in the wiki.

You can also set caching to be done on the data retrieved, and string replacement to hide API keys; see the "Usage" section, below, for how to do both of those.

Getting data from a non-API text file
If the data you wish to access is on a MediaWiki page or in an uploaded file, you can use the above methods to retrieve the data assuming the page or file only contains data in one of the supported formats:


 * for data on a wiki page, use " " as part of the URL;
 * for data in an uploaded file, use the full path.

If the MediaWiki page with the data is on the same wiki, it is best to use the fullurl: parser function, e.g.



Similarly, for uploaded files, you can use the filepath: function, e.g.

For wiki pages that have additional information, the External Data extension provides a way to create an API of your own, at least for CSV data. To get this working, first place the data you want accessed in its own wiki page, in CSV format, with the headers as the top row of data (see here for an example). Then, the special page 'GetData' will provide an "instant API" for accessing either certain rows of that data, or the entire table. By adding "field-name=value" to the URL, you can limit the set of rows returned.

A URL for the 'GetData' page can then be used in a call to #get_web_data, just as any other data URL would be; the data will be returned as a CSV file with a header row, so the 'format' parameter of #get_web_data should be set to 'CSV with header'. See here for an example of such data being retrieved and displayed using #get_web_data and #for_external_table. In this way, you can use any table-based data within your wiki without the need for custom programming.

Data caching
You can configure External Data to cache the data contained in the URLs that it accesses, both to speed up retrieval of values and to reduce the load on the system whose data is being accessed. To do this, you can run the SQL contained in the extension file 'ExternalData.sql' in your database, which will create the table 'ed_url_cache', then add the following to your LocalSettings.php file, after the inclusion of External Data:

You should also add a line like the following, to set the expiration time of the cache, in seconds; this example line will cache the data for a week:

By default, if data cannot be retrieved, and a cache table exists, #get_web_data will use the cached value for this data even if the cache has already expired. To disallow this, add the following to LocalSettings.php:

String replacement in URLs
One or more of the URLs you use may contain a string that you would prefer to keep secret, like an API key. If that's the case, you can use the array $edgStringReplacements to specify a dummy string you can use in its place. For instance, let's say you want to access the URL "http://worlddata.com/api?country=Guatemala&key=123abcd", but you don't want anyone to know your API key. You can add the following to your LocalSettings.php file, after the inclusion of External Data:

Then, in your call to #get_web_data, you can replace the real URL with: "http://worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY".

Whitelist for URLs
You can create a "whitelist" for URLs accessed by #get_web_data : in other words, a list of domains, that only URLs from those domains can be accessed. If you are using string replacements in order to hide secret keys, it is highly recommended that you create such a whitelist, in order to prevent users from finding out those keys by including them in a URL within a domain that they control.

To create a whitelist with one domain, add the following to LocalSettings.php:

To create a whitelist with multiple domains, add something like the following instead:

HTTP options
By default, #get_web_data allows for HTTPS-based wikis to access plain HTTP URLs, and vice versa, without the need for certificates (see Transport Layer Security on Wikipedia for a full explanation). If you want to require the presence of a certificate, add the following to LocalSettings.php :

Additionally, the global variable $edgHTTPOptions lets you set a number of other HTTP-related settings. It is an array that can take in any of the following keys:


 * - how many seconds to wait for a response from the server (default is 'default', which corresponds to the value of $wgHTTPTimeout, which by default is 25)
 * - whether to verify the SSL certificate, if retrieving an HTTPS URL (default is false)
 * - whether to retrieve another URL if the specified URL redirects to it (default is false)

So, for instance, if you want to verify the SSL certificate of any URL being accessed by #get_web_data, you would add the following to LocalSettings.php :

Using XPath
In some cases, the same tag or attribute name can be used more than once in an XML or HTML file, and you only want to get a specific instance of it. You can do that using the XPath notation. To do it, you just need to add the parameter "use xpath", and then have each "external variable name" in the "data=" parameter be in XPath notation, instead of just a simple name.

We won't get into the details of XPath notation here, but you can see a demonstration of "use xpath" here.

Using JSONPath
Just as with XML (see the section above), in JSON, specifying which data you want can require more than simply specifying an attribute or tag name. Thankfully, just as XML has XPath, JSON has JSONPath: JSONPath is less well-known but just as useful. See here for one guide to JSONPath syntax, and here for an online evaluator of JSONPath syntax.

To use JSONPath, just add the parameter "use xpath" to the parser function call, and then have each "external variable name" in the "data=" parameter be in JSONPath notation.

Using CSS-style selectors
With the "HTML" format, you can either use XPath (see above) or CSS-style selectors. For CSS-style selection, you do not need to specify a special parameter: it is the default approach used when "use xpath" is not specified. CSS selectors are a notation that uses tag names, classes and IDs to locate one or more elements in an HTML page; it is also the syntax used in jQuery. See here for one reference for CSS-style selectors.

INI texts
Texts (fetched from the web, files or program output) of the format INI (below) can be parsed using  format. The setting delimiter (by default, ) contains the delimiter between key and value. Strings without one are added to the external variable.

#get_soap_data - web data via SOAP
The parser function #get_soap_data, similarly to #get_web_data, lets you get data from a URL, but here using the SOAP protocol. It is called in the following way:

All of the LocalSettings.php settings that can be applied for #get_web_data can also be applied for #get_soap_data: $edgCacheTable, $edgCacheExpireTime, $edgStringReplacements, $edgAllowExternalDataFrom and $edgAllowSSL.

#get_file_data - retrieve files on the local server
You can get data from a file on the server on which the wiki resides, using #get_file_data. This parser function is called in a similar manner to #get_web_data - the set of allowed formats is the same, as are most of the other parameters. Unlike with #get_web_data, however, you cannot retrieve the data from any file; rather, the set of allowed files, and/or directories, must be set beforehand in LocalSettings.php, with an alias for each one, so that the actual file paths remain private. It is called in the following way:

Either "file=", or the combination of "directory=" and "file name=", should be set, but not both. If you want to give the wiki access to one or a small number of files, you could add one or more lines like the following to LocalSettings.php:

You would then set "file=" to the ID for that file.

And if there are any directories that you want the wiki to be able to access all files from, you could add one or more lines like the following to LocalSettings.php :

You would then set "directory=" to the ID of that directory, and "file name=" to the name of the file you want to access in this #get_file_data call. Note that the External Data code ensures that users cannot do tricks like adding "../.." and so on to the file name to access directories outside of the specified one.

To give an example, let's say that a lab wants to publish test results on their wiki. The results are all in CSV files in one directory on a server. So, they might add the following to LocalSettings.php :

Then, a #get_file_data call on the wiki might look like this:

Below that, there would presumably be a call to #for_external_table or #display_external_table to display the resulting data.

Is is also possible to process all files, optionally, with names matching a mask, in a directory. Example:

will produce a table of PHP classes with their parents in this extension, provided that  contains. File name, relative to, will be saved to the external variable.