Extension:External Data/Web pages/en

The External Data extension can be used to extract data from pages or documents on the web in a variety of formats, including CSV, JSON, XML, HTML, GFF and INI. This retrieval can either be done directly, or, if necessary, using the SOAP protocol.

#get_web_data - CSV, JSON, XML, etc.
To get data from a web page that holds structured data, call the parser function #get_web_data. It can take the following syntax:

The parameters format, delimiter, regex, start line, end line, header lines, footer lines, use xpath, default xmlns prefix, use jsonpath, json offset and allow trailing commas all relate to parsing specific data formats; for information on these parameters, see.

The parameters cache seconds and use stale cache relate to caching of data; for information on these parameters (and on caching in general), see.

An explanation of the other parameters:

If the value  is passed in, then all existing external variables (if there are any) will be mapped to internal ones of the same name, brought to lowercase, if field names are case-insensitive in the used format. Unless one of the options  or   is set, the parameter   can be omitted altogether: the effect will be the same as setting. Additionally, some "special variables" will be set as well; see.
 * url - sets the full URL of the file being retrieved.
 * data - holds the "mappings" that connect local variable names to external variable names. Each mapping (of the form  ) is separated by a comma. External variable names are the names of the values in the file (in the case of a header-less CSV file, the names are simply the indexes of the values: 1, 2, 3, etc.), and local variable names are the names that are later passed in to.
 * filters - sets filtering on the set of rows being returned. You can set any number of filters, separated by commas; each filter sets a specific value for a specific external variable. It is not necessary to use any filters; most APIs, it is expected, will provide their own filtering ability through the URL's query string.
 * post data - an optional parameter that lets you send some set of data to the URL via POST, instead of via the query string.
 * suppress error - an optional parameter that prevents any error message from getting displayed if there is a problem retrieving the data.

More than one  call can be used in a page. If this happens, though, make sure that every local variable name is unique.

For data from XML sources, the variable names are determined by both tag and attribute names. For example, given the following XML text:

the variable type would have the value Apple, and the variable color would have the value red.

Similarly, the following XML text would be interpreted as a table of values defining two variables named type and color:

A CSV file must be literally a CSV file, i.e., delimited by commas. A call for a headerless CSV file might look:



while a call to CSV with a header row might look like:



where the header contains, which is retrieved as   in the wiki.

You can also set caching to be done on the data retrieved, and string replacement to hide API keys; see the "Usage" section, below, for how to do both of those.

Getting data from a non-API text file
If the data you wish to access is on a MediaWiki page or in an uploaded file, you can use the above methods to retrieve the data assuming the page or file only contains data in one of the supported formats:


 * for data on a wiki page, use " " as part of the URL;
 * for data in an uploaded file, use the full path.

If the MediaWiki page with the data is on the same wiki, it is best to use the fullurl: parser function, e.g.



Similarly, for uploaded files, you can use the filepath: function, e.g.

For wiki pages that have additional information, the External Data extension provides a way to create an API of your own, at least for CSV data. To get this working, first place the data you want accessed in its own wiki page, in CSV format, with the headers as the top row of data (see here for an example). Then, the special page 'GetData' will provide an "instant API" for accessing either certain rows of that data, or the entire table. By adding "field-name=value" to the URL, you can limit the set of rows returned.

A URL for the 'GetData' page can then be used in a call to #get_web_data, just as any other data URL would be; the data will be returned as a CSV file with a header row, so the 'format' parameter of #get_web_data should be set to 'CSV with header'. See here for an example of such data being retrieved and displayed using #get_web_data and #for_external_table. In this way, you can use any table-based data within your wiki without the need for custom programming.

String replacement in URLs
One or more of the URLs you use may contain a string that you would prefer to keep secret, like an API key. If that's the case, you can use the field  of the relevant data source to specify a dummy string you can use in its place. For instance, let's say you want to access the URL "http://worlddata.com/api?country=Guatemala&key=123abcd", but you don't want anyone to know your API key. You can add the following to your  file, after the inclusion of External Data:

Then, in your call to #get_web_data, you can replace the real URL with: "http://worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY".

Whitelist for URLs
You can create a "whitelist" for URLs accessed by #get_web_data: in other words, a list of domains, that only URLs from those domains can be accessed.

As with other extension settings, there can be a common whitelist or a whitelist for a host or second level domain (effectively blacklisting the whole host or domain except the whitelisted URLs).

To create a whitelist with one URL, add the following to :

To create a whitelist with multiple URLs:

HTTP options
By default, #get_web_data allows for HTTPS-based wikis to access plain HTTP URLs, and vice versa, without the need for certificates (see Transport Layer Security on Wikipedia for a full explanation). If you want to require the presence of a certificate, add the following to LocalSettings.php:

Additionally, the setting  lets you set a number of other HTTP-related settings. It is an array that can take in any of the following keys:


 * - how many seconds to wait for a response from the server (default is 'default', which corresponds to the value of $wgHTTPTimeout, which by default is 25)
 * - whether to verify the SSL certificate, if retrieving an HTTPS URL (default is false)
 * - whether to retrieve another URL if the specified URL redirects to it (default is false)

So, for instance, if you want to verify the SSL certificate of any URL being accessed by #get_web_data, you would add the following to LocalSettings.php:

As with other settings, the global settings (data source ) can be overridden with the specific settings for a URL, host or second level domain.

ExternalDataBeforeWebCall hook
The  hook can be used to alter HTTP request options, URL, make any preparations to data retrieval like complex authentication procedure, or abort data retrieval.

Example:

Examples
You can see some example calls to #get_web_data, featuring real-world data sources, at the Examples page.

#get_soap_data - web data via SOAP
The parser function #get_soap_data, similarly to #get_web_data, lets you get data from a URL, but here using the SOAP protocol. It is called in the following way:

All of the settings for a data source that can be applied for #get_web_data can also be applied for #get_soap_data:,  ,   and.

All of the parsing-related parameters that #get_web_data supports (format, delimiter, use xpath, etc.) can be used for #get_soap_data as well; see.

The caching-related parameters that #get_web_data supports (cache seconds and use stale cache) can be used for #get_soap_data as well; see.