# Extension:External Data

MediaWiki extensions manual
External Data

Release status: stable

Implementation Parser function, Special page
Description Allows for using and displaying values retrieved from various sources: external URLs and local wiki pages (in CSV, GFF, JSON and XML formats), database tables, and LDAP servers.
Author(s) Yaron Koren and others
MediaWiki 1.17 or greater
Database changes Yes

Translate the External Data extension if it is available at translatewiki.net

Check usage and version matrix; code metrics

The External Data extension allows MediaWiki pages to retrieve, filter, and format structured data from one or more sources. These sources can include external URLs, regular wiki pages, uploaded files, databases or LDAP directories.

## Parser functions

The extension has nine parser functions:

• #get_web_data retrieves CSV, GFF, JSON or XML data from a URL and assigns it to variables that can be accessed on the page.
• #get_soap_data retrieves data from a URL via the SOAP protocol.
• #get_db_data retrieves data from a database.
• #get_ldap_data retrieves data from an LDAP server.
• #external_value displays the value of any such variable.
• #for_external_table cycles through all the values retrieved for a set of variables, displaying the same "container" text for each one.
• #store_external_table cycles through a table of values, storing them as semantic data via the Semantic MediaWiki extension. The storage used depends on SMW's storage system: if it uses SQLStore3, #store_external_table mimics a call to SMW's #subobject function for each row, while if SMW uses SQLStore2, it mimics a call to the #set_internal function, defined in the Semantic Internal Objects extension - so in that case, Semantic Internal Objects must be installed for this function to work.
• #display_external_table cycles through all the values retrieved for a set of variables, displaying each "row" using a template.
• #clear_external_data erases the current set of retrieved data.

You can also download the code directly via Git from the MediaWiki source code repository. From a command line, you can call the following:

git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ExternalData.git


You can also view the code online here.

## Installation

To install this extension, create an 'ExternalData' directory (either by extracting a compressed file or downloading via Git), and place this directory within the main MediaWiki 'extensions' directory. Then, in the file 'LocalSettings.php', add the following line:

include_once("$IP/extensions/ExternalData/ExternalData.php");  ## Authors External Data was created, and is maintained, by Yaron Koren (reachable at yaron57@gmail.com). The overall code base, though, is the work of many people; important code contributions were made by Michael Dale, David Macdonald, Siebrand Mazeland, Ryan Lane, Chris Wolcott, Jelle Scholtalbers, Kostis Anagnostopoulos, Nick Lindridge, Dan Bolser and Joel Natividad. Development of some features was funded by KeyGene and KDZ – Zentrum für Verwaltungsforschung. ## Retrieving data Data can be retrieved from three different sources: from a web page containing structured data (including a page on the wiki itself), from a database, and from an LDAP server. ### #get_web_data - CSV, GFF, JSON, XML To get data from a web page that holds structured data, call the parser function #get_web_data. It can take the following syntax: {{#get_web_data: url=<data source URL> |format={CSV|CSV with header|GFF|JSON|XML} |data=<local_variable_name1>=<external_variable_name1>[,...] |filters=<external_variable_name1>=<filter_value1>[,...] |use xpath |post data=<additional data> |cache seconds=<number of seconds> }}  An explanation of the parameters: • url= sets the full URL of the CSV, GFF, JSON or XML file. (CSV, JSON and XML are standard data formats; GFF, or the Generic Feature Format, is a format for genomic data.) • format= specifies the format of the file: it should be one of either 'CSV', 'CSV with header', 'GFF', 'JSON' or 'XML'. The difference between 'CSV' and 'CSV with header' is that 'CSV' is simply a set of lines with values; while in 'CSV with header', the first line is a "header", holding a comma-separated list of the name of each column. • data= holds the "mappings" that connect local variable names to external variable names. Each mapping (of the form <local_variable_name1>=<external_variable_name1>) is separated by a comma. External variable names are the names of the values in the file (in the case of a header-less CSV file, the names are simply the indexes of the values: 1, 2, 3, etc.), and local variable names are the names that are later passed in to #external_value. • filters= sets filtering on the set of rows being returned. You can set any number of filters, separated by commas; each filter sets a specific value for a specific external variable. It is not necessary to use any filters; most APIs, it is expected, will provide their own filtering ability through the URL's query string. • use xpath is an optional parameter that can be used with the "XML" format, to indicate that "data" mappings should be done using XPath notation; see Using XPath, below. • post data= is an optional parameter that lets you send some set of data to the URL via POST, instead of via the query string. • cache seconds= is an optional parameter that sets the number of seconds that the values from this call should be cached; it overrides the value of$edgCacheExpireTime, if there is any.

More than one #get_web_data call can be used in a page. If this happens, though, make sure that every local variable name is unique.

For data from XML sources, the variable names are determined by both tag and attribute names. For example, given the following XML text:

<fruit type="Apple"><color>red</color></fruit>

the variable type would have the value Apple, and the variable color would have the value red.

Similarly, the following XML text would be interpreted as a table of values defining two variables named type and color:

<fruits>
<fruit type="Apple"><color>red</color></fruit>
<fruit type="Kiwi"><color>brown</color></fruit>
</fruits>

A CSV file must be literally a CSV file, i.e., delimited by commas. A call for a headerless CSV file might look:

{{#get_web_data:url=<URL>|format=CSV|data=first name=1}}

while a call to CSV with a header row might look like:

where the header contains 'FirstName', which is retrieved as 'first name' in the wiki.

You can also set caching to be done on the data retrieved, and string replacement to hide API keys; see the "Usage" section, below, for how to do both of those.

#### Getting data from a non-API text file

If the data you wish to access is on a MediaWiki page or in an uploaded file, you can use the above methods to retrieve the data assuming the page or file only contains data in one of the supported formats:

• for data on a wiki page, use "&action=raw" as part of the URL;
• for data in an uploaded file, use the full path.

If the MediaWiki page with the data is on the same wiki, it is best to use the fullurl: parser function, e.g.

• {{fullurl:Test/test.csv|action=raw}}.

Similarly, for uploaded files, you can use the filepath: function, e.g.

• {{filepath:xyzzy.csv}}

For wiki pages that have additional information, the External Data extension provides a way to create an API of your own, at least for CSV data. To get this working, first place the data you want accessed in its own wiki page, in CSV format, with the headers as the top row of data (see here for an example). Then, the special page 'GetData' will provide an "instant API" for accessing either certain rows of that data, or the entire table. By adding "field-name=value" to the URL, you can limit the set of rows returned.

A URL for the 'GetData' page can then be used in a call to #get_web_data, just as any other data URL would be; the data will be returned as a CSV file with a header row, so the 'format' parameter of #get_web_data should be set to 'CSV with header'. See here for an example of such data being retrieved and displayed using #get_web_data and #for_external_table. In this way, you can use any table-based data within your wiki without the need for custom programming.

#### Data caching

You can configure External Data to cache the data contained in the URLs that it accesses, both to speed up retrieval of values and to reduce the load on the system whose data is being accessed. To do this, you can run the SQL contained in the extension file 'ExternalData.sql' in your database, which will create the table 'ed_url_cache', then add the following to your LocalSettings.php file, after the inclusion of External Data:

$edgCacheTable = 'ed_url_cache';  You should also add a line like the following, to set the expiration time of the cache, in seconds; this example line will cache the data for a week: $edgCacheExpireTime = 7 * 24 * 60 * 60;


#### String replacement in URLs

One or more of the URLs you use may contain a string that you would prefer to keep secret, like an API key. If that's the case, you can use the array $edgStringReplacements to specify a dummy string you can use in its place. For instance, let's say you want to access the URL "http://worlddata.com/api?country=Guatemala&key=123abcd", but you don't want anyone to know your API key. You can add the following to your LocalSettings.php file, after the inclusion of ExternalData: $edgStringReplacements['WORLDDATA_KEY'] = '123abcd';


Then, in your call to #get_web_data, you can replace the real URL with: "http://worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY".

#### Whitelist for URLs

You can create a "whitelist" for URLs accessed by #get_web_data: in other words, a list of domains, that only URLs from those domains can be accessed. If you are using string replacements in order to hide secret keys, it is highly recommended that you create such a whitelist, in order to prevent users from finding out those keys by including them in a URL within a domain that they control.

To create a whitelist with one domain, add the following to LocalSettings.php:

$edgAllowExternalDataFrom = 'http://example.org';  To create a whitelist with multiple domains, add something like the following instead: $edgAllowExternalDataFrom = array('http://example.org', 'http://example2.com');


#### Security

By default, #get_web_data allows for HTTPS-based wikis to access plain HTTP URLs, and vice versa, without the need for certificates (see Transport Layer Security on Wikipedia for a full explanation). If you want to require the presence of a certificate, add the following to LocalSettings.php, below the inclusion of External Data:

### #get_db_data - retrieve data from a database

The parser function #get_db_data allows retrieval of data from external databases. This function executes a simple SELECT statement and assigns the results to local variables that can then be used with the #external_value or #for_external_table functions.

A note about security: - If you are going to use #get_db_data you should think hard about the security implications. Configuring a database in LocalSettings.php will allow anyone with edit access to your wiki to run SQL select statements against that database. You should use a database user that has the minimum permissions for what you are trying to achieve. It is possible that complex SQL constructions could be passed to this function to cause it to do things vastly different to what it was designed for. You should know what you are doing before enabling this function.

#### Configuration

Each database being accessed needs to be configured separately in LocalSettings.php. For normal databases (i.e., everything except for SQLite), add the following stanza for each database:

$edgDBServer['ID'] = "server URL";$edgDBServerType['ID'] = "DB type";
$edgDBName['ID'] = "DB name";$edgDBUser['ID'] = "username";
$edgDBPass['ID'] = "password";  Where: • ID is a label for this database which is used when calling #get_db_data • server URL is the hostname on which the database lives • DB type is the type of database, i.e. mysql, postgres, mssql, oracle, sqlite, db2 or mongodb • DB name, username and password are details for accessing the database. An example of a set of values would be: $edgDBServer['employee-db'] = "127.0.0.1";
$edgDBServerType['employee-db'] = "mysql";$edgDBName['employee-db'] = "employeesDatabase";
$edgDBUser['employee-db'] = "guest";$edgDBPass['employee-db'] = "p@ssw0rd";


The following optional settings can also be added:

$edgDBFlags['id'] = "MediaWiki DB flags";$edgDBTablePrefix['id'] = "table prefix";


Example values for these variables are:

$edgDBFlags['employee-db'] = DBO_NOBUFFER & DBO_TRX;$edgDBTablePrefix['employee-db'] = "emp_";


#### Support for database systems

MySQL, Postgres (i.e. PostgreSQL), DB2 and MongoDB should work fully by default (though there are syntax limitations, and differences, for MongoDB - see below). For MS SQL/SQLServer, SQLite and Oracle, you may need to perform some special handling.

MS SQL/SQLServer

For Microsoft SQLServer, if it doesn't work by default, there are three options that may help:

• If the wiki is running on Windows, you can install Microsoft drivers for PHP for SQL Server. See instructions here.
• If the wiki is running on Linux, the MSSQLBackCompat or OdbcDatabase extensions may help.
• For MSSQLBackCompat, you will need to specify "mssqlold" (not "mssql") as the database type.
• For OdbcDatabase, you will need to specify "odbc" (not "mssql") as the database type.
• If you get messages at the top that look like, "Warning: Invalid argument supplied for foreach() in ...\includes\db\DatabaseMssql.php", this code change might help: in the MediaWiki file includes/db/DatabaseMssql.php, look for lines that look like:
                $rows = sqlsrv_fetch_array($queryresult, SQLSRV_FETCH_ASSOC );

foreach( $rows as$row ) {


...and replace them with:

                //$rows = sqlsrv_fetch_array($queryresult, SQLSRV_FETCH_ASSOC );

//foreach( $rows as$row ) {
while ($row = sqlsrv_fetch_array($queryresult, SQLSRV_FETCH_ASSOC ) ) {

SQLite

To connect to SQLite, you need something like the following in LocalSettings.php:

$edgDBServerType['ID'] = "sqlite";$edgDBDirectory['ID'] = "/directory/to/DB/file";
$edgDBName['ID'] = "Name of file, without .sqlite";  Oracle Connecting to Oracle may work by default. If it doesn't work, the following may help: • Make sure that the Oracle client, and the PHP version being used, are using the same architecture: they have to either both be 32-bit, or both be 64-bit. • Make sure that the value of$edgDBServer for the installation matches something in the corresponding Oracle client .ora files. The value may need to look like "serverName/dbName", as opposed to "serverName".
• If none of the above are the issue, you could try using the OdbcDatabase extension, which should work as well.
MongoDB

For MongoDB, there are no special connection parameters, although the username and password may be optional.

#### Usage

To get data from an external database, call the following:

{{#get_db_data:
db=[database ID]
|from=[from clause]
|where=[where clause]
|limit=[limit]
|order by=[order by clause]
|data=[data mappings]
}}


An explanation of the fields:

• db - the identifying label configured in LocalSettings.php (this parameter used to be known as "server")
• from - an SQL from clause
• where - an SQL where clause (optional)
• limit - a number, limiting the number of results (optional)
• order by - an SQL "order by" clause (optional)
• data - mapping of database column names to local variables (syntax: localVariable=databaseColumn - i.e. "strName" is the name of the database column in the example below).

An example call, using the "employee database" example from above:

{{#get_db_data:
db=employee-db
|from=tblPersonalInfo
|where=employeeID='{{{id}}}'
|limit=50
|order by=employeeName ASC
|data=dbemail=employeeEmail,name=employeeName
}}

##### MongoDB

MongoDB is a non-SQL (or "NoSQL", if you prefer) database system, with its own querying language. When accessing MongoDB, you can either pass in a standard MongoDB query, or use the standard SQL-like syntax of #get_db_data. To use standard MongoDB querying, pass in the query to the parameter "find query=".

You can also use the standard querying functionality. There are some restrictions and differences, however, for the "where" clause:

• only "AND"s can be used, not "OR"s
• for the "LIKE" comparison, no text should be placed around the comparator - it should look like "Username LIKE Jo", not "Username LIKE '%Jo%'".

Because MongoDB returns values in JSON that may be complex, and contain compound values, you can get data that is stored in such a way by separating field names with dots. For instance, if the return data contains a value for a field called "Measurements" that is an array, holding values for fields called "Height" and "Width", then the "data=" parameter to #get_web_data could have a value like "height=Measurements.Height,width=Measurements.Width".

You can do Memcached-based caching of values retrieved from MongoDB; to do that, you need the following two lines in LocalSettings.php:

$wgMainCacheType = CACHE_MEMCACHED;$edgMemCachedMongoDBSeconds = ''number of seconds'';


### #get_ldap_data - retrieve data from LDAP directory

The parser function #get_ldap_data allows retrieval of data from external LDAP directories. This function executes LDAP queries and assigns the results to local variables that can then be used with the #external_value function. It currently only handles a single row of results, and so is most appropriate for querying directories for data about individual users.

A note about security: - If you are going to use #get_ldap_data you should think hard about the security implications. Configuring an LDAP server in LocalSettings.php will allow anyone with edit access to your wiki to run queries against that server. You should use a domain user that has the minimum permissions for what you are trying to achieve. Wiki users could run queries to extract all sorts of information about your domain. You should know what you are doing before enabling this function.

#### Configuration

You need to configure each LDAP server in LocalSettings.php. Add the following stanza for each server:

$edgLDAPServer['domain'] = "myldapserver.com";$edgLDAPBaseDN['domain'] = "[basedn]";
$edgLDAPUser['domain'] = "myDomainUser";$edgLDAPPass['domain'] = "myDomainPassword";


Where:

• domain is a label to be used when calling #get_ldap_data
• myDomainuser and myDomainPassword are credentials used to bind to the LDAP server
• [basedn] is the base DN used for the search.

Example:

$edgLDAPServer['foobar'] = "foobar.com";$edgLDAPBaseDN['foobar'] = "OU=Users,dc=foobar,dc=com";
$edgLDAPUser['foobar'] = "u12345";$edgLDAPPass['foobar'] = "mypassword";


#### Usage

To query the LDAP server add this call to a wiki page:

{{#get_ldap_data:
domain=[domain]
|filter=[ldap filter]
|data=[data mappings]
}}


Where:

• domain is the label used in LocalSettings.php
• filter is the ldap filter used for the search
• data is the mappings of LDAP attributes to local variables

An example that retrieves a user from with Win2003/AD, using a userid passed to a template:

{{#get_ldap_data:
domain=foobar
|filter=(sAMAccountName={{{id}}})
|data=email=mail,title=title,company=company,department=department
}}


Note that #get_ldap_data will only retrieve one result.

## Displaying data

Once you have retrieved the data onto the page, from any source, there are two ways to display it on the page: #external_value and #for_external_table.

### Displaying individual values

If this call retrieved a single value for each variable specified, you can call the following:

{{#external_value:local_variable_name}}


{{#get_web_data:url=http://semanticweb.org/wiki/Special:Ask/-5B-5BGermany-5D-5D/-3FBorders/-3FPopulation/-3FArea/format%3Dcsv/sep%3D,

* Germany borders the following countries and bodies of water:
{{#arraymap:{{#external_value:bordered countries}}|,|x|[[Borders country::x]]}}.
* Germany has population {{#external_value:population}}.


The page gets data from a URL at semanticweb.org, generated by the Semantic MediaWiki extension. That URL contains the following text:

Germany,"North Sea,Denmark,Baltic Sea,Poland,Czech Republic,Austria,Switzerland,France,Luxembourg,Belgium,Netherlands","82,411,000",3.5705e+11 m²


The page then uses #external_value to display the 'bordered countries' and 'population' values; although it uses the #arraymap function, defined by the Semantic Forms extension, to apply some transformations to the 'bordered countries' value (you can ignore this detail if you want).

By default, #external_value displays an error message if it is called for a variable that has not been set. You can disable that by adding the following to LocalSettings.php:

$edgExternalValueVerbose = false;  ### Displaying a table of values The data returned by #get_web_data or #get_db_data (#get_ldap_data doesn't support this feature) can also be a "table" of data (many values per field), instead of just a single "row" (one value per field). In this case, you can display it using one of either the functions #for_external_table or #display_external_table. #### #for_external_table This URL at semanticweb.org contains information similar to that above, but for all the countries in Africa instead of just one country. Calling #get_web_data with this URL, with the same format as above, will set the local variables to contain arrays of data, rather than single values. You can then call #for_external_table, which has the following format: {{#for_external_table:[expression]}}  ...where "expression" is a string that contains one or more variable names, surrounded by triple brackets. This string is then displayed for each retrieved "row" of data. For an example, this page contains a call to #get_web_data for the semanticweb.org URL mentioned above, followed by this call: {| class="wikitable" ! Name ! Borders ! Population ! Area {{#for_external_table:<nowiki/> {{!}}- {{!}} {{{name}}} {{!}} {{{borders}}} {{!}} {{{population}}} {{!}} {{{area}}} }} |}  The call to #for_external_table holds a single row of a table, in wiki-text; it's surrounded by wiki-text to create the top and bottom of the table. The presence of "{{!}}" is a standard MediaWiki trick to display pipes from within parser functions; to get it to work, you just have to create a page called "Template:!" that contains a single pipe. There are much easier calls to #for_external_table that can be made, if you just want to display a line of text per data "row", but an HTML table is the standard approach. There's one other interesting feature of #for_external_table, which is that it lets you URL-encode specific values, by calling them with {{{field-name.urlencode}}} instead of just {{{field-name}}}. For instance, if you wanted to show links to Google searches on a set of terms retrieved, you could call: {{#for_external_table: http://www.google.com/search?q={{{term.urlencode}}} }}  This is required because standard parser functions can't be used within #for_external_table - so the following, for example, will not work: {{#for_external_table: http://www.google.com/search?q={{urlencode:{{{term}}}}} }}  #### #display_external_table #display_external_table is similar in concept to #for_external_table, but it passes the values in each row to a template, which handles the display. This function is called as: {{#display_external_table: template=[template name] |data=[set of parameters, separated by commas]}}  For example, to display the data from the previous example in a table as before, you could create a template called "Country info row", that had the parameters "Country name", "Countries bordered", "Population" and "Area", and then call the following: {| class="wikitable" ! Name ! Borders ! Population ! Area {{#display_external_table:template=Country info row|data=Country name=name,Countries bordered=borders,Population=population,Area=area}} |}  The template should then contain wikitext like the following: |- |{{{Country name}}} |{{{Countries bordered}}} |{{{Population}}} |{{{Area}}}  ## Clearing data You can also clear all external data that has already been retrieved, so that it doesn't conflict with calls to retrieve external data further down the page. The most likely case in which this is useful is when data is retrieved and displayed in a template that is called more than once on a page. To clear the data, just call "{{#clear_external_data:}}". Note that the ":" has to be there at the end of the call, or else MediaWiki will ignore the parser function. There is no way to clear the values for only one field; #clear_external_data erases the entire set of data. ## Using External Data with Semantic MediaWiki A common approach is to use External Data in conjunction with the Semantic MediaWiki extension: the data is retrieved using External Data, then stored using Semantic MediaWiki tags, so it can then be queried, aggregated, mapped etc. on the wiki. If you take this approach, you should note a common problem, which is that the data stored by SMW does not get automatically updated when the data coming from the external source changes. The best solution for this, assuming you expect the data to change over time, is to create a cron job to call the SMW maintenance script "SMW_refreshData.php" at regular intervals, such as once a day; that way, the data is never more than a day old. If you want to store an entire table of data in a single page, instead of just individual values, you should use the #store_external_table function. This function works as a hybrid of the #for_external_table function and one of either the #subobject function (defined in the Semantic MediaWiki extension) or the #set_internal function (defined in the Semantic Internal Objects extension). The function used depends on whether Semantic MediaWiki uses SQLStore2 or SQLStore3; if SQLStore2 is used, #set_internal is used, so Semantic Internal Objects must be installed. The syntax of #store_external_table is very similar to the syntax of #set_internal (see the Semantic Internal Objects page for documentation), but it loops over each row, and uses variables, in the same way as #for_external_table. You can see a demonstration of this function on the page fruits semantic data; the call to #store_external_table on that page looks like: {{#store_external_table:Is fruit in |Has name={{{name}}} |Has color={{{color}}} |Has shape={{{shape}}} }}  ## Common problems • If the call to #get_web_data or #for_external_table isn't returning any data, and the page being accessed is large, it could be because the call to retrieve is getting timed out. You should set the$wgHTTPTimeout flag in your LocalSettings.php file (which represents a number of seconds) to some number greater than 3, its default value. You could call, for instance:
\$wgHTTPTimeout = 20;

• If the data being accessed has changed, but the wiki page accessing it still shows the old data, it is because that page is being cached by MediaWiki. There are several solutions to this: if you are an administrator, you can hit the "refresh" tab above the page, which will purge the cache. You can also easily disable caching for the entire wiki; see here for how. Finally, if you wait long enough (typically no more than 24 hours), the page will get refreshed on its own and display the new data.

## Version history

External Data is currently at version 1.7. See the entire version history.

## Contributing

### Bugs and feature requests

You should use the Semantic MediaWiki mailing list, semediawiki-user, for any questions, suggestions or bug reports about External Data. If possible, please add "[ED]" at the beginning of the subject line, to clarify the subject matter.

(Although the External Data extension in general is independent of Semantic MediaWiki, the fact that it uses Semantic Internal Objects for one of its functions, along with the fact that the two extensions are often used together, means that External Data can be considered part of the Semantic MediaWiki "family" for communication purposes.)

You can also send specific code patches to Yaron Koren, at yaron57@gmail.com.

### Translating

Translation of External Data is done through translatewiki.net. The translation for this extension can be found here. To add language values or change existing ones, you should create an account on translatewiki.net, then request permission from the administrators to translate a certain language or languages on this page (this is a very simple process). Once you have permission for a given language, you can log in and add or edit whatever messages you want to in that language.