User:Jeblad/XMLstarlet

XMLstarlet adds a tag and parser functions to make it possible to hot link data from external sources by extracting data from web services or published xml or html documents. Such services are assumed to be services for national statistics and other services for distinct and systematic data, but also highly complex and structured data.

Usage
The extension relies upon redirection through a system message for access to external sites, or cooperation from the external site. The file  is checked to see if the extension is allowed to use the specified content from the external site if there are no redirect through the system messages. If access are allowed the extension will proceede to download the data, extract and process the data, and optionally store the final data in a local cache with time out information from the external site.

The data are extracted, and possibly reprocessed, according to directives given by the tag or parser functions. Some actions might be allowed or denied, given the specific site and possibly local configuration. The data will then be processed by an xlt-parser. If the cache time limit is exceeded the external site will be accessed again to fetch fresh data. The cache time can be overruled by local configuration, or by presets from the system message.

Parser function
The basic method for accessing data from other sites is through the xml function. This download a specific document and extracts a specific value identified by a xpath statement. It has exactly one additional argument.

When the protocol part of the url is replaced by a service name, then the given record may override local configurations for this service. If the url contains no protocol part the protocol is infered from the domain name, and defaults to http. Basic protocols can not be used as service names.

It is not possible to do further rewrite of the value. The value is transformed through a separate rewrite stage and exported as text. Any special characters in the resulting text are escaped.

The slightly more advanced method is to use additional arguments, at least one operation must be used. The operation typically alternates between select commands, edit commands and a few additional commands. If an operation from another command is inserted into the pipeline an additional command will be inserted to switch to a command supporting the operation. Usually there will be a phase to build a selection set, then a rewrite phase to adapt the set to something usable.

Note that the short form of the operations will never trigger a change of command. If there are no such short form of the operation for the active command it will be silently ignored.

Dangerous tags are removed from the final text, and the final text armoured against further transformation.

Commands
It is possible to enforce a new rule set by inserting a command into the pipeline. This is an explicit change of command compared to implicit changes done by the operations. The commands are one of
 * select (sel) data or query xml document
 * edit (ed) and/or update documents
 * format (fo) xml documents
 * escape (esc) special xml characters
 *  or unescape (unesc) special xml characters

Note that changing to specific commands can be denyed or allowed on a per service basis, or blocked alltogether.

The following are not supported as commands; validate (val) documents velformedness, elements (el) display the structure, canonic (c14n) – XML canonicalization, list (ls) – list directory as XML, xmln (pyx) – convert XML into PYX format, depyx (p2x) – convert PYX into XML.

Select
Select data or query xml documents are used to extract those elements from a document that will be used to produce the final fragment.
 * copy-of (c) =xpath, print copy of XPATH expression
 * value-of (v) =xpath, print value of XPATH expression
 * output (o) =string, output string literal
 * nl (n), print new line
 * inp-name (f), print input file name (or URL)
 * match (m) =xpath, match XPATH expression
 * if (i) =test-xpath, check condition 
 * elem (e) =name, print out element 
 * attr (a) =name, add attribute 
 * break (b), break nesting
 * sort (s) =op xpath (op can be followed by colon, space or both)
 * sort in order (used after -m) where
 * op is X:Y:Z,
 * X is A - for order="ascending"
 * X is D - for order="descending"
 * Y is N - for data-type="numeric"
 * Y is T - for data-type="text"
 * Z is U - for case-order="upper-first"
 * Z is L - for case-order="lower-first"

There can be multiple match, copy-of, value-of, and similar options in a single function call.

Edit
Edit and/or update xml documents.


 * delete (d) =xpath, delete object identified by xpath
 * insert (i) =xpath¦(e|elem|t|text|a|attr)¦name¦value
 * append (a) =xpath¦(e|elem|t|text|a|attr)¦name¦value
 * subnode (s) =xpath¦(e|elem|t|text|a|attr)¦name¦value
 * move (m) =xpath1¦xpath2, move object identified by xpath1 xpath2
 * rename (r) =xpath¦new-name
 * update (u) =xpath¦((v|value)¦value|(x|expr)¦xpath)

Format
Reformat extracted text from the external source.
 * noindent (n), do not indent
 * indent-tab (t), indent output with tabulation
 * indent-spaces (s) =num, indent output with spaces
 * omit-decl (o), omit xml declaration 
 * recover (R), try to recover what is parsable
 * dropdtd (D), remove the DOCTYPE of the input docs
 * nocdata (C), replace cdata section with text nodes
 * nsclean (N), remove redundant namespace declarations
 * encode (e) =encoding, output in the given encoding (utf-8, unicode...)
 * html (H), input is HTML

Escape
Escape special XML characters in the response from the external web service or published xml or html documents.

Unescape
Unescape special XML characters in the response from the external web service or published xml or html documents.

Tag extension
This is a tag extension and uses a complete xslt-document to transform a document into a new form. The tag extension will process the attributes to replace templates and template parameters, but it will not process the content of the tag.

Parser functions
The effect of applying function arguments can be illustrated with the following XSLT analogue

which is equivalent to applying the following XSLT

Tag functions
Transform xml documents using xslt.

Installation
Install and configure XMLStarlet Command Line XML Toolkit. Make sure this works and can be accessed by the Mediawiki installation.

The usual: Copy XMLstarlet.php and XMLstarlet.i18n.php to a subfolder XMLstarlet in the extensions folder, then add the following to LocalSettings.php:

Optionally some defaults can be changed, especially add a line to configure where the executable can be found.