User:Jeblad/XMLstarlet

From MediaWiki.org
Jump to navigation Jump to search

XMLstarlet adds a tag and parser functions to make it possible to hot link data from external sources by extracting data from web services or published xml or html documents. Such services are assumed to be services for national statistics and other services for distinct and systematic data, but also highly complex and structured data.

Usage[edit]

The extension relies upon redirection through a system message for access to external sites, or cooperation from the external site. The file robots.txt is checked to see if the extension is allowed to use the specified content from the external site if there are no redirect through the system messages. If access are allowed the extension will proceede to download the data, extract and process the data, and optionally store the final data in a local cache with time out information from the external site or set in the local configuration.

Data are extracted, and possibly reprocessed, according to directives given by the tag or parser functions. Some actions might be allowed or denied, given the specific site and possibly local configuration. The data will then be processed by an xlt-parser. If the cache time limit is exceeded the external site will be accessed again to fetch fresh data. The cache time can be overruled by local configuration, or by presets from the system message.

Accessing the external site can be set as a separate stage in the processing and several parameters are then available to change the behaviour of the underlaying system. When this are enabled the extension will also cache pages according to values given by the external site, and only maximum and minimum time outs for the cache are used from the local configuration.

There are both tag and parser function versions for the two access methods; the xml function and the xslt function. The function mainly relies on when the function has to be evaluated, and whats more readable.

<xml src="url" path="xpath">
…
</xml>
<xslt src="url">
…
</xslt>

The tag functions of the extension allows use of lines of statements inside a set of tags (<xml> or <xslt>), while the tag function attributes are used for additional arguments. The tag extension will process the attributes to replace templates and template parameters, but it will not process the content of the tag.

Similar parser functions exists, and all arguments will be processed to replace templates and template parameters. The xml function operates in two states, a basic state and an extended state. They are triggered by the number of arguments.

{{#xml:''url''|''xpath''}}
{{#xml:''url''|…}}
{{#xslt:''url''|…}}

When the protocol part of the url is replaced by a service name, then the given record may override local configurations for this service. If the url contains no protocol part the protocol is infered from the domain name, and defaults to http. Basic protocols can not be used as service names.

The xml function[edit]

The basic method for accessing data from other sites is through the xml function. This download a specific document and extracts a specific value identified by a xpath statement. It has exactly one additional argument. It is not possible to do further rewrite of the value. The value is transformed through a separate rewrite stage and exported as text. Any special characters in the resulting text are escaped.

The slightly more advanced method is to use additional arguments, at least one operation must be used. The operation typically alternates between select commands, edit commands and a few additional commands. If an operation from another command is inserted into the pipeline an additional command will be inserted to switch to a command supporting the operation. Usually there will be a phase to build a selection set, then a rewrite phase to adapt the set to something usable.

Note that the short form of the operations will never trigger a change of command. If there are no such short form of the operation for the active command it will be silently ignored.

Dangerous tags are removed from the final text, and the final text armoured against further transformation.

It is possible to enforce a new rule set by inserting a command into the pipeline. This is an explicit change of command compared to implicit changes done by the operations. The following commands are supported; select (sel) data or query xml document, edit (ed) and/or update documents, format (fo) xml documents and escape/uescape (esc/unesc) special xml characters.

Note that changing to specific commands can be denyed or allowed on a per service basis, or blocked alltogether.

The following XMLstarlet-commands are not supported as commands for the parser function; validate (val) documents velformedness, elements (el) display the structure, canonic (c14n) – XML canonicalization, list (ls) – list directory as XML, xmln (pyx) – convert XML into PYX format and depyx (p2x) – convert PYX into XML.

Select[edit]

Select data or query xml documents are used to extract those elements from a document that will be used to produce the final fragment.

  • copy-of (c) =xpath, print copy of XPATH expression
  • value-of (v) =xpath, print value of XPATH expression
  • output (o) =string, output string literal
  • nl (n), print new line
  • inp-name (f), print input file name (or URL)
  • match (m) =xpath, match XPATH expression
  • if (i) =test-xpath, check condition <xsl:if test="test-xpath">
  • elem (e) =name, print out element <xsl:element name="name">
  • attr (a) =name, add attribute <xsl:attribute name="name">
  • break (b), break nesting
  • sort (s) =op xpath (op can be followed by colon, space or both)
sort in order (used after -m) where
op is X:Y:Z,
  • X is A - for order="ascending"
  • X is D - for order="descending"
  • Y is N - for data-type="numeric"
  • Y is T - for data-type="text"
  • Z is U - for case-order="upper-first"
  • Z is L - for case-order="lower-first"

There can be multiple match, copy-of, value-of, and similar options in a single function call.

Edit[edit]

Edit and/or update xml documents.

  • delete (d) =xpath, delete object identified by xpath
  • insert (i) =xpath¦(e|elem|t|text|a|attr)¦name¦value
  • append (a) =xpath¦(e|elem|t|text|a|attr)¦name¦value
  • subnode (s) =xpath¦(e|elem|t|text|a|attr)¦name¦value
  • move (m) =xpath1¦xpath2, move object identified by xpath1 xpath2
  • rename (r) =xpath¦new-name
  • update (u) =xpath¦((v|value)¦value|(x|expr)¦xpath)

Format[edit]

Reformat extracted text from the external source.

  • noindent (n), do not indent
  • indent-tab (t), indent output with tabulation
  • indent-spaces (s) =num, indent output with <num> spaces
  • omit-decl (o), omit xml declaration <?xml version="1.0"?>
  • recover (R), try to recover what is parsable
  • dropdtd (D), remove the DOCTYPE of the input docs
  • nocdata (C), replace cdata section with text nodes
  • nsclean (N), remove redundant namespace declarations
  • encode (e) =encoding, output in the given encoding (utf-8, unicode...)
  • html (H), input is HTML

Escape/Unescape[edit]

Escape/Unescape special XML characters in the response from the external web service or published xml or html documents.

The xslt function[edit]

For very special processing it is sometimes necessary to use XSLT in its full power. Attributes for the tag function are mainly specialized attributes passed to the curl function.

The parser function version of the extension allows use of embedded transforms as the last anonymous attribute. Additional attributes are passed to the underlaying programs if they filter accordingly.

Dangerous tags are removed from the final text, and the final text armoured against further transformation.

Special page[edit]

A special page exist to help analyzing the structure of external pages and to build the statements. This page takes the tag form of the functions and produce either the content, the source or the element structure. By altering between different visualizations it is fairly easy to iteratively discover the optimum and most failsafe call structure.

The page will like the functions not present dangerous content, but will display it replaced by special markers. This behaviour can not be overruled through service declarations.

Examples[edit]

Parser functions[edit]

The effect of applying function arguments can be illustrated with the following XSLT analogue

{{#xml:http://some.where.net
|template|c=xpath0|m=xpath1|m=xpath2|v=xpath3
|template|m=xpath4|c=xpath5}}

which is equivalent to applying the following XSLT

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
  <xsl:call-template name="t1"/>
  <xsl:call-template name="t2"/>
</xsl:template>
<xsl:template name="t1">
  <xsl:copy-of select="xpath0"/>
  <xsl:for-each select="xpath1">
    <xsl:for-each select="xpath2">
      <xsl:value-of select="xpath3"/>
    </xsl:for-each>
  </xsl:for-each>
</xsl:template>
<xsl:template name="t2">
  <xsl:for-each select="xpath4">
    <xsl:copy-of select="xpath5"/>
  </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Tag functions[edit]

Transform xml documents using xslt.

<xslt src="http://some.where.net">
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
  <xsl:call-template name="t1"/>
  <xsl:call-template name="t2"/>
</xsl:template>
<xsl:template name="t1">
  <xsl:copy-of select="xpath0"/>
  <xsl:for-each select="xpath1">
    <xsl:for-each select="xpath2">
      <xsl:value-of select="xpath3"/>
    </xsl:for-each>
  </xsl:for-each>
</xsl:template>
<xsl:template name="t2">
  <xsl:for-each select="xpath4">
    <xsl:copy-of select="xpath5"/>
  </xsl:for-each>
</xsl:template>
</xsl:stylesheet>
</xslt>

Installation[edit]

Install and configure XMLStarlet Command Line XML Toolkit. Make sure this works and can be accessed by the Mediawiki installation.

The usual: Copy XMLstarlet.php and XMLstarlet.i18n.php to a subfolder XMLstarlet in the extensions folder, then add the following to LocalSettings.php:

require_once( "$IP/extensions/XMLstarlet/XMLstarlet.php");

Optionally some defaults can be changed, especially add a line to configure where the executable can be found.

External links[edit]