Extension:External Data/Parsing data

The parser functions #get_web_data, #get_soap_data, #get_file_data and #get_program_data all retrieve data that is expected to take a certain structured form. The following parameters can all be used in any of these calls to help parse these. Some can be used only for specific formats, while others are valid across all formats.

Cross-format parameters

 * format - specifies the format of the data being retrieved: it should be one of either 'CSV', 'CSV with header', 'GFF', 'JSON', 'XML', 'HTML' or 'text'. CSV, JSON and XML are standard data formats; GFF, or the Generic Feature Format, is a format for genomic data. The difference between 'CSV' and 'CSV with header' is that 'CSV' is simply a set of lines with values; while in ' ', the first line is a " ", holding a comma-separated list of the name of each column. 'text' indicates that the contents of the file should be retrieved as-is.


 * start line, end line, header lines, footer lines - use these to cut out a fragment of data. Line number are one-based, negative values (-1 meaning last) are possible as well as percentages (0% to 100%). Use header lines and footer lines to carve out a valid CSV, JSON or XML. Note that if any of these is set, additional newlines will be injected into XML or JSON to guarantee that required tag/variable blocks begin and end at new lines, which will influence the required start line and end line settings. The external variables  and   store the beginning and end of the main fragment (without header or footer),   contain the number of lines returned an   — total number of lines in the file.

Format-specific parameters

 * delimiter -
 * in CSV format, specifies the delimiter between values in the data set. The default value is " ". To specify a tab delimiter, use " ".
 * in INI format, specifies the delimiter between key and value; by default,.


 * regex - specifies a PHP regular expression that should be used to get specific strings; used with the "text" format. Example: For sample text, the regex   returns "Heading" to the external variable.


 * use xpath - an optional parameter that can be used with the "XML" or "HTML" formats, to indicate that "data" mappings should be done using XPath notation. This is especially useful if the same tag or attribute name is used more than once in the file, and you only want to get a specific instance of it. We won't get into the details of XPath notation here, but you can see a demonstration of "use xpath" here.


 * default xmlns prefix - an optional parameter that can be used with "use xpath", which sets the default namespace prefix to be used.


 * use jsonpath - an optional parameter that can be used with the "JSON" format, to indicate that "data" mappings should be done using JSONPath notation. JSONPath is less well-known than XPath, but documentation for it does exist: see here for one guide to JSONPath syntax, and here for an online evaluator of JSONPath syntax.


 * json offset - an optional parameter that represents the number of characters to ignore at the beginning of the data set being parsed. It is used with JSON values, in case the JSON being accessed has some kind of security string at the beginning.


 * allow trailing commas - if this is set, JSON files with commas before  or   will be parsed even though JSON specification does not allow trailing commas. This setting is useful when start line, end line, header lines and footer lines are set.

Using CSS-style selectors
With the "HTML" format, you can either use XPath (see above) or CSS-style selectors. For CSS-style selection, you do not need to specify a special parameter: it is the default approach used when "use xpath" is not specified. CSS selectors are a notation that uses tag names, classes and IDs to locate one or more elements in an HTML page; it is also the syntax used in jQuery. See here for one reference for CSS-style selectors.

INI texts
Texts (fetched from the web, files or program output) of the format INI (below) can be parsed using  format. The setting delimiter (by default, ) contains the delimiter between key and value. The setting comment delimiter (by default, ) contains the delimiter from which line comments begin. Comments are added to the external variable. Strings without a delimiter are treated as comments.