User:Kephir/XML parse tree

From MediaWiki.org
Jump to: navigation, search

The following is an unofficial documentation of the XML parse tree format, as returned by Special:ExpandTemplates and the API, like API:Expandtemplates and API:Properties#revisions / rv, when a generatexml argument is passed to the API call.

DTD[edit | edit source]

<!DOCTYPE root [
	<!ENTITY % mixed-markup "(#PCDATA|template|comment|h|tplarg|ext|ignore)*">
 
	<!ELEMENT root       %mixed-markup;                     >
	<!ELEMENT template   (title, part*)                     >
	<!ELEMENT tplarg     (title, part*)                     >
	<!ELEMENT part       (#PCDATA|name|value)               >
	<!ELEMENT title      %mixed-markup;                     >
	<!ELEMENT name       %mixed-markup;                     >
	<!ELEMENT value      %mixed-markup;                     >
	<!ELEMENT h          %mixed-markup;                     >
	<!ELEMENT comment    (#PCDATA)                          >
	<!ELEMENT ext        (name, attr, inner?, close?)       >
	<!ELEMENT attr       (#PCDATA)                          >
	<!ELEMENT inner      (#PCDATA)                          >
	<!ELEMENT ignore     (#PCDATA)                          >
	<!ELEMENT close      (#PCDATA)                          >
 
	<!ATTLIST root
		xml:space    CDATA    #FIXED     "preserve"     >
	<!ATTLIST template
		lineStart    CDATA    #IMPLIED                  >
	<!ATTLIST tplarg
		lineStart    CDATA    #IMPLIED                  >
	<!ATTLIST name
		index        CDATA    #IMPLIED                  >
	<!ATTLIST h
		i            CDATA    #REQUIRED
		level        CDATA    #REQUIRED                 >
]>

Elements[edit | edit source]

root
The root element. Has no interesting attributes by itself.
Since whitespace is significant in reconstructing wiki markup, it is a good idea to parse the XML document as if root had an xml:space="preserve" attribute. MediaWiki does not specify it explicitly, however.
template
Indicates a template invocation ({{ ... }}). Must contain at least a title element, followed by optional part elements.
The lineStart attribute is present and set to 1 if the template immediately follows a newline.
tplarg
Indicates a template argument reference ({{{ ... }}}). Contents are just like template, a title element followed by optional parts. The lineStart attribute has the same meaning as above.
part
Indicates a template argument (or default value for a template argument reference). Always contains a name and a value element, in that order, with an equal sign between them if the name is given explicitly. If the template argument is an implicitly numbered one, the name element will be empty and contain an index attribute specifying the index.
For tplarg elements, only the first part child should be looked at to provide default arguments, the rest are ignored. The split into name and value is disregarded.
h
Indicates a header (=== ... ===). The level attribute contains the header level, while i contains the section number, regardless of level (the same that the &section= query string parameter uses).
ext
Indicates a parser extension tag, such as <ref>...</ref>, <source>...</source> or <nowiki>...</nowiki>. Not all tags are parser extension tags; <b>...</b> or <table>...</table>, for example, are not. Which tags are considered parser tags depends on MediaWiki installation. To obtain a list of extension tags, use API:Meta#siteinfo / si with the siprop=extensiontags query parameter.
This element always contains (possibly empty) name (tag name) and attr (attributes) child elements, optionally an inner element, and optionally close following it. The contents of attr need not conform to HTML or XML attribute syntax.
If the parser tag is specified in a self-closing form (e.g. <nowiki/>), the ext element will lack inner and close child elements.
ignore
Indicates text to be ignored, usually a <noinclude>...</noinclude>, <onlyinclude>...</onlyinclude> or <includeonly>...</includeonly> tag and/or its contents.
There is no option in the publicly available API to preprocess wikitext in transclusion mode, i.e. ignoring contents of <noinclude>...</noinclude> while parsing <includeonly>...</includeonly> or restricting parsing to <onlyinclude>...</onlyinclude> (bug #49353).
comment
Indicates an HTML-style comment, i.e. <!-- ... -->. The contents of this element include the comment start mark (&lt;!--) and end mark (-->).

Serialisation[edit | edit source]

Note: the following method guarantees only that valid parser output will serialise back into original markup. Modifying parse trees without regard for escaping may produce unexpected results. See below for information on escaping template arguments.

Turning the XML parse tree back into wiki markup is rather simple. It amounts to four substitutions, three of them being:

<template>...</template> → {{...}}
<tplarg>...</tplarg> → {{{...}}}
<part>...</part> → |...

Care has to be taken when handling ext elements. For elements that contain inner element, the following substitution is appropriate:

<ext><name>...</name><attr>...</attr>...</ext> → <......>...

Otherwise, use:

<ext><name>...</name><attr>...</attr></ext> → <....../>

Other elements can have their contents passed through as is.

The whole process is equivalent to applying the following XSLT stylesheet:

<?xml version="1.0" standalone="yes" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
	<xsl:output method="text" media-type="text/x-wiki" />
	<xsl:preserve-space elements="*" />
 
	<xsl:template match="template">
		<xsl:text>{{</xsl:text>
		<xsl:apply-templates />
		<xsl:text>}}</xsl:text>
	</xsl:template>
 
	<xsl:template match="tplarg">
		<xsl:text>{{{</xsl:text>
		<xsl:apply-templates />
		<xsl:text>}}}</xsl:text>
	</xsl:template>
 
	<xsl:template match="part">
		<xsl:text>|</xsl:text>
		<xsl:apply-templates />
	</xsl:template>
 
	<xsl:template match="ext[inner]">
		<xsl:text>&lt;</xsl:text>
		<xsl:apply-templates />
	</xsl:template>
 
	<xsl:template match="ext[not(inner)]">
		<xsl:text>&lt;</xsl:text>
		<xsl:apply-templates />
		<xsl:text>/&gt;</xsl:text>
	</xsl:template>
 
	<xsl:template match="inner">
		<xsl:text>&gt;</xsl:text>
		<xsl:apply-templates />	
	</xsl:template>
 
	<xsl:template match="*">
		<xsl:apply-templates />
	</xsl:template>
</xsl:stylesheet>

Escaping and transformations[edit | edit source]

The pipe character, the equal sign and consecutive curly braces are interpreted specially in template invocations. If you wish to employ either as literal characters, you have to escape them. Unfortunately, MediaWiki markup does not lend itself to escaping very well. There are many methods of escaping markup, and they come with many caveats. Proper escaping is significant when modifying parse trees, hence we discuss it here.

The simplest method is to wrap special characters, or the whole string, inside a <nowiki> tag, or escape them with numerical HTML escapes: &#124;, &#61;, &#123; and &#125; (and possibly escape other characters as well). This has two disadvantages: first, wikilinks and transclusions stop working (obviously). Second, the escaped text might not be recognised by template or module logic that processes it. In this section, more universal alternatives will be discussed.

If you want to allow wikilinks in an argument, but not templates (or template arguments), the simplest universal method is to perform the following substitutions:

  • {{{{<noinclude/>{<noinclude/>{
  • }}}}<noinclude/>}<noinclude/>}
  • {{{<noinclude/>{
  • }}}<noinclude/>}
  • ={{lc:=}}
  • |{{!}} (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)

It has the disadvantage that piped wikilinks come out of it as [[link target{{!}}label]], which may be aesthetically unpleasing, although it still renders as expected. It also prevents the pipe trick from working. If you wish to avoid that, you will have to count pairs of brackets preceding | to see if they match, and therefore it is not a part of a wikilink and needs escaping.

You also need to be cautious in case a = character is contained in the wiki markup; if there is one, make sure that the template argument is passed with an explicit name, or escape the = character appropriately as well (usually this is done by transcluding Template:=). The above substitutions may also not work as expected for substituted templates. Depending on the desired effect, you may need to adjust the escapes accordingly. For example:

  • {{{{{{subst:lc:{}}{
  • }}}}{{subst:lc:}}}{{subst:lc:}}}
  • {{{{subst:lc:{}}{
  • }}}{{subst:lc:}}}
  • ={{subst:lc:=}}
  • |{{subst:!}} (see above)
  • ~~~~~~{{subst:lc:~~}}
  • ~~~~{{subst:lc|~~}}

The reslt can be safely embedded in a substituted template invocation. Performing pre-save transform on it alone will return the original string, with all subst: markers intact. Depending on use case, this may, or may not be the desired effect.

If you want to allow both links and templates, but prevent misinterpretations of | and premature template closures, you need to follow the following steps:

  1. Parse the markup you wish to escape. (The following will assume that you get an XML tree as described above.)
  2. For each direct child text node of the <root> element, escape =, |, }}} and }}, as discussed.
  3. Serialise the parse tree back into wiki markup.

The resultant text will be interpreted as if it were a stand-alone piece of markup, even inside a template argument. Following these steps is the only universal method of escaping wikitext.

Implementation[edit | edit source]