User:Kephir/XML parse tree

The following is an unofficial documentation of the XML parse tree format, as returned by Special:ExpandTemplates and the API, like API:Expandtemplates and API:Properties, when a  argument is passed to the API call.

Elements

 * root
 * The root element. Has no interesting attributes by itself.
 * Since whitespace is significant in reconstructing wiki markup, it is a good idea to parse the XML document as if  had an   attribute. MediaWiki does not specify it explicitly, however.


 * template
 * Indicates a template, variable, or parser function invocation . Must contain at least a  element, followed by optional   elements.
 * The  attribute is present and set to 1 if the template immediately follows a newline.
 * It is impossible in general to determine whether the node represents a transclusion or a parser function/variable until the contents of  are expanded:   expands to " " if   is assigned " ", for one.
 * API:Siteinfo provides several methods to gather the list of variables and parser functions (,  and  ), but none of them can be reliably used to recognise their precise syntax as of MediaWiki 1.24.


 * tplarg
 * Indicates a template argument reference . Contents are just like, a   element followed by optional  s. The   attribute has the same meaning as above.


 * part
 * Indicates a template argument (or default value for a template argument reference). Always contains a  and a   element, in that order, with an equal sign between them if the name is given explicitly. If the template argument is an implicitly numbered one, the   element will be empty and contain an   attribute specifying the index.
 * For  elements, only the first   child should be looked at to provide default arguments, the rest are ignored. The split into   and   is disregarded.


 * h
 * Indicates a header . The  attribute contains the header level, while   contains the section number, regardless of level (the same that the   query string parameter uses).


 * ext
 * Indicates a parser extension tag, such as ,  or . Not all tags are parser extension tags;  or , for example, are not. Which tags are considered parser tags depends on MediaWiki installation. To obtain a list of extension tags, use API:Meta with the  query parameter.
 * This element always contains (possibly empty)  (tag name) and   (attributes) child elements, optionally an   element, and optionally   following it. The contents of   need not conform to HTML or XML attribute syntax.
 * If the parser tag is specified in a self-closing form (e.g. ), the   element will lack   and   child elements.


 * ignore
 * Indicates text to be ignored, usually a ,  or  tag and/or its contents.
 * There is no option in the publicly available API to preprocess wikitext in transclusion mode, i.e. ignoring contents of <noinclude ></noinclude> while parsing <includeonly ></includeonly> or restricting parsing to <onlyinclude ></onlyinclude> (bug #49353).


 * comment
 * Indicates an HTML-style comment, i.e. . The contents of this element include the comment start mark  and end mark.

Serialisation

 * Note: the following method guarantees only that valid parser output will serialise back into original markup. Modifying parse trees without regard for escaping may produce unexpected results. See below for information on escaping template arguments.

Turning the XML parse tree back into wiki markup is rather simple. It amounts to four substitutions, three of them being:

&lt;template>...&lt;/template> → {&#123;...}} &lt;tplarg>...&lt;/tplarg> → {&#123;{...}}} &lt;part>...&lt;/part> → |...

Care has to be taken when handling  elements. For elements that contain  element, the following substitution is appropriate:

&lt;ext>&lt;name>...&lt;/name>&lt;attr>...&lt;/attr>...&lt;/ext> → &lt;......>...

Otherwise, use:

&lt;ext>&lt;name>...&lt;/name>&lt;attr>...&lt;/attr>&lt;/ext> → &lt;....../>

Other elements can have their contents passed through as is.

The whole process is equivalent to applying the following XSLT stylesheet:

Escaping and transformations
The pipe character, the equal sign and consecutive curly braces are interpreted specially in template invocations. If you wish to employ either as literal characters, you have to escape them. Unfortunately, MediaWiki markup does not lend itself to escaping very well. There are many methods of escaping markup, and they come with many caveats. Proper escaping is significant when modifying parse trees, hence we discuss it here.

The simplest method is to wrap special characters, or the whole string, inside a  tag, or escape them with numerical HTML escapes: ,  ,   and   (and possibly escape other characters as well). This has two disadvantages: first, wikilinks and transclusions stop working (obviously). Second, the escaped text might not be recognised by template or module logic that processes it. In this section, more universal alternatives will be discussed.

If you want to allow wikilinks in an argument, but not templates (or template arguments), the simplest universal method is to perform the following substitutions: It has the disadvantage that piped wikilinks come out of it as, which may be aesthetically unpleasing, although it still renders as expected. It also prevents the pipe trick from working. If you wish to avoid that, you will have to count pairs of brackets preceding  to see if they match, and therefore it is not a part of a wikilink and needs escaping.
 * →  (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)
 * →  (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)
 * →  (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)
 * →  (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)
 * →  (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)
 * →  (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)

You also need to be cautious in case a  character is contained in the wiki markup; if there is one, make sure that the template argument is passed with an explicit name, or escape the   character appropriately as well (usually this is done by transcluding Template:=). The above substitutions may also not work as expected for substituted templates. Depending on the desired effect, you may need to adjust the escapes accordingly. For example: The reslt can be safely embedded in a substituted template invocation. Performing pre-save transform on it alone will return the original string, with all  markers intact. Depending on use case, this may, or may not be the desired effect.
 * →  (see above)
 * →  (see above)
 * →  (see above)
 * →  (see above)
 * →  (see above)
 * →  (see above)

If you want to allow both links and templates, but prevent misinterpretations of  and premature template closures, you need to follow the following steps: The resultant text will be interpreted as if it were a stand-alone piece of markup, even inside a template argument. Following these steps is the only universal method of escaping wikitext.
 * 1) Parse the markup you wish to escape. (The following will assume that you get an XML tree as described above.)
 * 2) For each direct child text node of the   element, escape ,  ,    and  , as discussed.
 * 3) Serialise the parse tree back into wiki markup.

Implementation

 * [//git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/includes%2Fparser%2FPreprocessor_DOM.php <tt>Preprocessor_DOM.php</tt> in the git tree]
 * [//git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/includes%2Fparser%2FPreprocessor_Hash.php <tt>Preprocessor_Hash.php</tt> in the git tree]