Parsoid/Normalizations

From MediaWiki.org
Jump to navigation Jump to search

While serializing (html2wt), Parsoid performs a number of normalizations, some behind a scrub_wikitext flag.

Most can be found in normalizeDOM.js

Default[edit]

These are the normalizations that Parsoid performs by default.

  • Tag minimization (<i>/<b> tags)
  • Serialize invalid <a> tags to text
  • Enforce single-line context (in headings and lists)

scrub_wikitext[edit]

These normalizations are enabled if the scrub_wikitext parameter is passed to the Parsoid API.

  • Strip empty headings and style tags (only performed on new nodes)
  • Tag minimization (<a> tags, when at least one is new)
  • Whitespace at the start of paragraphs
  • New links that end in spaces
  • New table cells starting with escapable prefixes

Other normalizations that work around issues in Parsoid / VE+clients as a simpler solution for generating clean wikitext (at least for now)

  • Force category links and behaviour switches to serialize before/after headings (only performed on new nodes)
  • Strip <br> tags in headers (introduced by Parsoid in some paragraphs which when converted to headings in VE stick around)
  • Strip trailing <nowiki/> from wikitext lines (this one will be unnecessary once Parsoid stops introducing these)

Examples[edit]

Tag minimization (<i>/<b> tags)[edit]

<b>X</b><b>Y</b>

// becomes

<b>XY</b>

and

<i>A</i><b><i>X</i></b><b><i>Y</i></b><i>Z</i>

// becomes 

<i>A<b>XY</b>Z</i>

Force category links and behaviour switches to serialize before/after headings[edit]

<h2>hello there<link href="Category:A1" rel="mw:PageProp/Category" /></h2>

// becomes

<h2>hello there</h2>
<link href="Category:A1" rel="mw:PageProp/Category" />

and

<h2><meta property="mw:PageProp/toc" /> ok</h2>

// becomes

<meta property="mw:PageProp/toc" />
<h2> ok</h2>

Serialize invalid <a> tags to text[edit]

<a rel="mw:WikiLink" href="[[foo]]">text</a>

// serializes to

text

and

<a rel="mw:WikiLink" href="[[foo]]">*a [[foo]]</a>

// serializes to

<nowiki>*a [[foo]]</nowiki>

Enforce single-line context[edit]

<h2>testing
123</h2>

// becomes

<h2>testing 123</h2>

and

<ul><li>asd
sdf</li></ul>

// becomes

<ul><li>asd sdf</li></ul>

However, newlines in transclusion parameters are preserved.

<h2> hi <span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"bogus","href":"./Template:Bogus"},"params":{"1":{"wt":"there\nyou"}},"i":0}}]}'>there</span><span about="#mwt1">
</span><span about="#mwt1">you</span> </h2>

// serializes to

== hi {{bogus|there
you}} ==

Strip empty headings and style tags[edit]

Normally,

<h2></h2>
<i></i><b></b>

// serializes to

==<nowiki/>==
''<nowiki/>'''''<nowiki/>'''

but with scrubbing it's all dropped.

Tag minimization (<a> tags)[edit]

<a href="Football">Foot</a><a href="Football">ball</a>

// becomes

<a href="Football">Football</a>

and

<a href="Football"><i>Foot</i></a><a href="Football"><b><i>ball</i></b></a>

// becomes

<a href="Football"><i>Foot<b>ball</b></i></a>

Move formatting from link text to the entire link (with some exceptions)[edit]

<a rel="mw:WikiLink" href="./Football"><u><i><b>Football</b></i></u></a>

// becomes

<u><i><b><a rel="mw:WikiLink" href="./Football">Football</a></b></i></u>

This enables a simplified wikilink format if the href and link text formatting match. Without the reordering [[Football|<u>'''''Football'''''</u>]] would be emitted. With the reordering <u>'''''[[Football]]'''''</u> will be emitted.

Exceptions:

  • If the formatting tags have attributes like color, style, class since the reordering can change rendering in some cases. The A-tag's color style will override the outer style, i.e. <i color='brown'>[[Foo]]</i> doesn't render the same as [[Foo|<i color='brown'>Foo</i>]]
  • If the link text is not identical to the href, the reordering is not done since the simplified link form is not enabled in this case.

Whitespace at the start of paragraphs[edit]

These nowikis are to prevent roundtripping as preformatted text.

<p> hi
 ho</p>

// normally serializes to

<nowiki> </nowiki>hi
<nowiki> </nowiki>ho

// but with scrubbing becomes

hi
ho

New links that end in spaces[edit]

The nowiki here is to prevent link trails.

<p><a rel="mw:WikiLink" href="./Berlin" title="Berlin">Berlin </a>is the capital of Germany.</p>

// normally serializes to

[[Berlin ]]<nowiki/>is the capital of Germany.

// but with scrubbing becomes

[[Berlin]] is the capital of Germany.

New table cells starting with escapable prefixes[edit]

<table>
<tr><td>a</td></tr>
<tr><td>-</td></tr>
<tr><td>+</td></tr>
</table>

// normally serializes to

{|
|a
|-
|<nowiki>-</nowiki>
|-
|<nowiki>+</nowiki>
|}

// but with scrubbing becomes

{|
|a
|-
| -
|-
| +
|}

Related links[edit]