Parsoid/Roundtrip testpages

= Kinds of RT problems =


 * Wikitext escaping diffs: Lots of pages have wikitext escaping of text in brackets ( [..] ). This piece of wikitext escaping code needs fixing.
 * Whitespace/quote diffs for ref and other extension tags: Lots of pages have whitespace/quote diffs around extension tags ( vs,  vs )
 * Template diffs: Missing template output / duplicate template output. Most of these are probably because of incorrect DSR (DOM wikitext source range) information.
 * Unbalanced quote diffs: Some wikipages dont close quotes and Parsoid adds the missing balancing quotes which shows up as RT diffs.
 * Table wikitext diffs: Missing "-" or Missing "|" chars or some such -- maybe a serializer or html/missing-rt-data issue -- to be investigaged.
 * Lists/lists in tables diffs: List rt diffs or lists in table rt diffs (mostly involving dl-dt-dd lists)
 * Unbalanced tag diffs: On several pages, there are unbalanced opening and closing quotes that obviously dont RT correctly -- need workarounds/fixes/hacks/or wont-fix.
 * Wikitext "syntax errors": Pages where the wikitext syntax is erroneous in the surrounding context and leads to differences in parsing and roundtripping -- unbalanced tags above are one special case of this broader category.

Zero diffs

 * Medha_Patkar
 * John_McCain
 * Political_science
 * Hindu_reform_movements
 * PHP
 * Middle_Way
 * Substitution Cipher

Extension (ref, source, code, etc.) whitespace and quote diffs

 * No-self
 * Mahatma_Gandhi

Crashes

 * Advaita_Vedanta

When I rescue errors and continue, the real problem emerges -- the parsed output for this page is mangled up. The output from templates are interlaced rather than being serialized correctly. This messes up the template encapsulation code -- something is up with the async pipeline that needs fixing.


 * Help:Templates parser crashes in template encapsulation code.

Several template diffs
Nothing major right now.

Fixed

 * Anexo:Monumentos_Históricos_de_Panamá -- lots of template diffs.
 * Hayasdan -- couple significant template diffs.
 * Buddha
 * Anna_Hazare -- " " is the reduced test case that causes most of the RT problems on this page because of nested pre-tags showing up in the token stream. Yet to fix.
 * Adi_Shankara -- some newline/whitespace issues in one rt-ed cite template use. Yet to investigate.
 * Theravada -- now mostly whitespace, nowiki, and mismatched quote diffs.
 * Dependent_origination
 * Yoga
 * Vijayanagar Empire -- there is one notable diff which is a stray is rt-ed as &lt;/s
 * K-1 Rising 2002 -- template / table interaction

Lots of Whitespace/quote diffs

 * Pāli
 * Sanskrit
 * Node.js
 * Paul Ryan
 * Bellary -- cite template uses are losing newlines in RT-ing

Unbalanced quote diffs

 * Theravada
 * Emma Cairns
 * Gondi bank -- here and many other pages, quotes are nested improperly across ref-tags. Ex: This is an example of a ''quotes with each other .  This then parses and round trips improperly.  Fixed by moving cite processing earlier in the stage 3 pipeline.

Table wikitext diffs

 * Dont_Ask, Dont_Tell Repeal Act of 2010 -- templates used to set td attributes are not RT-ing.
 * Nagarjuna scroll down to find the table char diffs.
 * Enigma Machine
 * Simpsons dvd sets replaces empty lines with a td-wikitext char (|) -- either a parse or serializer error probably.

Test cases
1. ! chars in links in table cells

This example does not parse and RT as expected.

2. Table lines starting with a leading space <-- The 3rd line has a leading space -->

Another one where the ! marks dont parse properly. !h1!!h2 gets parsed as a string in a td

3. Unclosed attributes (missing " char) and | char in table cells

This example does not parse and RT as expected.

4. Table cells separated by "|" instead of "||"

The 'foo bar foo' in this example gets parsed as 3 attributes of the td and the second 'foo' gets dropped as a duplicate.

Lists / lists-in-tables RT diffs

 * Romance languages Lot of dl-dt list rt diffs
 * noticeboard/IncidentArchive541 WP Admin IncidentArchives541 Lot of list rt diffs
 * User page Lists in indented tables rt diffs ( ::{|.. )

Unbalanced tag diffs

 * Complete games -- uses a &lt;li&gt; tag at the start of a #-wikilist but gets rt-ed with an extra &lt/li&gt; tag as expected. This seems related to the bug Roan reported.

Fixed
Idea: If detectable, we could add a flag on automatically-closed tags so those tags can be skipped on RT. But, unsure if we can detect it since the treebuilder closes the tags and we cannot pass along attribute information on closing tags. Implemented in git SHA 051bf97b


 * Karate Coyote unbalanced/incorrect closing-br-tag diffs.
 * 2002 Australian Formula 3 season -- uses &lt;hiddentext&gt; but an incorrect closing tag &lt;\hiddentext&gt; instead of &lt;/hiddentext&gt; -- either because of this or even otherwise, this extension text gets fostered out of the table header and introduces a rt diff.
 * Mallor -- several unclosed &lt;li&gt; tags.
 * SC Canada Opinions lots of unbalanced &lt;small&gt; tags in table cells (opened but not closed). This diff is present in other pages as well. Nova Scotia Lt. Govs. list -- both canadian govt. page lists (clearly same editor is involved in both :-)).  These two pages have the most semantic diffs in our RT testing.
 * Pitt Panthers football -- lot of unclosed &lt;font&gt; tags generates RT diffs since Parsoid closes unclosed tags.
 * Indonesia's Got Talent -- lot of unclosed &lt;center&gt; tags generates RT diffs since Parsoid closes unclosed tags.

Wikitext "syntax errors"

 * List of scooter manufacturers -- in the table listing the scooter manufacturers, table cell content is added after table row wikitext ( |- ) instead of table data wikitext ( | ). The wikitext in question is this: |-Gorilla Motor Works || China || .  PHP parser ignores this content (see http://en.wikipedia.org/wiki/List_of_scooter_manufacturers).  Parsoid moves the content to the beginning of the table rather than ignoring it.  When roundtripped, this content shows up in the wrong place.  So, a possible Parsoid fix would be to not swallow content that shows up in a table-row (having it fostered lets the editor recognize the error in the output and fix it), but introduce a placeholder tag to rt the fostered content in place, but add a marker on the fostered content so it is suppressed during rt-ing.  Or, maybe just not bother about these kind of errors.
 * Food Science Australia -- similar bug as in the page above.

Wikitext escaping diffs
Nothing here at this time.

Fixed
Math-related pages use a lot of braces and because brace pairs are unconditionally nowiki-escaped right now, several math-related pages have a lot of semantic/syntactic diffs because of this.


 * Chinese Restaurant Process
 * Swamee-Jain equation
 * Angle sum law

The wikitext escaping code is due for an overhaul -- there are several cases of escaping. For example, right now, the escaper always escapes content in "[" and "]". The specifics of how that content is escape varies (sometimes everything is escaped, other times, only the right bracket is escaped).
 * -- Parserfunction-generated external link target in bracketed external link, closing bracket nowiki-escaped.

Other RT diffs

 * JRuby
 * Stomach cutting notable template rt diff
 * WP page -- lot of missing newlines among other white-space and nowiki diffs.
 * Difference in error recovery between Parsoid and PHP parser when extension and template tags nest improperly. This affects parse output and RTing on this page Milky way galaxy. Simplified test case:

= Performance issues =


 * Sanskrit Takes a long time (~80 sec) to parse this page.
 * Anexo:Monumentos_Históricos_de_Panamá Takes ~80-85 secs to parse this page.