Parsoid/Roundtrip testpages

= Kinds of RT problems =


 * Wikitext escaping diffs: Lots of pages have wikitext escaping of text in brackets ( [..] ). This piece of wikitext escaping code needs fixing.
 * Whitespace/quote diffs for ref and other extension tags: Lots of pages have whitespace/quote diffs around extension tags ( vs,  vs )
 * Template diffs: Missing template output / duplicate template output. Most of these are probably because of incorrect DSR (DOM wikitext source range) information.
 * Unbalanced quote diffs: Some wikipages dont close quotes and Parsoid adds the missing balancing quotes which shows up as RT diffs.
 * Table wikitext diffs: Missing "-" or Missing "|" chars or some such -- maybe a serializer or html/missing-rt-data issue -- to be investigaged.
 * Lists/lists in tables diffs: List rt diffs or lists in table rt diffs (mostly involving dl-dt-dd lists)
 * Unbalanced tag diffs: On several pages, there are unbalanced opening and closing quotes that obviously dont RT correctly -- need workarounds/fixes/hacks/or wont-fix.
 * Wikitext "syntax errors": Pages where the wikitext syntax is erroneous in the surrounding context and leads to differences in parsing and roundtripping -- unbalanced tags above are one special case of this broader category.

Zero diffs

 * Medha_Patkar
 * John_McCain
 * Political_science
 * Hindu_reform_movements
 * PHP
 * Middle_Way
 * Substitution Cipher

Extension (ref, source, code, etc.) whitespace and quote diffs

 * No-self
 * Mahatma_Gandhi

Crashes

 * Advaita_Vedanta

When I rescue errors and continue, the real problem emerges -- the parsed output for this page is mangled up. The output from templates are interlaced rather than being serialized correctly. This messes up the template encapsulation code -- something is up with the async pipeline that needs fixing.


 * Help:Templates parser crashes in template encapsulation code.

Few major template diffs

 * Buddha

Several template diffs

 * Anna_Hazare -- " " is the reduced test case that causes most of the RT problems on this page because of nested pre-tags showing up in the token stream. Yet to fix.
 * Adi_Shankara -- some newline/whitespace issues in one rt-ed cite template use. Yet to investigate.
 * Theravada -- now mostly whitespace, nowiki, and mismatched quote diffs.
 * Dependent_origination
 * Yoga
 * Vijayanagar Empire -- there is one notable diff which is a stray is rt-ed as &lt;/s
 * K-1 Rising 2002 -- template / table interaction

Lots of Whitespace/quote diffs

 * Pāli
 * Sanskrit
 * Node.js
 * Paul Ryan
 * Bellary -- cite template uses are losing newlines in RT-ing

Unbalanced quote diffs

 * Theravada
 * Emma Cairns
 * Gondi bank -- here and many other pages, quotes are nested improperly across ref-tags. Ex: This is an example of a ''quotes with each other .  This then parses and round trips improperly.

Table wikitext diffs

 * Dont_Ask, Dont_Tell Repeal Act of 2010 -- templates used to set td attributes are not RT-ing.
 * Nagarjuna scroll down to find the table char diffs.
 * Enigma Machine
 * Simpsons dvd sets replaces empty lines with a td-wikitext char (|) -- either a parse or serializer error probably.

Test cases
1. ! chars in links in table cells

This example does not parse and RT as expected.

Another one (NOTE: the lines here all have a leading space) where the ! marks dont parse properly. !!h2 gets fostered out and !h1 gets parsed as a td cell. instead of 2 th cells.

2. Unclosed attributes (missing " char) and | char in table cells

This example does not parse and RT as expected.

Lists / lists-in-tables RT diffs

 * Romance languages Lot of dl-dt list rt diffs
 * noticeboard/IncidentArchive541 WP Admin IncidentArchives541 Lot of list rt diffs
 * User page Lists in indented tables rt diffs ( ::{|.. )

Unbalanced tag diffs
Idea: If detectable, we could add a flag on automatically-closed tags so those tags can be skipped on RT. But, unsure if we can detect it since the treebuilder closes the tags and we cannot pass along attribute information on closing tags.


 * Karate Coyote unbalanced/incorrect closing-br-tag diffs.
 * SC Canada Opinions lots of unbalanced &lt;small&gt; tags in table cells (opened but not closed). This diff is present in other pages as well. Nova Scotia Lt. Govs. list -- both canadian govt. page lists (clearly same editor is involved in both :-)).  These two pages have the most semantic diffs in our RT testing.
 * 2002 Australian Formula 3 season -- uses &lt;hiddentext&gt; but an incorrect closing tag &lt;\hiddentext&gt; instead of &lt;/hiddentext&gt; -- either because of this or even otherwise, this extension text gets fostered out of the table header and introduces a rt diff.
 * Complete games -- uses a &lt;li&gt; tag at the start of a #-wikilist but gets rt-ed with an extra &lt/li&gt; tag as expected. This seems related to the bug Roan reported.
 * Pitt Panthers football -- lot of unclosed &lt;font&gt; tags generates RT diffs since Parsoid closes unclosed tags.
 * Indonesia's Got Talent -- lot of unclosed &lt;center&gt; tags generates RT diffs since Parsoid closes unclosed tags.

Wikitext "syntax errors"

 * List of scooter manufacturers -- in the table listing the scooter manufacturers, table cell content is added after table row wikitext ( |- ) instead of table data wikitext ( | ). The wikitext in question is this: |-Gorilla Motor Works || China || .  PHP parser ignores this content (see http://en.wikipedia.org/wiki/List_of_scooter_manufacturers).  Parsoid moves the content to the beginning of the table rather than ignoring it.  When roundtripped, this content shows up in the wrong place.  So, a possible Parsoid fix would be to not swallow content that shows up in a table-row (having it fostered lets the editor recognize the error in the output and fix it), but introduce a placeholder tag to rt the fostered content in place, but add a marker on the fostered content so it is suppressed during rt-ing.  Or, maybe just not bother about these kind of errors.

Other RT diffs

 * JRuby
 * Stomach cutting notable template rt diff
 * WP page -- lot of missing newlines among other white-space and nowiki diffs.

= Performance issues =


 * Sanskrit Takes a long time (~80 sec) to parse this page.