Parsoid/Roundtrip testpages

= Kinds of RT problems =


 * Wikitext escaping diffs: Lots of pages have wikitext escaping of text in brackets ( [..] ). This piece of wikitext escaping code needs fixing.
 * Whitespace/quote diffs for ref and other extension tags: Lots of pages have whitespace/quote diffs around extension tags ( vs,  vs )
 * Template diffs: Missing template output / duplicate template output. Most of these are probably because of incorrect DSR (DOM wikitext source range) information.
 * Unbalanced quote diffs: Some wikipages dont close quotes and Parsoid adds the missing balancing quotes which shows up as RT diffs.
 * Table wikitext diffs: Missing "-" or Missing "|" chars or some such -- maybe a serializer or html/missing-rt-data issue -- to be investigaged.
 * Unbalanced tag diffs: On several pages, there are unbalanced opening and closing quotes that obviously dont RT correctly -- need workarounds/fixes/hacks/or wont-fix.
 * Lists/lists in tables diffs: List rt diffs or lists in table rt diffs (mostly involving dl-dt-dd lists)

Zero diffs

 * Medha_Patkar
 * John_McCain
 * Political_science
 * Hindu_reform_movements
 * PHP
 * Middle_Way
 * Substitution Cipher

Extension (ref, source, code, etc.) whitespace and quote diffs

 * No-self
 * Mahatma_Gandhi

Crashes

 * Advaita_Vedanta

When I rescue errors and continue, the real problem emerges -- the parsed output for this page is mangled up. The output from templates are interlaced rather than being serialized correctly. This messes up the template encapsulation code -- something is up with the async pipeline that needs fixing.


 * Help:Templates parser crashes in template encapsulation code.

Few major template diffs

 * Buddha

Several template diffs

 * Anna_Hazare -- " " is the reduced test case that causes most of the RT problems on this page because of nested pre-tags showing up in the token stream. Yet to fix.
 * Adi_Shankara -- some newline/whitespace issues in one rt-ed cite template use. Yet to investigate.
 * Theravada -- now mostly whitespace, nowiki, and mismatched quote diffs.
 * Dependent_origination
 * Yoga
 * Vijayanagar Empire -- there is one notable diff which is a stray is rt-ed as &lt;/s
 * K-1 Rising 2002 -- template / table interaction

Lots of Whitespace/quote diffs

 * Pāli
 * Sanskrit
 * Node.js
 * Paul Ryan
 * Bellary -- cite template uses are losing newlines in RT-ing

Unbalanced quote diffs

 * Theravada
 * Emma Cairns

Table wikitext diffs

 * Dont_Ask, Dont_Tell Repeal Act of 2010 -- templates used to set td attributes are not RT-ing.
 * Nagarjuna scroll down to find the table char diffs.
 * Enigma Machine
 * Simpsons dvd sets replaces empty lines with a td-wikitext char (|) -- either a parse or serializer error probably.

Test cases
1. ! chars in links in table cells

This example does not parse and RT as expected.

Another one (NOTE: the lines here all have a leading space) where the ! marks dont parse properly. !!h2 gets fostered out and !h1 gets parsed as a td cell. instead of 2 th cells.

2. Unclosed attributes (missing " char) and | char in table cells

This example does not parse and RT as expected.

Unbalanced tag diffs
Idea: If detectable, we could add a flag on automatically-closed tags so those tags can be skipped on RT. But, unsure if we can detect it since the treebuilder closes the tags and we cannot pass along attribute information on closing tags.


 * Karate Coyote unbalanced/incorrect closing-br-tag diffs.
 * SC Canada Opinions lots of unbalanced &lt;small&gt; tags in table cells (opened but not closed). This diff is present in other pages as well. Nova Scotia Lt. Govs. list -- both canadian govt. page lists (clearly same editor is involved in both :-)).  These two pages have the most semantic diffs in our RT testing.
 * 2002 Australian Formula 3 season -- uses &lt;hiddentext&gt; but an incorrect closing tag &lt;\hiddentext&gt; instead of &lt;/hiddentext&gt; -- either because of this or even otherwise, this extension text gets fostered out of the table header and introduces a rt diff.
 * Complete games -- uses a &lt;li&gt; tag at the start of a #-wikilist but gets rt-ed with an extra &lt/li&gt; tag as expected. This seems related to the bug Roan reported.
 * Pitt Panthers football -- lot of unclosed &lt;font&gt; tags generates RT diff since Parsoid closes unclosed tags and that RTs differently.

Lists / lists-in-tables RT diffs

 * Romance languages Lot of dl-dt list rt diffs
 * noticeboard/IncidentArchive541 WP Admin IncidentArchives541 Lot of list rt diffs
 * User page Lists in indented tables rt diffs ( ::{|.. )

Other RT diffs

 * JRuby
 * Stomach cutting notable template rt diff
 * WP page -- lot of missing newlines among other white-space and nowiki diffs.

= Performance issues =


 * Sanskrit Takes a long time (~80 sec) to parse this page.