Parsoid/Roundtrip testpages

= Kinds of RT problems =


 * Wikitext escaping diffs: Lots of pages have wikitext escaping of text in brackets ( [..] ). This piece of wikitext escaping code needs fixing.
 * Whitespace/quote diffs for ref and other extension tags: Lots of pages have whitespace/quote diffs around extension tags ( vs,  vs )
 * Template diffs: Missing template output / duplicate template output. Most of these are probably because of incorrect DSR (DOM wikitext source range) information.
 * Unbalanced quote diffs: Some wikipages dont close quotes and Parsoid adds the missing balancing quotes which shows up as RT diffs.
 * Table wikitext diffs: Missing "-" or Missing "|" chars or some such -- maybe a serializer or html/missing-rt-data issue -- to be investigaged.
 * Unbalanced tag diffs: On several pages, there are unbalanced opening and closing quotes that obviously dont RT correctly -- need workarounds/fixes/hacks/or wont-fix.

Zero diffs

 * Medha_Patkar
 * John_McCain
 * Political_science
 * Hindu_reform_movements
 * PHP
 * Middle_Way
 * Substitution Cipher

Extension (ref, source, code, etc.) whitespace and quote diffs

 * No-self
 * Mahatma_Gandhi

Crashes

 * Advaita_Vedanta

When I rescue errors and continue, the real problem emerges -- the parsed output for this page is mangled up. The output from templates are interlaced rather than being serialized correctly. This messes up the template encapsulation code -- something is up with the async pipeline that needs fixing.


 * Help:Templates parser crashes in template encapsulation code.

Few major template diffs

 * Buddha

Several template diffs

 * Anna_Hazare -- " " is the reduced test case that causes most of the RT problems on this page because of nested pre-tags showing up in the token stream. Yet to fix.
 * Adi_Shankara -- some newline/whitespace issues in one rt-ed cite template use. Yet to investigate.
 * Theravada -- now mostly whitespace, nowiki, and mismatched quote diffs.
 * Dependent_origination
 * Yoga
 * Vijayanagar Empire -- there is one notable diff which is a stray is rt-ed as &lt;/s
 * K-1 Rising 2002 -- template / table interaction

Lots of Whitespace/quote diffs

 * Pāli
 * Sanskrit
 * Node.js
 * Paul Ryan
 * Bellary -- cite template uses are losing newlines in RT-ing

Unbalanced quote diffs

 * Theravada
 * Emma Cairns

Table wikitext diffs

 * Dont_Ask, Dont_Tell Repeal Act of 2010 -- templates used to set td attributes are not RT-ing.
 * Nagarjuna scroll down to find the table char diffs.
 * Enigma Machine

Test cases
1. ! chars in links in table cells

This example does not parse and RT as expected.

2. Unclosed attributes (missing " char) and | char in table cells

This example does not parse and RT as expected.

Unbalanced tag diffs

 * Karate Coyote unbalanced/incorrect closing-br-tag diffs.
 * SC Canada Opinions lots of unbalanced &lt;small&gt; tags in table cells (opened but not closed). This diff is present in other pages as well. Nova Scotia Lt. Govs. list -- both canadian govt. page lists (clearly same editor is involved in both :-)).  These two pages have the most semantic diffs in our RT testing.

Other RT diffs

 * JRuby
 * Romance languages Lot of dl-dt list rt diffs
 * noticeboard/IncidentArchive541 WP Admin IncidentArchives541 Lot of list rt diffs
 * Stomach cutting notable template rt diff
 * WP page -- lot of missing newlines among other white-space and nowiki diffs.

= Performance issues =


 * Sanskrit Takes a long time (~80 sec) to parse this page.