Parsoid/Roundtrip testpages

From mediawiki.org

Kinds of RT problems[edit]

  • Wikitext escaping diffs: Lots of pages have wikitext escaping of text in brackets ( [..]). This piece of wikitext escaping code needs fixing.
  • Whitespace/quote diffs for ref and other extension tags: Lots of pages have whitespace/quote diffs around extension tags (<ref name='foo'/> vs <ref name='foo' />, <source lang='javascript' /> vs <source lang="javascript"/>)
  • Template diffs: Missing template output / duplicate template output. Most of these are probably because of incorrect DSR (DOM wikitext source range) information.
  • Unbalanced quote diffs: Some wikipages dont close quotes and Parsoid adds the missing balancing quotes which shows up as RT diffs.
  • Table wikitext diffs: Missing "-" or Missing "|" chars or some such -- maybe a serializer or html/missing-rt-data issue -- to be investigaged.
  • Lists/lists in tables diffs: List rt diffs or lists in table rt diffs (mostly involving dl-dt-dd lists)
  • Unbalanced tag diffs: On several pages, there are unbalanced opening and closing quotes that obviously dont RT correctly -- need workarounds/fixes/hacks/or wont-fix.
  • Wikitext "syntax errors": Pages where the wikitext syntax is erroneous in the surrounding context and leads to differences in parsing and roundtripping -- unbalanced tags above are one special case of this broader category.

Zero diffs[edit]

Extension (ref, source, code, etc.) whitespace and quote diffs[edit]

Crashes[edit]

When I rescue errors and continue, the real problem emerges -- the parsed output for this page is mangled up. The output from templates are interlaced rather than being serialized correctly. This messes up the template encapsulation code -- something is up with the async pipeline that needs fixing.

  • ApocalyPS3 - Cannot call "removeChild" of null at deleteNode (/data/project/parsoid/js/lib/mediawiki.DOMPostProcessor.js:101:15)
  • Book:Bromine - RangeError: Maximum call stack size exceeded - Doesn't seem to happen on the web service, maybe an artifact of the round-trip test runner having a lot of callbacks?
  • Gordan Kožulj - TypeError: Cannot read property 'nextSibling' of undefined

Several template diffs[edit]

Nothing major right now.

Fixed[edit]

Lots of Whitespace/quote diffs[edit]

Unbalanced quote diffs[edit]

  • Theravada
  • Emma Cairns
  • Gondi bank -- here and many other pages, quotes are nested improperly across ref-tags. Ex: This is an example of a ''quotes <ref> and ref tags '' overlapping </ref> with each other . This then parses and round trips improperly. Fixed by moving cite processing earlier in the stage 3 pipeline.

Table wikitext diffs[edit]

Test cases[edit]

1. ! chars in links in table cells

This example does not parse and RT as expected.

{|
| [[Foo!! bar]]
|}

2. Table lines starting with a leading space

<-- The 3rd line has a leading space -->

{|
|-
 |a||b 
|}

Another one where the ! marks dont parse properly. !h1!!h2 gets parsed as a string in a td

<!-- all lines below have a leading space -->
 {|
 |-
 !h1!!h2
 |foo||bar
 |}

3. Unclosed attributes (missing " char) and | char in table cells

This example does not parse and RT as expected.

{|
|  style="text-align:center; {{Party shading/Republican}}|'''[[United States presidential election, 2008|2008]]'''
|  style="text-align:center; {{Party shading/Republican}}|'''53.1%''' ''4,523''
|  style="text-align:center; {{Party shading/Democratic}}|43.9% ''3,743''
|  style="text-align:center; background:honeyDew;"|2.9% ''243''
|}

Fixed in https://gerrit.wikimedia.org/r/#/c/45919/ and https://gerrit.wikimedia.org/r/#/c/45699/

4. Table cells separated by "|" instead of "||" . Unsure if this should be fixed at all since there is no clean fix for this. This may just have to be considered as bad wikitext that cannot be RT-ed as coded.

{|
| foo bar foo | baz
|}

The 'foo bar foo' in this example gets parsed as 3 attributes of the td and the second 'foo' gets dropped as a duplicate.

Lists / lists-in-tables RT diffs[edit]

Lists in tables and tables in lists combination has now been dealt with adequately and these diffs are no longer present.

Fixed[edit]

Unbalanced tag diffs[edit]

Fixed[edit]

Idea: If detectable, we could add a flag on automatically-closed tags so those tags can be skipped on RT. But, unsure if we can detect it since the treebuilder closes the tags and we cannot pass along attribute information on closing tags. Implemented in git SHA 051bf97b

  • Karate Coyote unbalanced/incorrect closing-br-tag diffs.
  • 2002 Australian Formula 3 season -- uses <hiddentext> but an incorrect closing tag <\hiddentext> instead of </hiddentext> -- either because of this or even otherwise, this extension text gets fostered out of the table header and introduces a rt diff.
  • Mallor -- several unclosed <li> tags.
  • SC Canada Opinions lots of unbalanced <small> tags in table cells (opened but not closed). This diff is present in other pages as well. Nova Scotia Lt. Govs. list -- both canadian govt. page lists (clearly same editor is involved in both :-)). These two pages have the most semantic diffs in our RT testing.
  • Pitt Panthers football -- lot of unclosed <font> tags generates RT diffs since Parsoid closes unclosed tags.
  • Indonesia's Got Talent -- lot of unclosed <center> tags generates RT diffs since Parsoid closes unclosed tags.

Wikitext "syntax errors"[edit]

  • List of scooter manufacturers -- in the table listing the scooter manufacturers, table cell content is added after table row wikitext (|-) instead of table data wikitext (|). The wikitext in question is this: |-[[Gorilla Motor Works]] || China ||. PHP parser ignores this content (see http://en.wikipedia.org/wiki/List_of_scooter_manufacturers). Parsoid moves the content to the beginning of the table rather than ignoring it. When roundtripped, this content shows up in the wrong place. So, a possible Parsoid fix would be to not swallow content that shows up in a table-row (having it fostered lets the editor recognize the error in the output and fix it), but introduce a placeholder tag to rt the fostered content in place, but add a marker on the fostered content so it is suppressed during rt-ing. Or, maybe just not bother about these kind of errors.
  • Food Science Australia -- similar bug as in the page above.
  • Chicago's Northwest Side -- bare lists in tables get fostered out of the table (same in php) which doesn't RT back to original wikitext.
  • List of Philippine radio stations by province (AM) -- bad <th> code
  • Northwest Side Chicago's Northwest Side -- lists inside tables that get fostered out of the table and wont RT as expected.

Wikitext escaping diffs[edit]

Nothing here at this time.

Fixed[edit]

Math-related pages use a lot of braces and because brace pairs {{ or }} are unconditionally nowiki-escaped right now, several math-related pages have a lot of semantic/syntactic diffs because of this.

The wikitext escaping code is due for an overhaul -- there are several cases of escaping. For example, right now, the escaper always escapes content in "[" and "]". The specifics of how that content is escape varies (sometimes everything is escaped, other times, only the right bracket is escaped).

  • [1] -- Parserfunction-generated external link target in bracketed external link, closing bracket nowiki-escaped.

Other RT diffs[edit]

  • JRuby
  • Stomach cutting notable template rt diff
  • WP page -- lot of missing newlines among other white-space and nowiki diffs.
  • Difference in error recovery between Parsoid and PHP parser when extension and template tags nest improperly. This affects parse output and RTing on this page Milky way galaxy. Simplified test case:
    {{echo|blah <ref name=foo>blah {{echo|blah}} blah}}</ref> blah}}
  • Use of {{!}} in non-table contexts (Example snippet from en:Death of the soviet union)
    [[Image:Edgar Savisaar 2005.jpg{{!}}100px|left|thumb|this is a caption]]

Performance issues[edit]