Talk:Parsing/Replacing Tidy

About this board

ParserMigration no longer deployed to Wikimedia Foundation wikis

2
Jdforrester (WMF) (talkcontribs)

Just FYI, for those that are interested and probably watch this page.

IKhitron (talkcontribs)

A pity.

Reply to "ParserMigration no longer deployed to Wikimedia Foundation wikis"
Dipsacus fullonum (talkcontribs)

Hi,

At dawiki we found that some pages have changed appearance because Tidy appearantly used to exhange div and span HTML elements so div wasn't placed inside span. "<span><div>Text here</div></span>" was changed to "<div><span>Text here</span></div>". Would it be possible for Special:LintErrors to find all such cases?

SSastry (WMF) (talkcontribs)

For God's sake, Tidy!! :-(

Yes, we can find them. But, it will take a few days to get it deployed. Do you have a sense of how many pages are affected? Is it from a template? If it is coming from a template, perhaps you can fix those right away and see what happens?

197.218.81.173 (talkcontribs)

It does find it in the Special:LintErrors/html5-misnesting. Although it seems that it ignores certain cases like the one you noted. Perhaps it was ignored because it didn't affect the how the final page looked. Other cases like the one below will be detected properly.

<span>bb<div>Text here</div></span>
SSastry (WMF) (talkcontribs)

Yes, Parsoid doesn't modify <span><div>foo</div></span> because paragraph tags aren't added around it. So, it doesn't trigger a html5-misnesting error.

But, <span>x<div>foo</div></span> has p-tags added because of the text node in the span tag. That is then broken up by the HTML5 parser and triggers a html5-misnesting error.

197.218.81.173 (talkcontribs)
SSastry (WMF) (talkcontribs)
IKhitron (talkcontribs)

Dipsacus fullonum (talkcontribs)

Hi, all known occurrences at dawiki was from one template which contained '<span style="style 1">{{{Text}}}</span>'.

That template was then used in other templates like '{{foo|Text=<div style="style 2">bar</div>}}

So when the span and div tags were exchanged by Tidy, it changed the order of the style attributes, and thus the appearance. I guess it may have affected around 30,000 pages at dawiki, mostly user talk pages.

The template with the span tag is already changed. But there may be unknown cases and similar cases at other wikis.

197.218.91.75 (talkcontribs)

This seems to occur in several cases outside divs or spans. It might be best to develop a general solution, but whitelist such reports in tags only when someone reports any occurences. That way it becomes a simpler case of a configuration change rather than coding.

See :

If one adds something like "style="background-color:green" to the parent tag it does show rendering differences. It might be good to over https://phabricator.wikimedia.org/tag/tidy/and close all tasks that became obsolete or decline as tidy is disabled.

The labels of categories in the linter could probably be changed too, eventually the concept of tidy will become irrelevant. Maybe it should instead focus on invalid html vs wrong / undesireable parser output.

SSastry (WMF) (talkcontribs)

This linter category is now live, but it probably has a number of false positives since rendering is affected in only in a small number of cases. We'll take a look at that. It shows up in the Miscellaneous-Tidy-Replacement-Issues category.

2A01:E0A:30:2B70:0:0:D724:C9A4 (talkcontribs)

Lot of templates do not test the presence of absence of a leading colon for page names, but simply prepend a ":" always to make sure they'll get a link, and not render a file as an image (or audio/videoplayer), and will not categorize a page, and will not add interwikis. Extra leading colons are harmless, but now the lint checker complains about these and I don't know why it is needed to fix that, given that page names can never start by a colon.

Fixing that in templates is sometimes very complex as it forces testing the values to check if they start by a colon or not, and generate the colon conditionally (and this test increase the expansion nodes count, so it will break several pages, as well it will increase the expansion depth by 1)...

Can't we fix that simply in the new parser instead of asking people to fix pages and templates ? -- Verdy p (talk) 01:21, 22 June 2018 (UTC)

JJMC89 (talkcontribs)

[[::Test]] gives [[::Test]] with either parser.

Verdy p (talkcontribs)

This were harmless, and this is a radical change... All these were equivalent, all of them generating a wikilink to the same page:

  • [[::Test]], [[:Test]], [[Test]]
    (gives now: [[::Test]], Test, Test)
  • [[:::fr:Test]], [[::fr:Test]], [[:fr:Test]]
    (gives now: [[:::fr:Test]], [[::fr:Test]], fr:Test)
    (but not [[fr:Test]] which generates an interwiki metadata and not a link to a resolved wiki)
  • [[::Category:Test]], [[:Category:Test]]
    (gives now: [[::Category:Test]], Category:Test)
    (but not [[Category:Test]] which sets a categorization metadata)

In images, we can use |link= with a value which can be either a wikilink, or a URL (starting by "http:" or "https:" or "//"), the colon may also be used to force a pagename wikilink instead of a URL starting by "http". Testing parameter values to know when to generate or not a colon is complex or will require adding some helper templates to know when to generate a colon depending on specific rules.

I do not see the interest of displaying verbatim "[]" pairs enclosing multiple leading colons except for [[:]] because there's no trailing pagename after the leading colons and it's impossible to generate a link from that.

If an article has to refer to a title starting by ":" we need to change the pagename: do you want to allow pagenames stargting by colons or just a page name with title ":" (that pagename would be quite difficult to use as targets of wikilinks or URLs) ?

Or do you plan to use "::" for adding new syntaxic features in MediaWiki (disambiguating more easily pagenames from other interwiki or namespace or special prefixes)?


Note that we use multiple leading colons with a visual editor (and in this talk thread using "Flow"), the whole link with brackets becomes now surrounded by "nowiki" tags (added silently). This does not happen when using the wikitext editor. I think this silent (and unexpected) addition of "nowiki" is in fact a nuisance (a pollution in fact). This unnecessarily obscures the code (and also caused edit bugs in this message when adding "(gives now: ...)" lines above, where all subsequents tags or wiki markup were corrupted)

Arlolra (talkcontribs)

I believe [[:::fr:Test]] would always have been rendered as plaintext; it's only 1 or 2 leading colons that would have worked. However, 2 colons would have resulted in a leading colon being part of the link text, which differs from the single colon escaping.

Verdy p (talkcontribs)

There's absolutely NO possibility for a link to start with a colon, as it is invalid in every page name (on all wikis, not just those from Wikimedia).

And there's absolutely no point at all in changing that to a plain-text with visible brackets surrounding the text with the multiple leading colons, and no point to enforce it (badly) by silently inserting nowiki tags, which also obscures the wikitext and make it even less editable.

Why do you want to see "[[::" and "]]" as plain-text ? If one really wants to see that, the "nowiki" tags can be added manually to escape them **only** where this plain-text is expected (extremely rare case in fact, compared to the very common cases where extra leading colons may be inserted by templates using optional parameters which may be empty for an optional namespace indicator or interwiki prefix).

Leading colons are used explicitly to force the interpretation as a wikilink and not as a rendered image or categorization metadata, or to make distinction between template names in the template namespace (no leading colon) or a transcluded page from another specified namespace or the root namespace.

This new requirement removes that useful distinction and just breaks many pages and complicates a lot the development of templates by forcing them to inspect the values of substituted parameters (we need now to add various tricky "#if" tests, the expansion time or nesting level is increased, the number of expanded nodes increases by forcing parameters tro be expanded multiple times... in summary this adds additional charges on the server or makes some page impossible now to render correctly due to resource exhaustion caused by these extra tests).

So in summary I do not like this new requirement that just breaks things and makes things just more complicated (and does not even help other parsers to disambiguate things. For me any wikitext sequence matched by this regexp (except those found in "nowiki" sections, or in HTML comments normally stripped in an earlier first stage of the parser, before handling "includeonly", "noinclude" and "onlyinclude" sections in the second earlier stage):

\[\[[ \t]*:*[ \t]*([^:\|\[\]][^\|\]]*)[ \t]*(\|[^\]]+)?\]\]

is a wikilink (or interwiki link) whose target is the page indicated in the first regexp-grouping parentheses in blue (to render as an HTML link with inline text content) if there's 1 or more leading colons (after stripping ignorable whitespaces , just indicated as [ \t]* in this simplified regexp, as there are other ignorable whitespaces), and the content of the second regexp-grouping parentheses in green is the inline content to render in that surrounding HTML link, independantly of the number of colons (indicated in red).

Note that the content matched by the blue group above may include transclusions of templates (or expansion of magic keywords) surrounded by {{...}}; their expansion could return leading colons to discard silently as well if they are in excess and there's already at least one colon before them...

Only when there's no leading colon at all (in red, or in the expansion of wiki templates in the blue group), the target may be interpreted as inline file rendering (image thumbnails, or audio/video player objects), or as a categorization (when the content of the first regexp-parentheses pair starts by the special namespaces names for files or categories); and otherwise it will also generate an HTML link with the inline content of the second group displayed.

Arlolra (talkcontribs)

To be clear, before any changes were made,

[[:{3,} ... ]]

already rendered as plaintext. It was only,

[[:{1,2} ... ]]

that gave the desired wikilink escaping.

The change was made because the 2 was likely the arbitrary result of some refactoring and not an explicit goal. There was no comment in the parser saying why it existed. And the functionality of being able to escape a wikilink did not depend on it.

When it came time to write another wikitext parser, it was a surprising find.

The point of the linting pass was to try and determine the extent to which it was relied upon.

As we saw in the ambassadors thread, it did result in some template authors having to use cumbersome workarounds rather than specifying that page titles passed to the templates shouldn't be manually colon escaped to begin with.

At the time I said I'd be willing to revert the change if it proved too bothersome but that was a year ago and there hasn't been much noise about it.

The proposal you're making here,

[[:+ ... ]]

is obviously more lenient and seems fine, but let's not pretend like that was ever the case.

Reply to "extra leading colons"
Summary by SSastry (WMF)

We didn't enable it, but we haven't used it for this testing after those early tests. So, not relevant anymore.

Sunpriat (talkcontribs)

Сould you enable on the wikitextexp.wmflabs.org :

This post was hidden by TerraCodes (history)
SSastry (WMF) (talkcontribs)

Regarding wikidata, since we are comparing output between identical configurations, the error itself shouldn't matter unless we expect Tidy to render the extension output differently. As for the others, I'll take a look before we do another round of testing in the coming week or so.

Elitre (WMF) (talkcontribs)

Sorry if this means you get irrelevant pings, and thanks for your understanding!

      • @Jonesey95, Samuele2002:. Please see https://phabricator.wikimedia.org/T161341 for templates that would need fixing. The linter category mentioned in that ticket should start populating later this week. SSastry (WMF) (talk) 18:09, 12 April 2017 (UTC)
        • Unable to find a link to or mention of any such category. As with most phab tickets, the insider lingo is impenetrable. Jonesey95 (talk) 04:58, 13 April 2017 (UTC)
        • @Jonesey95:The link will show up later today. But, is the explanation here any better? But, based on your feedback, we'll try to improve the explanation. See this example on testwiki (and examine the source). You will see that there is no red border around the table. SSastry (WMF) (talk) 12:40, 13 April 2017 (UTC)
          • Jonesey95, did that help? Elitre (WMF) (talk) 15:30, 27 April 2017 (UTC)
          • Also, we had to disable Linter on large wikis temporarily because of performance issues. It will come back in 2 weeks. Also, the Linter help pages have more information now. SSastry (WMF) (talk) 15:33, 27 April 2017 (UTC)
            • Thanks for the new Linter pages. They are as well written as documentation pages tend to get around here, and they should serve as good technical conversation-starters in en.WP gnome hangouts. I am looking forward to the arrival of these features on en.WP. As I mentioned on a page recently (I can't recall where), I believe that https://phabricator.wikimedia.org/T157670 is still blocking progress on creating full lists of pages with these errors. Pages that haven't been null-edited in a while may not show up on the lists. Jonesey95 (talk) 18:08, 27 April 2017 (UTC)

Listing what existing tools can do to help ?

7
NicoV (talkcontribs)

I was wondering if we should start listing what existing tools can do to help fixing the wikitext that needs to be fixed.

I have developed a few things in WPCleaner to help detecting and fixing some errors (like self closing tags) : should I list them in the page ?

SSastry (WMF) (talkcontribs)

Yes, that will be very helpful. Thanks!

NicoV (talkcontribs)

Ok, I have written a first short description. Is it ok ? Feel free to expand and modify...

SSastry (WMF) (talkcontribs)

Thanks! Good for now. We might reorganize / rearrange to highlight these fixup options as we get further along.

Elitre (WMF) (talkcontribs)
SSastry (WMF) (talkcontribs)

Feel free to reorganize it in a way that seems most useful. :-)

NicoV (talkcontribs)

Yes, same for me : if it's better in the FAQ, feel free to move the information there. I will add more later

Categories need to be fully populated before Tidy can be discontinued

5
Summary by SSastry (WMF)

This has been done.

Jonesey95 (talkcontribs)

I know I keep harping on this, but now that Tech News has announced that Tidy will be going away in 2017, the "Pages using invalid self-closed HTML tags" categories really do need to be fully populated on all wikis. The category was added to MediaWiki in July 2016, and it has still not fully populated. For example, on te.wikisource.org, there is only one page, "మూస:Transform-rotate", in the error category at this writing, but that page doesn't even have a problem – the problem is in a page that is transcluded on that page. That transcluded page is not yet in the error category (after seven months). This is task T132467.

How can we get an accurate list of all pages that need to be fixed?

See also task T106685 (insource searches don't work right), which is another bug that makes it more difficult to migrate away from Tidy. Let me know how I can help.

IKhitron (talkcontribs)

You can do the same as I did - take a month (for enwiki it will be a decade), and run nulledit bot on all pages.

Jonesey95 (talkcontribs)

Get a null-edit bot approved and running on 868 different wikis? That's outside of my scope.

If WMF wants to retire Tidy, WMF needs to null-edit all pages on all wikis or otherwise fix the conditions that prevent the categories from being fully populated. If that does not happen, it seems to me that Tidy's retirement will not be able to occur.

IKhitron (talkcontribs)

Yap. And 902 wikis.

SSastry (WMF) (talkcontribs)

@Jonesey95, we discussed this recently and @Legoktm updated T132467#3004685. We'll track this. But, note that we have backward compatibility fixes in the parser for self-closed tags, so, it is not catastrophic to not be able to fix all those self-closing tags before Tidy is removed. We'll eventually remove the compatibility fix once we are sure it is tackled. Meanwhile, we are also looking at how to ensure that pages are refreshed in a fixed time frame (as discussed in that phab task link above).

Jonesey95 (talkcontribs)

The dashboard report shows zero pages in en.WP, but there are 23 pages with errors. They are in a subcategory. The report may need to traverse subcategories.

Legoktm (talkcontribs)

I intentionally didn't traverse subcategories, because if "Category:Foo" has a improperly closed tag in it, then it will appear as a subcategory to the main one, but the articles inside that subcategory are totally fine.

IKhitron (talkcontribs)

Can you parse the system message with category names, retrieving only relevant subcategories, Legoktm?

Estimated counts, rather than actual ones.

1
Elitre (WMF) (talkcontribs)

Per Subbu: "Because of some performance-related fixes, Linter is now reporting estimated error counts instead of actual counts. Therefore, wikis might notice an artificial increase in error counts. Once we figure out a better solution to the performance problem, this will be fixed. [...] this only seems to affect wikis that have linter categories with large error counts." Keep an eye on https://phabricator.wikimedia.org/T184280: we appreciate your understanding.

Elitre (WMF) (talkcontribs)
Whatamidoing (WMF) (talkcontribs)
PerfektesChaos (talkcontribs)
Elitre (WMF) (talkcontribs)

You mean the fact that it reaches a specific section? I have noticed you love anchors, yes :p

PerfektesChaos (talkcontribs)

Yeah. But only before midnight.

Elitre (WMF) (talkcontribs)