Topic on Talk:Parsing/Replacing Tidy

Cleanup and designing abuse filters to capture (and prevent?) new additions

22
Billinghurst (talkcontribs)

While we have clean-up processes identified, has there been consideration in how to prevent the breaking tags being added into the future? Similarly, for the required cleanup, it would seem that as usual the biggest of the big wikis will be able to manage and monitor their cleanup and the smaller wikis with less technical knowledge are going to need assistance and guidance.

It would be worthwhile

  1. having a simple list of each of the "bad" tags
  2. publishing usable regexp searches, and some of the regexp cleanup scripts, so that the wikis can self-serve on the cleanup
  3. designing abuse filter(s) that can capture the new addition of such tags, this can be loaded as a meta global filter to protect small/medium wikis and shareable to the big wikis
Whatamidoing (WMF) (talkcontribs)

I've seen a couple of requests for regexp searches, and I believe that easy-to-use scripts would be welcome.

Do you think that you could add some suggestions for insource: searches to this page, especially for anything that might not be easily fixed via bot?

Whatamidoing (WMF) (talkcontribs)
Jonesey95 (talkcontribs)

I added a long list of varieties copied from the script that I have been using to fix errors on en.WP.

Jdforrester (WMF) (talkcontribs)

There's a danger of over-engineering this.

How does the current system stop people from breaking the page? It does it via humans — if you edit the page and it's then broken, you go back and fix it. If you currently write <di v> and find that it doesn't work, you go remove the space. The communities of each wiki don't seem to feel the need to add custom abuse filters for that kind of problem now — at least, I'm not aware of such.

With a good set of updates to documentation Tech/News notes, and the like, a long enough cut-over process, and judicious use of bots to find and fix particularly common issues, we might not necessarily need it for this change.

Jonesey95 (talkcontribs)

I think steps should go in this order:

  • create WikiMedia-level maintenance categories for the relevant errors,
  • find a way to get the categories to populate quickly (the self-closed tag category is still filling, months after its creation)
  • have bots and editors fix all of the existing problems,
  • watch the error categories to see how new errors are being added,
  • fix any tools and scripts that are programmed to add new broken code,
  • evaluate next options based on how bad the problem is (bot-delivered warnings to editors about errors they have made, continue to have bots follow editors and fix problems, implement edit filters for things that will truly break pages)
Whatamidoing (WMF) (talkcontribs)

Maybe it would be better to find a system that didn't rely upon maintenance categories.

Jonesey95 (talkcontribs)

Well, Tidy is a system that doesn't rely on maintenance categories, but it is apparently going away, and it's a workaround, for sure.

CheckWiki is a system that doesn't rely on maintenance categories, but it amounts to the same thing: the database is scanned for errors, and then lists of pages with errors are compiled.

Big red Javascript error messages don't rely on error categories, but they do not provide a way for gnomes to locate articles with errors so that they can be fixed, and they are likely to confuse regular editors.

It may be that maintenance categories, like democracy, are the worst form of error correction except for all those other forms that have been tried. (with apologies to Sir Winston Churchill)

Whatamidoing (WMF) (talkcontribs)

The CheckWiki lists are available to gnomes. A ready-made list of regexp searches, or AWB-like scripts, would also be available to gnomes. So that's two options.

SSastry (WMF) (talkcontribs)

On the Parsoid end, we are beginning to finish up the first prototype of a linting tool (finally, after the first half was done couple years back). But, this tool will allow errors like these to be tagged and added to the database with precise source locations which existing tools could leverage. For details, see https://phabricator.wikimedia.org/T48705 ... but, this might hopefully be an alternative to maintenance categories in the longer term.

Amire80 (talkcontribs)

Reviving this old thread...

Apparently, the Italian Wikipedia has AbuseFilter 423, which totally prevents publishing if any tags that are not supported by Remex are used.

I'm concerned about this.

Doing this with an AbuseFilter feels wrong in general. AbuseFilters are local to every wiki. The Tidy migration project, on the other hand, uses the same technology for all the wikis. Hence, the blocking or the warnings about deprecated tags should be uniform as well.

If anybody really wants AbuseFilters, I'd recommend not defining them as blocking, but as warnings and tags. At least for now that the dust hasn't completely settled on the transition. Such edits are usually done by well-meaning people, who don't necessarily understand error messages about self closing tags. People usually don't insert such tags intentionally and maliciously, but because they use some template, or because they are copying from some place. It's better to track the problematic edits as they happen and explore the reasons for them and not to block people from editing with cryptic error messages.

Elitre (WMF) (talkcontribs)

Do I understand correctly, 0 hits in 9 511 edits?

Amire80 (talkcontribs)

Yeah, it looks a bit strange to me as well, but it did happen at least once, otherwise I wouldn't even know about it. See Topic:U64esu5s3dmxq9jn.

Elitre (WMF) (talkcontribs)

It now says 2 out of 3 095 . I am not entirely sure if the figures are right, but it seems to me like it's triggered so rarely that the fact that the community chose to block the action rather than just warn, in this case, isn't the end of the world, until a proper solution is developed by Subbu's team? I mean, we can still discuss with them if you think that's necessary.

Amire80 (talkcontribs)

Yeah, I'd prefer a non-blocking filter. It makes it easier to actually examine and fix the issues. If it happens rarely, it's a good reason not to block.

Elitre (WMF) (talkcontribs)

I brought this up on the village pump subpage where the fixes are being discussed. HTH

Daimona Eaytoy (talkcontribs)

Since I'm the author of the filter, I gave some kind of explanation on it.wiki, that I'm gonna repeat here.

First, let's talk about numbers: the filter is up since late september, and it has about 1000 hits right now. Of course that's not much, but we can't let it go, since hits are not that rare.

I completely agree about users hitting it in good faith, though this doesn't mean much. In most of the cases, people add those tag when translating from another wiki where such tags are still used. We don't have any more obsolete tag on it.wiki, meaning they come from user intention or from outside. So, a really working solution would be to clean out those wikis as well, but that's a huge amount of job: it would be 16e+6 tags only on en.wiki.

About users not understanding the error message: that's definitely true. However it's also hard to write down an error message being both short and clear, even for those who don't know HTML at all. Since the filter is active, I tried asking feedbacks to people who hit it more than 5 times in a row, though I almost received 0 feedbacks, which made even more difficult to make it better. Moreover, we should also consider the fact that, while using CX, people don't get the right warning; instead, they're given raw code as pasted by Codas in the other topic, which would make it difficult to understand to almost everyone without good skills.

Finally, that filter was actually a warning-only one for about two months. The reason I made it block the action is that almost no-one fixed their edits without saving, which wasn't good as well.

So, no problem in making it warning-only, especially if looking forward to a built-in solution, though there isn't a true way to solve this problem right now.

Amire80 (talkcontribs)

I wasn't aware of it until I started this discussion. If it becomes warning-only, I'll be happy to help investigate the issues, and get them properly reported or fixed. It's not a problem for me that it is in Italian. I have experience with doing something very similar around problematic markup in the Hebrew Wikipedia.

If you make it warning-only and reach out to me, I promise nice surprises.

Daimona Eaytoy (talkcontribs)

No problem in making it warning-only, I'll do it right now. Though, I'm not that confident that we could get to a proper solution without fixing the other main wikis. Anyway, let's see how it goes, there might be good results anyway.

Amire80 (talkcontribs)
SSastry (WMF) (talkcontribs)

There are two possible strategies here, one of which is applicable to the immediate task of Tidy replacement, and the other is more general and applicable to linting in general.

As far as Tidy replacement is concerned, once Remex replaces Tidy, the rendering breakage offers immediate clear feedback about broken markup - I don't think abuse filters or other such tooling is required. So, it doesn't make sense to deploy something that is only applicable for the interim period between now and when Remex is deployed.

However, more generally, it does make sense to close the linting loop to add a pre-save linting ability that lints the page and displays any issues on the page -- however, this needs some thought and design work to figure out what best makes sense. Do we display this to all users? Or to some subset of users (however that is determined) more likely to be able to deal with the notices / warnings? Do we lint only the edited portions of the page or the whole page? But, linting only edited portions of the page is difficult to get right because wikitext ... But, displaying lint errors on the entire page can also be overwhelming. But, yes, we definitely need to close the loop and add a pre-save linting feature. We are aware of this need, and it has been requested a few times now. We haven't been able to get around to it in the middle of everything else we are already working on. But, it will happen.

Not sure if this addresses your question / concerns. Let me know.

Amire80 (talkcontribs)

That's OK, that's the line of thought I was pointing at.

Ideally, I'd love to see a live comment while editing, similar to what an editor seen in Visual Editor when writing wiki syntax such as [[.

If not, it can be a warning similar to what is shown when a user is trying to save without writing an edit summary.

Finding a list of pages that have lint errors is already possible using Special:LintErrors.

Reply to "Cleanup and designing abuse filters to capture (and prevent?) new additions"