Talk:Parsoid/Language conversion/Preprocessor fixups

About this board

DePiep (talkcontribs)

I propose to wait a few weeks before making a new page listing. By then, lager wikis already listed can have had a first cleanup. Then, a more complete (and simple) script run would be helpful.

Cscott (talkcontribs)

@Amire80 asked for more complete results, so I've made one based on the 2017-05-01 dumps. There probably won't be another dump made until 2017-05-20 or so, at the earliest.

Elitre (WMF) (talkcontribs)

@Cscott, I think getting a new list (and eventually sending an update to affected communities) is the last step so I can declare https://phabricator.wikimedia.org/T165175 closed. LMK when you can do it. I think it's safe for you to skip Wikidata (dunno about the rest).

Whatamidoing (WMF) (talkcontribs)

I looked over the list, and most of the wikis have no problems or just a small number of pages. Are you all feeling ready to go? Or do you want to wait for the next dump (a few days from now?) and make a decision then?

DePiep (talkcontribs)

In short: 1. Don't renew the list, the 'done-exception' notes are helpful. 2. When gone live, pages are broken only minor, and we can continue to work this list (If I understood correctly).

Making a new list from the next dump looks like a bad idea. It would remove all notes like "done, some issues" etc. That way helpful information on the status is lost (we'd have to revisit and check those wikis once more). Also, because the required edits leave some hits (false positives, no harm) behind, we are not striving for a zero-Pages list (in which case a new dump-list would be helpful).

If "ready to go" means roll out the change, I'd have to leave that decision to others. If I understand the big issue correct, pages will be broken in details (e.g., text disappearing), but not fatally. I can not judge on the effects in sister wikis.

This post was hidden by Elitre (WMF) (history)
SSastry (WMF) (talkcontribs)

The current plan is to merge the patch on Monday and have it be deployed next week. In the meantime, we are considering adding a new Linter category so that we can more precisely identify the remaining instances that might need fixing. So, yes, we can skip doing another round of dump grepping unless found necessary for some unforeseen reason.

Whatamidoing (WMF) (talkcontribs)

Is this live now, or was it reverted?

SSastry (WMF) (talkcontribs)

This was live on group2 wikis briefly as part of 1.30.0-wmf.2 before it was reverted because of T166345. So, as of now, this is only live on group0 and group1 wikis.

Cscott (talkcontribs)

It has since gone live on all wikis. I'm generating another list from the 20170601 dump, though I think it will be more useful to wait for the 20170620 dump to complete.

IKhitron (talkcontribs)

Hello, User:Cscott. Thank you for your work. Is there a way to watch some page to be aware about new runs? I recognized only now about the Juny 1 run, from your post above. Thanks.

Elitre (WMF) (talkcontribs)
IKhitron (talkcontribs)

Thank you.

Cscott (talkcontribs)

Yeah. I deliberately didn't make a lot of noise about the 20170601 run, because I'm about to replace it with results from the 20170620 dump which should be better, since they won't include as many pages which were already fixed up in the last big community cleanup push. But if you watch the Preprocessor_fixups page, you should get notified when that's done.

IKhitron (talkcontribs)

What is that page?

Elitre (WMF) (talkcontribs)

The one I linked above? :)

IKhitron (talkcontribs)

I thought this is the intension, but the Juny 1 run was published there yesterday, so I don't think Cscott is talking about it.

Elitre (WMF) (talkcontribs)

Whatever comes up next, you'll find it linked there. Promise!

IKhitron (talkcontribs)

👍

Like

IKhitron (talkcontribs)

N3: It seems like a too big like. Something with flow?

Elitre (WMF) (talkcontribs)

Leave it! I like it.

IKhitron (talkcontribs)

Fine for here, but if it's a bug, it should be fixed for future usage.

Elitre (WMF) (talkcontribs)
IKhitron (talkcontribs)

??? OK...............

Elitre (WMF) (talkcontribs)

We REALLY LIKE things around here.

IKhitron (talkcontribs)

:-)

Cscott (talkcontribs)

New 2017-06-20 dump is up and linked from Preprocessor fixups. Enjoy!

I'm going to be looking at adding a parser warning or a linter rule to catch these in the future, hopefully when they occur, since I noticed a few cases of editors flailing around to see why their templates weren't working.

IKhitron (talkcontribs)

Most of the problems in our wiki came from tech news about this problem :-). Surprisingly, about 99% of problems on user pages subcribed to tech news were not recognized here.

Elitre (WMF) (talkcontribs)

I'd appreciate if you could clarify this comment. I don't understand what you mean.

IKhitron (talkcontribs)

There were two issue of Tech News, explained about language conversion problem. They have the forbidden string in the news message. So, the new run found these pages in Wikipedia namespace. But it did not find them in User Talk namespace, when many user are subscribed to the News bulletin.

Elitre (WMF) (talkcontribs)

So @Cscott, I know you're filtering out non-wikitext, but if the mention in Tech News shows up on one namespace, it should also show up on the user talk one? (TN can get delivered to subscribers' talk pages.) Or maybe we are assuming that on the Wikipedia namespace it shows up because of Tech News, but maybe that's actually not the case.

Cscott (talkcontribs)

Can you give me two specific URLs (or article names), one where the mention is included in the results and the other where it is not. That will allow me to diagnose the issue.

Cscott (talkcontribs)

I looked into this a bit. We use the -pages-articles.xml.bz2 dumps for everything except labswiki. These contain "articles, templates, media/file descriptions, and primary meta-pages". It appears they do *not* contain user talk pages. The Tech News subscription page appears to push to to a list of "community pages", and then to the *talk* page for the pages listed under "Users". I found hits from the "community pages", but some of these have archive or other features which complicate the issue. For example, Commons:User_scripts/tech_news is subscribed, but if you look at the history you see that User:ArchiverBot moves the published Tech News to its own archive after 7 days. So the actual hit listed in the 2017-06-20 dump is Commons:User_scripts/tech_news/Archives/2017/May, which is where Tech News 2017-19 ended up. It just so happens in this case that the archive page is still included in the -pages-articles dump. On other wikis, with other archivers, or different distribution destinations, the ultimate location of Tech News 2017-19 might not be included in -pages-articles and thus it wouldn't show up in the results.

Elitre (WMF) (talkcontribs)

I do not think we need to dig further? The mention in Tech News doesn't seem to be breaking the page. In cases I have seen the syntax at fault was in users' signature, and also not breaking anything. Wikis like the Dutch one didn't want fixes except in the articles' namespace, and didn't fix the other stuff.

IKhitron (talkcontribs)

I thought about this. But does the script know to avoid tech news? And anyway, if not all the namespaces are covered, maybe you should think to change this.

Elitre (WMF) (talkcontribs)

Why? Literally nobody AFAIK has complained about finding errors elsewhere and not being able to determine where they come from. The main namespaces are covered. We shouldn't request additional work when it is not evidently necessary - and there are still articles to fix, FWIW.

IKhitron (talkcontribs)

I see. Your choice.

Elitre (WMF) (talkcontribs)

I mean, in an ideal world, of course. But we are talking about a few dozen pages here, while we already need to focus on another project that requires intervention on many, many more pages - and this will be a theme for a while.

IKhitron (talkcontribs)

So, should you change the dump? And do you still need the links? Thank you.

Reply to "Next page listing"
SSastry (WMF) (talkcontribs)

I see Module namespace titles in some of the lists for fixups. Modules aren't parsed as wikitext, are they? I don't think they are affected.

DePiep (talkcontribs)

Rule: skip all module pages (Lua code). But: module:.../doc pages have plain text and may be edited.

Cscott (talkcontribs)

I'm skipping pages where the format is not wikitext now. That should handle modules, while still checking the module doc pages.

DePiep (talkcontribs)

I'm not sure if we should edit this example in mw:WfMsg statistics:

<code>|wfMessage( "protect-level-{$permission}" )</code>

There is no closing hyphen, so my bet would be to edit.

In general mw:Writing systems/Syntax is the reference right? (so those situations should be untouched, false positives here). Hope I can recognise these.

Cscott (talkcontribs)

<pre> -{ ... } </pre> (and extension tags in general) is safe, you don't need to edit inside those. But I'm not sure about <code>; I think that HTML tags are just copied literally in the output and wouldn't end the -{ region. So I think <code> would need to be edited.

Bdijkstra (talkcontribs)

So what about future cases of preprocessor breakage? When users type <code>-{</code> in template arguments or in an URL without the intention to start a "language converter" construct, will they get an error message? If not, and if the result is not what the user expects, then how are they supposed to "debug" this? Bdijkstra (talk) 11:49, 9 May 2017 (UTC)

Elitre (WMF) (talkcontribs)

My 2 cents: Code gets deprecated and breaks up all the time. I expect that any documentation where that syntax is mentioned labels it clearly as invalid, suggests workarounds etc. Tech-savvy people like template editors should also be aware, so they can tell others and fix future problems.

Cscott (talkcontribs)

If they use VE, it will get auto-wiki'ed automatically. For wikitext editors, the intent is to add a parser warning (php side) or automatic linter rule (parsoid side) to prevent regressions in the future. Unfortunately, there's a chicken-and-egg issue, in that I can't add the warning until the PHP/Parsoid code is merged, and we're currently trying to clean up as many cases as possible *before merge* so that things don't break too much.

Cscott (talkcontribs)

Parsoid patch isn't quite merged yet ( https://gerrit.wikimedia.org/r/140235 ), but when it is it will protect language converter markup when used inadvertently in VE. I'm going to look at a linter warning and/or a Parser warning to flag these for source editors.

Reply to "Future cases"
IKhitron (talkcontribs)

Hi. I tried to find the problems in hewiki. There are 336 possible cases of transclusion, fair enough: . But I can't check the urls: cirrussearch fails. If I exclude the file namespace, it's OK, 0 results: . But in files, there are too much image descriptions in commons to scan. Is there a way to search in local wiki only? Thank you.

Amire80 (talkcontribs)

More simply, can anybody please run this for all projects, rather than wiki projects with >1,000,000 articles?

Cscott (talkcontribs)

@Amire80 Sure. There will be false positives in wikis with LanguageConverter turned on, of course. Give me a bit of time to download the latest dumps and re-run the grep.

Elitre (WMF) (talkcontribs)

Will this also help IKhitron though?

Amire80 (talkcontribs)

Yes, I think it will help, but @IKhitron can correct me if I'm wrong.

Wikipedia's search box is better than ever for finding information useful for readers, but I don't really expect it to be useful for finding precise strings of exotic wiki syntax, which is what's needed in this case. If @Cscott can run a comprehensive grep in all namespaces in all wikis, it will be exactly what we all need.

IKhitron (talkcontribs)

Of course, thanks a lot.

Elitre (WMF) (talkcontribs)

When you say all wikis, I think you mean Wikipedias - Cscott, were other sites checked/should they be? Meta, Commons, ...

IKhitron (talkcontribs)

Think about a possibility not to create 902 subpages, but put the results inside the wikis, at unique address, for example, "(ns:4):preprocessor fixups-May 2017".

Amire80 (talkcontribs)
Elitre (WMF) (talkcontribs)

I don't know about that. It's certainly not common practice, and people come here to look for documentation and stuff. I also don't know how long would it take to get all of this done.

DePiep (talkcontribs)

I suggest to list all cases in all lang-wikipedias and all sisterprojects, wikilinked in one or two pages (two=lang and sisters split). Pagename could be systematically & simplified like :lang/sistercode:ns:pagename. Actual red-marking of the offensive code, as is was done last time, is not needed. All aims to make a simple script run, and to reduce post-processing. This would introduce false-positives, but that is acceptable -- the alternative would be to exclude situations (e.g. in REGEX strings...). The task then is to manually/visually check the need for the edit (that is: AWB-style not bot-style editing).

Cscott (talkcontribs)

I suspect we'll run up against the page size limit in some cases---certainly if we include the wikis where language converter is actually enabled, like zhwiki. I'm currently downloading the 20170501 dump for all 739 not-private not-closed wikis. I'll try to tweak the munge script so it dumps raw wikilinked titles onto a small number of pages... patches welcome of course.

DePiep (talkcontribs)

Sounds good. Completeness over most wikis would be great, and I wanted to reduce post-run extra processes for you. With this, number of listing pages is not an issue.

Cscott (talkcontribs)
IKhitron (talkcontribs)

Thanks a lot!

Cscott (talkcontribs)
Cscott (talkcontribs)

All wikis except for wikidata are now complete. For some reason the JSON blobs in the wikidata dump seem to take much longer than any other wiki. I'm not entirely sure how/if to fix them, either -- it depends on whether the type of the wikidata item is wikitext, plaintext, or something else. There don't seem to be very many matches in wikidata in any case.

Elitre (WMF) (talkcontribs)
Elitre (WMF) (talkcontribs)
Elitre (WMF) (talkcontribs)
Thiemo Kreuz (WMDE) (talkcontribs)

There are 2 results in other namespaces. I have not looked at them in detail, but I think it's fine to let them break and fix after the fact.

The +8000 results in the main namespace are mostly because of labels of chemicals (obviously from a mass import because they have sequential IDs). These labels are meant to be plain text, and must be wikitext escaped when used in a wikitext context. The code we maintain does this. It might be that Lua modules and other code written by volunteers uses these labels in a wrong way, assuming they don't need escaping. But this is not a new issue introduced by the planned change. Such code already does unexpected things whenever it encounters something that looks like wikitext syntax. If it did not broke before on labels that contain [[, or {{, or ''', it won't break now with -{.

Elitre (WMF) (talkcontribs)

That's reassuring to hear. Danke.

Elitre (WMF) (talkcontribs)

So if I understand it correctly, now the question is, what if a template elsewhere starts embedding that content after the change? should we nowiki everything just to stay safe?

Cscott (talkcontribs)

In general, properly escaping wikitext (other than wrapping <nowiki> around the whole thing) is quite tricky. So I'd hope that any existing code would in fact be just wrapping <nowiki> around everything, and thus wouldn't require any changes. But if someone was trying to be clever and (for example) only escape "special characters" like [[, then they might miss the newly-special -{ sequence. I thought it was worth bringing this to the attention of the wikidata team just in case they knew of any specific code which we could proactively patch.

Elitre (WMF) (talkcontribs)

I'm reading the exchange between DePiep and Thiemo here and (again, if I'm understanding correctly: I wasn't able to track down a chemistry template recalling data from Wikidata) I dunno if we should be worried, because I don't know if templates embedding data from Wikidata actually nowiki anything.

DePiep (talkcontribs)

Please cotinue at: Parsoid/Language conversion/Preprocessor fixups/20170501#wikidatawiki

Example. Such a template would be like :en:Template:Infobox chemical

<nowiki>{{infobox |label1=IUPAC name |data1={{#property|P123456}} }}</nowiki>

With the property value being, example case, "3-<nowiki/>{[(1''S'',2''R'')-2,15-Dimethyl-5,14-dioxotetracyclo]sulfanyl}-propanoic acid" (that is: has the tricky code, unescaped really there/here not for obvious reason).

Note 1: Actually not often applied this way in enwiki chemical templates, because editors are worried about data quality & sourcing.

Note 2: The IUPAC name is not yet a Property in Wikidata. The example stays.

Now, we know that the value is safe within Wikidata. But in this case, it is read into an enwiki article for regular template parameter processing. This way, that value, while safe in WD, can create the error we are trying to prevent, in enwiki.

IKhitron (talkcontribs)

Off-topic: can anyone finally explain me what does the code -{...}- should mean if it is not an error? Thank you.

SSastry (WMF) (talkcontribs)
IKhitron (talkcontribs)

Thank you.

Variant markup in URL fragment

5
Tgr (talkcontribs)

The example about fixing URLs is questionable: unlike other parts of the URL, fragments (the part after the #) do not necessarily use fragment encoding. Modern browsers understand it, older browsers don't, client-side applications won't unless the author has specifically considered that this might happen.

SSastry (WMF) (talkcontribs)
SSastry (WMF) (talkcontribs)

@Cscott and I are talking about this and this looks related to the html5 id discussion as well.

Cscott (talkcontribs)

We should pursue the same strategy we use for HTML5 IDs. I think the latest insight there is that percent-encoding works everywhere in browsers old and new. If that is the case, then this is just "broken client-side javascript applications don't always handle percent encoding as they ought". That's not our bug to fix. I will test to see if entity encoding will be a reasonable workaround in this case (it should) but if the standards-compliant thing is url encoding, then I don't know that we should confuse editors by mentioning this special case for buggy clients.

If it turns out the rules to properly encode URLs for backwards compatibility are complex, we could also implement this in the PHP Sanitizer, which is in charge of correctly encoding URLs for output to HTML. We already URL-decode article titles in wikitext, before re-encoding them for output in HTML. We could do the same for external links, which would allow hiding the complexities of fragments, etc (assuming there are complexities) from the author.

Tgr (talkcontribs)

If you want to go by standards, none of the IETF URI, IETF IRI and WHATWG URL specs allow a raw { or } in the fragment so arguably any application relying on that is wrong. Neither of those standards says anything about how to encode disallowed characters, though. (This is understandable as the semantics of fragments is left intentionally unspecified in these standards so that each MIME type standard can define its own semantics.)

As far as Javascript applications go, Firefox will force percent-encoding and transparently convert the URL if needed (and unencode visually), other browsers just return the exact bytes in location.hash. So I guess anything that does not understand percent encoding (and outputs URLs with curly braces in them) is already broken in a major browser and we can ignore it. I retract this thread :)

DePiep (talkcontribs)
SSastry (WMF) (talkcontribs)

Thanks for logging your work - this is useful documentation.

Cscott (talkcontribs)

Fantastic work! @DePiep seems like you slowed down after 15 Apr -- are you blocked on something, or did you just run out of time? We'd like to put an item about this in an upcoming TechNews, and then deploy the preprocessor change some time (2 weeks?) after that. Is there anything you specifically need help on, like admin help in order to fix templates on various wikis, etc?

DePiep (talkcontribs)

Oops, that's soon. Unexpectedly bizzy in RL, distracting issue.

What I need (to get) is AWB permission in each of these wikis. No template fixes foreseen. (The edits are pretty simple & straightforward). Will try to show action here shortly.

SSastry (WMF) (talkcontribs)

@DePiep .. no worries. We weren't expecting you to fix all the problems. :) Mostly checking if you were blocked on us for any reason. We figured making a wider announcement will let us alert editors to potential breakage on wikis we haven't tracked yet.

DePiep (talkcontribs)

I definitely need a wikibreak for RL activity. Will be back on or after May 8.

Elitre (WMF) (talkcontribs)

I am seeing questions on that page. Were they addressed?

DePiep (talkcontribs)

If 'that page' is the ongoing edit log: no, they were not addressed. More like a building wisdom approach.

Elitre (WMF) (talkcontribs)

I guess I was echoing Subbu's concern, please let us know if there's anything you need the team to say/do to make your work easier :)

Elitre (talkcontribs)

These fixes are ridiculously easy. I do have a tip though - I'm using Opera - at least on my Air, the search box will stay open and remember the query so I can locate the faulty syntax very easily, paste the nowiki tag in, then save.

Reply to "Edit log subpage"
DePiep (talkcontribs)

I am familiar with the enwiki template that uses <code>| IUPACName=</code>, causing a lot of issues apparently. Can I claim the fixup routine, or can we know if someone else is working on this? -DePiep (talk) 09:02, 6 April 2017 (UTC)

Cscott (talkcontribs)

Go ahead! No one else has volunteered yet, it would be great to get that fixed up. Thanks!

SSastry (WMF) (talkcontribs)

@DePiep ... FYI in case you didn't notice this update.

DePiep (talkcontribs)

Done for enwiki. See my notes in the table, I expect some 75 pages are kept. Unfortunately, I have no permission for other wikis like de or fr (now).

Whatamidoing (WMF) (talkcontribs)

What rights do you need, to be able to do this work?

DePiep (talkcontribs)

[[:en:WP:AWB|AWB]] is my tool of choice. I could do this cleanup in all listed wikis, except: zhwiki, mediawikiwiki (so 13 + enwiki remain). A this-task-only, April-limited AWB-permission would be OK too.

I am not an admin anywhere. Manually, for example dewiki permission I should ask through [[:de:Wikipedia:AutoWikiBrowser/CheckPage|this]] page.

I see that there are false positives in these lists, so especially in unknown languages I'll have to be on the safe side (when unclear don't edit).

Whatamidoing (WMF) (talkcontribs)

IMO the global bot policy is in need of some updates to address this kind of situation. It appears that the policy wants AWB users to visit every single wiki separately, figure out the local process (and hope that there both is a process and that there's a community there to make the process work), and then fix the problems. I don't think that's entirely functional for this kind of problem.

(Why can't you do this here at mw.org?)

DePiep (talkcontribs)

Required to "visit every page" - sure, that's how I use AWB, and I must do so. Stricktly speaking, it's not a bot (I must manually check & save each edit. For these ~1000 pages that's doable).

I got the impression you knew of some general all-wiki AWB permission, but alas, its all local. I'll see what I can do.

(and: the mw.org list is full of intentional uses of "-{", or possibly intentional. See the /Release Notes listings for example. I am not familiar enough with that code, and most should stay anyway. btw, total is 60 pages only, so a manual check is viable.)

Whatamidoing (WMF) (talkcontribs)

It's the visit-every-wiki-for-local-permission thing that shouldn't be necessary. Maybe we should go to Meta and find out what would be necessary to update the global policy (or policies).

DePiep (talkcontribs)

I'll go on with this task by local wiki. These days, I don't have the time nor mindset to pursue that Meta proposal.

Whatamidoing (WMF) (talkcontribs)

Thank you.

People were talking about this last year on Meta, so I've tried to re-start that discussion. I don't know if anything will come of it at all, but nothing is likely to happen soon. You'll probably be done with this list by then.

Thank you, thank you very much, for doing this. I really appreciate it.

Reply to "Pick up"
DePiep (talkcontribs)

In the enwiki list a lot of pages in Wikipedia ns appear. These are mostly very old (&lt;2010) archive-like pages. Should these be left untouched?

Whatamidoing (WMF) (talkcontribs)

Well... To the extent that we want to be able to read these pages easily, then we probably should change them. For the English Wikipedia, it's probably worth discussing it at a Village Pump first.

Reply to "Old archived pages"