Topic on Talk:Parsoid/Language conversion/Preprocessor fixups

Variant markup in URL fragment

5 comments • 17:51, 24 May 2017 6 years ago

5

Tgr (talkcontribs)

The example about fixing URLs is questionable: unlike other parts of the URL, fragments (the part after the #) do not necessarily use fragment encoding. Modern browsers understand it, older browsers don't, client-side applications won't unless the author has specifically considered that this might happen.

05:36, 23 May 2017 6 years ago

SSastry (WMF) (talkcontribs)

Hmm ... one way around is to update the parser to decode entities (in fragments) before generating HTML. While use of -{ markup is relatively rare in urls, https://en.wikipedia.org/wiki/Help:Citation_Style_1#Special_characters would be likely affect a lot more instances.

14:31, 23 May 2017 6 years ago

SSastry (WMF) (talkcontribs)

@Cscott and I are talking about this and this looks related to the html5 id discussion as well.

15:32, 23 May 2017 6 years ago

Cscott (talkcontribs)

We should pursue the same strategy we use for HTML5 IDs. I think the latest insight there is that percent-encoding works everywhere in browsers old and new. If that is the case, then this is just "broken client-side javascript applications don't always handle percent encoding as they ought". That's not our bug to fix. I will test to see if entity encoding will be a reasonable workaround in this case (it should) but if the standards-compliant thing is url encoding, then I don't know that we should confuse editors by mentioning this special case for buggy clients.

If it turns out the rules to properly encode URLs for backwards compatibility are complex, we could also implement this in the PHP Sanitizer, which is in charge of correctly encoding URLs for output to HTML. We already URL-decode article titles in wikitext, before re-encoding them for output in HTML. We could do the same for external links, which would allow hiding the complexities of fragments, etc (assuming there are complexities) from the author.

16:32, 23 May 2017 6 years ago

Tgr (talkcontribs)

If you want to go by standards, none of the IETF URI, IETF IRI and WHATWG URL specs allow a raw { or } in the fragment so arguably any application relying on that is wrong. Neither of those standards says anything about how to encode disallowed characters, though. (This is understandable as the semantics of fragments is left intentionally unspecified in these standards so that each MIME type standard can define its own semantics.)

As far as Javascript applications go, Firefox will force percent-encoding and transparently convert the URL if needed (and unencode visually), other browsers just return the exact bytes in location.hash. So I guess anything that does not understand percent encoding (and outputs URLs with curly braces in them) is already broken in a major browser and we can ignore it. I retract this thread :)

17:25, 24 May 2017 6 years ago