Topic on Extension talk:SyntaxHighlight

safe highlighting of source code: how to HTML-unescape character entities?

7
Verdy p (talkcontribs)

When we use either the XML-like syntax, or the #tag syntax, we have a problem to embed source code which is NOT using the wiki markup language (i.e. the source language is NOT "html")

Some characters in the embedded source code will cause

  • either the XML-like syntax to be terminated if the source code contains any occurence `</source>` or `</syntaxhighlight>`
  • or the #tag: syntax will break if the source code contains any vertical bar (`|`).

The temporary solution (if the source code is in the same page and not transcluded) is then to:

  • for the XML-like syntax, HTML-escape the problematic characters `<` or `&` by a numeric character entity `<` or `&`; however the syntax highliter, when called with the XML-like syntax, does not properly unescape them to get the raw character.
  • for the #tag: syntax, escape the problematic characters `|` by an HTML-escape like `&124;`.

The problem is that this does not work: the extension still does not unescape the source code given.

It is then impossible to properly transclude the source code from another page (using `{{msgnw:Full:Pagename}}` which generates lot of numeric character entities, that are still not decoded.

We should have either a parameter for the syntax highlight extension, instructing it to unescape all character entities present in the given source code, or another variant of the tag, where this instruction is explicit.

Without proper HTML-unescaping, this extension does not work properly: some source code will break either the parser, or will cause the MediaWiki parser to interpret and modify the source code (when using the #tag: syntax instead).

Can such option be added to this extension tag, as an optional parameter like `|unescape=yes` (when using it with the `#tag:` syntax, notably when the source code is transcluded by msgnw from another page) or `unescape="yes"` (when using it with the XML-like syntax)?

Useful examples of use (with safe transclusion of source code from another page, using msgnw):

  • {{#tag:source|{{msgnw:Full:Pagename}}|lang=html|unescape=yes}}
  • {{#tag:source|{{msgnw:Template:Templatename}}|lang=html<!--or mediawiki?-->|unescape=yes}}
  • {{#tag:source|{{msgnw:User:Someone/foo.c}}|lang=C|unescape=yes}}
  • {{#tag:source|{{msgnw:Module:Pagename}}|lang=lua|unescape=yes}}

Alternative (or addition): don't embed the source code in the content of the tag, pass the name of a page to transclude (it would be more efficient as there's then no need to first HTML-encode the transcluded source code with msgnw, then unescape it in #tag:source; instead #tag:source will just read the raw content of the page, without passing it throught the wiki parser):

  • {{#tag:source||page=Full:Pagename|lang=html}}, or
    <source page="Full:Pagename" as="html"/> with the XML/HTML like syntax
  • {{#tag:source||page=Template:Templatename|lang=html<!--or mediawiki?-->}}
  • {{#tag:source||page=User:Someone/foo.c|lang=C}}
  • {{#tag:source||page=Module:Pagename|lang=lua}}
Pppery (talkcontribs)

You can work around the limitations of the #tag method by escaping the vertical bar as {{!}}, instead of using an HTML entity.

Verdy p (talkcontribs)

No, it does not work (I've tried), and certainly not for simply trancluding the source from another page/subpage.

It makes editing wiki pages containing source code very complicate if we need to check every character used in that language. And there's still no way to prevent MediaWiki to transform the source code, even if this is not using the Mediawiki syntax (e.g. it could be Lua code, C, C++...)

And this does not work if this if for embedding some source code in an review wikipage presenting the code used in some external page: the source code is edited independantly of the review wikipage, it may be blocked from editing while the review page open for talks/comments.

What you suggest are compliate "hacks", it would just be simpler to support unescaping or direct transclusion by this extension tag: users will provide their source code in a page or subpage (or a Lua module, or a CSS stylesheet page), and will reference that full page name.

PerfektesChaos (talkcontribs)

BTW, please note that source is obsolete for 8 years now and will be replaced by syntaxhighlight entirely; that means: no support for <source> any longer in near future. It is throwing maintenance categories already.

The method is dedicated to show code exactly as-is; that does mean: it is not supposed to “properly unescape” but to reproduce that entity source code.

Verdy p (talkcontribs)

Yes I know, I've used "source" but my comment is completely valid as well with the "syntaxhighlight" keyword (so horrible that many have difficulties to type correctly, or remember...)

  • Adding the "unescape" parameter has the demonstrated use (safe transclusion, and simpler editing of source code in a separate page/subpage, possibly in a dedicated namespace with a better editor like the Lua source editor, or a CSS editor; the wiki editor is not fun for editing C, C++, Lua, CSS, or raw HTML code, or other markup languages like BB-code).
  • Adding the "page" parameter (to replace the content parameter 1 which will be then ignored or just prepended to the transcluded page) makes all this also simpler for users (no need to know the "msgnw:" transclusion mechanism and its caveats), and much more efficient (transclusion with "msgnw:" is very costly in memory terms of generated text length because of the many HTML-escapes spread everywhere! Newlines, tabs, almost all ASCII punctuations are HTML-escaped; only letters, digits and a very few symbols like ".", or non-ASCII characters are not HTML-escaped by "msgnw:" tranclusion). "msgnw:" has no practical use, and should be deprecated (really too costly and impossible to process simply in any Wiki template, even if it may be processed in Lua performing the needed unescaping and using the #tag extension directly without returning to the wiki parser)

Many community wikis have strongly rejected the suppression of the "source" name (or alias), they don't like the "syntaxhighlight" name that they can't memoize and type correctly (and which is overlong). Unless you propose a better name (a single word please, like "syntax"!), "source" is there and will continue to be widely used in many wiki pages.

There's more urgent useful thing to do than rejecting "source". Notably "lang" conflicts with HTML as well as it does not indicate a BCP47 language code and it causes confusion! Note that some programming languages have linguistic variants (e.g. Excel formulas: syntax="excel" lang="fr", where "IF(condition;value1;value2)" in English Excel must be rewritten "SI(condition;valeur1;valeur2)" in French Excel...) As well lang="*" would be useful for indicating the human language used in text litterals of the programming language, to allow a spell checker to scan these litterals; the litteral values will be given to the spell-checker by the syntaxic parser, the spell-checker needs no parsing of the syntax of the programming language; the spell-checker may as well be hinted, i.e. "in which language are the litteral values that the spell checker will process", by directives detected in the source code by the program parser).

For the related (but different) HTML element "script", the programming language (for the embedded source code which is not rendered but executed) is indicated by a "type" attribute, not "lang". This should be the same for the "source" extension tag, aka "syntaxhighlight", aka "syntax", aka "program" or the simple name you would choose for the XML-like syntax; the name used for the "#tag:" syntax however is not relevant and does not need to remove "source" as there's no conflict with XML/HTML asn the "#tag:" syntax already isolates it in its namespace. For the HTML/XML-like syntax, the extended tags should have better used the "tag:" namespace since the begining, in which case there was no problem with <tag:source type="C">...</tag:source> or <tag:code type="C">...</tag:code> OR <tag:pre type="C">...</tag:pre> or <tag:tt type="C">...</tag:tt>... not to be confused on some wikis like the OSM wiki, where the prefix "Tag:" is also used to name pages or Mediawiki namespaces for page titles).

PerfektesChaos (talkcontribs)

(other message crossed) There is no exception than escaping pipe (and perhaps }} terminators) of the code parameter.

I am doing this for quite a long time, and there is no problem at all if you embed static code into <syntaxhighlight> tags. If you want dynamic embedding of external things than you are running into a philosophical problem: You need to be clear what the code is that you want to be presented, and where and how you make the distinction against the code triggering the dynamic transclusion effects but shall not be visible and shall be interpreted.

I do not type <syntaxhighlight>, never, I use C&P or editing tools for insertion.

Verdy p (talkcontribs)

You can do this only within the same wiki page. This is still horrible and complicate, very errorprone. This is impossible via transclusion (which is definitely simpler for all users, given they can edit their code separately in a better editor for the programming language, than the default wiki editor)

For example we have a Wikitext editor (along with its visual editor) for wiki pages, a CSS editor, a Lua editor. We could have dedicated editors for data (e.g. CSV or tabulated, or JSON), C/C++, PHP, Javascript, Java, XML, SVG... These editors will internally also embed their syntax highlighter (probably the same as the one use with the source/syntaxhighlight tag in Madiawiwiki pages, i.e. GeSHI or Pygments.

Reply to "safe highlighting of source code: how to HTML-unescape character entities?"