Talk:Markup spec/Archive 1

A long time ago, before MediaWiki had such fantastical things as extensions, I did a bit of hacking around in the rendering code in order to implement a few syntax enhancements that I needed at that time. As part of the process I documented some of what I found. I don't know if it is at all relevant any more - it was for MW 1.3.10, I think, and I'm sure a lot has changed - but I've posted it at User:HappyDog/WikiText parsing in case it's of any use to anyone. --HappyDog 14:54, 17 May 2006 (UTC)

Parsing Expression Grammar
I rather like the idea of using a Parsing expression grammar. I'll give it a try here as soon as I work out where to start. HTH HAND —Phil | Talk 22:02, 24 May 2006 (UTC)

BNF
I saw that HappyDog started with giving the links in BNF. I thought a bit about it and concluded that that I'd prefer something easier to get started, so I tried to describe articles containing only text and horizontal rules at Markup spec/BNF/Article. It is a long time ago that I did this, but I seem to remember that the alternatives in BNF must be disjoint. I don't think it is possible to achieve this without producing an absurdly long specification. So I decided to add some comments telling how the rules should be applied.

I don't care what form we specify the wiki markup, but after this short attempt with BNF I can see why Phil wants to use a PEG. However, there is inherently nothing wrong with BNF + English. In short, I wished I had a better memory and actually remembered what I was taught in the Formal Languages course. -- Jitse Niesen 14:47, 27 May 2006 (UTC)


 * I started with links, because I figured it was easier to start at the bottom and work up than the other way round. Actually, it probably makes little difference though.  I also started work on a top-down approach, which is different to yours.  I don't want to go in and just change what you've done, so I'm going to post what I have so far in Talk:Markup spec/BNF/Article for discussion. --HappyDog 02:18, 29 May 2006 (UTC)


 * Actually - on closer inspection they are not that different. --HappyDog 02:29, 29 May 2006 (UTC)

Exceptions, Context-sensitivity and hacks
I do not know whether this is the best central page to discuss the formal description of MediaWiki syntax (is there a better one?).

In any case, there are many hacks in the original parser that make the language elements highly context-sensitive. The question is whether this should get described and implemented in future parsers, making it very difficult to create and maintain such parser or if it wouldn't be better to remove these things from the language in order to make the meaning more easy to grasp for both humans and computers.

One of the most difficult things to parse correctly are quotes: unlike the article text seems to imply, two quotes are not always italics, three quotes are not always bold etc. If two quotes are followed by three quotes then the three quotes are interpreted as end of italics plus a literal quote, but not if again followed by three quotes in which case they are interpreted as start bold (and the second triple of quotes is interpreted as end bold). This gets even more complex with more complex combinations and sometimes the placement of the literal quote depends on whether it is followed by a single lower case character. All the quote-induced formatting is ended by a single newline in the input, but the equivalent HTML tags (e.g. &lt;i>) which are usually allowed in the input will not be ended by even paragraphs.

So I think a detailed English language description of how the markup is processed would be a necessary first step: this would need to include exceptions and information about which constructs (if any) will terminate other constructs (e.g. end of line terminates any open italic/bold if it was started by a quote-construct) or under what circumstances the construct is NOT interpreted in the way one would expect (i.e. three quotes not interpreted as start/end of bold). Johann p 16:37, 20 February 2007 (UTC)

Lists
Lists can't be handled by BNF when using a naive lexical analyser, but it may be possible to handle them using a more complicated lexer. The idea is to consider the blocks of *:;# at the start of lines to be separate from the rest of the line; instead of generating tokens for *, :, etc., it makes much more sense to generate tokens for 'one more * than the previous line', 'one less * than the previous line', etc.. Representing these tokens as {*, *}, etc., and a newline as \n, the following complicated nested list: would be lexically analsyed as {# a \n {* b \n c \n *} d \n e \n f \n {: {: g \n :} {* {* h \n *} *} :} #} {* {* {* {* i \n *} *} *} *} {# j \n }# This maps to the equivalent HTML, list structure, etc., in a very BNF-able way, and is easily obtainable from the lexical analyser (which tokenises the wikitext before passing it to the BNF). It makes a lot more sense than trying to treat * as a token, anyway... Ais523 10:37, 14 November 2007 (UTC)
 * a
 * b
 * c
 * d
 * e
 * f
 * g
 * h
 * i
 * j