Alternative grammar 
I was also working on an article definition (actually, article is not quite the top level...). Here's what I've got so far - we can discuss where the two descriptions differ and try and incorporate them both into a more definitive form... --HappyDog 02:22, 29 May 2006 (UTC)
<wiki-page> ::= <redirect> [<article>] | [<article>] <redirect> ::= <redirect-tag> <whitespace> <link> <redirect-tag> ::= "#redirect" [:] <article> ::= <paragraph> [<article>] <paragraph> ::= <para-start> [<para-body>] <para-start> ::= (BOF | <newline> <newline>) [<newlines>] <para-body> ::= (<wiki-text> | <html> | <plain-text>) [<para-body>] <wiki-text> ::= <link> | <heading> | <formatting> | <ruler> | <wiki-htmlstyle-tags> | <table> | <list> | ... //////////////////////////////////// <formatting> ::= <bold> | <italic> <bold> ::= "'''" [<wiki-text>] "'''" <italic> ::= "''" [<wiki-text>] "''" //////////////////////////////////// <ruler> ::= <BOL> "----" "-"* //////////////////////////////////// <heading> ::= <BOL> (<l1-heading> | <l2-heading> | <l3-heading> | <l4-heading> | <l5-heading> | <l6-heading>) <l1-heading> ::= "=" <wiki-text> "=" <l2-heading> ::= "==" <wiki-text> "==" <l3-heading> ::= "===" <wiki-text> "===" <l4-heading> ::= "====" <wiki-text> "====" <l5-heading> ::= "=====" <wiki-text> "=====" <l6-heading> ::= "======" <wiki-text> "======" //////////////////////////////////// <wiki-htmlstyle-tags> ::= <opening-nowiki-tag> [<plain-text>] (<closing-nowiki-tag> | EOF) | <opening-pre-tag> [<plain-text>] (<closing-pre-tag> | EOF) | <html> //////////////////////////////////// <html> ::= <opening-paired-tag> [<wiki-text>] <closing-tag> | <unpaired-tag> <opening-nowiki-tag> ::= "<" [<white-space>] "nowiki" [<white-space>] ">" <closing-nowiki-tag> ::= "</nowiki" [<white-space>] ">" <opening-pre-tag> ::= "<" [<white-space>] "pre" [<white-space>] ">" <closing-pre-tag> ::= "</pre" [<white-space>] ">" <opening-paired-tag> ::= "<" [<white-space>] <tag-name> [<attribute-list>] [<white-space>] ">" <closing-tag> ::= "</" <tag-name> [<white-space>] ">" <unpaired-tag> ::= "<" [<white-space>] <tag-name> [<attribute-list>] [<white-space> ["/"]] ">" <attribute-list> ::= <whitespace> <attribute-name> [ "=" <quoted-attribute-value>] [<attribute-list>] <quoted-attribute-value> ::= <attribute-value> | """ <attribute-value> """ | "'" <attribute-value> "'" <tag-name> ::= <non-whitespace-char> <tag-name> <attribute-name> ::= <non-whitespace-char> [<attribute-name>] <attribute-value> ::= <plain-text> //////////////////////////////////// <plain-text> ::= <character> <plain-text>
- Great. I think it will work much better if we have a bit of discussion and exchange ideas, because it won't be easy. One problem with the above grammar is that it assumes that paragraphs are separated by two newlines. This is not necessarily the case, e.g.
- is parsed as a paragraph and a heading, but the above grammar does not capture that (as far as I can see). That's why the grammar on Markup spec/BNF/Article distinguishes between special-block and text-block.
- Then there is the strange case of
- ==heading== text
- more text
- which is parsed as a heading plus *two* paragraphs. However, we could argue that the MediaWiki parser is at fault here … -- Jitse Niesen 13:57, 29 May 2006 (UTC)
- To clarify, the above grammar does match
- but it parses this text as one <paragraph>, containing <plain-text> "foo" and <l2-heading> "bar". The grammar on Markup spec/BNF/Article parses this text as a <text-block> (which should probably be called paragraph) containing "foo" and a <level-2-heading> "bar". I think the latter is closer to the MediaWiki parser, since it translates the above wikitext to <p>foo</p><h2>bar</h2> .
- Yeah - that's a fundamental thing we need to sort out. My example clearly doesn't deal with it correctly, but I'm not sure if yours does quite either. How about something like this:
<article> ::= (<para-start> | <non-paragraph>) [<paragraph>] [<article>] <para-start> ::= (BOF | <newline> <newline>) [<newlines>] <paragraph> ::= (<wiki-text> | <html> | <plain-text>) [<paragraph>] <non-paragraph> ::= <non-para-start> <heading> | <ruler> | <table> | <list> | ... <non-para-start> ::= (BOF | <newline>) <wiki-text> ::= <link> | <formatting> | <wiki-htmlstyle-tags> | ...
- Better, but that does not handle two non-paragraphs in succession, like
- This should parse as two headings.
- Could you please be a bit more specific about "I'm not sure if yours does quite either"? -- Jitse Niesen 06:39, 4 June 2006 (UTC)
- The above example will be parsed as follows (where [CR] is a newline character):
==foo== [CR]==bar== --> <non-paragraph> <article> --> <non-para-start> <heading> <non-paragraph> --> BOF <heading> <non-para-start> <heading> --> BOF <heading> <newline> <heading>
- I don't know if that is clear, but hopefully it explains how it would be parsed. --HappyDog 23:50, 4 June 2006 (UTC)
- (Sorry to interrupt your text, but there is no easy way to reply to a specific statement.) That's quite clear. I guess I was careless when I wrote that; I think I didn't spot that the <paragraph> is optional. Well, your grammar is quite close at least. It doesn't do ==foo== [CR] [CR] ==bar== if I read it correctly, but that can be fixed by including an optional newline in <non-para-start>.
- I guess it is a bit counterintuitive to allow <para-start> between <non-paragraph>s. The rules with which the current parser handles multiple newlines are also tricky, and I don't think they are captured well by your grammar, but perhaps this can be fixed. -- Jitse Niesen 13:58, 6 June 2006 (UTC)
- I don't know if that is clear, but hopefully it explains how it would be parsed. --HappyDog 23:50, 4 June 2006 (UTC)
- When I said "I'm not sure if yours does quite either", I guess I really meant that I couldn't quite follow it, so I wasn't sure if it was right or not. I remember at the time coming up with something that I didn't think it could parse, but I can't remember what it was, or if the criticism still stands given your recent changes. If I spot anything specific then I'll let you know. --HappyDog 23:50, 4 June 2006 (UTC)
- Actually, I just spotted this: Your parser will interpret:
This is a piece of text, but == this isn't a heading ==
- as <paragraph-and-more> <special-block-and-more> due to the <newline> being optional when a special block follows a paragraph.
- The way I understand it, paragraphs and special blocks must _always_ start at the beginning of a line. This was why my grammar went the opposite way and defined them by starting with a newline, rather than ending with one - although in retrospect I think defining them by how they end is better. --HappyDog 00:41, 5 June 2006 (UTC)
- Actually, I think that is what my parser does. A paragraph in my parser must end with a newline: a paragraph contains lines-of-text, and a line-of-text ends with a newline. For this reason, it doesn't parse your example as you say. -- Jitse Niesen 13:58, 6 June 2006 (UTC)
- I'm going to do a little thinking aloud here, so please excuse me if it is not completely coherent... :)
- One thing I think we should aim for is something that decomposes properly in English, as well as being a correct description of the grammar. For example, I would describe the wiki markup as follows:
An article is either an empty page, or contains article-content. article-content is either a paragraph or a special-block, followed by more article-content. a paragraph is wiki-text followed by a newline a special-block is one of the special-items followed directly by a paragraph or a newline special-items are heading, ruler, list, etc. wiki-text is a wiki-text-item followed by more wiki-text wiki-text-items are link, formatting, magic-link, etc. or just plain-text
- The above description doesn't quite cut it - it doesn't deal with paragraph following paragraph. This may be fixable, but I'm not immediately sure how. Another approach would be to look at an article as consisting of lines of text, so maybe:
An article is either an empty page, or contains article-content. article-content is either a paragraph or a special-block, followed by more article-content. a paragraph is one more paragraph-lines followed by a blank-line or a special-block a special-block is one of the special-items followed by a paragraph or a newline
- It should be noted that (as far as I can tell) all wiki-text-items that are left open are automatically closed at the end of a paragraph (or special block). However, the way this works when a line-break is encountered varies:
strikethrough [CR] strikethrough continues
bold [CR] bold does not continue
- As you can see, the bold tags are closed at the line-break, whilst the strikethrough is not. Anyway - these are a lot of late-night ramblings that might not make much sense even to me. Let me know what you think - I may come back and tidy up this edit when I'm a bit more awake... :) --HappyDog 00:41, 5 June 2006 (UTC)
It is not clear whether recognition of redirects is part of the parser. I think redirects are recognized in Title.php and not in Parser.php. Furthermore, in preview and diff mode, redirects are rendered differently from normal page-view mode. On the other hand, logically the recognition of redirects is probably part of the parser. -- Jitse Niesen 14:07, 31 May 2006 (UTC)
- They are definitely need to be recognised by any wikitext parser, even if it is handled in a different place in the code. If the page does a hard redirect then the rest of the page is ignored, but if this is not the case (&redirect=no, preview edit, etc.) then the whole page is rendered, including the redirect. The actual detail of how the redirect should be displayed should be left out of the specification, but nonetheless it is part of the page parsing, wherever it happens. --HappyDog 07:11, 1 June 2006 (UTC)
- What I tried to say is that &redirect=no and preview edit are handled differently. Compare
- The rest of the page is only showed in the second case. However, I agree with including redirects in the parser. -- Jitse Niesen 10:49, 1 June 2006 (UTC)
- Yeah - I think we're both saying the same thing with regards your final point. It is interesting though how the two differ. In the redirect=no, all wiki text is ignored BUT the page must still get parsed somewhere, as the category links inherited from the template are present. In the preview version though, the # is just treated as a numbered list (as it would be in any other case) and rendered appropriately. This implies that the page is 'parsed' differently on different pages. A standard parsing (if not a standard format for handling the parsed data) may well be one of the things we can derive from this project. --HappyDog 18:56, 1 June 2006 (UTC)
Empty lines 
Some notes about how the current parser handles empty lines. The tables show the wikitext on the left (with "CR" indicating the end of line, in order to show empty lines) and the generated HTML on the right.
Some editors on the English Wikipedia complained about the current behaviour , but it seems unlikely to change.
Empty lines between paragraphs 
text CR <p>text and more</p> and more CR
text CR <p>text</p> CR <p>and more</p> and more CR
text CR <p>text</p> CR <p><br />and more</p> CR and more CR
text CR <p>text</p> CR <p><br /></p> CR <p>and more</p> CR and more CR
text CR <p>text</p> CR <p><br /></p> CR <p><br />and more</p> CR CR and more CR
Empty lines between a paragraph and a horizontal rule 
text CR <p>text</p> ---- CR <hr />
text CR <p>text</p> CR <hr /> ---- CR
text CR <p>text</p> CR <p><br /></p> CR <hr /> ---- CR
text CR <p>text</p> CR <p><br /></p> CR <hr /> CR ---- CR
Empty lines between a horizontal rule and a paragraph 
---- CR <hr /> text CR <p>text</p>
---- CR <hr /> CR <p>text</p> text CR
---- CR <hr /> CR <p><br />text</p> CR text CR
---- CR <hr /> CR <p><br /></p> CR <p>text</p> CR text CR
Empty lines between two horizontal rules 
---- CR <hr /> ---- CR <hr />
---- CR CR <hr /> ---- CR <hr />
---- CR <hr /> CR <p><br /></p> CR <hr /> ---- CR
---- CR <hr /> CR <p><br /></p> CR <hr /> CR ---- CR
One other thing. I'm finding it quite hard to see how the various grammars work. Do you have any experience with parser generators? I only know yacc/bison, but that is not so easy and if I remember correctly, they don't quite accept the grammar in our form. Antlr seems quite promising. A problem is that I'm not sure what to do with the lexer. Perhaps it's best to just skip it? For instance, in this sentence [[ is not parsed as start of link, because the link is not closed. Finally, I'm not sure whether it is worth the effort to implement it. We don't want to duplicate Magnus' work, or that of the many others that have started on building a parser. -- Jitse Niesen 13:58, 6 June 2006 (UTC)
- To be honest, I don't have much experience with parser generators either, but I'm not sure whether that is our ultimate aim or not. As far as I can see there are too many dynamic variables (e.g. terminals that are derived from config/language settings or DB data) to do this, though maybe this is less of a problem than I think. I think the main point is to remove any ambiguity from the way a page is parsed and the way that it is rendered. Maybe I am mistaken though. Are we planning for this to be used with existing parser technology, or are we just trying to describe the syntax in an unambiguous way?
- Regarding your example of the unterminated [[, this is what I was referring to above about some tags being automatically closed by the end of a paragraph, or the end of a line, and I think this is something that can be desribed by the grammar (though perhaps not so easily...) --HappyDog 19:20, 6 June 2006 (UTC)
My aim for the implementation is to check that the grammar that we are building actually does what I want. We could just use the English language defaults. I do not plan to have it actually used in practice. It's probably just too much work. Besides, we may well find that automatic parser generators are not flexible enough to handle all the gory details of the MediaWiki markup.
More fun with tags:
bla bla bla <div style="color: red;"> bla bla bla bla bla bla </div> bla bla bla
bla bla bla
bla bla bla
By the way, as you may know Brion's going to make some changes to the handling of HTML tags which will modify the behaviour of the parser in some places. -- Jitse Niesen 09:34, 7 June 2006 (UTC)
- Hmmm... that is wierd. The engine automatically closes _some_ tags, then? Or is there an HTML tidier that does all that post-parsing? Maybe we should wait until Brion's changes are in place (are they in the new 1.6.7? - some changes have been made, I notice, but are there more to come?) before we delve into this any further. --HappyDog 21:04, 7 June 2006 (UTC)
Yes, the parser is weird. I'm not sure what's happening here.
If Manual:$wgUseTidy is set, then HTML Tidy is used to clean up the tags, otherwise an internal routine is used. Consequently, the above example parses differently. This is what Brion's changes were supposed to fix. However, he seems to have run into some opposition so the changes have been postponed.
Something else did change. Things like
== heading == with more text
are no longer recognised as a heading. I guess we just have to live with the fact that the parser changes all the time. -- Jitse Niesen 07:55, 10 June 2006 (UTC)
- Maybe we should work from the latest major points release (i.e. 1.6), otherwise there will be too much shifting of the ground. For example, until 1.7 is released, we can't be sure that changes such as the above are permanent (rather than a bug, or a test feature that may be reverted before release). If we're too near the bleeding edge then we'll never be able to pin anything down! --HappyDog 12:46, 12 June 2006 (UTC)
That's a good point. However, according to the MediaWiki roadmap, 1.7 will be released in two weeks' time. So I suggest we stick with the bleeding edge for the moment and work from 1.7 when it comes out. -- Jitse Niesen 14:36, 17 June 2006 (UTC)
- OK, but I would suggest that once 1.7 is released, we work to 1.7, even if 1.7.1, 1.7.2,... 1.7.15 are released. When 1.8 is released we can then incorporate the changes introduced at that point and from then on are working to 1.8. Minor point releases shouldn't be making any significant changes to the markup anyway, so I think this will work and avoid the problem of trying to keep up-to-date with something that could, potentially, change every day. --HappyDog 01:01, 18 June 2006 (UTC)
I agree. -- Jitse Niesen 11:15, 18 June 2006 (UTC)
Error handling 
I think the specification of the grammar should not only be limited to describing what constructs are well-formed according to the grammar but also how invalid constructs should be handled and how errors should get detected.
It might help to make some of the definitions more strict than they are in order to help the parser find out about an error very early, without consuming too much memory or processing time.
This means, there should be a more strict definition about what nestings can occur, or what constructs will definitely terminate other constructs.
Finally, there should be some agreement about how errors should influence the result -- after all we do not want the parser to just produce an error (as e.g. when parsing XHTML in strict mode), but rather want the parsed input still be usable in some way (more similar to HTML tag soup processing).
18.104.22.168 11:17, 25 January 2007 (UTC)