Markup spec/OCaml

Draft lexer for the MediaWiki markup implemented in OCaml.

Other wikis have benefited from using a real lexer instead of a series of regular expressions. For example, the Lexer from Wookee engine for UseModWiki (which is unfortunately written in Perl, not PHP) could be even quite usable if html attributes parsing was added. (The Parser, however, is questionable).

Later some proof-of-concept parser could be made. It should probably generate some AST, not (X)HTML directly, so it can be used in many useful bots too. Or have 2 modes - AST for bots and HTML for maximum performance.

Idea

 * New Line
 * End of File
 * whitespace +
 * (\* | # )+ (only at linestart)
 * ={1,6} (at linestart or lineend)
 * [ [ articlenamespecification]]
 * [ [ articlenamespecification|
 * ] ]
 * urlspecification
 * [ urlspecification
 * ]
 * &lt;pre> anything_but_close_pre &lt;/pre>
 * &lt;nowiki> anything_but_close_nowiki &lt;/nowiki>
 * &lt;math> anything_but_close_math &lt;/math>
 * &lt;!-- anything_but_close_html_comment -->
 * ISBN whitespace [0-9X-]+
 * RFC whitespace \d+
 * (one token per valid HTML tag)
 * & entityspecification ;
 * anyothercharacter
 * variables (+ magic to parse "articlename" for variables to emulate current multipass parser)
 * &lt;math> anything_but_close_math &lt;/math>
 * &lt;!-- anything_but_close_html_comment -->
 * ISBN whitespace [0-9X-]+
 * RFC whitespace \d+
 * (one token per valid HTML tag)
 * & entityspecification ;
 * anyothercharacter
 * variables (+ magic to parse "articlename" for variables to emulate current multipass parser)

Regular expressions used:
 * articlenamespecification = ???
 * entityspecification = ???
 * urlspecification = ???
 * whitespace = [\s|\t]

= Code = Mini-lexer has 4 files lexer.mll, util.ml, tokens.ml and main.ml (which is kind of trivial parser). It's incomplete: