PEG tokenizer

From mediawiki.org

<< Parsoid/Internals

The PEG tokenizer is a component of the Parsoid. It is a PEG-based wiki tokenizer which produces a combined token stream from wiki and HTML syntax.The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use WikiPEG, a fork of PEG.js, to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.

Source code: https://phabricator.wikimedia.org/diffusion/GPAR/browse/master/lib/wt2html/tokenizer.js