Parsing/Notes/Moving Parsoid Into Core/Porting

As part of the Moving Parsoid Into Core project, the code is being ported from JavaScript to PHP. This page is a collection of notes towards that effort.

Introduction

 * PHP Coding Conventions adopted by MediaWiki
 * Running PHP linters and tests via composer

Regex
PHP uses PCRE. Via Anomie:

After some searching, it looks like you should generally be good to go for JS→PCRE for ASCII strings and patterns, most of the differences seem to be features that PCRE has and JS lacks. Differences of note include:


 * JS /\s/ matches against a bunch of Unicode whitespace (e.g. U+00A0, the non-breaking space) while PCRE by default only matches ASCII whitespace.
 * JS /[^]/ is more or less the same as /./ (it's interpreted as "the inverse of the empty set"), while in PCRE it's a syntax error (it's interpreted like /[^\]/, so the character class is unterminated).
 * There are some differences if you do weird things with capturing groups and backreferences, e.g. referencing a capturing group that didn't match the pattern (like /(.)?\1/ or any of the capturing groups in /(?:(a)|(b)|(c))/), or referencing a group inside itself (like /(.\1)/), or forward-referencing a group (like /\1(.)/). In general, JS ignores the backreference (or treats it as the empty string) while PCRE fails the match.

There are more differences if you're needing Unicode-sensitive matching, beginning with the fact that PCRE needs the 'u' modifier on the pattern (e.g. '/\s/u', which BTW matches everything JS /\s/ does except U+FEFF) and uses "\x{####}" (which can handle all code points U+0000 to U+10FFFF) while JS uses "\u####" and cannot directly handle code points above U+FFFF. Plus the fact that we generally use UTF-8 strings in PHP while JS uses either UCS-2 or UTF-16.