Parsing/Notes/Moving Parsoid Into Core/Porting

As part of the Moving Parsoid Into Core project, the code is being ported from JavaScript to PHP. This page is a collection of notes towards that effort.

Introduction

 * PHP Coding Conventions adopted by MediaWiki
 * Running PHP linters and tests via composer

Regex
PHP uses PCRE. Via Anomie:

After some searching, it looks like you should generally be good to go for JS→PCRE for ASCII strings and patterns, most of the differences seem to be features that PCRE has and JS lacks. Differences of note include:


 * JS  matches against a bunch of Unicode whitespace (e.g. , the non-breaking space) while PCRE by default only matches ASCII whitespace.
 * JS  is more or less the same as   (it's interpreted as "the inverse of the empty set"), while in PCRE it's a syntax error (it's interpreted like , so the character class is unterminated).
 * There are some differences if you do weird things with capturing groups and backreferences, e.g. referencing a capturing group that didn't match the pattern (like  or any of the capturing groups in , or referencing a group inside itself (like  ), or forward-referencing a group (like  ). In general, JS ignores the backreference (or treats it as the empty string) while PCRE fails the match.

There are more differences if you're needing Unicode-sensitive matching, beginning with the fact that PCRE needs the 'u' modifier on the pattern (e.g., which BTW matches everything JS   does except  ) and uses   (which can handle all code points   to  ) while JS uses   and cannot directly handle code points above. Plus the fact that we generally use UTF-8 strings in PHP while JS uses either UCS-2 or UTF-16.