Parsing/Notes/Moving Parsoid Into Core/Porting

As part of the Moving Parsoid Into Core project, the code is being ported from JavaScript to PHP. This page is a collection of notes towards that effort.

Introduction

 * PHP Coding Conventions adopted by MediaWiki
 * Running PHP linters and tests via composer

Regex
PHP uses PCRE. Via Anomie:

After some searching, it looks like you should generally be good to go for JS→PCRE for ASCII strings and patterns, most of the differences seem to be features that PCRE has and JS lacks. Differences of note include:


 * JS  matches against a bunch of Unicode whitespace (e.g. , the non-breaking space) while PCRE by default only matches ASCII whitespace.
 * JS  is more or less the same as   (it's interpreted as "the inverse of the empty set"), while in PCRE it's a syntax error (it's interpreted like , so the character class is unterminated).
 * There are some differences if you do weird things with capturing groups and backreferences, e.g. referencing a capturing group that didn't match the pattern (like  or any of the capturing groups in , or referencing a group inside itself (like  ), or forward-referencing a group (like  ). In general, JS ignores the backreference (or treats it as the empty string) while PCRE fails the match.

There are more differences if you're needing Unicode-sensitive matching, beginning with the fact that PCRE needs the 'u' modifier on the pattern (e.g., which BTW matches everything JS   does except  ) and uses   (which can handle all code points   to  ) while JS uses   and cannot directly handle code points above. Plus the fact that we generally use UTF-8 strings in PHP while JS uses either UCS-2 or UTF-16.

Boolean tests
From Anomie:

In general,
 * JS  => PHP   (or   if $foo is an object) is strictly correct, but if null can be treated the same as undefined we prefer to do like the next bullet.
 * JS  => PHP
 * JS  => PHP  . is_null does exist, but there's no reason to use it.
 * JS, where it might be undefined => PHP
 * JS, where it is supposed to be defined => PHP

In the last two cases, keep in mind that in PHP the string "0" and the empty array are considered falsey while JS considers both of those truthy.