Parsing/Notes/Moving Parsoid Into Core/Porting

As part of the Moving Parsoid Into Core project, the code is being ported from JavaScript to PHP. This page is a collection of notes towards that effort.

Introduction

 * PHP Coding Conventions adopted by MediaWiki
 * Running PHP linters and tests via composer

Strings
JavaScript represents strings internally as UCS-2, although in some places it uses the superset UTF-16. Mostly this is transparent to the coder, although it's why  (U+1F4A9 is represented using the surrogate pair U+D83D U+DCA9).

PHP's core string functions work on binary strings, although it also has multibyte string functions that work on strings with many different encodings.

In practice much PHP code (including MediaWiki) represents strings using UTF-8, which has the benefit that most of those binary-string functions do the right thing in most cases. For example, searching a UTF-8 string for "X" or "á" using a bytewise-search function will never find spurious matches by looking at parts of another character's representation or by combining (parts of) multiple adjacent characters, as it could with UCS-2 or most other multibyte encodings. This also tends to work well with the fact that much of the Internet has standardized on UTF-8 (for many of these same reasons), so conversion is often not needed.

Some cases where care over binary versus UTF-8 may be required:
 * When passing string lengths or offsets over public interfaces, including to/from clients.
 * When using regular expressions (see ). For example, if a  or   isn't anchored by literals so it might match a partial character.
 * When dealing with user input. In MediaWiki, almost all the fetching of GET/POST data goes through  from the UtfNormal library to be converted to NFC and have invalid sequences replaced with &#xfffd;.

String function equivalents
In general, remember that code points beyond U+FFFF use surrogate pairs in JS but not in PHP, which also influences string lengths and offsets.

Regex
PHP uses PCRE. Via Anomie:

After some searching, it looks like you should generally be good to go for JS→PCRE for ASCII strings and patterns, most of the differences seem to be features that PCRE has and JS lacks. Differences of note include:


 * JS  matches against a bunch of Unicode whitespace (e.g. , the non-breaking space) while PCRE by default only matches ASCII whitespace.
 * JS  is more or less the same as   (it's interpreted as "the inverse of the empty set"), while in PCRE it's a syntax error (it's interpreted like , so the character class is unterminated).
 * There are some differences if you do weird things with capturing groups and backreferences, e.g. referencing a capturing group that didn't match the pattern (like  or any of the capturing groups in , or referencing a group inside itself (like  ), or forward-referencing a group (like  ). In general, JS ignores the backreference (or treats it as the empty string) while PCRE fails the match.

There are more differences if you're needing Unicode-sensitive matching, beginning with the fact that PCRE needs the 'u' modifier on the pattern (e.g., which BTW matches everything JS   does except  ) and uses   (which can handle all code points   to  ) while JS uses   and cannot directly handle code points above. Plus the fact that we generally use UTF-8 strings in PHP while JS uses either UCS-2 or UTF-16.

Boolean tests
From Anomie:

In general,
 * JS  => PHP   (or   if $foo is an object) is strictly correct, but if null can be treated the same as undefined we prefer to do like the next bullet.
 * JS  => PHP
 * JS  => PHP  . is_null does exist, but there's no reason to use it.
 * JS, where it might be undefined => PHP
 * JS, where it is supposed to be defined => PHP

In the last two cases, keep in mind that in PHP the string "0" and the empty array are considered falsey while JS considers both of those truthy.

Domino (JS) vs PHP DOM
Domino seems to use default upper case tag names (LI, UL, DIV, ..) whereas the PHP DOM seems ot use default lower case tag names (li, ul, div, ...). So, during porting, anywhere an upper case tag name is encountered, it should be replaced with the lower case.