Parsing/Notes/Moving Parsoid Into Core/Porting

As part of the Moving Parsoid Into Core project, the code is being ported from JavaScript to PHP. This page is a collection of notes towards that effort.

Introduction

 * PHP Coding Conventions adopted by MediaWiki
 * Running PHP linters and tests via composer

Strings
JavaScript represents strings internally as UCS-2, although in some places it uses the superset UTF-16. Mostly this is transparent to the coder, although it's why  (U+1F4A9 is represented using the surrogate pair U+D83D U+DCA9).

PHP's core string functions work on binary strings, although it also has multibyte string functions that work on strings with many different encodings.

In practice much PHP code (including MediaWiki) represents strings using UTF-8, which has the benefit that most of those binary-string functions do the right thing in most cases. For example, searching a UTF-8 string for "X" or "á" using a bytewise-search function will never find spurious matches by looking at parts of another character's representation or by combining (parts of) multiple adjacent characters, as it could with UCS-2 or most other multibyte encodings. This also tends to work well with the fact that much of the Internet has standardized on UTF-8 (for many of these same reasons), so conversion is often not needed.

Some cases where care over binary versus UTF-8 may be required:
 * When passing string lengths or offsets over public interfaces, including to/from clients.
 * When using regular expressions (see ). For example, if a  or   isn't anchored by literals so it might match a partial character.
 * When dealing with user input. In MediaWiki, almost all the fetching of GET/POST data goes through  from the UtfNormal library to be converted to NFC and have invalid sequences replaced with &#xfffd;.

We'd discussed trying to maintain UTF-16 offsets in the DSR pass, in order to more easily compare the PHP and JS implementations. This is a bit tricky: mb_strlen will return the number of *codepoints*, which is almost-but-not-quite right. Those darn astral characters are two Javascript "characters" but only one "codepoint". That is,  is 1 but we want (hold your nose)   which is 2.

String function equivalents
In general, remember that code points beyond U+FFFF use surrogate pairs in JS but not in PHP, which also influences string lengths and offsets.

Regex
PHP uses PCRE. Via Anomie:

After some searching, it looks like you should generally be good to go for JS→PCRE for ASCII strings and patterns, most of the differences seem to be features that PCRE has and JS lacks. Differences of note include:


 * JS  matches against a bunch of Unicode whitespace (e.g. , the non-breaking space) while PCRE by default only matches ASCII whitespace. (CSA audited most places in Parsoid once upon a time to replace \s with a PHP-compatible character class, but...)
 * JS  (or  )is more or less the same as   (it's interpreted as "the inverse of the empty set"), while in PCRE it's a syntax error (it's interpreted like , so the character class is unterminated).
 * JS can use  as a regex that will never match anything. In PCRE, use.
 * There are some differences if you do weird things with capturing groups and backreferences, e.g. referencing a capturing group that didn't match the pattern (like  or any of the capturing groups in , or referencing a group inside itself (like  ), or forward-referencing a group (like  ). In general, JS ignores the backreference (or treats it as the empty string) while PCRE fails the match.

There are more differences if you're needing Unicode-sensitive matching, beginning with the fact that PCRE needs the 'u' modifier on the pattern (e.g., which BTW matches everything JS   does except  ) and uses   (which can handle all code points   to  ) while JS uses   and cannot directly handle code points above. Plus the fact that we generally use UTF-8 strings in PHP while JS uses either UCS-2 or UTF-16.

Boolean tests
From Anomie:

In general,
 * JS  => PHP   (or   if $foo is an object) is strictly correct, but if null can be treated the same as undefined we prefer to do like the next bullet.
 * JS  => PHP
 * JS  => PHP  . is_null does exist, but there's no reason to use it.
 * JS, where it might be undefined => PHP
 * JS, where it is supposed to be defined => PHP

In the last two cases, keep in mind that in PHP the string "0" and the empty array are considered falsey while JS considers both of those truthy.

Depending on the context, JS calls to  can either be removed entirely or be replaced like JS.

Tests for method existence with code like JS  can be directly translated to PHP , but it's often more idiomatic to test PHP   instead.

Callbacks
In PHP, you can do callbacks in five ways:
 * 1) Referring to a (global) function, by passing its name as a string:
 * 2) Referring to a static class method, by passing the class and method as a string:
 * 3) Referring to a static class method, by passing the class and method as a two-element string array:
 * 4) * Note  is the same as the string , but the former is preferred because it does the expected thing with respect to   and  , as well as making it easier for static analysis tools.
 * 5) * You can get the "compile"-time class name with, and the runtime class name (e.g. the class of  ) with.
 * 6) Referring to an object instance method, by passing the object instance and method name as a two-element array:
 * 7) As an anonymous function/closure, represented as a Closure object:

When documenting or typehinting, use "callable" as the type name. Note Doxygen doesn't allow for directly documenting the arguments and return type like the JS documentor does, you'll have to describe it in prose if it's relevant.

Anonymous functions/closures in PHP work much like in JS, however only  from the outer scope is available by default. You can pass variables from the current scope into the closure like. This works much like argument passing for a function call:  and   are passed by value and later reassignments in the outer scope won't change the values seen inside the closure, while   is passed by reference and any modifications in either scope will be seen in both scopes.

PHP doesn't have the ability to override an object's defined methods at runtime; if you need to do that, you might use a property holding a callable instead.

JS code passing  as a callback will be basically equivalent to PHP code passing. You may be able to do some other JS bind tricks using Closure::fromCallable, Closure::bindTo, and Closure::call, although if possible that should probably be avoided.

Ternary operator associativity
JavaScript's ternary operator is right-associative, i.e.  is interpreted as , which is generally what you'd expect. In PHP, however, it's left-associative:  is interpreted as. In PHP you should always use explicit parentheses when writing an expression with nested ternary operators to avoid confusion, and phpcs will flag it as an issue if you don't.

Note this doesn't affect the binary operators  and. works like  like you'd expect, just like JS.

Domino (JS) vs PHP DOM
Domino uses default upper case tag names (LI, UL, DIV, ..) whereas the PHP DOM seems to use default lower case tag names (li, ul, div, ...). So, during porting, anywhere an upper case tag name is encountered, it should be replaced with the lower case.

PHP's DOM implementation is based on DOM Level 1 Core and DOM Level 2 XML, and lacks the features added in higher levels of those modules and features from the DOM HTML module. See T215000 for further discussion.

A  utility class is being implemented to address DOM functionality gaps and those helpers should be used for all missing functionality.

Code replacement tips

 * Replace  in JS codewith   in PHP.  That JS Util helper existed to deal with language differences in the first place, and in a port, we can go back to the original PHP version.
 * Replace  with
 * Do not use  (see Requests for comment/Assert for more info if you want to dig into this more). The problem is apparently ameliorated in PHP7 but not entirely gone. So, the recommendation is to use   for temporary asserts during porting (for flagging unported methods or for redirecting uses to other native PHP methods) or throwing appropriate semantic exception classes like   for example.
 * In JS land, strings and other non-string tokens are all objects. But, in PHP, string is a primitive type and non-string tokens are non-primitive types. So, the token type tests used in JS cannot be used as is in PHP. You can neither use  nor do a   on strings. You would first have to check   before testing for other types. To DRY this,   provide two helpers that let rest of the codebase deal with this uniformly just like in JS.
 * In JavaScript  and   will produce the first truthy value, while in PHP they will always produce boolean true/false. This is sometimes used in JS to assign default values, as in  . PHP has   and   operators that can sometimes be used for this purpose.
 * is shorthand for, and can be used for JS   as long as   is known to exist. Thus, it can't be safely used when   is something like   or.
 * is shorthand for, and can be used for JS   when you only want to use   if   is undefined or null. To be clear: if   is boolean false, 0, empty string, string-0, or the empty array, that will use  , not
 * In other cases, use the ternary operator or explicit  statements, e.g..
 * Array membership in PHP is . Note that last parameter which is strict mode and checks for types as well. This is important since DOM node comparison seems to give incorrect results without strict mode. Same for   which is use to get index of   in.

Make use of types

 * Declare strict types: By default, we are going to add the  declaration for strict types in all files.
 * Use type hints: Type hints provide all the benefits of typing. Let us use them wherever possible and catch bugs early. We are going to be on PHP 7.1+ and we can use scalar type hints, nullable type hints for args and return types, as well as void return types.

Testing ported code
Given that Parsoid primarily relies on a lot of integration tests (parser tests, mocha tests, roundtrip testing on production pages), we cannot directly use those tests for testing ported code. But, we are not going to wait till the end of the port to actually test everything. Depending on the code in question, there are different testing modalities available.

WT2HTML: Token transformers
-- to be documented --

WT2HTML: DOM transformers
Generate pre- and post- dom outputs from the JavaScript parse of a page, and then use them to verify the transform, If the output fails to match, the  flag can be used to dump it for comparison.

Note that all the regular uses of  are in play here, so simplified tests cases, as opposed to whole pages, can used to narrow down the focus.

HTML2WT: DOMDiff
-- to be documented --

HTML2WT: DOM Normalization
-- to be documented --

Other code
-- to be documented --