Jump to content

Parsing/Notes/Moving Parsoid Into Core/Porting

From mediawiki.org

As part of the Moving Parsoid Into Core project, the code is being ported from JavaScript to PHP. This page is a collection of notes towards that effort.


Language differences[edit]


JavaScript represents strings internally as UCS-2, although in some places it uses the superset UTF-16. Mostly this is transparent to the coder, although it's why '💩'.length === 2 (U+1F4A9 is represented using the surrogate pair U+D83D U+DCA9).

PHP's core string functions work on binary strings, although it also has multibyte string functions that work on strings with many different encodings.

In practice much PHP code (including MediaWiki) represents strings using UTF-8, which has the benefit that most of those binary-string functions do the right thing in most cases. For example, searching a UTF-8 string for "X" or "á" using a bytewise-search function will never find spurious matches by looking at parts of another character's representation or by combining (parts of) multiple adjacent characters, as it could with UCS-2 or most other multibyte encodings. This also tends to work well with the fact that much of the Internet has standardized on UTF-8 (for many of these same reasons), so conversion is often not needed.

Some cases where care over binary versus UTF-8 may be required:

  • When passing string lengths or offsets over public interfaces, including to/from clients.
  • When using regular expressions (see #Regex). For example, if a . or .* isn't anchored by literals so it might match a partial character.
  • When dealing with user input. In MediaWiki, almost all the fetching of GET/POST data goes through UtfNormal\Validator::cleanUp() from the UtfNormal library to be converted to NFC and have invalid sequences replaced with �.

We'd discussed trying to maintain UTF-16 offsets in the DSR pass, in order to more easily compare the PHP and JS implementations. This is a bit tricky: mb_strlen(<utf8 string>) will return the number of *codepoints*, which is almost-but-not-quite right. Those darn astral characters are two Javascript "characters" but only one "codepoint". That is, mb_strlen( "\u{1F4A9}" ) is 1 but we would want (hold your nose) strlen( mb_convert_encoding( "\u{1F4A9}", 'utf-16' ) ) / 2 or mb_strlen( mb_convert_encoding( "\u{1F4A9}", 'utf-16' ), 'ucs-2' ) which is 2.

String function equivalents[edit]

In general, remember that code points beyond U+FFFF use surrogate pairs in JS but not in PHP, which also influences string lengths and offsets.

JavaScript PHP binary PHP multibyte Notes
.length strlen() mb_strlen() Be careful: mb_strlen is "codepoints" and not actually the same as what JS would return. (See above.)
.substr(), .substring(), .charAt() substr() mb_substr() Note that JS .substring() takes an ending index while PHP substr() and mb_substr() take a length (like JS .substr()).
Passing null to PHP substr() means a length of 0, while to mb_substr() it means "until the end of the string".
.indexOf() strpos() n/a In PHP, to test for failure, use === false.
.lastIndexOf() strrpos() n/a In PHP, to test for failure, use === false.
.split() explode(), preg_split() n/a Note PHP str_split() is not the same thing.
.join() implode() n/a While PHP also has join(), MediaWiki's coding conventions prefer implode().
.trim(), .trimStart(), .trimEnd() trim(), ltrim(), rtrim() None built-in PHP trim also trims ASCII NUL, which hopefully makes no difference in any real code, and also allows for passing a set of bytes to trim if the default isn't what you need. If you need to trim non-ASCII characters, use preg_replace() with an appropriate regex.
.charCodeAt( 0 ), .codePointAt( 0 ) ord() mb_ord(), UtfNormal\Util::utf8ToCodepoint()
.fromCharCode(), .fromCodePoint() chr() mb_chr(), UtfNormal\Util::codepointToUtf8()
.padStart(), .padEnd() str_pad() None built-in
.toLowerCase() strtolower() mb_strtolower() Whether the exact behavior is the same for all strings I don't know. Note PHP 7.3 has different behavior from earlier versions for some characters, see T207100.
.toUpperCase() strtoupper() mb_strtoupper()
non-regex .replace() str_replace(), strtr() n/a PHP can process arrays of replacements in one call, but doesn't support a callback function for the replacement. You can do a callback using regular expressions with preg_replace_callback(), but this doesn't let you pass flags like PREG_OFFSET_CAPTURE.
Regular expression test/match preg_match() See #Regex.
Regular expression replace preg_replace(), preg_replace_callback() See #Regex.
encodeURIComponent() rawurlencode() n/a PHP will happily encode non-UTF-8 data if given non-UTF-8 input. The similar urlencode() encodes space as '+' rather than '%20'.
decodeURIComponent() rawurldecode() n/a JS throws if the input doesn't represent a valid UTF-8 string, while PHP just decodes the bytes. JS throws if the input contains '%' that isn't followed by two hex digits, while PHP treats it as a literal '%'. The similar urldecode() also decodes'+' to a space.
encodeURI() None built-in
decodeURI() None built-in


PHP uses PCRE. Via Anomie:

After some searching, it looks like you should generally be good to go for JS→PCRE for ASCII strings and patterns, most of the differences seem to be features that PCRE has and JS lacks. Differences of note include:

  • JS /\s/ matches against a bunch of Unicode whitespace (e.g. U+00A0, the non-breaking space) while PCRE by default only matches ASCII whitespace. (CSA audited most places in Parsoid once upon a time to replace \s with a PHP-compatible character class, but...)
  • JS /[^]/ is more or less the same as /./ or /[\s\S]/ (it's interpreted as "the inverse of the empty set"), while in PCRE it's a syntax error (it's interpreted like /[^\]/, so the character class is unterminated).
  • JS can use /[]/ as a regex that will never match anything. In PCRE, use /(?!)/.
  • There are some differences if you do weird things with capturing groups and backreferences, e.g. referencing a capturing group that didn't match the pattern (like /(.)?\1/ or any of the capturing groups in /(?:(a)|(b)|(c))/), or referencing a group inside itself (like /(.\1)/), or forward-referencing a group (like /\1(.)/). In general, JS ignores the backreference (or treats it as the empty string) while PCRE fails the match.

There are more differences if you're needing Unicode-sensitive matching, beginning with the fact that PCRE needs the 'u' modifier on the pattern (e.g. /\s/u, which BTW matches everything JS /\s/ does except U+FEFF) and uses \x{####} (which can handle all code points U+0000 to U+10FFFF) while JS uses \u#### and cannot directly handle code points above U+FFFF. Plus the fact that we generally use UTF-8 strings in PHP while JS uses either UCS-2 or UTF-16.

Boolean tests[edit]

From Anomie:

In general,

  • JS foo.bar === undefined => PHP !array_key_exists( 'bar', $foo ) (or !property_exists( $foo, 'bar' ) if $foo is an object) is strictly correct, but if null can be treated the same as undefined we prefer to do like the next bullet.
  • JS foo.bar === undefined || foo.bar === null => PHP isset( $foo['bar'] )
  • JS foo.bar === null => PHP $foo['bar'] === null. is_null() does exist, but there's no reason to use it.
  • JS foo.bar, where it might be undefined => PHP !empty( $foo['bar'] )
  • JS foo.bar, where it is supposed to be defined => PHP $foo['bar']

In the last two cases, keep in mind that in PHP the string "0" and the empty array are considered falsey while JS considers both of those truthy.

Depending on the context, JS calls to Object.hasOwnProperty() can either be removed entirely or be replaced like JS foo.bar !== undefined.

Tests for method existence with code like JS foo.methodName can be directly translated to PHP is_callable( [ $foo, 'methodName' ] ), but it's often more idiomatic to test PHP $foo instanceof ClassOrInterface instead.


In PHP, you can do callbacks in five ways:

  1. Referring to a (global) function, by passing its name as a string: 'fooBar'
  2. Referring to a static class method, by passing the class and method as a string: 'Some\Class::fooBar'
  3. Referring to a static class method, by passing the class and method as a two-element string array: [ Some\Class::class, 'fooBar' ]
    • Note Some\Class::class is the same as the string 'Some\Class', but the former is preferred because it does the expected thing with respect to namespace and use, as well as making it easier for static analysis tools.
    • You can get the "compile"-time class name with self::class, and the runtime class name (e.g. the class of $this) with static::class.
  4. Referring to an object instance method, by passing the object instance and method name as a two-element array: [ $obj, 'fooBar' ]
  5. As an anonymous function/closure, represented as a Closure object: function ( $args ) { /*...*/ }

When documenting or typehinting, use "callable" as the type name. Note Doxygen doesn't allow for directly documenting the arguments and return type like the JS documentor does, you'll have to describe it in prose if it's relevant.

Anonymous functions/closures in PHP work much like in JS, however only $this from the outer scope is available by default. You can pass variables from the current scope into the closure like function ( $args ) use ( $var1, $var2, &$var3 ) {}. This works much like argument passing for a function call: $var1 and $var2 are passed by value and later reassignments in the outer scope won't change the values seen inside the closure, while $var3 is passed by reference and any modifications in either scope will be seen in both scopes.

PHP doesn't have the ability to override an object's defined methods at runtime; if you need to do that, you might use a property holding a callable instead.

JS code passing obj.fooBar.bind( obj ) as a callback will be basically equivalent to PHP code passing [ $obj, 'fooBar' ]. You may be able to do some other JS bind tricks using Closure::fromCallable(), Closure::bindTo(), and Closure::call(), although if possible that should probably be avoided.

Ternary operator associativity[edit]

JavaScript's ternary operator is right-associative, i.e. a ? b : c ? d : e is interpreted as a ? b : ( c ? d : e ), which is generally what you'd expect. In PHP, however, it's left-associative: $a ? $b : $c ? $d : $e is interpreted as ( $a ? $b : $c ) ? $d : $e. In PHP you should always use explicit parentheses when writing an expression with nested ternary operators to avoid confusion, and phpcs will flag it as an issue if you don't.

Note this doesn't affect the binary operators ?: and ??. $a ?: $b ?: $c works like $a ? $a : ( $b ? $b : $c ) like you'd expect, just like JS a || b || c.

Library differences[edit]

Domino (JS) vs PHP DOM[edit]

Domino uses default upper case tag names (LI, UL, DIV, ..) whereas the PHP DOM seems to use default lower case tag names (li, ul, div, ...). So, during porting, anywhere an upper case tag name is encountered, it should be replaced with the lower case.

PHP's DOM implementation is based on DOM Level 1 Core and DOM Level 2 XML, and lacks the features added in higher levels of those modules and features from the DOM HTML module. See T215000 for further discussion. But even then, some PHP features are broken, for example Node::normalize() doesn't remove zero-length Text nodes in PHP.

A DOMCompat utility class is being implemented to address DOM functionality gaps and those helpers should be used for all missing functionality.

Code replacement tips[edit]

  • Replace Util.phpURLEncode(str) in JS codewith urlencode($str) in PHP. That JS Util helper existed to deal with language differences in the first place, and in a port, we can go back to the original PHP version.
  • Replace JSUtils.lastItem(a) with end($a)
  • Do not use assert (see Requests for comment/Assert for more info if you want to dig into this more). The problem is apparently ameliorated in PHP7 but not entirely gone. So, the recommendation is to use throw new \BadMethodCallException('...') for temporary asserts during porting (for flagging unported methods or for redirecting uses to other native PHP methods) or throwing appropriate semantic exception classes like InvalidTokenException for example.
  • In JS land, strings and other non-string tokens are all objects. But, in PHP, string is a primitive type and non-string tokens are non-primitive types. So, the token type tests used in JS cannot be used as is in PHP. You can neither use instanceof nor do a $token->getType() on strings. You would first have to check is_string($token) before testing for other types. To DRY this, TokenUtils.php provide two helpers that let rest of the codebase deal with this uniformly just like in JS.
  • In JavaScript || and && will produce the first truthy value, while in PHP they will always produce boolean true/false. This is sometimes used in JS to assign default values, as in value = something || "default". PHP has ?: and ?? operators that can sometimes be used for this purpose.
    • $a ?: $b is shorthand for $a ? $a : $b, and can be used for JS || as long as $a is known to exist. Thus, it can't be safely used when $a is something like $obj->maybeUnset or $arr['maybeUnset'].
    • $a ?? $b is shorthand for isset( $a ) ? $a : $b, and can be used for JS || when you only want to use $b if $a is undefined or null. To be clear: if $a is boolean false, 0, empty string, string-0, or the empty array, that will use $a, not $b
    • In other cases, use the ternary operator or explicit if statements, e.g. !empty( $a ) ? $a : $b.
  • Array membership in PHP is in_array($v, $array, true). Note that last parameter which is strict mode and checks for types as well. This is important since DOM node comparison seems to give incorrect results without strict mode. Same for array_search($v, $array, true) which is use to get index of $v in $array.

Make use of types[edit]

  • Declare strict types: By default, we are going to add the declare( strict_types = 1 ); declaration for strict types in all files.
  • Use type hints: Type hints provide all the benefits of typing. Let us use them wherever possible and catch bugs early. We are going to be on PHP 7.1+ and we can use scalar type hints, nullable type hints for args and return types, as well as void return types.

Testing ported code[edit]

Given that Parsoid primarily relies on a lot of integration tests (parser tests, mocha tests, roundtrip testing on production pages), we cannot directly use those tests for testing ported code. But, we are not going to wait till the end of the port to actually test everything. Depending on the code in question, there are different testing modalities available.

WT2HTML: Token transformers[edit]

Testing a ported token transformer in isolation[edit]

Generate a dump of tokens passing in and out of the token transformer via the parse.js script

node bin/parse.js --pageName <page> --genTest <pass> --genTestOut <filename> < /dev/null > /dev/null

Use the JS code in/out token dump to verify your PHP port of the transformer

php bin/TransformTests.php --transformer <pass> --inputFile <testfile>

Note: All the regular uses of bin/parse.js are in play here, so simplified tests cases, as opposed to whole pages, can be used to narrow down the focus.

Testing a ported token transformer via JS-PHP hybrid parsing[edit]

Edit phpconfig.yaml (possibly creating it from phpconfig.example.yaml) to add the new pass to the phpTokenTransformers list. You have to edit bin/runTransform.php and update the switch statement to construct the PHP token transformer. Now, try to run the parser tests,

node bin/parserTests.js --quiet --phpConfigFile phpconfig.yaml

This splices the PHP code in place of the JS code and runs parser tests.

WT2HTML: DOM transformers[edit]

Testing a ported DOM pass in isolation[edit]

Generate pre- and post- dom outputs from the JavaScript parse of a page,

node bin/parse.js --genTest dom:<pass> --pageName <page> < /dev/null > /dev/null

and then use them to verify the transform,

php bin/DOMPassTester.php --transformer <pass> --inputFilePrefix <page>

If the output fails to match, the --debug_dump flag can be used to dump the DOM before and after the pass for comparison.

Note: All the regular uses of bin/parse.js are in play here, so simplified tests cases, as opposed to whole pages, can be used to narrow down the focus.

Testing a ported DOM pass via JS-PHP hybrid parsing[edit]

Edit phpconfig.yaml (possibly creating it from phpconfig.example.yaml) to add the new pass to the phpDOMTransformers list. You have to edit bin/runDOMTransform.php and update the switch statement to construct the PHP DOM transformer. Now, try to run the parser tests,

node bin/parserTests.js --quiet --phpConfigFile phpconfig.yaml

This splices the PHP code in place of the JS code and runs parser tests.

HTML2WT: DOMDiff[edit]

You will need to port bin/domdiff.test.js to PHP and use that version to run tests.

TODO: We are going to be updating the JS-PHP hybrid parsing infrastructure to be able to plug in the PHP domdiff code and run parser tests. At that time, we'll update the instructions here.

HTML2WT: DOM Normalization[edit]

You will need to port bin/normalize.test.js to PHP and use that version to run tests.

TODO: We are going to be updating the JS-PHP hybrid parsing infrastructure to be able to plug in the PHP DOMNormalizer code and run parser tests. At that time, we'll update the instructions here.

Other code[edit]

-- to be documented --

Running js2php[edit]

Make sure your node_modules/ are up-to-date and then run,

$ node tools/js2php.js