Parsoid/Debugging

From mediawiki.org

Debugging tips for commandline parsing (parse.php script)[edit]

This section assumes you are in the bin/ directory.

Debugging the wt2html mode[edit]

php bin/parse.php --help is a useful command to remember. Continue reading to find out more about a few of these options. Since Parsoid processes wikitext in a pipeline composed of synchronous and asynchronous phases, it is sometimes useful to know how to examine the contents of the pipeline at various stages.

1. If you want to debug the tokenizer, php bin/parse.php --trace peg is useful. Each time the tokenizer emits a token array to the next stage in the pipeline, this option prints out the token array.

[subbu@earth tests] echo -e "foo bar\nThis is a [[link]]" | php bin/parse.php --trace peg
0-[peg]        | ---->   ["foo bar"]
0-[peg]        | ---->   [{"type":"NlTk","dataAttribs":{"tsr":[7,8]}}]
0-[peg]        | ---->   ["This is a ",{"type":"SelfclosingTagTk","name":"wikilink","attribs":[{"k":"href","v":["link"],"srcOffsets":[20,20,20,24],"vsrc":"link"}],"dataAttribs":{"tsr":[18,26],"src":"[[link]]"}}]
0-[peg]        | ---->   [{"type":"NlTk","dataAttribs":{"tsr":[26,27]}}]

The end-of-output is signalled by the EOFTk. Also note that multiple tokenizers might be active at the same time because of concurrent template expansions. Future enhancement of this debugging output would assign debug ids to every tokenizer and use that id to distinguish output between tokenizers.

2. If you want to look at the fully expanded and in-order token-stream, php bin/parse.php --trace tsp is your friend. This emits the tokens as seen by the TokenStreamPatcher handler which is the very first handler in the in-order third phase synchronous transformation passes. So, it is a good proxy for the in-order token stream of the top-level document.

[subbu@earth tests] echo -e "foo \n{{1x|bar}}\n[[link]]" | php bin/parse.php --trace tsp
0-[TSP]        | "foo "
0-[TSP]        | {"type":"NlTk","dataAttribs":{"tsr":[4,5]}}
0-[TSP]        | {"type":"SelfclosingTagTk","name":"meta","attribs":[{"k":"typeof","v":"mw:Transclusion"},{"k":"about","v":"#mwt1"}],"dataAttribs":{"tsr":[5,15],"src":"{{1x|bar}}","tmp":{"tplarginfo":"{\"dict\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"bar\"}}},\"paramInfos\":[{\"k\":\"1\",\"srcOffsets\":[10,10,10,13]}]}"}}}
0-[TSP]        | "bar"
0-[TSP]        | {"type":"SelfclosingTagTk","name":"meta","attribs":[{"k":"typeof","v":"mw:Transclusion/End"},{"k":"about","v":"#mwt1"}],"dataAttribs":{"tsr":[null,15]}}
0-[TSP]        | {"type":"NlTk","dataAttribs":{"tsr":[15,16]}}
0-[TSP]        | {"type":"TagTk","name":"a","attribs":[{"k":"rel","v":"mw:WikiLink"},{"k":"href","v":"./Link"},{"k":"title","v":"Link"}],"dataAttribs":{"tsr":[16,24],"stx":"simple","a":{"href":"./Link"},"sa":{"href":"link"}}}
0-[TSP]        | "link"
0-[TSP]        | {"type":"EndTagTk","name":"a","attribs":[],"dataAttribs":{}}
0-[TSP]        | {"type":"NlTk","dataAttribs":{"tsr":[24,25]}}
0-[TSP]        | {"type":"EOFTk"}

3. If you want to look at the fully processed and transformed token stream (post all transformations), php bin/parse.php --trace html is a good proxy. The output is a little bit noisier than it needs to be. Refining it and making it more useful is left as an enhancement.

4. If you want to look at the DOM at different stages of transformation, --dump dom:post-builder, --dump dom:pre-dsr, --dump dom:pre-encap are useful DOM debug options which can be combined as --dump dom:pre-dsr,dom:pre-encap

5. Sometimes, it is useful to look at the preprocessed template source that Parsoid then tokenizes. --dump tplsrc is useful in those scenarios.

6. There are a bunch of other handler-specific tracing flags. "php bin/parse.php --help" should tell you what they do. There are tracers for the PreHandler, ListHandler, and ParagraphWrapper. There is no tracer currently for the QuoteTransformer.

Debugging the html2wt mode[edit]

php bin/parse.php --help has a few options to help debug the serializer (this converts HTML to Wikitext).

1. If you want to trace the actions of the regular serializer, --trace wts is what you want.

$ echo "<p>foo</p>" | php bin/parse.php --html2wt --trace wts

2. If you want to debug the wikitext escaping behavior of the serializer, --trace wt-escape is what you want.

$ echo "<p> foo\nbar</p>\n\n<p>*a\n*b</p>" | php bin/parse.php --trace wt-escape --html2wt

Debugging the selective serializer (selser)[edit]

In order to test the selective serializer, you need (a) original wikitext (b) original html (c) modified html.

Running selser (for HTML with inline data-parsoid)[edit]

Let us first look at ways to test the selective serializer on the command line.

$ echo "<p>foo</p><p>boo</p>" > /tmp/wt
$ php bin/parse.php < /tmp/wt > /tmp/orig.html 
$ sed "s/foo/bar/g" < /tmp/orig.html > /tmp/edited.html
$ php bin/parse.php --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/orig.html  < /tmp/edited.html

Running selser on HTML and data-parsoid in separate files[edit]

This is useful to testing parsoid output after dumping orig HTML and edited HTML (from VE -- see instructions for doing that later in this file) and fetching data-parsoid from RESTBase. This effectively simulates the v2 html2wt API endpoint that VE and other clients use via RESTBase.

$ php bin/parse.php --dpinfile data-parsoid.json --selser true --oldhtmlfile old.html --oldtextfile wt.txt < new.html

There are entirely commandline options for running selser for very simple examples. Check php bin/parse.php --help to find out more.

Debugging DOMDiff[edit]

Selser first compares the old and new html and generates a diff-marked DOM. This is the DOMDiff class. There is a commandline script to test and debug this functionality in isolation.

Output edited below to fit window

$ php bin/domdiff.test.php /tmp/orig.html /tmp/edited.html
----- DIFF-marked DOM -----
<html data-parsoid-diff="{&quot;id&quot;:null,&quot;diff&quot;:[&quot;subtree-changed&quot;]}">
<head></head>
<body data-parsoid="{&quot;dsr&quot;:[0,21,0,0]}" data-parsoid-diff="{&quot;id&quot;:null,&quot;diff&quot;:[&quot;subtree-changed&quot;]}">
<p data-parsoid="{&quot;stx&quot;:&quot;html&quot;,&quot;dsr&quot;:[0,10,3,4]}" data-parsoid-diff="{&quot;id&quot;:null,&quot;diff&quot;:[&quot;children-changed&quot;,&quot;subtree-changed&quot;]}">bar</p>
<p data-parsoid="{&quot;stx&quot;:&quot;html&quot;,&quot;dsr&quot;:[10,20,3,4]}">boo</p>
</body></html>

You can look at (currently very) verbose output of DOM-diffing by turning on the --debug option.

$ php bin/domdiff.test.php --debug /tmp/orig.html /tmp/edited.html
...
$ php bin/domdiff.test.php --help

Debugging selser[edit]

To debug the selective serializer, you can use --trace selser. Using this flag will automatically enable tracing of the regular serializer, so there is no need to say --trace wts as well.

$ php bin/parse.php --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/orig.html --trace selser < /tmp/edited.html

Debugging what's happening on a local mediawiki install[edit]

$ MW_INSTALL_PATH=/var/www/html/mediawiki php bin/parse.php --integrated --domain http://localhost --pageName Main_Page < /dev/null > /dev/null will throw what happens with a given page on the locally installed MediaWiki on the command line.

Reproducing roundtrip testing conditions[edit]

In order to reproduce issues that appear in the roundtrip testing ("rt-testing"), some options must be used with parse.php:

  • --wrapSections must be used in the wt->html direction
  • --scrubWikitext must be used in html->wt direction
  • very rarely, some bugs only happen with the --pageBundle option
  • some html2wt bugs are only visible with the --selser option

Debugging Translate annotations[edit]

The Translate extension defines annotations; to debug annotation behaviour coming from production, one way is to temporary apply the following patch to a local Parsoid checkout:

diff --git a/bin/parse.php b/bin/parse.php
index f67fecd69..732af5e22 100644
--- a/bin/parse.php
+++ b/bin/parse.php
@@ -23,6 +24,7 @@ use Wikimedia\Parsoid\Mocks\MockDataAccess;
 use Wikimedia\Parsoid\Mocks\MockPageConfig;
 use Wikimedia\Parsoid\Mocks\MockPageContent;
 use Wikimedia\Parsoid\Mocks\MockSiteConfig;
+use Wikimedia\Parsoid\ParserTests\DummyAnnotation;
 use Wikimedia\Parsoid\ParserTests\TestUtils;
 use Wikimedia\Parsoid\Parsoid;
 use Wikimedia\Parsoid\Tools\ExtendedOptsProcessor;
@@ -253,7 +255,8 @@ class Parse extends \Wikimedia\Parsoid\Tools\Maintenance {
         */
        private function setupMwConfig( array $configOpts ) {
                $services = MediaWikiServices::getInstance();
-               $siteConfig = $services->getParsoidSiteConfig();
+               $parsoidServices = new ParsoidServices( $services );
+               $siteConfig = $parsoidServices->getParsoidSiteConfig();
                // Overwriting logger so that it logs to console/file
                $logFilePath = null;
                if ( $this->hasOption( 'logFile' ) ) {
@@ -287,6 +290,7 @@ class Parse extends \Wikimedia\Parsoid\Tools\Maintenance {
                $api = new ApiHelper( $configOpts );
 
                $siteConfig = new SiteConfig( $api, $configOpts );
+               $siteConfig->registerExtensionModule( DummyAnnotation::class );
                if ( $this->hasOption( 'logFile' ) ) {
                        // Overwrite logger so that it logs to file
                        $siteConfig->setLogger( SiteConfig::createLogger( $this->getOption( 'logFile' ) ) );
diff --git a/src/ParserTests/DummyAnnotation.php b/src/ParserTests/DummyAnnotation.php
index d7245dc3c..2b2a91b17 100644
--- a/src/ParserTests/DummyAnnotation.php
+++ b/src/ParserTests/DummyAnnotation.php
@@ -16,7 +16,7 @@ class DummyAnnotation extends ExtensionTagHandler implements ExtensionModule {
                        'name' => 'DummyAnnotation',
                        // If these are not the same length as "translate" and "tvar"
                        // respectively, it requires adjusting wtOffsets in the (large) test file.
-                       'annotations' => [ 'dummyanno', 'ann2' ]
+                       'annotations' => [ 'dummyanno', 'ann2', 'translate', 'tvar' ]
                ];
        }
 }

Debugging tips for parser tests (parserTests.php script)[edit]

Running tests in all modes[edit]

parserTests is the script to run parser tests. The following command runs tests in all 5 modes. Of course, you can run tests for any of the 5 combinations. The default commandline runs in 4 modes (excludes selser mode).

$ php bin/parserTests.php --wt2html --wt2wt --html2wt --html2html --selser true

All commandline options that the parse.js script accepts can be passed into parserTests as well. So, the debugging techniques from the previous section are applicable here. In addition, a couple parser tests specific options are useful when debugging parser test failures.

Running a subset of tests[edit]

parserTests accepts the --filter <string> option which can be used to run a single test or a subset of tests. Examples below:

$ php bin/parserTests.php --wt2html --filter "Tabs don't trigger preformatted text"
$ php bin/parserTests.php --wt2wt --selser true --filter "Tabs don't trigger preformatted text" 
--trace wts,selser
$ php bin/parserTests.php --wt2wt --selser true --filter "Tabs don't trigger preformatted text" 
--trace wts,selser --knownFailures false

The last commandline ignores entries in the knownFailures file and dumps failure output (input, expected output, and rendered output).

Additionally, it's also possible to pass a single test file to parserTests.php:

$ php bin/parserTests.php --wt2wt --selser true tests/parser/annotationParserTests.txt

This can be combined with the --filter option to execute a single test or a subset of tests without having to go through the irrelevant files to find said test. This is particularly useful in debug sessions where the performance impact of xdebug makes parserTests a long file to process if it's not entirely necessary.

Running selser with a specific edit[edit]

In selser mode, parserTests script generates a bunch of edited DOM tests by generating random DOM changes and applying those to the HTML and running selser test on it. The generated changes is called a changetree and is specific to the wt2html output produced on the wikitext for a test. In order to run a selser test for a specific change-tree, the --changetree commandline flag can be used.

$ php bin/parserTests.php --selser true --filter "Tabs don't trigger preformatted text" --changetree "[2,0,0]" --knownFailures false tests/parser/indentPre.txt

Usually this last commandline is used to debug a failing selser test as recorded in the knownFailures file and determining whether wt2wt output is incorrect or selser output is incorrect. This is easy to determine by dumping the edited DOM after the changetree is applied to the wt2html output as follows:

$ php bin/parserTests.php --selser true --filter "Tabs don't trigger preformatted text" --changetree "[2,0,0]" --dump dom:post-changes --knownFailures false tests/parser/indentPre.txt 
Loaded known failures from /home/subbu/work/wmf/parsoid/tests/parser/indentPre-standalone-knownFailures.json
----- Original DOM -----
<body data-parsoid='{"dsr":[0,75,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><p data-parsoid='{"dsr":[0,33,0,0]}'>	This is not
	 preformatted text.</p>
<pre data-parsoid='{"dsr":[34,75,1,0]}'>This is preformatted text.
	So is this.</pre></body>

----- Change Tree -----
[2,0,0]
----- Edited DOM -----
<body data-parsoid='{"dsr":[0,75,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><p>1cm3xh0</p><p data-parsoid='{"dsr":[0,33,0,0]}'>	This is not
	 preformatted text.</p>
<pre data-parsoid='{"dsr":[34,75,1,0]}'>This is preformatted text.
	So is this.</pre></body>

The changeTree parameter represent changes applied to the tree:

  • each node is assigned to an element of the array, in order; nested nodes are represented by nested arrays.
  • operations are assigned to these nodes:
    • 0: no change
    • 1: change node wrapper - adds a data-foobar attribute
    • 2: insert new (text) node before child - unless we're in a list / table / body, in which case we insert li, table element or p.
    • 3: delete tree rooted at child
    • 4: change tree rooted at child (delete + insert)

Additionally the single operation [5] appends a comment, which is then stripped from the output, and the result is compared to the original wikitext.

Using debug_selser.sh script[edit]

For debugging selser failures for a specific test with a specific edit, debug_selser.sh script is your friend. It takes the test name as the first argument and the edit tree as its second argument.

Running a roundtrip test and emitting roundtrip diffs (roundtrip-test.js)[edit]

You can use roundtrip-test.js to run a roundtrip test (converting from html to wikitext and back) on a title on a wiki with a registered wiki prefix. Roundtrip-test additionally supports trace, dump, and debug flags that are passed through to the parser and the serializer.

$ node roundtrip-test.js --prefix enwiki "Medha Patkar"
.... verbose diff omitted here ....
=====================================================================
=====================================================================
SUMMARY:
Semantic differences : 0
Syntactic differences: 1
---------------------------------------------------------------------
ALL differences      : 1
=====================================================================
=====================================================================
$ node roundtrip-test.js --prefix enwiki "Medha Patkar" --trace selser
-----------------WTS-mode-----------------
["WTS:"," DOM ==> \n","<body data-parsoid
.... verbose diff omitted here ....

Other helpful scripts[edit]

fetch-wt.js is a useful script to fetch wikitext for a revision. This is useful when you want to debug Parsoid behavior on the commandline for a specific page.

[subbu@earth tests] fetch-wt.js --help
Usage: node ./fetch-wt.js [options] <page-title or rev-id>
If first argument is numeric, it is used as a rev id; otherwise it is
used as a title.  Use the --title option for a numeric title.

Options:
  --output  Write page to given file                                                                           
  --prefix  Which wiki prefix to use; e.g. "enwiki" for English Wikipedia, "eswiki" for Spanish,
            "mediawikiwiki" for mediawiki.org  [default: "en"]
  --revid   Page revision to fetch                                                                             
  --title   Page title to fetch (only if revid is not present)                                                 
  --help    Show this message

Generating PHP parser output (without Tidy) on snippets[edit]

Note that running the PHP parser without Tidy enabled has been deprecated and will shortly be removed. In addition, the below is out-of-date since the maintenance/parse.php script in core has been tweaked to tidy by default. If you really want to see no-tidy output, you need to give the --no-tidy option to parse.php.

In the browser[edit]

This requires you to have a mediawiki install with Tidy not enabled. The default Mediawiki installation comes without Tidy enabled.

Create a wikipage in your browser and save or look at preview.

View Source in the browser (Inspect Element in the DOM inspector will show you the DOM that your browser helpfully fixed up for you, but that is not what you want).

Via parserTests script on the commandline[edit]

Create a test file with a wikitext snippet you want to generate output for in the parser tests format but leave the result section empty (see example below)

!! test
Sample test
!! input
{||}

{|
|}
!! result
!! end

Now run this through php parserTests script as follows

$ cd <mediawiki-core-install>
$ php tests/parserTests.php --file=/tmp/tst
This is MediaWiki version 1.23alpha (edac6c3).

Running test Sample test... FAILED!
--- /tmp/mwParser-1026460993-expected	2013-11-15 16:39:57.096375665 -0600
+++ /tmp/mwParser-1026460993-actual	2013-11-15 16:39:57.096375665 -0600
@@ -1 +1,7 @@
+<table>
+
+<table>
+<tr><td></td></tr></table>
+<tr><td></td></tr>
+</table>
 

Passed 0 of 1 tests (0%)... 1 tests failed!

And you have the PHP parser's output without Tidy getting in your way.

Stepping through the ParsoidService.js (for Parsoid/JS)[edit]

Pass `-n 0` to avoid spawning workers with cluster.

npm install -g node-inspector
node --debug-brk api/server.js -n 0
node-inspector &
open http://127.0.0.1:8080/debug?port=5858

Dumping HTML DOM in VE[edit]

In Chrome or Firefox, you can interactively explore the HTML in the console by typing:

  • ve.init.target.doc for the original HTML before edits, or
  • ve.init.target.docToSave for the HTML after edits

For further analysis, it might be helpful to copy the HTML as a string, so that you can paste it into a file for further analysis or debugging. To do this, use:

  • copy(ve.init.target.doc.body.outerHTML) for the original HTML before edits, or
  • copy(ve.init.target.docToSave.body.outerHTML) for the HTML after edits.

Stepping into Parsoid code from a browser session with XDebug / PHPStorm on a local installation[edit]

Warning Warning: This section contains information that mostly come from ongoing experimentation. It documents a constellation that seems to mostly work on one person's computer, and may contain inaccurate or downright wrong information.
Warning Warning: The following configuration should only ever be used in development environment and never in a production setting. Enabling xdebug has a strong impact on performance, and the cookie forwarding may have security consequences, in particular on private wikis!

Initial setup[edit]

  • Follow Download from Git and Parsoid#Linking_a_developer_checkout_of_Parsoid to install MediaWiki linked to a local checkout of Parsoid.
  • Setup the PHP used by the web server so that it loads xdebug. For Apache on Ubuntu, this probably means editing something like /etc/php/7.4/apache2/conf.d/20-xdebug.ini (up to versions and paths) to add the following, and restarting Apache:
zend_extension=xdebug.so
xdebug.remote_enable=1
xdebug.remote_host=127.0.0.1
xdebug.remote_port="9000"
  • Check on phpinfo that xdebug is indeed activated
  • Follow the documentation on https://www.jetbrains.com/help/phpstorm/2021.2/zero-configuration-debugging.html to install the PHP debugging browser extension and get information on PHPStorm Debug Connection feature.
  • Setup a breakpoint at the beginning of wfIndexMain in index.php, start Listening for PHP Debug Connections in PHPStorm, go to your local mediawiki index page, click on the browser extension to activate debugging.
  • On the first execution of the debugging session, PHPStorm displays a popup on connection asking where the source code is located. Give the path to your mediawiki core source code.

If this works, congratulations, MediaWiki can be debugged. If it doesn't, it's probably due to server path mappings not being set up correctly - this is addressed in #Path mapping("Remote file path is not mapped to any path in project").

Forward cookies[edit]

To indicate to PHPStorm that it should follow an HTTP request into the debugger, a cookie is set (setting this cookie is essentially the job of the browser extension). That cookie is, by default, lost when RESTBase requests are made - which is the case for Parsoid requests. When that cookie is lost, the debugging request is lost, and the corresponding breakpoint is never hit in PHPStorm.

To allow the PHPStorm XDebug cookie to be sent to the request, add the following to LocalSettings.php:

$wgVirtualRestConfig['modules']['parsoid'] = [
        'forwardCookies' => true
];

Path mapping ("Remote file path is not mapped to any file path in project")[edit]

Depending on the local setup, and in particular if it involves symlinks, the debugger may fail to go through break points. Depending on PHPStorm's configuration, this can either translate into a silent failure, a breakpoint on the first line of index.php, or a message at the bottom of the editor indicating that "Debug session was finished without being paused". In any case, this can be fixed in Settings > Servers. The popup that appeared during the setup created a Server configuration (typically for localhost), which contains a mapping between the files of the project and the paths on the server. Adding an explicit mapping between the file where there's a breakpoint (in File/Directory) and its "Absolute path on the server" (which, if you're running debugging that version of the code, would be the same path - but no symlinks allowed!) should allow PHPStorm to find the debug breakpoint. It seems to be SOMETIMES enough to give a path mapping to a directory above that file works, but it doesn't seem to be systematic (more testing required). Adding a mapping between the parsoid codebase and the path that gets resolved on the server is probably a good idea anyway.

Activating Xdebug on demand only[edit]

PHP with XDebug is far slower than without; but restarting the web server every time one wants to move from debug to non-debug session is also quite annoying. Tim Starling explains how he deals with that in a blog post: Xdebug on demand. It may be a good idea to setup the "always-on" version first to not be debugging several moving pieces at the same time.

Good test pages[edit]

Purging RESTBase[edit]

Use the regular parsercache purge urls to purge stored content from RESTBase as well (https://wiki-base-url/wiki/$title?action=purge, Ex: https://www.mediawiki.org/wiki/Parsoid?action=purge). Monitor the etags to determine when it took effect.

See also[edit]