Parsoid/Debugging

Debugging tips for commandline parsing (parse.php script)
This section assumes you are in the bin/ directory.

Debugging the wt2html mode
php bin/parse.php --help is a useful command to remember. Continue reading to find out more about a few of these options. Since Parsoid processes wikitext in a pipeline composed of synchronous and asynchronous phases, it is sometimes useful to know how to examine the contents of the pipeline at various stages.

1. If you want to debug the tokenizer, php bin/parse.php --trace peg is useful. Each time the tokenizer emits a token array to the next stage in the pipeline, this option prints out the token array.

The end-of-output is signalled by the EOFTk. Also note that multiple tokenizers might be active at the same time because of concurrent template expansions. Future enhancement of this debugging output would assign debug ids to every tokenizer and use that id to distinguish output between tokenizers.

2. If you want to look at the fully expanded and in-order token-stream, php bin/parse.php --trace tsp is your friend. This emits the tokens as seen by the TokenStreamPatcher handler which is the very first handler in the in-order third phase synchronous transformation passes. So, it is a good proxy for the in-order token stream of the top-level document.

3. If you want to look at the fully processed and transformed token stream (post all transformations), php bin/parse.php --trace html is a good proxy. The output is a little bit noisier than it needs to be. Refining it and making it more useful is left as an enhancement.

4. If you want to look at the DOM at different stages of transformation, --dump dom:post-builder, --dump dom:pre-dsr, --dump dom:pre-encap are useful DOM debug options which can be combined as --dump dom:pre-dsr,dom:pre-encap

5. Sometimes, it is useful to look at the preprocessed template source that Parsoid then tokenizes. --dump tplsrc is useful in those scenarios.

6. There are a bunch of other handler-specific tracing flags. "php bin/parse.php --help" should tell you what they do. There are tracers for the PreHandler, ListHandler, and ParagraphWrapper. There is no tracer currently for the QuoteTransformer.

Debugging the html2wt mode
php bin/parse.php --help has a few options to help debug the serializer (this converts HTML to Wikitext).

1. If you want to trace the actions of the regular serializer, --trace wts is what you want.

2. If you want to debug the wikitext escaping behavior of the serializer, --trace wt-escape is what you want.

Debugging the selective serializer (selser)
In order to test the selective serializer, you need (a) original wikitext (b) original html (c) modified html.

Running selser (for HTML with inline data-parsoid)
Let us first look at ways to test the selective serializer on the command line.

Running selser on HTML and data-parsoid in separate files
This is useful to testing parsoid output after dumping orig HTML and edited HTML (from VE -- see instructions for doing that later in this file) and fetching data-parsoid from RESTBase. This effectively simulates the v2 html2wt API endpoint that VE and other clients use via RESTBase.

There are entirely commandline options for running selser for very simple examples. Check php bin/parse.php --help to find out more.

Debugging DOMDiff
Selser first compares the old and new html and generates a diff-marked DOM. This is the DOMDiff class. There is a commandline script to test and debug this functionality in isolation.

You can look at (currently very) verbose output of DOM-diffing by turning on the --debug option.

Debugging selser
To debug the selective serializer, you can use --trace selser. Using this flag will automatically enable tracing of the regular serializer, so there is no need to say --trace wts as well.

Debugging what's happening on a local mediawiki install
will throw what happens with a given page on the locally installed MediaWiki on the command line.

Reproducing roundtrip testing conditions
In order to reproduce issues that appear in the roundtrip testing ("rt-testing"), some options must be used with parse.php:


 * --wrapSections must be used in the wt->html direction
 * --scrubWikitext must be used in html->wt direction
 * very rarely, some bugs only happen with the --pageBundle option
 * some html2wt bugs are only visible with the --selser option

Running tests in all modes
parserTests is the script to run parser tests. The following command runs tests in all 5 modes. Of course, you can run tests for any of the 5 combinations. The default commandline runs in 4 modes (excludes selser mode).

All commandline options that the parse.js script accepts can be passed into parserTests as well. So, the debugging techniques from the previous section are applicable here. In addition, a couple parser tests specific options are useful when debugging parser test failures.

Running a subset of tests
parserTests accepts the --filter  option which can be used to run a single test or a subset of tests. Examples below:

The last commandline ignores entries in the knownFailures file and dumps failure output (input, expected output, and rendered output).

Additionally, it's also possible to pass a single test file to parserTests.php: This can be combined with the --filter option to execute a single test or a subset of tests without having to go through the irrelevant files to find said test. This is particularly useful in debug sessions where the performance impact of xdebug makes parserTests a long file to process if it's not entirely necessary.

Running selser with a specific edit
In selser mode, parserTests script generates a bunch of edited DOM tests by generating random DOM changes and applying those to the HTML and running selser test on it. The generated changes is called a changetree and is specific to the wt2html output produced on the wikitext for a test. In order to run a selser test for a specific change-tree, the --changetree commandline flag can be used.

Usually this last commandline is used to debug a failing selser test as recorded in the knownFailures file and determining whether wt2wt output is incorrect or selser output is incorrect. This is easy to determine by dumping the edited DOM after the changetree is applied to the wt2html output as follows:

The changeTree parameter represent changes applied to the tree:


 * each node is assigned to an element of the array, in order; nested nodes are represented by nested arrays.
 * operations are assigned to these nodes:
 * 0: no change
 * 1: change node wrapper - adds a data-foobar attribute
 * 2: insert new (text) node before child - unless we're in a list / table / body, in which case we insert li, table element or p.
 * 3: delete tree rooted at child
 * 4: change tree rooted at child (delete + insert)

Additionally the single operation [5] appends a comment, which is then stripped from the output, and the result is compared to the original wikitext.

Using debug_selser.sh script
For debugging selser failures for a specific test with a specific edit, debug_selser.sh script is your friend. It takes the test name as the first argument and the edit tree as its second argument.

Running a roundtrip test and emitting roundtrip diffs (roundtrip-test.js)
You can use roundtrip-test.js to run a roundtrip test (converting from html to wikitext and back) on a title on a wiki with a registered wiki prefix. Roundtrip-test additionally supports trace, dump, and debug flags that are passed through to the parser and the serializer.

Other helpful scripts
fetch-wt.js is a useful script to fetch wikitext for a revision. This is useful when you want to debug Parsoid behavior on the commandline for a specific page.

Generating PHP parser output (without Tidy) on snippets
Note that running the PHP parser without Tidy enabled has been deprecated and will shortly be removed. In addition, the below is out-of-date since the  script in core has been tweaked to tidy by default. If you really want to see no-tidy output, you need to give the  option to.

In the browser
This requires you to have a mediawiki install with Tidy not enabled. The default Mediawiki installation comes without Tidy enabled.

Create a wikipage in your browser and save or look at preview.

View Source in the browser (Inspect Element in the DOM inspector will show you the DOM that your browser helpfully fixed up for you, but that is not what you want).

Via parserTests script on the commandline
Create a test file with a wikitext snippet you want to generate output for in the parser tests format but leave the result section empty (see example below)

Now run this through php parserTests script as follows +   +

Passed 0 of 1 tests (0%)... 1 tests failed!

And you have the PHP parser's output without Tidy getting in your way.

Stepping through the ParsoidService.js (for Parsoid/JS)
Pass `-n 0` to avoid spawning workers with cluster.

Dumping HTML DOM in VE
In Chrome or Firefox, you can interactively explore the HTML in the console by typing:
 * for the original HTML before edits, or
 * for the HTML after edits

For further analysis, it might be helpful to copy the HTML as a string, so that you can paste it into a file for further analysis or debugging. To do this, use:
 * for the original HTML before edits, or
 * for the HTML after edits.

Initial setup

 * Follow Download from Git and Parsoid to install MediaWiki linked to a local checkout of Parsoid.
 * Setup the PHP used by the web server so that it loads xdebug. For Apache on Ubuntu, this probably means editing something like  (up to versions and paths) to add the following, and restarting Apache:


 * Check on phpinfo that xdebug is indeed activated
 * Follow the documentation on https://www.jetbrains.com/help/phpstorm/2021.2/zero-configuration-debugging.html to install the PHP debugging browser extension and get information on PHPStorm Debug Connection feature.
 * Setup a breakpoint at the beginning of wfIndexMain in index.php, start Listening for PHP Debug Connections in PHPStorm, go to your local mediawiki index page, click on the browser extension to activate debugging.
 * On the first execution of the debugging session, PHPStorm displays a popup on connection asking where the source code is located. Give the path to your mediawiki core source code.

If this works, congratulations, MediaWiki can be debugged. If it doesn't, it's probably due to server path mappings not being set up correctly - this is addressed in.

Forward cookies
To indicate to PHPStorm that it should follow an HTTP request into the debugger, a cookie is set (setting this cookie is essentially the job of the browser extension). That cookie is, by default, lost when RESTBase requests are made - which is the case for Parsoid requests. When that cookie is lost, the debugging request is lost, and the corresponding breakpoint is never hit in PHPStorm.

To allow the PHPStorm XDebug cookie to be sent to the request, add the following to LocalSettings.php:

Path mapping ("Remote file path is not mapped to any file path in project")
Depending on the local setup, and in particular if it involves symlinks, the debugger may fail to go through break points. Depending on PHPStorm's configuration, this can either translate into a silent failure, a breakpoint on the first line of index.php, or a message at the bottom of the editor indicating that "Debug session was finished without being paused". In any case, this can be fixed in Settings > Servers. The popup that appeared during the setup created a Server configuration (typically for localhost), which contains a mapping between the files of the project and the paths on the server. Adding an explicit mapping between the file where there's a breakpoint (in File/Directory) and its "Absolute path on the server" (which, if you're running debugging that version of the code, would be the same path - but no symlinks allowed!) should allow PHPStorm to find the debug breakpoint. It seems to be SOMETIMES enough to give a path mapping to a directory above that file works, but it doesn't seem to be systematic (more testing required). Adding a mapping between the parsoid codebase and the path that gets resolved on the server is probably a good idea anyway.

Activating Xdebug on demand only
PHP with XDebug is far slower than without; but restarting the web server every time one wants to move from debug to non-debug session is also quite annoying. Tim Starling explains how he deals with that in a blog post: Xdebug on demand. It may be a good idea to setup the "always-on" version first to not be debugging several moving pieces at the same time.

Good test pages

 * en:Barack Obama -- our favorite. For testing.
 * en:Keretapi Tanah Melayu -- huge templated table in the rail route diagram
 * en:List_of_go_games -- interesting tables again

Purging RESTBase
Use the regular parsercache purge urls to purge stored content from RESTBase as well ( https://wiki-base-url/wiki/$title?action=purge, Ex: https://www.mediawiki.org/wiki/Parsoid?action=purge ). Monitor the etags to determine when it took effect.