Parsoid/Debugging

From MediaWiki.org
Jump to: navigation, search

Debugging tips for commandline parsing (parse.js script)[edit]

This section assumes you are in the tests/ directory.

Debugging the wt2html mode[edit]

node parse --help is a useful command to remember. Continue reading to find out more about a few of these options. Since Parsoid processes wikitext in a pipeline composed of synchronous and asynchronous phases, it is sometimes useful to know how to examine the contents of the pipeline at various stages.

1. If you want to debug the tokenizer, node parse --trace peg is useful. Each time the tokenizer emits a token array to the next stage in the pipeline, this option prints out the token array.

[subbu@earth tests] echo "foo bar\nThis is a [[link]]" | node parse --trace peg
0-[peg]        | ---->   ["foo bar"]
0-[peg]        | ---->   [{"type":"NlTk","dataAttribs":{"tsr":[7,8]}}]
0-[peg]        | ---->   ["This is a ",{"type":"SelfclosingTagTk","name":"wikilink","attribs":[{"k":"href","v":["link"],"vsrc":"link"}],"dataAttribs":{"tsr":[18,26],"src":"[[link]]"}}]
0-[peg]        | ---->   [{"type":"NlTk","dataAttribs":{"tsr":[26,27]}}]
0-[peg]        | ---->   [{"type":"EOFTk"}]

The end-of-output is signalled by the EOFTk. Also note that multiple tokenizers might be active at the same time because of concurrent template expansions. Future enhancement of this debugging output would assign debug ids to every tokenizer and use that id to distinguish output between tokenizers.

2. If you want to look at the fully expanded and in-order token-stream, node parse --trace tsp is your friend. This emits the tokens as seen by the TokenStreamPatcher handler which is the very first handler in the in-order third phase synchronous transformation passes. So, it is a good proxy for the in-order token stream of the top-level document.

[subbu@earth tests] echo "foo \n{{echo|bar}}\n[[link]]" | node parse --trace tsp
0-[TSP]        | "foo "
0-[TSP]        | {"type":"NlTk","dataAttribs":{"tsr":[4,5]}}
0-[TSP]        | {"type":"SelfclosingTagTk","name":"meta","attribs":[{"k":"typeof","v":"mw:Transclusion"},{"k":"about","v":"#mwt1"}],"dataAttribs":{"tsr":[5,17],"src":"{{echo|bar}}","tmp":{"tplarginfo":"{\"dict\":{\"target\":{\"wt\":\"echo\",\"href\":\"./Template:Echo\"},\"params\":{\"1\":{\"wt\":\"bar\"}}},\"paramInfos\":[{\"k\":\"1\",\"srcOffsets\":[12,12,12,15],\"spc\":[\"\",\"\",\"\",\"\"]}]}"}}}
0-[TSP]        | "bar"
0-[TSP]        | {"type":"SelfclosingTagTk","name":"meta","attribs":[{"k":"typeof","v":"mw:Transclusion/End"},{"k":"about","v":"#mwt1"}],"dataAttribs":{"tsr":[null,17]}}
0-[TSP]        | {"type":"NlTk","dataAttribs":{"tsr":[17,18]}}
0-[TSP]        | {"type":"TagTk","name":"a","attribs":[{"k":"rel","v":"mw:WikiLink"},{"k":"href","v":"./Link"}],"dataAttribs":{"tsr":[18,26],"stx":"simple","a":{"href":"./Link"},"sa":{"href":"link"}}}
0-[TSP]        | "link"
0-[TSP]        | {"type":"EndTagTk","name":"a","attribs":[],"dataAttribs":{}}
0-[TSP]        | {"type":"NlTk","dataAttribs":{"tsr":[26,27]}}
0-[TSP]        | {"type":"EOFTk"}

3. If you want to look at the fully processed and transformed token stream (post all transformations), node parse --trace html is a good proxy. The output is a little bit noisier than it needs to be. Refining it and making it more useful is left as an enhancement.

4. If you want to look at the DOM at different stages of transformation, --dump dom:post-builder, --dump dom:pre-dsr, --dump dom:pre-encap are useful DOM debug options which can be combined as --dump dom:pre-dsr,dom:pre-encap

5. Sometimes, it is useful to look at the preprocessed template source that Parsoid then tokenizes. --dump tplsrc is useful in those scenarios.

6. There are a bunch of other handler-specific tracing flags. "node parse --help" should tell you what they do. There are tracers for the PreHandler, ListHandler, and ParagraphWrapper. There is no tracer currently for the QuoteTransformer.

Debugging the html2wt mode[edit]

node parse --help has a few options to help debug the serializer (this converts HTML to Wikitext).

1. If you want to trace the actions of the regular serializer, --trace wts is what you want.

$ echo "<p>foo</p>" | node parse --html2wt --trace wts

2. If you want to debug the wikitext escaping behavior of the serializer, --trace wt-escape is what you want.

$ echo "<p> foo\nbar</p>\n\n<p>*a\n*b</p>" | node parse --trace wt-escape --html2wt

Debugging the selective serializer (selser)[edit]

In order to test the selective serializer, you need (a) original wikitext (b) original html (c) modified html. Strictly speaking, (b) is not necessary since selser reparses (a) to generate (b) as necessary. However, in certain cases where you want to control testing conditions, it is useful to provide original HTML as well.

Running selser (for HTML with inline data-parsoid)[edit]

Let us first look at ways to test the selective serializer on the command line.

$ echo "<p>foo</p><p>boo</p>" > /tmp/wt
$ node parse < /tmp/wt > /tmp/orig.html 
$ sed "s/foo/bar/g" < /tmp/orig.html > /tmp/edited.html
$ node parse --selser --oldtextfile /tmp/wt < /tmp/edited.html
OR
$ node parse --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/orig.html  < /tmp/edited.html

Running selser on HTML and data-parsoid in separate files[edit]

This is useful to testing parsoid output after dumping orig HTML and edited HTML (from VE -- see instructions for doing that later in this file) and fetching data-parsoid from RESTBase. This effectively simulates the v2 html2wt API endpoint that VE and other clients use via RESTBase.

$ node parse --dpinfile data-parsoid.json --selser --oldhtmlfile old.html --oldtextfile wt.txt < new.html

There are entirely commandline options for running selser for very simple examples. Check node parse --help to find out more.

Debugging DOMDiff[edit]

Selser first compares the old and new html and generates a diff-marked DOM. This is the DOMDiff class. There is a commandline script to test and debug this functionality in isolation.

Output edited below to fit window

$ node domdiff.test.js /tmp/orig.html /tmp/edited.html
----- DIFF-marked DOM -----
<html data-parsoid-diff="{&quot;id&quot;:null,&quot;diff&quot;:[&quot;subtree-changed&quot;]}">
<head></head>
<body data-parsoid="{&quot;dsr&quot;:[0,21,0,0]}" data-parsoid-diff="{&quot;id&quot;:null,&quot;diff&quot;:[&quot;subtree-changed&quot;]}">
<p data-parsoid="{&quot;stx&quot;:&quot;html&quot;,&quot;dsr&quot;:[0,10,3,4]}" data-parsoid-diff="{&quot;id&quot;:null,&quot;diff&quot;:[&quot;children-changed&quot;,&quot;subtree-changed&quot;]}">bar</p>
<p data-parsoid="{&quot;stx&quot;:&quot;html&quot;,&quot;dsr&quot;:[10,20,3,4]}">boo</p>
</body></html>

You can look at (currently very) verbose output of DOM-diffing by turning on the --debug option.

$ node domdiff.test.js --debug /tmp/orig.html /tmp/edited.html
...
$ node domdiff.test.js --help

Debugging selser[edit]

To debug the selective serializer, you can use --trace selser. Using this flag will automatically enable tracing of the regular serializer, so there is no need to say --trace wts as well.

$ node parse --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/orig.html --trace selser < /tmp/edited.html

Debugging tips for parser tests (parserTests.js script)[edit]

Running tests in all modes[edit]

parserTests is the script to run parser tests. The following command runs tests in all 5 modes. Of course, you can run tests for any of the 5 combinations. The default commandline runs in 4 modes (excludes selser mode).

$ node parserTests --wt2html --wt2wt --html2wt --html2html --selser

All commandline options that the parse.js script accepts can be passed into parserTests as well. So, the debugging techniques from the previous section are applicable here. In addition, a couple parser tests specific options are useful when debugging parser test failures.

Running a subset of tests[edit]

parserTests accepts the --filter <string> option which can be used to run a single test or a subset of tests. Examples below:

$ node parserTests --wt2html --filter "Tabs don't trigger preformatted text"
$ node parserTests --wt2wt --selser --filter "Tabs don't trigger preformatted text" 
--trace wts,selser
$ node parserTests --wt2wt --selser --filter "Tabs don't trigger preformatted text" 
--trace wts,selser --no-blacklist

The last commandline ignores the blacklist entries and dumps failure output (input, expected output, and rendered output).

Running selser with a specific edit[edit]

In selser mode, parserTests script generates a bunch of edited DOM tests by generating random DOM changes and applying those to the HTML and running selser test on it. The generated changes is called a changetree and is specific to the wt2html output produced on the wikitext for a test. In order to run a selser test for a specific change-tree, the --changetree commandline flag can be used.

$ node parserTests --selser --filter "Tabs don't trigger preformatted text" --changetree "[2,0,0]" --no-blacklist

Usually this last commandline is used to debug a failing selser test as recorded in the blacklist file and determining whether wt2wt output is incorrect or selser output is incorrect. This is easy to determine by dumping the edited DOM after the changetree is applied to the wt2html output as follows:

$ node parserTests --selser --filter "Tabs don't trigger preformatted text" 
--changetree "[2,0,0]" --dump dom:post-changes --no-blacklist
WARNING: parserTests.txt not up-to-date with upstream.
ParserTests running with node v0.10.19
Initialisation complete. Now launching tests.
-------------------------
Change tree: [2,0,0]
-------------------------
DOM with changes applied: <body data-parsoid="{&quot;dsr&quot;:[0,75,0,0]}">gzlly4beyrx561or<p data-parsoid="{&quot;dsr&quot;:[0,33,0,0]}">	This is not
	 preformatted text.</p>
<pre data-parsoid="{&quot;dsr&quot;:[34,75,1,0]}">This is preformatted text.
	So is this.</pre></body>

Using debug_selser.sh script[edit]

For debugging selser failures for a specific test with a specific edit, debug_selser.sh script is your friend. It takes the test name as the first argument and the edit tree as its second argument.

Running a roundtrip test and emitting roundtrip diffs (roundtrip-test.js)[edit]

You can use roundtrip-test.js to run a roundtrip test (converting from html to wikitext and back) on a title on a wiki with a registered wiki prefix. Roundtrip-test additionally supports trace, dump, and debug flags that are passed through to the parser and the serializer.

$ node roundtrip-test.js --prefix enwiki "Medha Patkar"
.... verbose diff omitted here ....
=====================================================================
=====================================================================
SUMMARY:
Semantic differences : 0
Syntactic differences: 1
---------------------------------------------------------------------
ALL differences      : 1
=====================================================================
=====================================================================
$ node roundtrip-test.js --prefix enwiki "Medha Patkar" --trace selser
-----------------WTS-mode-----------------
["WTS:"," DOM ==> \n","<body data-parsoid
.... verbose diff omitted here ....

Other helpful scripts[edit]

fetch-wt.js is a useful script to fetch wikitext for a revision. This is useful when you want to debug Parsoid behavior on the commandline for a specific page.

[subbu@earth tests] fetch-wt.js --help
Usage: node ./fetch-wt.js [options] <page-title or rev-id>
If first argument is numeric, it is used as a rev id; otherwise it is
used as a title.  Use the --title option for a numeric title.

Options:
  --output  Write page to given file                                                                           
  --prefix  Which wiki prefix to use; e.g. "enwiki" for English Wikipedia, "eswiki" for Spanish,
            "mediawikiwiki" for mediawiki.org  [default: "en"]
  --revid   Page revision to fetch                                                                             
  --title   Page title to fetch (only if revid is not present)                                                 
  --help    Show this message

Generating PHP parser output (without Tidy) on snippets[edit]

In the browser[edit]

This requires you to have a mediawiki install with Tidy not enabled. The default Mediawiki installation comes without Tidy enabled.

Create a wikipage in your browser and save or look at preview.

View Source in the browser (Inspect Element in the DOM inspector will show you the DOM that your browser helpfully fixed up for you, but that is not what you want).

Via parserTests script on the commandline[edit]

Create a test file with a wikitext snippet you want to generate output for in the parser tests format but leave the result section empty (see example below)

!! test
Sample test
!! input
{||}

{|
|}
!! result
!! end

Now run this through php parserTests script as follows

$ cd <mediawiki-core-install>
$ php tests/parserTests.php --file=/tmp/tst
This is MediaWiki version 1.23alpha (edac6c3).

Running test Sample test... FAILED!
--- /tmp/mwParser-1026460993-expected	2013-11-15 16:39:57.096375665 -0600
+++ /tmp/mwParser-1026460993-actual	2013-11-15 16:39:57.096375665 -0600
@@ -1 +1,7 @@
+<table>
+
+<table>
+<tr><td></td></tr></table>
+<tr><td></td></tr>
+</table>
 

Passed 0 of 1 tests (0%)... 1 tests failed!

And you have the PHP parser's output without Tidy getting in your way.

Stepping through the ParsoidService.js[edit]

Pass `-n 0` to avoid spawning workers with cluster.

npm install -g node-inspector
node --debug-brk api/server.js -n 0
node-inspector &
open http://127.0.0.1:8080/debug?port=5858

Dumping HTML DOM in VE[edit]

In Chrome or Firefox, you can interactively explore the HTML in the console by typing:

  • ve.init.target.doc for the original HTML before edits, or
  • ve.init.target.docToSave for the HTML after edits

For further analysis, it might be helpful to copy the HTML as a string, so that you can paste it into a file for further analysis or debugging. To do this, use:

  • copy(ve.init.target.doc.body.outerHTML) for the original HTML before edits, or
  • copy(ve.init.target.docToSave.body.outerHTML) for the HTML after edits.

Good test pages[edit]

Purging RESTBase[edit]

Use the regular parsercache purge urls to purge stored content from RESTBase as well (https://wiki-base-url/wiki/$title?action=purge, Ex: https://www.mediawiki.org/wiki/Parsoid?action=purge). Monitor the etags to determine when it took effect.

See also[edit]