Parsing/Visual Diff Testing

The test results are accessible at http://mw-expt-tests.wmflabs.org/. For debugging help, see Parsing/Visual Diff Testing/Debugging.

Overview
For evaluating changes to parsing or to the parser setup, we are using mass visual diff testing. In this setup, we have two mediawiki installs. One is the default (base) mediawiki, and the other is the experimental (expt) mediawiki install. Currently we run these via mediawiki-vagrant on labs VMs, but, these could be setup wherever. Currently these two vms are wikitextexp-base-1002.eqiad.wmflabs and wikitextexp-expt-1002.eqiad.wmflabs. Each of them is a multi-wiki setup initialized with production content from about 41 wikis from wikipedia, wikisource, wiktionary, and wikivoyage. As of April 29, 2016, there are about 50K titles that are usable for running tests.

Separately on parsing-qa-01.eqiad.wmflabs, we run a testreduce-based testing setup that runs a visualdiff test on a test client. The visualdiff test requests the test title from $wiki-base-wikitextexp.eqiad.wmflabs and $wiki-expt-wikitextexp.eqiad.wmflabs, generates screenshots for each of those via phantomjs (after doing some CSS and JS post-processing to strip the chrome, expand all collapsed boxes, etc.), and the compares the two screenshots via uprightdiff which in turns, generates a diff image with differences marked up while accounting for vertical pixel shifts of content on the page.

Testreduce code
The testreduce code is in  which is used to run the mw-expts-vd and mw-expts-vd-client services. The systemd controller files for these services are in  and   — these files have derived from the puppetized code for similar services on ruthenium used for Parsoid's roundtrip testing.

The testreduce server config is in. The testreduce client config is in  which also includes a section that provides the config for the visual diff tests that are to be run.

Visualdiff code
The visualdiff code is in  that also provides config and hooks to use it with testreduce. The file  also provides the visualdiff config. It specifies how to fetch the HTML for the two screenshots, specifics uprightdiff as the diffing engine to use, and a few other parameters that control these -- the comments should be fairly self-explanatory. The uprightdiff code is in.

There is a separate helper service for viewing results for a single title without having to go digging for them in the directory containing them. On parsing-qa-01, the code in  is run as the visualdiff-item service. The config for this is in. The systemd controller file is in.

Managing services: mw-expts-vd, mw-expts-vd-client, visualdiff-item
To {stop,restart,start} all clients: Client logs are in systemd journals and can be accessed as: The public-facing web UIs for these services are managed by a nginx config in  and provides access to the web UI for the mw-expts-vd and visualdiff-item services and also enables directory listing for the screenshots generated during the test runs. The config should be self-explanatory.

Updating the code to test (and being run by the clients)
Unlike Parsoid where the code to test is determined by the latest git commit, in the mw-expts setup, the code to run lives on a separate VM, and sometimes the change might be in the config files, and may not be available in a git repository (at least as of today). The testreduce codebase implicitly assumes that the test to run is a git commit. However, the testreduce client config file (/etc/testreduce/mw-expts-vd-client.config.js) can declare a getGitCommit function that is then used by the server as clients to identify the test run in the database. So, in our case, this function simply returns a unique string identifying the test run based on changes to the code on the wikitextexp-expt-1002 labs VM. So, to initiate a new test run, simply change the string being returned by this function, save the file, and restart the mw-expts-vd-client service and you will be ready to go.

Anyway, here are the steps:
 * 1) Update the code / config on the wikitextexp-expt-1002.eqiad.wmflabs VM. You would do this by going to   and checking out the specific gerrit patch or branch to test, or by updating the config in   appropriately.  IMPORTANT: In order to get accurate results, ensure that on both wikitextexp-base-1002 and wikitextexp-expt-1002 VMs, you are on master branch, and on the wikitextexp-expt-1002 VM, you can checkout a branch / gerrit patch and rebase it on top of the latest master. This way, the only diffs between the two are VMs is the code you are testing.
 * 2) Login to parsing-qa-01.wikitextexp.eqiad.wmflabs. Edit   and update the string in the   function at the bottom.
 * 3) Restarting the mw-expts-vd and mw-expts-vd-client services shouldn't be necessary, but doesn't hurt just in case they aren't currently running.

Updating the testreduce, visualdiff, uprightdiff code
Of course, there will continue to be bug fixes and tweaks to these codebases. To update the relevant code, simply go to,  , or  , and do a  , and restart the affected services. As simple as that!

Retesting a subset of titles
The only way to do this is to clear the result entries in the mysql db. The mysql credentials (username, db, password) are in /etc/testreduce/mw-expts-vd.settings.js That will clear all test results for titles that have a score > 5000 which is equivalent to pages that have rendering diff > 5%. Score = errors * 1M + truncate(diff%) * 1000 + fractional-part-of-diff%. This weird scoring formula is just a result of shoe-horning the visualdiff results into the testreduce setup that was built for parsoid-rt testing. So, to clear test results for all erroring pages, you use latest_score >= 1000000.

Look at the schema for the pages table to clear results for other subsets.

Resource usage and # of test clients
parsng-qa-01 is a large labs vm with 12 cpu cores, 32 gb memory, and a 400+gb disk. Even so, visual diff testing can use up all these resources. 20 testreduce clients seem to be about the upper-end of how many can be run at the same time. This is enough to sometimes bring cpu load to 13-15 and memory usage to 28+gb. Probably 16 clients is a more comfortable number. The # of test clients to run can be tweaked by editing /lib/system/systemd/mw-expts-vd-client.service

The screenshots from phantomjs and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix. These images are overwritten with each test run. It takes too much disk space to store these images per test run. 125GB is used per test run. But, in the future, we could consider storing results from the most recent 2-3 runs or get a larger disk and expand that range a bit more.

Web UI for browsing results
The screenshots from phantomjs and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix and are accessible via HTTP @ http://mw-expt-tests.wmflabs.org/visualdiff/pngs/.

However, a better way of browsing these results is via the mw-expts-vd web UI at http://mw-expt-tests.wmflabs.org. The /topfails link sorts results in descending order of score which makes it easy to look at pages that generate the most prominent diffs first. The @remote link on these results listing page is a easy way to look at the 2 HTML screenshots and the uprightdiff screenshot. That output is outsourced to the visualdiff-item service. It simply links to the existing screenshots (or if missing, generates them on demand).

Uprightdiff numeric scoring
Uprightdiff compares the two candidate images and returns 3 metrics: In other words, The goal of generating a numerical score is to be able to (a) compare test results for different pages and identify the most significant ones, and (b) compare test results for the same page across test runs and determine whether our fixes improved or worsened the situation. With these goals in mind, the visual diffing code takes the totalArea of the image and uses the above 3 metrics to generate 2 different numbers. The total score is then computed as 1,000,000 * ErrorMetric + 1,000 * SignificantDiffMetric + InsignificantDiffMetric (In other words, this can be seen as a number in base-1000 notation).
 * if modifiedArea == 0, then the images had pixel-perfect match. In this scenario, movedArea and residualArea will also be zero.
 * if modifiedArea > 0, then the images obviously differed. If residualArea == 0, then it tells us that all the differences could be accounted for by vertical motion and the rendering differences are mostly insignificant. In this scenario, movedArea tells us how many pixels were affected.
 * 1) SignificantDiffMetric (when residualArea > 0): 75 * residualArea / totalArea + 0.25 * min(max(2^(residualArea / 100000) - 1, 0), 100)
 * 2) InsignificantDiffMetric (when residualArea == 0): 50 * modifiedArea / totalArea + 50 * movedArea / totalArea
 * 3) ErrorMetric: 1 if the test had a fatal error, 0 otherwise.

This scoring technique gives us what we want. In addition, the signficant diff metric tries to flag pages that are really large (big totalArea value), that have a sizeable pixel diff (big residualArea), but which is fairly small relative to the size of the page (small residualArea / totalArea ratio). A simple residualArea / totalArea ratio would favor small pages with mostly insignificant residualArea values over large pages with mostly significant residualArea values. So, we pick a 1M area as our baseline and figure out how big the residual area is relative to that and use exponentiation to weight those heavily.

We believe that this numeric metric lets us quickly identify problematic rendering differences and use mass visual diff testing without having to manually sift through thousands of diff images to identify where to focus our efforts.