Parsing/Visual Diff Testing


The code for generating visual diffs is mirrored from the integration/visualdiff repo on gerrit to github at:

  • diffserver/ has code for running a visual-diff server for generating diffs on demand.
  • testreduce/ has code for running mass visual diff testing via the testreduce setup, and for configuring the testreduce server.


We have visual diff code set up on parsing-qa-02 is a labs server and you can run visual diff tests only against public APIs (whether Parsoid, mediawiki, something else altogether).

Currently, there is one visual-diff instances on this VM.

  • is used with the parsoid_vs_core dabase and parsoid-vs-core-vd and parsoid-vs-core-vd-client testreduce services. This instance is set up to compare Parsoid rendering and core parser rendering for production wiki pages.

As many as necessary can be set up as long as the right visualdiff, testreduce, proxy domain and nginx configs are updated.

Previously, there also used to be a different one as below. But, this instance and the associated VMs have been decommissioned as no longer needed. This information is retained here for historical reasons in case there is a reason to review this in the future.

  • is used with the mwexpts testreduce database and mw-expts-vd and mw-expts-vd-client testreduce services. This instance is set up to compare wiki pages on two labs vms.

The documentation on the rest of this page uses the mw-expts-vd visual-diff instance (even though these services and the associated labs VMs have been decommissioned). The instructions for the services running on parsing-qa-02 transfer over to the parsoid-vs-core-vd* instances.

For debugging help, see Parsing/Visual Diff Testing/Debugging.

For evaluating changes to parsing or to the parser setup, we are using mass visual diff testing. In this setup, we have two mediawiki installs. One is the default (base) mediawiki, and the other is the experimental (expt) mediawiki install. Currently we run these via mediawiki-vagrant on labs VMs, but, these could be setup wherever. Currently these two vms are wikitextexp-base-1002.eqiad.wmflabs and wikitextexp-expt-1002.eqiad.wmflabs. Each of them is a multi-wiki setup initialized with production content from about 41 wikis from wikipedia, wikisource, wiktionary, and wikivoyage. As of April 29, 2016, there are about 50K titles that are usable for running tests.

Separately on, we run a testreduce-based testing setup that runs a visualdiff test on a test client. The visualdiff test requests the test title from $wiki-base-wikitextexp.eqiad.wmflabs and $wiki-expt-wikitextexp.eqiad.wmflabs, generates screenshots for each of those via puppeteer (after doing some CSS and JS post-processing to strip the chrome, expand all collapsed boxes, etc.), and the compares the two screenshots via uprightdiff which in turns, generates a diff image with differences marked up while accounting for vertical pixel shifts of content on the page.

Testreduce code[edit]

The testreduce code is in /srv/testreduce which is used to run the mw-expts-vd and mw-expts-vd-client services. The systemd controller files for these services are in /lib/systemd/system/mw-expts-vd.service and /lib/systemd/system/ — these files have derived from the puppetized code for similar services on ruthenium used for Parsoid's roundtrip testing.

The testreduce server config is in /etc/testreduce/mw-expts-vd.settings.js. The testreduce client config is in /etc/testreduce/mw-expts-vd-client.config.js which also includes a section that provides the config for the visual diff tests that are to be run.

Visualdiff code[edit]

The visualdiff code is in /srv/visualdiff that also provides config and hooks to use it with testreduce. The file /etc/testreduce/mw-expts-vd-client.config.js also provides the visualdiff config. It specifies how to fetch the HTML for the two screenshots, specifics uprightdiff as the diffing engine to use, and a few other parameters that control these -- the comments should be fairly self-explanatory. The uprightdiff code is in /srv/uprightdiff.

There is a separate helper service for viewing results for a single title without having to go digging for them in the directory containing them. On parsing-qa-02, the code in /srv/visualdiff/diffserver/diffserver.js is run as the visualdiff-item service. The config for this is in /etc/visualdiff/mw-expts-diffserver.config.js. The systemd controller file is in /lib/systemd/system/mw-expts-diffserver.service.

Managing services: mw-expts-vd, mw-expts-vd-client, mw-expts-diffserver[edit]

To {stop,restart,start} all clients:

sudo service mw-expts-vd-client stop
sudo service mw-expts-vd-client restart
sudo service mw-expts-vd-client start

Client logs are in systemd journals and can be accessed as:

### Logs for the mw-expts-vd-client service
# equivalent to tail -f <log-file>
sudo journalctl -f -u mw-expts-vd-client
# equivalent to tail -n 1000
sudo journalctl -n 1000 -u mw-expts-vd-client

### Logs of the mw-expts-vd testreduce server
sudo journalctl -f -u mw-expts-vd

### Logs of the mw-expts-diffserver service
sudo journalctl -u mw-expts-diffserver

The public-facing web UIs for these services are managed by a nginx config in /etc/nginx/sites-available/mw-expts-vd and provides access to the web UI for the mw-expts-vd and mw-expts-diffserver services and also enables directory listing for the screenshots generated during the test runs. The config should be self-explanatory.

Updating the code to test (and being run by the clients)[edit]

Unlike Parsoid where the code to test is determined by the latest git commit, in the mw-expts setup, the code to run lives on a separate VM, and sometimes the change might be in the config files, and may not be available in a git repository (at least as of today). The testreduce codebase implicitly assumes that the test to run is a git commit. However, the testreduce client config file (/etc/testreduce/mw-expts-vd-client.config.js) can declare a getGitCommit function that is then used by the server as clients to identify the test run in the database. So, in our case, this function simply returns a unique string identifying the test run based on changes to the code on the wikitextexp-expt-1002 labs VM. So, to initiate a new test run, simply change the string being returned by this function, save the file, and restart the mw-expts-vd-client service and you will be ready to go.

Anyway, here are the steps:

  1. Update the code / config on the wikitextexp-expt-1002.eqiad.wmflabs VM. You would do this by going to /srv/mediawiki-vagrant/mediawiki and checking out the specific gerrit patch or branch to test, or by updating the config in /srv/mediawiki-vagrant/settings.d appropriately. IMPORTANT: In order to get accurate results, ensure that on both wikitextexp-base-1002 and wikitextexp-expt-1002 VMs, you are on master branch, and on the wikitextexp-expt-1002 VM, you can checkout a branch / gerrit patch and rebase it on top of the latest master. This way, the only diffs between the two are VMs is the code you are testing.
  2. Login to Edit /etc/testreduce/mw-expts-vd-client.config.js and update the string in the getGitCommit function at the bottom.
  3. Restarting the mw-expts-vd and mw-expts-vd-client services shouldn't be necessary, but doesn't hurt just in case they aren't currently running.

Updating the testreduce, visualdiff, uprightdiff code[edit]

Of course, there will continue to be bug fixes and tweaks to these codebases. To update the relevant code, simply go to /srv/testreduce, /srv/visualdiff, or /srv/uprightdiff, and do a git pull, and restart the affected services. As simple as that!

Retesting a subset of titles[edit]

The only way to do this is to clear the result entries in the mysql db. The mysql credentials (username, db, password) are in /etc/testreduce/mw-expts-vd.settings.js

mysql> update pages set claim_hash="",claim_num_tries=0, claim_timestamp=null,latest_stat=null,latest_result=null,latest_score=0,num_fetch_errors=0 where latest_score > 5000;

That will clear all test results for titles that have a score > 5000 which is equivalent to pages that have rendering diff > 5%. Score = errors * 1M + truncate(diff%) * 1000 + fractional-part-of-diff%. This weird scoring formula is just a result of shoe-horning the visualdiff results into the testreduce setup that was built for parsoid-rt testing. So, to clear test results for all erroring pages, you use latest_score >= 1000000.

Look at the schema for the pages table to clear results for other subsets.

Resource usage and # of test clients[edit]

parsng-qa-01 is a large labs vm with 12 cpu cores, 32 gb memory, and a 400+gb disk. Even so, visual diff testing can use up all these resources. 20 testreduce clients seem to be about the upper-end of how many can be run at the same time. This is enough to sometimes bring cpu load to 13-15 and memory usage to 28+gb. Probably 16 clients is a more comfortable number. The # of test clients to run can be tweaked by editing /lib/system/systemd/mw-expts-vd-client.service

The screenshots from puppeteer and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix. These images are overwritten with each test run. It takes too much disk space to store these images per test run. 125GB is used per test run. But, in the future, we could consider storing results from the most recent 2-3 runs or get a larger disk and expand that range a bit more.

Web UI for browsing results[edit]

The screenshots from puppeteer and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix and are accessible via HTTP @

However, a better way of browsing these results is via the mw-expts-vd web UI at The /topfails link sorts results in descending order of score which makes it easy to look at pages that generate the most prominent diffs first. The @remote link on these results listing page is a easy way to look at the 2 HTML screenshots and the uprightdiff screenshot. That output is outsourced to the visualdiff-item service. It simply links to the existing screenshots (or if missing, generates them on demand).

Uprightdiff numeric scoring[edit]

Uprightdiff compares the two candidate images and returns 3 metrics:

* modifiedArea : This is a simple count of the number of pixels for which the source does not match the destination (after they have both been expanded to the same size).
* movedArea    : The number of pixels for which nonzero motion was detected.
* residualArea : The number of pixels which differed between the resulting image and the second input image.

In other words,

  • if modifiedArea == 0, then the images had pixel-perfect match. In this scenario, movedArea and residualArea will also be zero.
  • if modifiedArea > 0, then the images obviously differed. If residualArea == 0, then it tells us that all the differences could be accounted for by vertical motion and the rendering differences are mostly insignificant. In this scenario, movedArea tells us how many pixels were affected.

The goal of generating a numerical score is to be able to (a) compare test results for different pages and identify the most significant ones, and (b) compare test results for the same page across test runs and determine whether our fixes improved or worsened the situation. With these goals in mind, the visual diffing code takes the totalArea of the image and uses the above 3 metrics to generate 2 different numbers.

  1. SignificantDiffMetric (when residualArea > 0): 75 * residualArea / totalArea + 0.25 * min(max(2^(residualArea / 100000) - 1, 0), 100)
  2. InsignificantDiffMetric (when residualArea == 0): 50 * modifiedArea / totalArea + 50 * movedArea / totalArea
  3. ErrorMetric: 1 if the test had a fatal error, 0 otherwise.

The total score is then computed as 1,000,000 * ErrorMetric + 1,000 * SignificantDiffMetric + InsignificantDiffMetric (In other words, this can be seen as a number in base-1000 notation).

This scoring technique gives us what we want. In addition, the signficant diff metric tries to flag pages that are really large (big totalArea value), that have a sizeable pixel diff (big residualArea), but which is fairly small relative to the size of the page (small residualArea / totalArea ratio). A simple residualArea / totalArea ratio would favor small pages with mostly insignificant residualArea values over large pages with mostly significant residualArea values. So, we pick a 1M area as our baseline and figure out how big the residual area is relative to that and use exponentiation to weight those heavily.

We believe that this numeric metric lets us quickly identify problematic rendering differences and use mass visual diff testing without having to manually sift through thousands of diff images to identify where to focus our efforts.

Updating the VMs[edit]

Just to be clear, the above talks about labs VMs which, in the following discussion, are the hosts to the VMs that mediawiki-vagrant spins up. This section is about keeping mediawiki-vagrant and the VMs it spins up up-to-date.

In the future, it might be easiest to just create new labs VMs and start from scratch. has some notes from when we updated the VMs this way in 2018. In addition, the following notes might nevertheless be a useful guide in cases problems arise while upgrading.

Troubleshooting notes from May 2020 while upgrading vagrant and mediawiki checkout[edit]

Keeping mediawiki-vagrant up-to-date is supposed to be as simple as git pull && vagrant provision. In practice, that wasn't so. This is most likely because of nfs issues that were left unresolved when setting it up. At the time, vagrant reload was abused until no errors were reported when starting up. To consistently startup without error, the suggestion from T139859 was used to set vagrant config nfs_shares no.

Unfortunately, after booting, the permissions in /vagrant are in a problematic state. In order to work around it, on the hosts, do sudo chown -R mwvagrant:wikidev /srv/mediawiki-vagrant and, in the VM, do sudo chown -R vagrant:www-data /vagrant. That at least allows for basic vagrant commands to work.

Generally, to update mediawiki in the VMs, vagrant ssh in and then fix the permissions. Then, instead of using vagrant git-update on the hosts, just invoke run-git-update from inside the VM.

There were a few other one off problems when provisioning the VMs that required apt-get install php-redis php-igbinary php-luasandbox and fixing links to the available modules when going to php 7.2 that won't likely need repeating. These were from T213016 and T213993

All that said and done, the major hurdle was that we were using an import from before the actor migration began. I imagine that because users weren't imported, when the schema migration scripts in maintenance/update.php ran, we ended up in a broken state. In order to fix the revision_actor_temp table, I just assigned all the actions to the Admin user. Based on T249185#6028521, in the VMs, create a file t.sql,

insert into revision_actor_temp (revactor_rev, revactor_actor, revactor_timestamp, revactor_page) select rev_id, 1, rev_timestamp, rev_page from revision r where not exists ( select 1 from revision_actor_temp a where a.revactor_rev = r.rev_id );

and then run,

#!/usr/bin/env bash

for db in $(alldbs); do
	echo $db
	mysql $db < t.sql