Parsing/Visual Diff Testing

The code for generating visual diffs is mirrored from the integration/visualdiff repo on gerrit to github at:

GitHub:

diffserver/ has code for running a visual-diff server for generating diffs on demand.
testreduce/ has code for running mass visual diff testing via the testreduce setup, and for configuring the testreduce server.

Overview

We have visual diff code set up on parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud. parsing-qa-02 is a labs server and you can run visual diff tests only against public APIs (whether Parsoid, mediawiki, something else altogether).

Currently, there is one visual-diff instance on this VM.

http://parsoid-vs-core.wmflabs.org is used with the parsoid_vs_core and other databases, and parsoid-vs-core-vd and parsoid-vs-core-vd-client testreduce services. This instance is set up to compare Parsoid rendering and core parser rendering for production wiki pages.

Other visual diff instances can be set up as long as the right visualdiff, testreduce, proxy domain and nginx configs are updated.

Testreduce code

The testreduce code is in /srv/testreduce which is used to run the parsoid-vs-core-vd and parsoid-vs-core-vd-client services. The systemd controller files for these services are in /lib/systemd/system/parsoid-vs-core-vd.service and /lib/systemd/system/parsoid-vs-core-vd-client.services — these files have derived from the puppetized code for similar services on scandum used for Parsoid's roundtrip testing.

The testreduce server config is in /etc/testreduce/parsoid-vs-core-vd.settings.js. The testreduce client config is in /etc/testreduce/parsoid-vs-core-vd-client.config.js which also includes a section that provides the config for the visual diff tests that are to be run.

Visualdiff code

The visualdiff code is in /srv/visualdiff that also provides config and hooks to use it with testreduce. The file /etc/testreduce/parsoid-vs-coe-vd-client.config.js also provides the visualdiff config. It specifies how to fetch the HTML for the two screenshots, specifics uprightdiff as the diffing engine to use, and a few other parameters that control these -- the comments should be fairly self-explanatory. The uprightdiff code is in /srv/uprightdiff.

There is a separate helper service for viewing results for a single title without having to go digging for them in the directory containing them. On parsing-qa-02, the code in /srv/visualdiff/diffserver/diffserver.js is run as the visualdiff-item service. The config for this is in /etc/visualdiff/parsoid-vs-core-diffserver.config.js. The systemd controller file is in /lib/systemd/system/parsoid-vs-core-diffserver.service.

Managing services: parsoid-vs-core-vd, parsoid-vs-core-vd-client, parsoid-vs-core-diffserver

To {stop,restart,start} all clients:

sudo service parsoid-vs-core-vd-client stop
sudo service parsoid-vs-core-vd-client restart
sudo service parsoid-vs-core-vd-client start

Client logs are in systemd journals and can be accessed as:

### Logs for the parsoid-vs-core-vd-client service
# equivalent to tail -f <log-file>
sudo journalctl -f -u parsoid-vs-core-vd-client
# equivalent to tail -n 1000
sudo journalctl -n 1000 -u parsoid-vs-core-vd-client

### Logs of the parsoid-vs-core-vd testreduce server
sudo journalctl -f -u parsoid-vs-core-vd

### Logs of the parsoid-vs-core-diffserver service
sudo journalctl -u parsoid-vs-core-diffserver

The public-facing web UIs for these services are managed by a nginx config in /etc/nginx/sites-available/parsoid-vs-core-vd and provides access to the web UI for the parsoid-vs-core-vd and parsoid-vs-core-diffserver services and also enables directory listing for the screenshots generated during the test runs. The config should be self-explanatory.

Updating the code to test (and being run by the clients)

Unlike Parsoid where the code to test is determined by the latest git commit, in the parsoid-vs-core setup, the code to run lives on a separate VM, and sometimes the change might be in the config files, and may not be available in a git repository (at least as of today). The testreduce codebase implicitly assumes that the test to run is a git commit. However, the testreduce client config file (/etc/testreduce/parsoid-vs-core-vd-client.config.js) can declare a getGitCommit function that is then used by the server as clients to identify the test run in the database. So, in our case, this function simply returns a unique string identifying the test run. So, to initiate a new test run, simply change the string being returned by this function, save the file, and restart the parsoid-vs-core-vd-client service and you will be ready to go.

Anyway, here are the steps:

Login to parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud. Edit /etc/testreduce/parsoid-vs-core-vd-client.config.js and update the string in the getGitCommit function at the bottom.
Restarting the parsoid-vs-core-vd service shouldn't be necessary, but occasonally that service might crash and might need restarting

Updating the testreduce, visualdiff, uprightdiff code

Of course, there will continue to be bug fixes and tweaks to these codebases. To update the relevant code, simply go to /srv/testreduce, /srv/visualdiff, or /srv/uprightdiff, and do a git pull, and restart the affected services. As simple as that!

Retesting a subset of titles

The only way to do this is to clear the result entries in the mysql db. The mysql credentials (username, db, password) are in /etc/testreduce/parsoid-vs-core-vd.settings.js

mysql> update pages set claim_hash="",claim_num_tries=0, claim_timestamp=null,latest_stat=null,latest_result=null,latest_score=0,num_fetch_errors=0 where latest_score > 5000;

That will clear all test results for titles that have a score > 5000 which is equivalent to pages that have rendering diff > 5%. Score = errors * 1M + truncate(diff%) * 1000 + fractional-part-of-diff%. This weird scoring formula is just a result of shoe-horning the visualdiff results into the testreduce setup that was built for parsoid-rt testing. So, to clear test results for all erroring pages, you use latest_score >= 1000000 (or a really high score value less that 1 million).

Look at the schema for the pages table to clear results for other subsets.

Deleting 404-ing titles

Occasonally, titles in the test database might be deleted on the wiki -- so visual diff tests for these titles will start failing after that point, and will show up as an erroring title. Here is how you would go about detecting and deleting those:

mysql> select page_id from results where result like '%code: 404%' and commit_hash='.. latest hash ..';
.. this gives you a set of page ids ...
mysql> delete from stats where page_id in (... above set ...);
mysql> delete from results where page_id in (... above set ...);
mysql> delete from pages where id in (... above set ...);

Resource usage and # of test clients

parsng-qa-01 is a large labs vm with 12 cpu cores, 32 gb memory, and a 400+gb disk. Even so, visual diff testing can use up all these resources. 20 testreduce clients seem to be about the upper-end of how many can be run at the same time. This is enough to sometimes bring cpu load to 13-15 and memory usage to 28+gb. Probably 16 clients is a more comfortable number. The # of test clients to run can be tweaked by editing /lib/system/systemd/parsoid-vs-core-vd-client.service

The screenshots from puppeteer and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix. These images are overwritten with each test run. It takes too much disk space to store these images per test run. 125GB is used per test run. But, in the future, we could consider storing results from the most recent 2-3 runs or get a larger disk and expand that range a bit more.

Web UI for browsing results

The screenshots from puppeteer and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix and are accessible via HTTP @ http://mw-expt-tests.wmflabs.org/visualdiff/pngs/.

However, a better way of browsing these results is via the parsoid-vs-core-vd web UI at http://mw-expt-tests.wmflabs.org. The /topfails link sorts results in descending order of score which makes it easy to look at pages that generate the most prominent diffs first. The @remote link on these results listing page is a easy way to look at the 2 HTML screenshots and the uprightdiff screenshot. That output is outsourced to the visualdiff-item service. It simply links to the existing screenshots (or if missing, generates them on demand).

Uprightdiff numeric scoring

Uprightdiff compares the two candidate images and returns 3 metrics:

* modifiedArea : This is a simple count of the number of pixels for which the source does not match the destination (after they have both been expanded to the same size).
* movedArea    : The number of pixels for which nonzero motion was detected.
* residualArea : The number of pixels which differed between the resulting image and the second input image.

In other words,

if modifiedArea == 0, then the images had pixel-perfect match. In this scenario, movedArea and residualArea will also be zero.
if modifiedArea > 0, then the images obviously differed. If residualArea == 0, then it tells us that all the differences could be accounted for by vertical motion and the rendering differences are mostly insignificant. In this scenario, movedArea tells us how many pixels were affected.

The goal of generating a numerical score is to be able to (a) compare test results for different pages and identify the most significant ones, and (b) compare test results for the same page across test runs and determine whether our fixes improved or worsened the situation. With these goals in mind, the visual diffing code takes the totalArea of the image and uses the above 3 metrics to generate 2 different numbers.

SignificantDiffMetric (when residualArea > 0): 75 * residualArea / totalArea + 0.25 * min(max(2^(residualArea / 100000) - 1, 0), 100)
InsignificantDiffMetric (when residualArea == 0): 50 * modifiedArea / totalArea + 50 * movedArea / totalArea
ErrorMetric: 1 if the test had a fatal error, 0 otherwise.

The total score is then computed as 1,000,000 * ErrorMetric + 1,000 * SignificantDiffMetric + InsignificantDiffMetric (In other words, this can be seen as a number in base-1000 notation).

This scoring technique gives us what we want. In addition, the signficant diff metric tries to flag pages that are really large (big totalArea value), that have a sizeable pixel diff (big residualArea), but which is fairly small relative to the size of the page (small residualArea / totalArea ratio). A simple residualArea / totalArea ratio would favor small pages with mostly insignificant residualArea values over large pages with mostly significant residualArea values. So, we pick a 1M area as our baseline and figure out how big the residual area is relative to that and use exponentiation to weight those heavily.

We believe that this numeric metric lets us quickly identify problematic rendering differences and use mass visual diff testing without having to manually sift through thousands of diff images to identify where to focus our efforts.

Updating the VMs

Just to be clear, the above talks about labs VMs which, in the following discussion, are the hosts to the VMs that mediawiki-vagrant spins up. This section is about keeping mediawiki-vagrant and the VMs it spins up up-to-date.

In the future, it might be easiest to just create new labs VMs and start from scratch. https://phabricator.wikimedia.org/T204566#4797907 has some notes from when we updated the VMs this way in 2018. In addition, the following notes might nevertheless be a useful guide in cases problems arise while upgrading.

Troubleshooting notes from May 2020 while upgrading vagrant and mediawiki checkout

Keeping mediawiki-vagrant up-to-date is supposed to be as simple as git pull && vagrant provision. In practice, that wasn't so. This is most likely because of nfs issues that were left unresolved when setting it up. At the time, vagrant reload was abused until no errors were reported when starting up. To consistently startup without error, the suggestion from T139859 was used to set vagrant config nfs_shares no.

Unfortunately, after booting, the permissions in /vagrant are in a problematic state. In order to work around it, on the hosts, do sudo chown -R mwvagrant:wikidev /srv/mediawiki-vagrant and, in the VM, do sudo chown -R vagrant:www-data /vagrant. That at least allows for basic vagrant commands to work.

Generally, to update mediawiki in the VMs, vagrant ssh in and then fix the permissions. Then, instead of using vagrant git-update on the hosts, just invoke run-git-update from inside the VM.

There were a few other one off problems when provisioning the VMs that required apt-get install php-redis php-igbinary php-luasandbox and fixing links to the available modules when going to php 7.2 that won't likely need repeating. These were from T213016 and T213993

All that said and done, the major hurdle was that we were using an import from before the actor migration began. I imagine that because users weren't imported, when the schema migration scripts in maintenance/update.php ran, we ended up in a broken state. In order to fix the revision_actor_temp table, I just assigned all the actions to the Admin user. Based on T249185#6028521, in the VMs, create a file t.sql,

insert into revision_actor_temp (revactor_rev, revactor_actor, revactor_timestamp, revactor_page) select rev_id, 1, rev_timestamp, rev_page from revision r where not exists ( select 1 from revision_actor_temp a where a.revactor_rev = r.rev_id );

and then run,

#!/usr/bin/env bash

for db in $(alldbs); do
	echo $db
	mysql $db < t.sql
done