Requests for comment/Unit testing

This document explains: This is also a request for comments & commitments to make our testing environment better.
 * why MediaWiki needs even better unit testing,
 * what we're doing now
 * and what we need to do next.

Why MediaWiki needs even better unit testing
There are lots of approaches to software testing, and "integration testing" (testing the integrated whole as it operates together) is one of them. But we've found that automated integration testing of MediaWiki is less useful, and more likely to break. That's why many MediaWiki contributors want to focus on "unit testing" right now.

Unit testing
In unit testing, you precisely and quickly test one isolated part a time (a unit). For each unit, there is one or more tests. And, instead of mucking around trying to create realistic input that tests every edge case, each test uses a very precisely defined decoy object called a mock. Mock objects are stand-ins for all the code that the test isn't checking; each mock thus isolates the function. For example, if you want to test how an application deals with database failures, you create a mock database, test the database access code with those mocks, and can then test against a mock database that's programmed to fail at appropriate times.

If the mock is right, then every time an automated test suite runs the unit test, the test will thoroughly exercise all the function's code paths, and we'll know if there's a failure.

So you can see that unit tests are a valuable form of automation (and automated regression testing). You instantly see not just that something's broken but what's broken. When someone breaks the unit test, there is a specific function they know they need to go investigate. When some developers want to refactor code, improve its performance or add a new feature, the automatic regression testing is a huge time saver.

(And to be honest, unit tests make it easy to encourage people to write quality code. If a test fails, either the function needs fixing or the test is wrong and needs fixing.  And either way, it's just one small, well-defined thing, so it's ridiculously specific and easy-to-act-on feedback.)

The future
Unit testing highlights breaking code and makes it easier for developers to find what they broke. Unit tests, done properly and running quickly and consistently, should be our first line of defense in baking quality into MediaWiki. This is why we're prioritizing unit testing frameworks and tools.

What we're doing right now
MediaWiki, as a 2011 web application, has two separate code trees using different programming languages: JavaScript is run by the client in a web browser, PHP is run by the server on the application servers. So we use two unit testing frameworks:
 * for JavaScript: QUnit (but it's really doing integration testing)
 * for PHP: PHPUnit

We also have an homemade integration system for the Parser code (parsertests).

JavaScript: From Selenium to QUnit & TestSwarm
We tried Selenium Grid for automated integration testing, then decided that approach had fundamental problems and switched to TestSwarm.

QUnit is a JavaScript unit testing framework developed by the jQuery project. In this case we're using it more for integration testing. TestSwarm is a platform that continuously distributes these unit tests to different browsers through its swarm.

TestSwarm does its distribution solely through JavaScript and the browser, thus making it possible for any OS or browser to join the swarm (including mobile browsers). TestSwarm is to be installed on a WMF server in July 2012. Currently, an instance of TestSwarm runs on the Toolserver, and although limited due to max_user_connections, is already showing fruit. Krinkle leads this implementation effort -- see Krinkle's Berlin hackfest presentation.

But using TestSwarm doesn't rule out using virtual machines. Ashar notes that we can reproduce the Grid's advantages by running browsers in VMs and pointing their homepages to the TestSwarm. We will probably (if needed) maintain a few VMs running with old versions of browsers like IE or Safari.

The JavaScript tests only cover code run on or outputted to the client side. For the server side, we have to test the PHP source code. So we use:

PHP: PHPUnit, CruiseControl, and parser tests
PHPUnit is a testing framework for PHP, much like QUnit is for JavaScript. CruiseControl is a platform to automatically run tests against new code.

For example, we have parser tests, a homegrown set of tests that have been around for a very long time. They use strange edge-case markup as test case inputs and confirm that the parser parses them into the expected HTML outputs. These are now integrated into our PHPUnit tests and thus run every time CruiseControl runs.

Originally, Hexmode set up CruiseControl in mid-2010. By late 2010, people weren't paying attention to it, and it was no longer functioning. As of July 2011, Chad, Ashar and others have repaired CruiseControl. It currently fetches MediaWiki code every 2 minutes and runs PHPUnit automatically against the new code. If there are any failures, CC saves them as a text file. A bot then fetches it and announces the result in the #mediawiki IRC channel, spitting an error if something fails.

Because it runs on a schedule, rather than as a post-commit hook, it will only run once for a set of commits, even though there are 3 or 4 commits in a row that it's testing. That has confusing consequences for the yelling bot. Right now, it's possible to get into a situation where there are 2 or 5 or more commits, one of which broke a test, but we're not sure which. This means the test breakage leads to an incomplete sense of ownership & shame. It's good, but we need to improve to be truly effective.

Moreover, if a revision fails, every subsequent run will be marked as a failure. That makes it hard to diagnose the root cause.

What we should do next
What's supposed to happen:

Anytime you develop a new piece of code, it would be great if you also supplied unit tests for it.

BIG PROBLEM:

Our unit test coverage is terrible, 2% at most.

But our codebase is so spaghetti & global-ridden that writing proper unit tests will be very difficult until we do some refactoring, which will be very time-consuming. Our codebase is cluttered and messy. We are scaring off new contributors & slowing down experienced contributors. And we see now that the codebase's messiness is also stopping us from taking steps towards automating quality into future releases.

This is the bulk of our technical debt that is keeping us from achieving velocity.

PHPUnit has some tools & tricks built in to deal with legacy codebase issues. For example, you can spawn off a copy of global namespace, modify it up, and return from it. The problem with this approach is that we have so many globals that the call takes about 2 minutes, which is unacceptable for performance for a post-commit hook.

We want to focus on the testing that is amenable to postcommit hooks. For example, Ashar suggests we could set up a sanity-check group of fast tests for a post-commit hook, and only run the slower tests locally or by CruiseControl.

We already have the infrastructure (CruiseControl and PHPUnit), and the interest and skill to get it done. Let's do high-yield stuff first.

Request for comments & commitments
What are the next steps?

Better code coverage would be excellent. One suggestion: "70% should probably be our aim, with critical code (such as validating user input) being at 100% and carefully reviewed to make sure we actually cover 100%."

So: remember to try to write testable code and start considering writing tests alongside code you contribute. Writing these tests and mocks correctly, and running them quickly and consistently, is important to making good unit tests & getting what we want out of unit testing as a practice.

If you're getting rid of unnecessary globals when you see them, great -- keep doing that. And Chad's global config object project is a strong step towards getting us out of global hell so we can have proper unit testing in the future, so please comment on his request for comments.

Statistics might also help us. Ashar says: "It would be great to have statistics for most changed functions between versions, or over 1, 3 or 6 months periods. That can help us identify the code which is stable enough and the parts that keep breaking/changing."

How can you help?

 This document was originally written by Sumana with material by RobLa, Chad, Krinkle, and Hashar.