Quality Assurance/Refactoring 2014

From mediawiki.org

What I learned from refactoring browser tests

Background, or why I spent so much time at this[edit]

In November of 2014 we published the "Team Health Check" survey results.

The least satisfactory category was "Support":

  • Teams experience too many external dependencies and feel like they can’t get them resolved quickly
  • Teams share a sense that things often go sideways that are outside the control of the team that they can’t get resolved
  • Teams share a sense of difficulty in escalating problems/blockers/etc externally
  • Teams share a sense of ‘always waiting'

One reason to refactor browser tests was to ease these areas. I can ease some stress in these areas and I can improve support from QA by improving and modernizing the browser test code across all the repositories that have them.

Another area with issues that caught my interest was "Code Quality"

  • Legacy stuff is getting everyone down! Lots of technical debt and teams are struggling to figure out how to cope with this
  • Folks generally feeling better about more recently developed code. Teams are generally increasing test coverage, documentation, refactoring

I do not want the browser test code to become legacy ever.

The updates every repository needs[edit]

We have in place two modernization efforts that need to be put in place for all the existing code. For one thing, we need to update the syntax in the test assertions to use RSpec3 instead of RSpec2. While 2 and 3 can live together for some time, for a number of reasons RSpec3 is a superior choice.

For another thing, we've recent brought in a Ruby linter called "rubocop" in order to enforce important style rules.

The timing was good, so I set out first to update the assertions in our tests for RSpec3, and then to update the format of all the code to conform to rubocop style rules. Along the way I put things in alphabetical order and generally "swept the floor".

A funny thing happened on the way to the refactoring[edit]

As I looked at the assertions I was updating, I started finding areas of confusion in the tests. As I went along, I attempted to change the tests so that they would continue to be effective but would display a better sensibility, a better sense of intent. I did this for a number of reasons. Ease of maintenance was one. Convenience in demonstrating to others was one. Better understanding of failures was another.

The issues I encountered and (mostly) fixed along the way fall into a few rough categories:

  • Confusion about intent of the steps in the test, code re-use (good) and tight coupling (bad), and what UI tests can and cannot accomplish
  • Fighting the test framework instead of using the power it provides
  • Coding for the future: organization and alphabetization, TODO/FIXIT comments, cruft in general

I will be having conversations about these areas at the Tech Days this year. But for now, here is an overview of each area of concern:

Intent and purpose[edit]

When you have a hammer, everything looks like a nail. It is tempting to try to test everything with browser tests, but in practice they are effective for certain purposes and not others.

Browser tests should emulate those actions a user would take in order to demonstrate that some feature is behaving as the user would expect. A browser test should be an arrow fired at a target.

Ideally, a browser test will set up some sort of initial conditions. Think of these as nouns; the stuff that has to be in place before the user can start testing the feature. This might mean to log in; or it might mean that a particular wiki page with particular content has to be in place. These conditions should be in the "Given" steps of the Cucumber Scenario.

Given the initial conditions, the user will take some set of actions to arrive at a destination where the feature being tested can be assessed. Think of these as verbs. This might mean typing in text areas, setting radio buttons or checkboxes or select menus; clicking Save. These are When steps.

And having navigated to a place in the application where the effects of the feature should be visible to users, the test should assert that those conditions do in fact exist. These are the Then steps. There are two things that make Then steps different from Given and When steps: the text of the Cucumber Scenario step ought to contain the word "should"; and the the step itself should contain an RSpec assertion like "expect(foo).to be_visible" or similar.

However, nothing prevents test designers from using Then steps as Given or When, or Given steps as Then steps, or any permutation. I saw this often in the tests that I refactored. This is an issue because it takes valuable time for maintainers to figure out what the test is doing, and it confuses analysis when the test fails. Let's look at two examples.

The tester wants a certain condition to exist at the start of the test; a Given step. The tester sees that such an assertion already exists in a Then step, so they re-use the Then step in the Given clause. But upon the test failing, the error message is such that someone analyzing the test sees an error message from RSpec instead of from the test framework. And what if the Then step passes in its normal location but fails when used as a Given? Analyzing steps re-used in the wrong place can take significant effort to unravel.

From time to time I saw people get around this by creating a step with new text that simply calls a different step of a different type of Given/When/Then. This is even worse, because in order to analyze the failure one has to dig into the code to find that the failing step is not actually the step they think it is but is rather somewhere else in the code. Having a step that does nothing but call a single other step is a "code smell".

Finally, let me emphasize that browser tests should have a particular purpose, to demonstrate that some feature is in place in the way that the user expects. Browser tests should not check every page for some error that may or may not exist, nor should browser tests attempt to validate features that are not manifested in the UI. Browser tests should set up their conditions and navigate by the most efficient route to the place where a particular feature is manifested and can be checked.

Fighting the test framework instead of using the power it provides[edit]

Re-using one step as a different kind of step is an example of tight coupling. When a Given step is re-used as a Then step, it confuses the behavior of the test, and when the behavior of the element acted upon by the step changes, the test can become meaningless. Likewise when some step only calls one other step but with a different description of behavior, the semantics of the test become confused in ways that only become apparent when the behavior of the application changes and the test fails for need of maintenance.

Our framework of Cucumber/page-object/watir-webdriver/RSpec provides a simple and powerful way to control the elements on a web page with code. In fact, the simplicity of the framework *is* the power of the framework.

It is tempting to do things like notice that some elements on the page are similar, so it is possible to loop over them, manipulating and checking. This is bad idea.

It is tempting to note that some condition may or may not be in place, so to write a conditional statement to move the test along. This is also a bad idea.

In the course of refactoring all the tests, I encountered a number of instances of what I call "stunt programming". Rather than use the simple and powerful tools provided by the framework, test designers implement loops, complex data structures, and conditionals in an effort to be efficient. But this stunt programming is a false economy. It is in fact a species of tight coupling.

In the case of looping over constructs, we often find that the behavior with regard to the 0th element or the last element or some other element in the loop will change. When the test fails because of this, the maintainer has to deconstruct or dismantle the looping construct completely in order to isolate the proper behavior of the element that changed. This can be extraordinarily expensive. It is far better to use the simple and direct controls provided by the framework, and refrain from stunt programming.

Introducing a conditional instantly makes a test non-determinate. For each run of the test, some branch of the conditional is *not* being checked, and behavior of the unexpected branch can be surprising and confusing.

Conditionals, loops, and complex data structures beyond what the framework provides are "code smells" and should be avoided. Find a way to work with the framework rather than against it.

Cruft[edit]

Just a few examples:

Comments in the Cucumber Scenarios[edit]

Cucumber Scenarios are free-form step-by-step text descriptions of what that test should be doing. It should rarely if ever be necessary to put comments in a Scenario; that is like commenting on a comment (on a comment). If you find yourself adding comments to Cucumber scenarios, you may not have a complete understanding either of what the test is supposed to be doing, or of what the feature itself is supposed to be doing.

TODO/FIXME[edit]

In some places I found comments like

  1. TODO this is dead code, it should be removed
  2. TODO this belongs in a different place

Just remove the dead code please or move stuff properly. Or not. These comments are technical debt that someone has to address later, and the longer they sit the more confusing they get. Better just to leave it alone than to drop a drive-by TODO comment. To paraphrase Yoda, there is only DO or NOT DO, there is no TODO.

  1. FIXME add scenario to check foo

The place for this is not in a comment in the code, it is in a branch in gerrit with an unimplemented Scenario. It doesn't take any longer to make.

And just a few comments about structure:

  • Make the spacing in the Cucumber Scenarios nice, and keep them nice.
  • In the steps files, alphabetize each step within the categories of Given/When/Then. Not only does this make code easier to find when doing maintenance, but it also points out discrepancies if any steps are used incorrectly.
  • Put some thought into organizing the page objects. Each repository has different schemes for doing this, and a future project may be to standardize our approach.

Payoff[edit]

Refactoring all these tests was expensive. It was a significant investment. You may ask yourself what return we get from such an investment.

Investing in clean, understandable test code yields a number of payoffs. I have experienced these benefits in my daily work, and so have others:

  • Test results are not ambiguous in any way. Failures are easily diagnosed.
  • Maintenance becomes simpler. Eliminating stunt programming makes updating tests for new behavior much simpler.
  • Training others is easier. It is much easier to bring new people up to speed when the code is where you expect to find it, and it does what it says it does.
  • Tests are faster. Removing cruft, replacing sleep() statements with appropriate waits instead, taking out unnecessary steps, all contribute to faster builds. In one case I shaved more than three minutes off a build by refactoring a poorly conceived sleep(10) into three different wait conditions.
  • Better examples available today means better tests created in the future.

So that's why I spent the end of 2014 refactoring browser tests. As the Team Health Check points out, we do not want the browser test suites to be an external dependency that drags on projects. We don't want browser tests to go sideways out of control (or even to be confusing in any way at all). We don't want these tests to block anything, or for anyone to encounter difficulty resolving issues with the tests, nor should people have to wait on anything in these repos.

We keep these tests valid, vibrant, and valuable by addressing the technical debt and and by preventing them from becoming a legacy anchor on future work.

(Also see: Quality Assurance/Browser testing/Guidelines and good practices .