Help:Pickle

From MediaWiki.org
Jump to navigation Jump to search

Other languages:
English • ‎español • ‎norsk bokmål • ‎中文 • ‎日本語
Warning Warning: This is work in progress. Sorry about that, to all those that want to do full spec-tests in Lua right away!

Pickle is a project to integrate continuous testing into an environment for continuous integration, like Wikipedia which use Lua-scripts. Within Wikipedia Lua is used to implement advanced templates, and the same solution is used on a lot of other websites and projects. Continuous integration is a core element of continuous deployment, which is very important for sites that must be up and running 24×7. Pickle use spec-style testing, which can be described as a variant of unit testing, or how to make the thing right. Later on, Pickle might be extended with step-style testing, or how to build the right thing. What to build, and how to build it, is not the only important thing. It is also utterly important to reduce the overall risk associated with running live Lua development within an active MediaWiki project.

In English a pickle is a type of vegetable that has undergone pickling, such as a gherkin, which also happen to be the name of the Gherkin language to do step-style testing. At the same time you pick on some code, trying to dissect it, finding flaws and correcting them in a continuous process.

Background

Testing allows us to say something qualitative about the software, that is the modules in a wiki, such as the number and type of known defects, especially for functional requirements. That helps the developers to identify and fix defects during development, but also builds the community’s confidence in the final module. When someone, later on, wants to use the module as part of a template, then they can check out the results, possibly being assured that this is, in fact, a good module, before calling the module as part of the template. If something happens later on, possibly because of a defect in some other included module, then the tests will hopefully identify the root cause of the problem.[1]

In this solution, anyone can make the necessary tests, but it should be expected that the user developing the code would also develop the tests. That user knows best what to test and how to test it. You don't accept half-done code, you should neither accept half-done tests. Still, it is easy to forget some important cases, and when you see such cases do add them! Better test coverage means fewer defects now and in the future.

Basic idea

Screendump of the "page status indicator" showing a good final result of test run.

When a tester (that is a pickle) has run all its test for a specific testee (that is the lib or module), and all the tests are passed, then that page will be flagged as passed both by a page indicator, and by a tracking category. If the state changes then there will also be created log entries. If the tests start to fail for whatever reason it will be flagged as failed by a page indicator, and been put in a category for failed libs. If the state change from some state into something else, then there will also be a log entry. Together this is the primary mechanisms for tracking down failing libs.

The outer frame (world) should only try to tests a single testee, and because of this there should only be a single describe. It is completely valid to have several, but it won't really make sense while testing single libs. This can be rephrased as a change of context, that is a call to context, which would imply a change of environment for running the testee. Such a change will be given as a descriptive text, where data is extracted from that string. Those extracted data snippets will then be cast to correct type and provided to the frame (world). Most of the time it will not be necessary to create separate context and it will only be a describe with a number of it.

Assume that there is a module for the ubiquitous "Hello World". This code looks like the example code 1

local p = {}
function p.hello()
    return 'Hello'
end
return p

Example code 1, an example of the ubiquitous "Hello World". There is nothing extraordinary with the definition this module.

The code for testing this module looks like the example code 2

return describe 'the ubiquitous module for "Hello World"' (function()
	subject:hello()
	it 'shall reply with "Hello"' (function( str )
		expect( str ):toBeEqual()
	end)
end)

Example code 2, example of a spec-style test. There is (will be) both an implicit and explicit form of calls on subjects.[note 1]

Note that the inner it takes a string as an argument, the description, and that this string has an inner string "Hello". This string is delivered to the "frame" ("world") and is used in the test. The outer describe has also an inner string in the description, but this is not used.

The first describe tries to attach the test to the correct lib, that is the testee, and will store this as the subject.[note 2] This can be tested by comparing it to an expectation. The expectation is what you want a value from the subject to be, not the subject itself. This makes it possible to make a model of what the subject should be. You compare the model with the subject, to valuate it in some respect. How the valuation is done, and its outcome, is then reported. If the value is pass for all valuations, then the overall state for the testee is set as passed.

Statements that are truthy (upvalues) will be stated as expect something, but what about statements that should be falsy (downvalues)? Those statements use a function contradict. They are very similar to expect, but they have their final valuation negated. This makes it simple to make a statement that the code should not fulfill. A closer inspection reveals that each test is a row in a Karnaugh map, with expect being rows with a truthy outcome and contradict those with a falsy outcome.[note 3]

Because the columns in a Karnaugh map represents the branching decisions in the code, that is the nodes in a control flow graph, while the rows represents valuated paths, and that our tests only hits a fraction of the possible valuations, the actual valuations must be as representative as possible.[note 4] Usually, we will put great effort into identifying border, edge, and corner cases, but if the code is badly written it could be very difficult or even impossible to hit all alternate valuations from the columns. Often the best we can do is to try to cover most of the branches (branch coverage, "edge cases"), that is the number of tests should be of the same order as cyclomatic complexity or close to it but still lower, or if we are adventurous we can head for independent paths (path coverage).

Notes

  1. In the given example the explicit form is used. This is available in the current implementation. It will probably also be possible to use the implicit form later on. This modifies context and it so members functions are redirected to the current subject instance, thereby making it possible to drop the explicit use of subject.
  2. The reference is for the same subject through the whole test run, and if the lib has secondary effects that can create problems. If we recreate the full environment a test run could easily take to long time.
  3. All adaptations are assumed to be part of the universal quantification (∀), that is for all-statements. Later it might be possible to define a frame to use existential quantification (Ǝ), that is there exists-statements.
  4. Another formulation could be that we sample only some of the possible valuations from a possibly very large set and that a given set of tests could be incomplete.

References

  1. ISTQB: Certified Tester – Foundation Level Syllabus. Chapter 1.1, 1.2