Core Platform Team/Decisions Architecture Research Documentation/Using our Integration Testing Framework for Monitoring

Monitoring of production web services is currently accomplished with Service-Checker, which reads test data from an OpenAPI extension. This works well but has a number of drawbacks.

For example: OpenAPI specifications are a machine-readable definition of a service's interface, information that moves in lock-step with the implementing code. However, the extension used to define tests often requires configuration on a case-by-case basis. This mixing of concerns is problematic and leads to one or both of duplication, fragmentation of the config, or worse encourages an even tighter coupling by using the specification as application configuration (see: RESTBase for an example of the latter).

Additionally, Service-Checker makes a hard-dependency of OpenAPI specifications, and as a solution, provides no answer for non-RESTful HTTP interfaces (MediaWiki's Action API for example).

Finally, as we begin standardizing integration testing around our Integration Testing Framework we want to minimize duplication (DRY) in regards to monitoring. We realize monitoring tests are in fact integration tests and as an initial step, we want to investigate whether or not the Integration Testing Framework will be sufficient to provide monitoring for instances of the Kask service.

Investigation
We begin our investigation by comparing the current monitoring tool, Service-Checker, to our Integration Testing Framework against a set of minimum requirements for web service monitoring. As shown above, our testing framework meets the requirements for monitoring and even excels in certain aspects such as HTTP requests. Our testing framework supports sending all HTTP methods meanwhile Service-Checker only supports GET and POST requests. Also because Service-Checker uses URL template interpolation it limits the kind of substitutions that can occur while our testing framework allows for any kind of URL to be used.

On the other hand, because our testing framework provides a lot of flexibility it creates a greater opportunity for abuse. Thus we suggest all integration tests, whether being used for monitoring or not, should follow the same patterns and guidelines set in the Integration Testing Framework repo.

Proposed approach
We are not proposing a change in overall functionality but rather a change in test runner from Service-Checker to our Integration Testing Framework. Currently, to use Service-Checker post-merge and in production, we pass the base URL, spec URL (location of tests), and the timeout (number of seconds to wait for each request). To utilize our framework in the same sense we propose adding integration tests to each service’s repository, limiting dependencies to those currently used in our framework where appropriate, and ensuring all dependencies are installed when deploying the service. Then we would simply need a script to set the base URL as an environment variable and invoke Mocha with the location of the tests (file(s) or directory) and the timeout.

As for the tests, they should be written with monitoring in mind. That is to say, they shouldn’t exceed the timeout specified for each request (5 seconds) and the entire service check (typically 60 seconds for Nagios-like systems). Also, up to this point, all integration tests have been written using the Chai assertion library, monitoring tests should follow the same pattern and limit the use of other assertion libraries where appropriate.

Finally, we’d suggest using both the Service-Checker and our integration testing framework for a set of weeks to gather live information and adjust guidelines and implementation details as necessary.