Core Platform Team/Decisions Architecture Research Documentation/Using our Integration Testing Framework for Monitoring

A potential use case for our Integration Test Framework is as a monitoring tool to support health checks on services. As an initial step we want to investigate whether or not the Integration Test Framework will be sufficient to provide Monitoring for instances of the Kask Service.

Investigation
We began our investigation by comparing the current monitoring tool, Service-Checker, to our integration testing framework against a set of minimum requirements for web service monitoring. As shown above, our integration testing framework seems to beat the Service-Checker when considering the flexibility of HTTP requests. Service-Checker only supports GET and POST requests meanwhile our testing framework supports all methods supported by Node. Also because Service-Checker uses URL template interpolation it limits the kind of substitutions that can occur while our testing framework allows for any kind of URL to be used.

On the other hand, because our testing framework provides a lot of flexibility it creates a greater opportunity for abuse. Thus guidelines should be in place and a distinction must be made between thorough integration tests that are run during the development process and monitoring tests that will run in production.

Proposed approach
To use Service-Checker post-merge and in production, we pass the base URL, spec URL (location of tests), and the timeout (number of seconds to wait for each request). To utilize our framework in the same sense we propose adding monitoring tests to each service’s repo, limiting dependencies to those currently used in our framework where appropriate, and ensuring all dependencies are installed when deploying the service. Then we would simply need a script to set the base URL as an environment variable and invoke mocha with the location of the tests and the timeout.

As for the tests, they should be written with monitoring in mind. That is to say, they shouldn’t exceed the timeout specified for each request (5 seconds) and the entire service check (typically 60 seconds for Nagios-like systems). Also, up to this point, all integration tests have been written using the Chai assertion library, monitoring tests should follow the same pattern and limit the use of other assertion libraries where appropriate.

Finally, we’d suggest using both the Service-Checker and our integration testing framework for a set of weeks to gather live information and adjust guidelines and implementation details as necessary.