As you'd expect in this field, with this kind of data, a lot of care and concern is taken to understand how the app performs and what caused any changes in that performance across releases. There are many layers of testing.
In one of those layers we have medical test cases. (And these are literally cases in the medical sense.) Each one represents data about an individual who might present to a doctor with a particular set of symptoms, a given medical history, possible comorbidities and so on. Each also comes with a set of acceptable condition suggestions and other expectations about how the software should behave when asking this kind of user questions.
My team has recently merged with another and taken over responsibility for the software that runs these test cases. This made me happy: a test runner is an attack vector for testing!
Having learned how the runner works, in the last couple of weeks I've been pairing with our medical staff, using the test cases and the runner to get "landscape views" of the performance of our software.
How? By taking a single test case and mutating it systematically on a handful of variables to generate loads of variant cases differing from one another only slightly, but predictably.
To take a simple example, let's say I have a test case that describes a 10-year-old boy with a nominated illness. I might generate 100 versions of that case that change only the age, i.e. the same case data for a male of age one, two, three, and so on up to 100 years. I'd then run them, parse the results of each case, and load it into a spreadsheet for analysis.
The obfuscated screenshot at the top is an example of one such experiment. The expected most likely condition is highlighted in green. Sorting and filtering the data shows that its position in the list of suggestions is invariant except for a few places, shown in the rows near the top of the image.
That's interesting to the doctors, and it's also very easy to look for and highlight in this kind of view.
Maybe you are thinking that equivalence classes might have observed the same. Well you're right ... they might. But when the software under test is complex and likely to exhibit emergent behaviour, this kind of approach, making many small variations in inputs and comparing the outputs in bulk to identify patterns or outliers, can be a productive way to look for places to dig into.
The approach feels like a close relative of metamorphic testing. There's nothing particularly complex about it either. I have a couple of bash scripts that mangle the data on the way in and collate it on the way out and the test runner itself just does what it always did.
It's great to mutate the title says, and it's right, but I also love exploiting existing test data and infrastructure to ask new questions.
Comments
Post a Comment