On Fridays I pair with doctors from Ada's
medical quality team. It's a fun and productive collaboration where I gain
deeper insight into the way that diagnostic information is encoded in our
product and they get to see a testing perspective unhindered by domain
knowledge.
We meet at the same time each week and decide late on
our focus, choosing something that one of us is working on that's in a state
where it can be shared. This week we picked up a task that I'd been hoping to
get to for a while: exploring an API which takes a list of symptoms and
returns a list of potential medical conditions that are consistent with those
symptoms.
I was interested to know whether I could find small input differences that led to large output differences. Without domain knowledge, though, I wasn't really sure what "small" and "large" might mean.
I prepared an input payload and
wrote a simple shell script which did the following:
- make a timestamped directory for this run, results
- copy the payload to results
- POST the payload to the API endpoint
- copy the response to results
- parse the response to summarise just the condition list
- copy the condition list to results
- echo the condition list to the terminal
This super simple runner gave us the ability to loop tightly and efficiently
like this:
- edit the payload
- call the runner
- inspect the conditions
- choose the next edit
We found that we often wanted to compare two successive runs and this was easy because the console just displayed it:
$ run.sh input.json ----- results/2022-01-09_092501 A,B,C,D,E $ run.sh input.json ----- results/2022-01-09_092520 A,B,C,D,FIf we needed more of the metadata around the list we could look at the raw response. If we needed to check exactly what payload had produced a specific response or contained a specific symptom we could easily search, and if we wanted to re-run a particular payload that was straightforward too:
$ grep -l symptomX results/*
results/2022-01-09_092501
$ run.sh results/2022-01-09_092501/input.json
-----
results/2022-01-09_093345
A,B,C,D,E
At
the start of our session we prioritised a set of strategies that we thought
had the potential to show the kind of effect I was interested in. As it
happens, none of the differences we saw were medically significant. But that's not a major
problem, we only spent an hour on this work starting from a list of symptoms I had
created more or less at random.
I now have a
tool with which I can easily control and observe the system under test and some insight into differences that might matter. With
those pieces I can take lists of more plausible symptoms, create many slight
variations of them, run the script and use my new heuristics to target relevant differences.
I love
testing by exploring and I love the way that automation can be a force multiplier for that exploration.
Image: Science Fiction and Fantasy Stack Exchange
Highlighting: Pinetools
Comments
Post a Comment