Infinite Loop * Infinite Space

The Wikipedia page on infinite loops in programming describes them as "a sequence of instructions that, as written, will continue endlessly, unless an external intervention occurs." One common example might be a while loop whose exit condition is never met, and needs to be aborted by a human pressing Control-C.

With that concept in mind, we can make an easy analogy to software development where the same kinds of events happen over and over and over until our product becomes irrelevant, or uneconomic, or our organisation closes down, and the loop is exited.

Inside the loop, the world in which our product exists will change, the market in that world will change, the requirements on our product in the market will change, the product itself changes as our teams add features, or fix bugs, or update libraries, or run on new platforms, and so on.

So, for their lifetimes, our products inhabit an infinite loop of change and, if they are non-trivial, also contain an infinite space of possibility: potential inputs, possible outputs, paths through the execution, timings, integrations, network, and the rest.

And that's one of our challenges as testers:

How can we decide what to test and to what extent in the face of that change?
How can we test it in the face of that huge range of potential states?
Given that we already do both of those things today somehow, how could we do it more efficiently and effectively?

--00--

On my current project, a dialog engine for a medical symptom checking application, I responded by building a tool, the dialog walker, which:

Runs on any available dialog, not just one standard end-to-end test example.
Makes random inputs, so the input and output data changes on every run.
Follows legal over expected paths, exploring product possibilities not assumed user behaviour.
Provides a framework for me to ask and answer questions.
Generates large amounts of structured data for post-hoc review.

In large, complex, interconnected systems we can't know, and find it difficult to model and reason about, where we might see issues in general. The walker puts the system under test into novel states — and novel sequences of states — on each run in an attempt to find states where the system doesn't behave as we would expect.

I keep it in step with the product as it develops, teaching it to understand and use new features as they are added, and allowing it to work around bugs until they are fixed.

--00--

The sequence below shows a handful of steps from an assessment dialog in Ada's symptom checker. The user provides different kinds of inputs (highlighted in red) in different turns and ultimately ends up at a list of potential conditions, reflecting the information the user entered:

You can imagine from the open-ended free text input, the range of choice points that can affect the dialog flow, and knowledge of the broad range of medical conditions a system like this needs to handle that there are, if not an infinite number, many possible dialogs. If you factor in the ability to go back a step at any point, the variability in time taken between turns, and the dependencies on external services, it becomes effectively infinite.

--00--

In its first iteration, the dialog walker contained a model of a generic dialog, understood how to use the engine's API to traverse it, and used dice rolls to decide what choices to make and what input to provide at each turn in the dialog:

The orange bubbles on the walker show the kinds of things that it could assert on, for example that some raw template text like {he/she/they} had been sent to the client side rather than being proceesed into a specific pronoun in the engine, or a valid user input had provoked an HTTP error code from the service. These are generic, invariant, assertions: they are true for any dialog.

You'll notice that there are orange bubbles on the system under test too, that's because inspecting the behaviour and output of the system under test while running the walker can be valuable. This is especially true when running the walker at scale, and you can choose how many iterations to execute and how many dialogs to run in parallel in each iteration.

Note that the walker is not asserting on specific outcomes for specific inputs. It can and does assert on data consistency but in order to operate on any dialog (such as a symptom assessment, questionnaire, or user feedback collection) it needs to have invariants to look for: what must always be true here? what would tell us that something was wrong? what would tell us that something might be wrong?

--00--

Little by little, I've added features as I've had the need for them. The architecture diagram below shows logs, configuration, access to the SUT's database, scripts, and visualisation of runs:

The logs record all HTTP traffic between the walker and the dialog engine. With access to logs, I can ask questions after a run. How many times was state X entered? What kinds of error responses were seen? Were any dialog steps never encountered? As I learn which of these are productive, I add them to the walker as assertions.

The scripts are used to replay dialogs to help with debugging and reproduction. They're simple brute-force bash scripts that use jq and curl to replay a turn and then wait for a keypress. This means that I can find interesting runs and step through them at my own pace with the application under test running in the debugger and a client open against the back-end database.

Configuration is used to tune the route taken through the dialogs, for example to say how often to go back a step, or to give a list of possible inputs for specific questions. This helps to drive the walker towards particular behaviours that I want to test.

Access to the application's database is interesting. With it, the walker can make assertions about internal state consistency, for instance that the information given to the client is the same information that the application sees itself, or that the information provided by the client ends up in the right place in the database.

--00--

Visualisation was introduced most recently. After a run, I can start the visualiser and it shows a summary of the outcomes, flagging particular instances that had assertion failures:

Clicking through from there, I can look at logs, a traditional HTTP request-response view, or a "chatbot" representation of the dialog (as shown below). This presents the data that the user would see along with metdata per turn and a chunk of system state (greyed out in the image below, sorry):

The system state is particularly interesting when there is a problem because the before and after states can be diffed to help with diagnosis. There's also a search interface that helps to find dialogs where the "user" saw particular text, or entered particular states.

The visualisation is incredibly useful for me to understand what was happening at any given point in a dialog, but also encourages others to use the walker and is exceptional for sharing findings with others (as my team's PM noted.)

--00--

Which all sounds great, for sure, but how does it help to address the infinite loop and the infinite space?

Infinite space first: over time, and runs, because of the random nature of the paths taken, more and more of the space of possible assessments is covered. Of course, because of the infinite loop the space is not static, so by choosing how many runs to make and tune the way they run, I can cover the space for today's investigation based on current perceived risk.

On the infinite loop the walker helps at different levels. When a PR is ready for review I will often kick off the walker against the engine in that branch while I explore the product changes. Sometimes this produces an interesting finding immediately and if it doesn't I can either configure the walker to try to exercise a specific thing, change the walker to touch the new feature, or accept that there's low risk of unexpected side-effects.

Some of the developers on my team run the walker themselves while they are writing code, and some ask to run the walker with me after they have finished. A significant value to them is finding side-effects in behaviours they did not expect to change and so didn't test for. It's a safety net of sorts against limited confirmatory testing.

My team has a good test culture, in the sense that the developers see it as their responsibility to write unit, integration, and even occasional end-to-end tests in code when they change the product. However, those kinds of tests only cover the things that the author (a) can think of, (b) can think of a way to implement, and (c) has time and motivation to actually do.

The walker complements this because (a) it isn't trying to think about potential consequences, (b) there is no additional cost to running along one path or another, and (c) it runs unattended, for as many times as you ask it, without complaint.

Above the implementation level, I can frame questions like: could we ever see ...? how likely is it that ...? is there a route to ...? For example, could we ever see data from separate but concurrent user sessions being mixed up in the database?

Now, absolutely, we can and should investigate this by inspecting the code, by writing tests for the integration between the product and the database and so on. But we can additionally run lots of dialogs in parallel and then check for that kind of cross-contamination. Because the dialogs are random, the experiment is more "user-like" than running the same handful of canned dialogs over and over, which is what most regression tests will be doing, and so increases the changes of finding the particular state where there is scope for some kind of data issue.

---00--

The walker amplifies my exploratory testing by helping me to ask specific questions and generating data to help answer them, and by looking for potential problems that it knows about already. It operates unsupervised to cover the infinite space and is configurable to allow it to be focused when that's necessary.

It takes ideas from model-based testing, unit testing, property testing, and observability to create a custom tool that pays back the investment I have made in it in spades. But that investment has been incremental, the first version was a horrible bash script, just enough to navigate a dialog randomly against an early skeleton of the product. The current version is a mixture of Javascript for the UI and Java for the walker itself, built on Graphwalker, and in the last year I have used Cursor for many of the changes, notably in building the UI.

I have been talking about exploring with automation on this blog for a long time and I have been actually exploring with automation for a very long time. I find it a productive approach for starting to address the problem of the infinite space inside the infinite loop and, to be honest, I also wouldn't want to work any other way.
Image: Ivan Slade on Unsplash

This post is a summary of a talk (slides are here) I gave at EC Utbildning and Ministry of Testing Cambridge recently.

Hiccupps

Search This Blog

Infinite Loop * Infinite Space

Labels

Popular posts from this blog

Meet Me Halfway?

The Best Programmer Dan Knows

My Adidas

Notes on Testing Notes

Going Underground

How do I Test AI?

On Herding Cats

Bottom-up or Top-down?

The Best Testing I Could

Not a Happy Place