Skip to main content

How do I Test AI?

 

Recently a few people have asked me how I test AI. I'm happy to share my experiences, but I frame the question more broadly, perhaps something like this: what kinds of things do I consider when testing systems with artificial intelligence components

I freestyled liberally the first time I answered but when the question came up again I thought I'd write a few bullets to help me remember key things. This post is the latest iteration of that list.

Caveats: I'm not an expert; what you see below is a reminder of things to pick up on during conversations so it's quite minimal; it's also messy; it's absolutely not a guide or a set of best practices; each point should be applied in context; the categories are very rough; it's certainly not complete. 

Also note that I work with teams who really know what they're doing on the domain, tech, and medical safety fronts and some of the things listed here are things they'd typically do some or all of.

Testing AI

  • It's the same as testing anything: looking for relevant incongruities.
  • You can use all the same skills you use in your regular testing role.
  • CODS - Control, Observe, Decompose, Simplify
  • You can never test everything. Even more so here. 
  • You'll need to find comparisons: input(s) result in behaviour(s), against a (partial) oracle.
  • Empirical data on the system behaviour is key.
  • Your (perceived) coverage of the input and output space is going to be key.
  • Even if it's "AI" don't be fooled into thinking that the system "understands."
  • What are the risks of this thing in this use in this context?
  • What are the egregious outputs?
  • What are the potential bad outcomes? How bad? For who? When?
  • Is it solving the problem? Could the problem be solved another way? 

Information Spaces

  • Input and output spaces can be functionally infinite.
  • Depending on the underlying models, you can't be sure what data is present.
  • You can't be sure what "reasoning" will come out.

Oracles

  • How do you define the characteristics of good output?
  • How do you evaluate the output against those characteristics?
  • Look for related/overlapping/implicit concerns, e.g. societal, sociological, interactional, ...
  • You problem is probably not unidimensional.
  • You probably want multiple metrics.
  • How do you balance multiple metrics to judge the behaviour of the system as a whole?

Data

  • Think carefully about input space coverage.
  • Think carefully about output space coverage.
  • Consider very large data sets with some metric on acceptable correctness rates.
  • Are there specific cases which MUST have a specific outcome?  
  • Are there specific outcomes we MUST NOT see?
  • Try semantically-identical data with different syntax, synonyms, null content, length, change ordering, bury the relevant content, ...
  • Where is your test data coming from? 
  • Do you have an existing system with user data and expected outputs?
  • Are you making the data up? On what basis? Using what tools? How can you judge how similar it will be/is to real user data?
  • Nonsense. What data should never get a response? (Beware the happy path glow.)
  • Which languages to allow? LLMs will often respond to anything you input.

Assorted Approaches

  • In most projects you'll want scale.
  • At scale you'll need automation to exercise your system.
  • Automation for change detection over time.
  • Automation for exploration.
  • At scale you'll need statistical evaluation of the outcomes.
  • At scale you probably still want a human in the loop somewhere.
  • Can you sample from test and production for human review? How much? How often? Which data?
  • Property-based testing is a good model to consider.
  • Metamorphic testing can be a valuable approach.
  • Think about combinations of inputs.
  • Adversarial testing (using another AI).
  • Domain experts can identify subtleties that you will miss.
  • How to evaluate the difference between two versions of a system?
  • Evaluating the level of variability in some version of a system.
  • Is there bias? against who? according to who?
  • Bespoke tooling for different kinds of experiments.
  • Bespoke metrics for different kinds of experiments.
  • Can you break the system under test down into steps and inspect them?
  • If the system is chat-based are you evaluating each turn, the end result, both?

Development

  • How can you test the effect of changes? (e.g. adding "Don't be biased." to a prompt.)
  • What is the goal of this piece of testing?
  • When do you run your tests? Who runs your tests?
  • What is the cost of testing? What is the potential cost of not testing?

Model/Provider Choice

  • Is this the right model/provider? (What contraints do you have on this choice? Why?)
  • How easy would it be to swap to another model/provider?
  • What SLA does the provider give?
  • Do you believe they will stick to it? 
  • How can you test that?

Reproducibility/Explainability

  • Typically non-deterministic, even at low temperatures.
  • Tracking down bugs can be difficult.
  • Often not clear how any particular result was arrived at.

Morality

  • Whose data was the model trained on?
  • What is the effect on the world of using this provider? (Electricity, etc)
  • Does the system accept and emit controversial language, e.g. racism, sexism, ...

 Long-Term

  • Logging (but be careful of logging e.g. PHI).
  • Monitoring in production (what data will tell us there might be a problem? How do we know?)
  • Local development. Which tests to run? smoke, regression, behavioural, ...
  • Build pipelines. Which tests to run? smoke, regression, behavioural, ...
  • Integration tests. Is there (enough) consistency from any external services?
  • When to retest? e.g. pipeline change, prompt change, model change, ...
  • Human review of some production traffic. Which data? Why? When?

Reliability

  • Prompts are like code but not code. 
  • A white box view of the prompt is not a white box view of the code.
  • Non-determinism.
  • Hallucinations.
  • Model changes outside our control (if using external providers).
  • Performance (e.g. latency, error rate).
  • Back-off strategy, circuit-breaker, etc (if connection to LLM fails, is too slow etc).

Security

  • Jailbreaks.
  • Expose training data.
  • Expose user details.

Image: https://flic.kr/p/2iNyfvg

Comments

Popular posts from this blog

Meet Me Halfway?

  The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00-- "Stop answering my questions with questions." Sure, I can do that. In return, please stop asking me questions so open to interpretation that any answ...

Can Code, Can't Code, Is Useful

The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00-- "If testers can’t code, they’re of no use to us" My first reaction is to wonder what you expect from your testers. I am immediately interested ...

The Best Programmer Dan Knows

  I was pairing with my friend Vernon at work last week, on a tool I've been developing. He was smiling broadly as I talked him through what I'd done because we've been here before. The tool facilitates a task that's time-consuming, inefficient, error-prone, tiresome, and important to get right. Vern knows that those kinds of factors trigger me to change or build something, and that's why he was struggling not to laugh out loud. He held himself together and asked a bunch of sensible questions about the need, the desired outcome, and the approach I'd taken. Then he mentioned a talk by Daniel Terhorst-North, called The Best Programmer I Know, and said that much of it paralleled what he sees me doing. It was my turn to laugh then, because I am not a good programmer, and I thought he knew that already. What I do accept, though, is that I am focussed on the value that programs can give, and getting some of that value as early as possible. He sent me a link to the ta...

Beginning Sketchnoting

In September 2017 I attended  Ian Johnson 's visual note-taking workshop at  DDD East Anglia . For the rest of the day I made sketchnotes, including during Karo Stoltzenburg 's talk on exploratory testing for developers  (sketch below), and since then I've been doing it on a regular basis. Karo recently asked whether I'd do a Team Eating (the Linguamatics brown bag lunch thing) on sketchnoting. I did, and this post captures some of what I said. Beginning sketchnoting, then. There's two sides to that: I still regard myself as a beginner at it, and today I'll give you some encouragement and some tips based on my experience, to begin sketchnoting for yourselves. I spend an enormous amount of time in situations where I find it helpful to take notes: testing, talking to colleagues about a problem, reading, 1-1 meetings, project meetings, workshops, conferences, and, and, and, and I could go on. I've long been interested in the approaches I've evol...

Not Strictly for the Birds

  One of my chores takes me outside early in the morning and, if I time it right, I get to hear a charming chorus of birdsong from the trees in the gardens down our road, a relaxing layered soundscape of tuneful calls, chatter, and chirrupping. Interestingly, although I can tell from the number and variety of trills that there must be a large number of birds around, they are tricky to spot. I have found that by staring loosely at something, such as the silhouette of a tree's crown against the slowly brightening sky, I see more birds out of the corner of my eye than if I scan to look for them. The reason seems to be that my peripheral vision picks up movement against the wider background that direct inspection can miss. An optometrist I am not, but I do find myself staring at data a great deal, seeking relationships, patterns, or gaps. I idly wondered whether, if I filled my visual field with data, I might be able to exploit my peripheral vision in that quest. I have a wide monito...

ChatGPTesters

The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00--  "Why don’t we replace the testers with AI?" We have a good relationship so I feel safe telling you that my instinctive reaction, as a member of the T...

Don't Know? Find Out!

In What We Know We Don't Know , Hillel Wayne crisply summarises a handful of research findings about software development, describes how the research is carried out and reviewed and how he explores it, and contrasts those evidence-based results with the pronouncements of charismatic thought leaders. He also notes how and why this kind of research is hard in the software world. I won't pull much from the talk because I want to encourage you to watch it. Go on, it's reasonably short, it's comprehensible for me at 1.25x, and you can skip the section on Domain-Driven Design (the talk was at DDD Europe) if that's not your bag. Let me just give the same example that he opens with: research shows that most code reviews focus more on the first file presented to reviewers rather than the most important file in the eye of the developer. What we should learn: flag the starting and other critical files to receive more productive reviews. You never even thought about that possi...

Vanilla Flavour Testing

I have been pairing with a new developer colleague recently. In our last session he asked me "is this normal testing?" saying that he'd never seen anything like it anywhere else that he'd worked. We finished the task we were on and then chatted about his question for a few minutes. This is a short summary of what I said. I would describe myself as context-driven . I don't take the same approach to testing every time, except in a meta way. I try to understand the important questions, who they are important to, and what the constraints on the work are. With that knowledge I look for productive, pragmatic, ways to explore whatever we're looking at to uncover valuable information or find a way to move on. I write test notes as I work in a format that I have found to be useful to me, colleagues, and stakeholders. For me, the notes should clearly state the mission and give a tl;dr summary of the findings and I like them to be public while I'm working not just w...

Express, Listen, and Field

Last weekend I participated in the LLandegfan Exploratory Workshop on Testing (LLEWT) 2024, a peer conference in a small parish hall on Anglesey, north Wales. The topic was communication and I shared my sketchnotes and a mind map from the day a few days ago. This post summarises my experience report.  Express, Listen, and Field Just about the most hands-on, practical, and valuable training I have ever done was on assertiveness with a local Cambridge coach, Laura Dain . In it she introduced Express, Listen, and Field (ELF), distilled from her experience across many years in the women’s movement, business, and academia.  ELF: say your key message clearly and calmly, actively listen to the response, and then focus only on what is relevant to your needs. I blogged a little about it back in 2017 and I've been using it ever since. Assertiveness In a previous role, I was the manager of a test team and organised training for the whole ...

The Best Laid Test Plans

The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00-- "What's the best format for a test plan?" I'll side-step the conversation about what a test plan is and just say that the format you should use is one th...