Skip to main content

How do I Test AI?

 

Recently a few people have asked me how I test AI. I'm happy to share my experiences, but I frame the question more broadly, perhaps something like this: what kinds of things do I consider when testing systems with artificial intelligence components

I freestyled liberally the first time I answered but when the question came up again I thought I'd write a few bullets to help me remember key things. This post is the latest iteration of that list.

Caveats: I'm not an expert; what you see below is a reminder of things to pick up on during conversations so it's quite minimal; it's also messy; it's absolutely not a guide or a set of best practices; each point should be applied in context; the categories are very rough; it's certainly not complete. 

Also note that I work with teams who really know what they're doing on the domain, tech, and medical safety fronts and some of the things listed here are things they'd typically do some or all of.

Testing AI

  • It's the same as testing anything: looking for relevant incongruities.
  • You can use all the same skills you use in your regular testing role.
  • CODS - Control, Observe, Decompose, Simplify
  • You can never test everything. Even more so here. 
  • You'll need to find comparisons: input(s) result in behaviour(s), against a (partial) oracle.
  • Empirical data on the system behaviour is key.
  • Your (perceived) coverage of the input and output space is going to be key.
  • Even if it's "AI" don't be fooled into thinking that the system "understands."
  • What are the risks of this thing in this use in this context?
  • What are the egregious outputs?
  • What are the potential bad outcomes? How bad? For who? When?
  • Is it solving the problem? Could the problem be solved another way? 

Information Spaces

  • Input and output spaces can be functionally infinite.
  • Depending on the underlying models, you can't be sure what data is present.
  • You can't be sure what "reasoning" will come out.

Oracles

  • How do you define the characteristics of good output?
  • How do you evaluate the output against those characteristics?
  • Look for related/overlapping/implicit concerns, e.g. societal, sociological, interactional, ...
  • You problem is probably not unidimensional.
  • You probably want multiple metrics.
  • How do you balance multiple metrics to judge the behaviour of the system as a whole?

Data

  • Think carefully about input space coverage.
  • Think carefully about output space coverage.
  • Consider very large data sets with some metric on acceptable correctness rates.
  • Are there specific cases which MUST have a specific outcome?  
  • Are there specific outcomes we MUST NOT see?
  • Try semantically-identical data with different syntax, synonyms, null content, length, change ordering, bury the relevant content, ...
  • Where is your test data coming from? 
  • Do you have an existing system with user data and expected outputs?
  • Are you making the data up? On what basis? Using what tools? How can you judge how similar it will be/is to real user data?
  • Nonsense. What data should never get a response? (Beware the happy path glow.)
  • Which languages to allow? LLMs will often respond to anything you input.

Assorted Approaches

  • In most projects you'll want scale.
  • At scale you'll need automation to exercise your system.
  • Automation for change detection over time.
  • Automation for exploration.
  • At scale you'll need statistical evaluation of the outcomes.
  • At scale you probably still want a human in the loop somewhere.
  • Can you sample from test and production for human review? How much? How often? Which data?
  • Property-based testing is a good model to consider.
  • Metamorphic testing can be a valuable approach.
  • Think about combinations of inputs.
  • Adversarial testing (using another AI).
  • Domain experts can identify subtleties that you will miss.
  • How to evaluate the difference between two versions of a system?
  • Evaluating the level of variability in some version of a system.
  • Is there bias? against who? according to who?
  • Bespoke tooling for different kinds of experiments.
  • Bespoke metrics for different kinds of experiments.
  • Can you break the system under test down into steps and inspect them?
  • If the system is chat-based are you evaluating each turn, the end result, both?

Development

  • How can you test the effect of changes? (e.g. adding "Don't be biased." to a prompt.)
  • What is the goal of this piece of testing?
  • When do you run your tests? Who runs your tests?
  • What is the cost of testing? What is the potential cost of not testing?

Model/Provider Choice

  • Is this the right model/provider? (What contraints do you have on this choice? Why?)
  • How easy would it be to swap to another model/provider?
  • What SLA does the provider give?
  • Do you believe they will stick to it? 
  • How can you test that?

Reproducibility/Explainability

  • Typically non-deterministic, even at low temperatures.
  • Tracking down bugs can be difficult.
  • Often not clear how any particular result was arrived at.

Morality

  • Whose data was the model trained on?
  • What is the effect on the world of using this provider? (Electricity, etc)
  • Does the system accept and emit controversial language, e.g. racism, sexism, ...

 Long-Term

  • Logging (but be careful of logging e.g. PHI).
  • Monitoring in production (what data will tell us there might be a problem? How do we know?)
  • Local development. Which tests to run? smoke, regression, behavioural, ...
  • Build pipelines. Which tests to run? smoke, regression, behavioural, ...
  • Integration tests. Is there (enough) consistency from any external services?
  • When to retest? e.g. pipeline change, prompt change, model change, ...
  • Human review of some production traffic. Which data? Why? When?

Reliability

  • Prompts are like code but not code. 
  • A white box view of the prompt is not a white box view of the code.
  • Non-determinism.
  • Hallucinations.
  • Model changes outside our control (if using external providers).
  • Performance (e.g. latency, error rate).
  • Back-off strategy, circuit-breaker, etc (if connection to LLM fails, is too slow etc).

Security

  • Jailbreaks.
  • Expose training data.
  • Expose user details.

Image: https://flic.kr/p/2iNyfvg

Comments

Popular posts from this blog

Meet Me Halfway?

  The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00-- "Stop answering my questions with questions." Sure, I can do that. In return, please stop asking me questions so open to interpretation that any answ...

Can Code, Can't Code, Is Useful

The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00-- "If testers can’t code, they’re of no use to us" My first reaction is to wonder what you expect from your testers. I am immediately interested ...

The Best Programmer Dan Knows

  I was pairing with my friend Vernon at work last week, on a tool I've been developing. He was smiling broadly as I talked him through what I'd done because we've been here before. The tool facilitates a task that's time-consuming, inefficient, error-prone, tiresome, and important to get right. Vern knows that those kinds of factors trigger me to change or build something, and that's why he was struggling not to laugh out loud. He held himself together and asked a bunch of sensible questions about the need, the desired outcome, and the approach I'd taken. Then he mentioned a talk by Daniel Terhorst-North, called The Best Programmer I Know, and said that much of it paralleled what he sees me doing. It was my turn to laugh then, because I am not a good programmer, and I thought he knew that already. What I do accept, though, is that I am focussed on the value that programs can give, and getting some of that value as early as possible. He sent me a link to the ta...

Beginning Sketchnoting

In September 2017 I attended  Ian Johnson 's visual note-taking workshop at  DDD East Anglia . For the rest of the day I made sketchnotes, including during Karo Stoltzenburg 's talk on exploratory testing for developers  (sketch below), and since then I've been doing it on a regular basis. Karo recently asked whether I'd do a Team Eating (the Linguamatics brown bag lunch thing) on sketchnoting. I did, and this post captures some of what I said. Beginning sketchnoting, then. There's two sides to that: I still regard myself as a beginner at it, and today I'll give you some encouragement and some tips based on my experience, to begin sketchnoting for yourselves. I spend an enormous amount of time in situations where I find it helpful to take notes: testing, talking to colleagues about a problem, reading, 1-1 meetings, project meetings, workshops, conferences, and, and, and, and I could go on. I've long been interested in the approaches I've evol...

Don't Know? Find Out!

In What We Know We Don't Know , Hillel Wayne crisply summarises a handful of research findings about software development, describes how the research is carried out and reviewed and how he explores it, and contrasts those evidence-based results with the pronouncements of charismatic thought leaders. He also notes how and why this kind of research is hard in the software world. I won't pull much from the talk because I want to encourage you to watch it. Go on, it's reasonably short, it's comprehensible for me at 1.25x, and you can skip the section on Domain-Driven Design (the talk was at DDD Europe) if that's not your bag. Let me just give the same example that he opens with: research shows that most code reviews focus more on the first file presented to reviewers rather than the most important file in the eye of the developer. What we should learn: flag the starting and other critical files to receive more productive reviews. You never even thought about that possi...

ChatGPTesters

The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00--  "Why don’t we replace the testers with AI?" We have a good relationship so I feel safe telling you that my instinctive reaction, as a member of the T...

Express, Listen, and Field

Last weekend I participated in the LLandegfan Exploratory Workshop on Testing (LLEWT) 2024, a peer conference in a small parish hall on Anglesey, north Wales. The topic was communication and I shared my sketchnotes and a mind map from the day a few days ago. This post summarises my experience report.  Express, Listen, and Field Just about the most hands-on, practical, and valuable training I have ever done was on assertiveness with a local Cambridge coach, Laura Dain . In it she introduced Express, Listen, and Field (ELF), distilled from her experience across many years in the women’s movement, business, and academia.  ELF: say your key message clearly and calmly, actively listen to the response, and then focus only on what is relevant to your needs. I blogged a little about it back in 2017 and I've been using it ever since. Assertiveness In a previous role, I was the manager of a test team and organised training for the whole ...

Software Sisyphus

The Association for Software Testing is crowd-sourcing a book,  Navigating the World as a Context-Driven Tester , which aims to provide  responses to common questions and statements about testing from a  context-driven perspective . It's being edited by  Lee Hawkins  who is  posing questions on  Twitter ,   LinkedIn , Mastodon , Slack , and the AST  mailing list  and then collating the replies, focusing on practice over theory. I've decided to  contribute  by answering briefly, and without a lot of editing or crafting, by imagining that I'm speaking to someone in software development who's acting in good faith, cares about their work and mine, but doesn't have much visibility of what testing can be. Perhaps you'd like to join me?   --00-- "How can I possibly test 'all the stuff' every iteration?" Whoa! There's a lot to unpack there, so let me break it down a little: who is suggesting that "al...

Not a Happy Place

  A few months ago I stopped having therapy because I felt I had stabilised myself enough to navigate life without it. For the time being, anyway.  I'm sure the counselling helped me but I couldn't tell you how and I've chosen not to look deeply into it. For someone who is usually pretty analytical this is perhaps an interesting decision but I knew that I didn't want to be second-guessing my counsellor, Sue, or mentally cross-referencing stuff that I'd researched while we were talking. And talk was what we mostly did, with Sue suggesting hardly any specific tools for me to try. One that she did recommend was finding a happy place to visualise, somewhere that I could be out of the moment for a moment to calm disruptive thoughts. (Something like this .) Surprisingly, I found that I couldn't conjure anywhere up inside my head. That's when I realised that I've always had difficulty seeing with my mind's eye but never called it out. If I try to imagine ev...