Recently a few people have asked me how I test AI. I'm happy to share my experiences, but I frame the question more broadly, perhaps something like this: what kinds of things do I consider when testing systems with artificial intelligence components.
I freestyled liberally the first time I answered but when the question came up again I thought I'd write a few bullets to help me remember key things. This post is the latest iteration of that list.
Caveats: I'm not an expert; what you see below is a reminder of things to pick up on during conversations so it's quite minimal; it's also messy; it's absolutely not a guide or a set of best practices; each point should be applied in context; the categories are very rough; it's certainly not complete.
Also note that I work with teams who really know what they're doing on the domain, tech, and medical safety fronts and some of the things listed here are things they'd typically do some or all of.
Testing AI
- It's the same as testing anything: looking for relevant incongruities.
- You can use all the same skills you use in your regular testing role.
- CODS - Control, Observe, Decompose, Simplify
- You can never test everything. Even more so here.
- You'll need to find comparisons: input(s) result in behaviour(s), against a (partial) oracle.
- Empirical data on the system behaviour is key.
- Your (perceived) coverage of the input and output space is going to be key.
- Even if it's "AI" don't be fooled into thinking that the system "understands."
- What are the risks of this thing in this use in this context?
- What are the egregious outputs?
- What are the potential bad outcomes? How bad? For who? When?
- Is it solving the problem? Could the problem be solved another way?
Information Spaces
- Input and output spaces can be functionally infinite.
- Depending on the underlying models, you can't be sure what data is present.
- You can't be sure what "reasoning" will come out.
Oracles
- How do you define the characteristics of good output?
- How do you evaluate the output against those characteristics?
- Look for related/overlapping/implicit concerns, e.g. societal, sociological, interactional, ...
- You problem is probably not unidimensional.
- You probably want multiple metrics.
- How do you balance multiple metrics to judge the behaviour of the system as a whole?
Data
- Think carefully about input space coverage.
- Think carefully about output space coverage.
- Consider very large data sets with some metric on acceptable correctness rates.
- Are there specific cases which MUST have a specific outcome?
- Are there specific outcomes we MUST NOT see?
- Try semantically-identical data with different syntax, synonyms, null content, length, change ordering, bury the relevant content, ...
- Where is your test data coming from?
- Do you have an existing system with user data and expected outputs?
- Are you making the data up? On what basis? Using what tools? How can you judge how similar it will be/is to real user data?
- Nonsense. What data should never get a response? (Beware the happy path glow.)
- Which languages to allow? LLMs will often respond to anything you input.
Assorted Approaches
- In most projects you'll want scale.
- At scale you'll need automation to exercise your system.
- Automation for change detection over time.
- Automation for exploration.
- At scale you'll need statistical evaluation of the outcomes.
- At scale you probably still want a human in the loop somewhere.
- Can you sample from test and production for human review? How much? How often? Which data?
- Property-based testing is a good model to consider.
- Metamorphic testing can be a valuable approach.
- Think about combinations of inputs.
- Adversarial testing (using another AI).
- Domain experts can identify subtleties that you will miss.
- How to evaluate the difference between two versions of a system?
- Evaluating the level of variability in some version of a system.
- Is there bias? against who? according to who?
- Bespoke tooling for different kinds of experiments.
- Bespoke metrics for different kinds of experiments.
- Can you break the system under test down into steps and inspect them?
- If the system is chat-based are you evaluating each turn, the end result, both?
Development
- How can you test the effect of changes? (e.g. adding "Don't be biased." to a prompt.)
- What is the goal of this piece of testing?
- When do you run your tests? Who runs your tests?
- What is the cost of testing? What is the potential cost of not testing?
Model/Provider Choice
- Is this the right model/provider? (What contraints do you have on this choice? Why?)
- How easy would it be to swap to another model/provider?
- What SLA does the provider give?
- Do you believe they will stick to it?
- How can you test that?
Reproducibility/Explainability
- Typically non-deterministic, even at low temperatures.
- Tracking down bugs can be difficult.
- Often not clear how any particular result was arrived at.
Morality
- Whose data was the model trained on?
- What is the effect on the world of using this provider? (Electricity, etc)
- Does the system accept and emit controversial language, e.g. racism, sexism, ...
Long-Term
- Logging (but be careful of logging e.g. PHI).
- Monitoring in production (what data will tell us there might be a problem? How do we know?)
- Local development. Which tests to run? smoke, regression, behavioural, ...
- Build pipelines. Which tests to run? smoke, regression, behavioural, ...
- Integration tests. Is there (enough) consistency from any external services?
- When to retest? e.g. pipeline change, prompt change, model change, ...
- Human review of some production traffic. Which data? Why? When?
Reliability
- Prompts are like code but not code.
- A white box view of the prompt is not a white box view of the code.
- Non-determinism.
- Hallucinations.
- Model changes outside our control (if using external providers).
- Performance (e.g. latency, error rate).
- Back-off strategy, circuit-breaker, etc (if connection to LLM fails, is too slow etc).
Security
- Jailbreaks.
- Expose training data.
- Expose user details.
Image: https://flic.kr/p/2iNyfvg
Comments
Post a Comment