I2E, the flagship product at Linguamatics, is a text mining engine and so sits in the broad space of search tools such as grep, Ctrl-F, and even Google. In that world, evaluating "how good" or "how relevant", or the "correctness" of a set of search results is interesting for a number of reasons, including:
- it may be hard to define what those terms mean, in general cases.
- it may be possible to calculate some kind of metric on well-understood, small, data sets but less so at scale.
- it may be possible to calculate some kind of metric for simple searches, but less so for complex ones.
- on different occasions the person searching may have different intent and needs from the same search.
But today we'll concentrate on two standard metrics that can be easily defined and which have agreed definitions: precision (roughly "how useful the search results are") and recall (roughly "how complete the results are").
Imagine we want to test our search engine. We have a set of documents and we will search them for the single word "testing". The image below, from Wikipedia, shows how we could calculate the metrics.
There's a lot of information in there, let's unpack some of it:
- The square represents the documents.
- The solid grey circles are occurrences of the word "testing".
- The clear grey circles are occurrences of other words.
- The central black circle is the set of results from the search.
- The term positive means that a word is in the results.
- The term negative means that a word is not in the results.
- The term true means that a word is classified correctly.
- The term false means that a word is classified incorrectly.
We run our search using the SUT and get back 50 results (the central black circle). We inspect those results and find that 35 are the word "testing" (the true positives) and 15 are something else (the false positives - asserted to be correct, but in fact incorrect).
The pictographs at the bottom of the image give us the formulae we need: precision comes only from the set of results we can see, and in this case is 35/50 or 70%. Recall requires knowledge of the whole set of documents, and for us is 35/100 or 35%.
A striking difference but which is better? Can one be better? These things are metrics, so can they be gamed?
Well, if the search simply returned every word in the documents its recall would be 100/100, or 100%, but precision would be very low at 100/1000, or 10%, because precision takes the negative content in the search results into account.
So can you get 100% precision? You certainly can: have the search return only those results with an extremely high confidence of being correct. Imagine only one result is returned, and it's a good one, then precision is 1/1 or 100%. Sadly, recall in this case is 1/100 or 1%.
Which is very interesting, really, but what does it have to do with testing?
Good question; it's background for a rough and ready analogy that squirted out of a conversation at work this week, illustrating the appealing trap of simple confirmatory testing. Imagine that you run your system under test with nominated input, inspect what comes out, and check that against some acceptance criteria. Everything in the output meets the criteria. Brilliant! Job done? Or precision 100%?
Images: https://flic.kr/p/sqtWUT, Wikipedia