Bayesian Testing?

Introduction

I'm tossing this idea out into the world. It's half-formed and I'm learning as I go along. It may be invalid, it may be old news, it may not. What I'm hoping for is that someone who knows more about at least one of testing and Bayesian inference than I do will come and set me straight.

UPDATE: Laurent Bossavit turned out to be that person. The results below have be adjusted significantly a a result of a very illuminating conversation with him. Whatever virtue these results now have is due to him (and the defects remain my responsibility). Laurent, many thanks.

In addition, a bunch of folks kindly came along to an open space session at Xp Day London this year. Here is the commentary of one. From that already the idea became better formed, and this article reflects that improvement, thanks all. If you want to skip the motivation and cut to the chase, go here.


Evidence

You may have read that absence of evidence is not evidence of absence. Of course, this is exactly wrong. I've just looked, and there is no evidence to be found that the room in which I am sitting (nor the room in which you are, I'll bet: look around you right now) contains an elephant. I consider this strong evidence that there is no elephant in the room. Not proof, and in some ways not the best reason for inferring that there is no elephant, but certainly evidence that there is none. This seems to be different from the form of bad logic that Sagan is actually criticising, in which the absence of evidence that there isn't an elephant in the room would be considered crackpot-style evidence that there was an elephant in the room.

You may also have read (on page 7 of that pdf) that program testing can be used to show the presence of bugs, but never to show their absence! I wonder. In the general case this certainly seems to be so, but I'm going to claim that working programmers don't often address the general case.

Dijkstra's argument is that, even in the simple example of a multiplication instruction, we do not have the resources available to exhaustively test the implementation but we still demand that it should correctly multiply any two numbers within the range of the representation. Dijkstra says that we can't afford to take even a representative sample (whatever that might look like) of all the possible multiplications that our multiplier might be asked to do. And that seems plausible, too. Consider how many distinct values a numerical variable in your favourite language can take, and then square it. That's how many cases you expect the multiplication operation in your language to deal with, and deal with correctly. As an aside: do you expect it to work correctly? If so, why do you?


A Small Example of Confidence

Let's say that we wish to write some code to recognise if a stone played in a game of Go is in atari or not (this is my favourite example, for the moment). The problem is simple to state: a stone with two or more "liberties" is not in atari, a stone with one liberty is in atari. A stone can have 1 or more liberties. In a real game situation it can be some work to calculate how many liberties a stone has, but the condition for atari is that simple.

A single stone can have only 1, 2, 3 or 4 liberties and those are the cases I will address here. I write some code to implement this function and I'll say that I'm fairly confident I've got it right (after all, it's only an if), but not greatly so. Laurent proposed a different question to ask from the one I was asking before—a better question, and he helped me find and understand a better answer.

The prior probability of correctness that question leads to is 1 ⁄ 16. This is because there are 16 possible one-to-one onto mappings from {1, 2, 3, 4} to {T, F} and only one of them is the correct function. Thus, the prior is the prior probability that my function behaves identically to some other function that is correct by definition.

How might a test result influence that probability of correctness? There is a spreadsheet which shows a scheme for doing that using what very little I understand of Bayesian inference, slightly less naïvely applied than before.

Cells in the spreadsheet are colour–coded to give a guide as to how the various values are used in the Bayesian formula. The key, as discussed in the XpDay session is how to count cases to find the conditional probabilities of seeing the evidence.

The test would look something like this:
One Liberty Means Atari
libertiesatari?
1true

The posterior probability of correctness is 0.125


Adding More Test Cases

Suppose that I add another case that shows that when there are 2 liberties the code correctly determines that the stone is not in atari.
One Liberty Means Atari
libertiesatari?
1true
2false
Using the same counting scheme as in the first case and using the updated probability from the first case as the prior in the second then it seems as if the updated probability of correctness with the new evidence is increased to 0.25 as this sheet shows.

But suppose that the second test actually showed an incorrect result: 2 liberties and atari true.
One Liberty Means Atari
libertiesatari?
1true
2true
Then, as we might expect, the updated probability of correctness falls to 0.0 as shown here. And as the formula works by multiplication of the prior probability by a factor based on the evidence, the updated probability will stay at zero no matter what further evidence is presented—which seems like the right behaviour to me.

This problem is very small, so in fact we can exhaustively test the solution. What happens to the probability of correctness then? Extending test coverage to these cases
One Liberty Means Atari
libertiesatari?
1true
2false
3false
gives an updated probability of 0.5 as shown here.

One more case remains to be added:
One Liberty Means Atari
libertiesatari?
1true
2false
3false
4false
and the posterior probability of correctness is updated to 1.0 as shown here.

That result seems at to contradict Dijkstra: exhaustive testing, in a case where we can do that, does show the absence of bugs. He probably knew that.


Next?

My brain is fizzing with all sorts of questions to ask about this approach: I talked here about retrofitted tests, can it help with TDD? Can this approach guide us in choosing good tests to write next? How can the structure of the domain and co-domain of the functions we test guide us to high confidence quickly? Or can't they? Can the current level of confidence be a guide to how much further investment we should make in testing?

Some interesting suggestions are coming in in the comments, many thanks for those.

My next plan I think will be to repeat this exercise for a slightly more complex function.