Bayesian Testing?


I'm tossing this idea out into the world. It's half-formed and I'm learning as I go along. It may be invalid, it may be old news, it may not. What I'm hoping for is that someone who knows more about at least one of testing and Bayesian inference than I do will come and set me straight.

UPDATE: Laurent Bossavit turned out to be that person. The results below have be adjusted significantly a a result of a very illuminating conversation with him. Whatever virtue these results now have is due to him (and the defects remain my responsibility). Laurent, many thanks.

In addition, a bunch of folks kindly came along to an open space session at Xp Day London this year. Here is the commentary of one. From that already the idea became better formed, and this article reflects that improvement, thanks all. If you want to skip the motivation and cut to the chase, go here.


You may have read that absence of evidence is not evidence of absence. Of course, this is exactly wrong. I've just looked, and there is no evidence to be found that the room in which I am sitting (nor the room in which you are, I'll bet: look around you right now) contains an elephant. I consider this strong evidence that there is no elephant in the room. Not proof, and in some ways not the best reason for inferring that there is no elephant, but certainly evidence that there is none. This seems to be different from the form of bad logic that Sagan is actually criticising, in which the absence of evidence that there isn't an elephant in the room would be considered crackpot-style evidence that there was an elephant in the room.

You may also have read (on page 7 of that pdf) that program testing can be used to show the presence of bugs, but never to show their absence! I wonder. In the general case this certainly seems to be so, but I'm going to claim that working programmers don't often address the general case.

Dijkstra's argument is that, even in the simple example of a multiplication instruction, we do not have the resources available to exhaustively test the implementation but we still demand that it should correctly multiply any two numbers within the range of the representation. Dijkstra says that we can't afford to take even a representative sample (whatever that might look like) of all the possible multiplications that our multiplier might be asked to do. And that seems plausible, too. Consider how many distinct values a numerical variable in your favourite language can take, and then square it. That's how many cases you expect the multiplication operation in your language to deal with, and deal with correctly. As an aside: do you expect it to work correctly? If so, why do you?

A Small Example of Confidence

Let's say that we wish to write some code to recognise if a stone played in a game of Go is in atari or not (this is my favourite example, for the moment). The problem is simple to state: a stone with two or more "liberties" is not in atari, a stone with one liberty is in atari. A stone can have 1 or more liberties. In a real game situation it can be some work to calculate how many liberties a stone has, but the condition for atari is that simple.

A single stone can have only 1, 2, 3 or 4 liberties and those are the cases I will address here. I write some code to implement this function and I'll say that I'm fairly confident I've got it right (after all, it's only an if), but not greatly so. Laurent proposed a different question to ask from the one I was asking before—a better question, and he helped me find and understand a better answer.

The prior probability of correctness that question leads to is 1 ⁄ 16. This is because there are 16 possible one-to-one onto mappings from {1, 2, 3, 4} to {T, F} and only one of them is the correct function. Thus, the prior is the prior probability that my function behaves identically to some other function that is correct by definition.

How might a test result influence that probability of correctness? There is a spreadsheet which shows a scheme for doing that using what very little I understand of Bayesian inference, slightly less naïvely applied than before.

Cells in the spreadsheet are colour–coded to give a guide as to how the various values are used in the Bayesian formula. The key, as discussed in the XpDay session is how to count cases to find the conditional probabilities of seeing the evidence.

The test would look something like this:
One Liberty Means Atari

The posterior probability of correctness is 0.125

Adding More Test Cases

Suppose that I add another case that shows that when there are 2 liberties the code correctly determines that the stone is not in atari.
One Liberty Means Atari
Using the same counting scheme as in the first case and using the updated probability from the first case as the prior in the second then it seems as if the updated probability of correctness with the new evidence is increased to 0.25 as this sheet shows.

But suppose that the second test actually showed an incorrect result: 2 liberties and atari true.
One Liberty Means Atari
Then, as we might expect, the updated probability of correctness falls to 0.0 as shown here. And as the formula works by multiplication of the prior probability by a factor based on the evidence, the updated probability will stay at zero no matter what further evidence is presented—which seems like the right behaviour to me.

This problem is very small, so in fact we can exhaustively test the solution. What happens to the probability of correctness then? Extending test coverage to these cases
One Liberty Means Atari
gives an updated probability of 0.5 as shown here.

One more case remains to be added:
One Liberty Means Atari
and the posterior probability of correctness is updated to 1.0 as shown here.

That result seems at to contradict Dijkstra: exhaustive testing, in a case where we can do that, does show the absence of bugs. He probably knew that.


My brain is fizzing with all sorts of questions to ask about this approach: I talked here about retrofitted tests, can it help with TDD? Can this approach guide us in choosing good tests to write next? How can the structure of the domain and co-domain of the functions we test guide us to high confidence quickly? Or can't they? Can the current level of confidence be a guide to how much further investment we should make in testing?

Some interesting suggestions are coming in in the comments, many thanks for those.

My next plan I think will be to repeat this exercise for a slightly more complex function.


Markus Gärtner said...

From my vague understanding of Bayesian filtering approaches your consideration does not seem to be misleading to me.

While trying to come up with some elaboration on it, I realized that I need to think more about it. I tried to apply the models I know from pattern classification to software testing, but wasn't able to bring it to a pragmatic model, yet. Bayesian classification is just one of the applications that could serve some purpose.

keithb said...

Hi Markus,
During the discussion at Xp Day someone did suggest that it might we worth restating the question as a choice between hypotheses, as a classifier might do, rather that as the degree of support for an hypothesis.

I don't know enough about Bayesian classifiers to know where to start with that.

Matthew said...

I think the baysian filtering concept can work to give us rules of thumb - heuristics - about the state of the software.

For the ability to prove it, you might want to google "impossibility of complete testing." Lots and lots of good stuff in that body of knwoledge. (Or check out chapter 16 of "beautiful testing", which you can get for free on the web here: It's okay, I wrote it and distribute it under creative commons.)

Markus Gärtner said...

After looking up the formula from Paul Graham, the basic algorithm used for email spam filtering works something like this:

You have a repository of words counted as good and one counted as bad. In each you have the number of occurrences of the words in good mails and in bad mails. To yield an email, you iterate over all words in that mail and calculate a score. You look up each word in your mail in both repositories, getting the counts for the good and the bad occurrences. You give these measures also some weighting (2 for good in Grahams explanation, 1 for bad). Then you divide the number of bad occurrences through the number of good plus bad. This results in a percentage value of being bad. After scoring each word on this scheme, you take the 5 or 10 best words and the 5 or ten worst words and calculate these up in a correlation model. This gives you an overall score for the email at hand. Then you use that heuristic to train your classificator with the words in that mail and put them into either the good or bad repository. Of course, this is a heuristic approach and needs human intervention from time to time to foster the training basis.

How could this approach be transported to software? Well, you have working software and buggy software. Training then a classificator could give you hints on where bad software lies. Actually, Thomas Zimmermann has done something like this. Search for "When do changes induce fixes?" which is a pretty good paper on the topic. The idea is to combine the bug database together with the source code repository and feedback the information into the GUI of the developer to indicate risky areas in the code. The approach might be taken further in combination with a pattern classification approach to show what past patterns of bugs were in the source code base thus yielding a classificator for your particular development team.

This is the model of pattern classification that I have transported to software thus far. Your write-up gives a new idea for yet another heuristic which might prove to be working. As I said before, I think I need to think a bit more over this.

I hope I have given some deeper insights into the topic.

tonyx said...

Hi, thanks for the post, this is my comment:

I don't know how much is the probability of a bug, given that all test passes (I want it to be close to zero).

So I want to keep p(bug | test passes) small.

Bayes rule is
P(bug | test passes) = P( test passes | error) * p(error)/p(test passes)


Test passes means p(test passes)=1
thus we can simplify:

p(bug | test passes) = p(test passes | bug) * p(bug)

p(test passes | bug) is the probability that test passes given that there is a bug. More the tests I do, the more code I cover, the lower is this value.
it is also higher if I have to test mocks.

p(bug) = probablity of a bug for a program that we can write for solving it. More complex is the problem, more this value is high.

So for complex problem, p(bug) is higher, and I have to do more tests in order to make p(bug | test passes) still low.

How can we define complexity here?

There is a theory about, called Kolmogorov-Chaitin complexity, that states (roughly) that the size of the lowest program possible that does something is the complexity of "something".

For example: "the taxes that you have to pay are 30%"
is simple
while the following is more complex:

the taxes that you have to pay are 0% under Y,
17% for the part that exeeds Y
23% for the part that exeeds Y1
and so on...

The latter schema is clearly more complex that the former, and it's intuitive that it requires more tests.

So we could estimate the complexity of a problem(from the size of the refactored code (minus boilerplate code), and mocks used when solving that problem?

I would also consider the possibility of bugs due to lack of communications between customer and developers. This is alswo a complexity issue.


keithb said...

Hi Tony, thanks for your comment. I think I agree that Kolmogorov complexity probably has something to do with this—but I need to get my understanding of how to apply Bayes sorted out first. it got a lot better just today.

Michael Bolton said...

Testing for specific problems is not the only reason we test. I think that you might be confusing the specific case and the general case.

Absence of an elephant is one thing; absence of evidence for an elephant is another. However, we also test

- to find out if there are elephants in the next room.
- to find out if there are ways that an elephant might get in.
- to see whether our conceptions of "an elephant" are consistent with others. For example: there is an elephant in the room that I'm in; it's a stuffed elephant that belongs to my daughter. In addition, the words "an elephant" appear several times in the text in front of me. Now, you might say "that's not what I meant; I meant a real elephant." Yet my daughter's elephant is real. So then you might say "I meant a real living elephant." In that case, we can agree that there is no elephant in the room that we can perceive. But you might have met something different. That's the kind of information that excellent testing is intended to reveal.

Want an example closer to programming? "The program will accept an integer as input." By that, someone might mean a short integer (16 bits, maybe), a long integer (32 bits, maybe, or 64 bits, maybe), or an integer of arbitrary or indeterminate length. Testing helps to reveal that kind of misunderstanding.

The important thing about Bayes, as far as I know, is that it's a heuristic and not an algorithmic approach to problem-solving. Bayesian analysis is a way of evaluating the strength of a heuristic decision governed by a particular model. Yet it's important to check to see if we've got the right model, too.

---Michael B.

keithb said...

Hi Michael,
I absolutely am talking about what I believe you call "checking" and not about "exploration".

My experience, though, is that pushing really hard on the checking lever results in the explorers meeting fewer elephants.

I also absolutely agree with you about the benefits of testing practice in uncovering misunderstandings. That's why I encourage my clients to get testers involved in requirements work.

What I'm investigating here (and will continue to in some subsequent posts) is very narrow, and deliberately so. Perhaps I should make that clearer.

simi said...


This Software Testing artical is very useful for me. I am a QA Engineer and always looking to learn something new. I would like to introduce another good Software Testing blog, Have a look.


Brian Marick said...

A few random thoughts:

- there's been academic work with both Bayesian estimation & Kolmogorov complexity applied to testing. For Bayes, Robert Horgan comes to mind, also Keith Miller For Kolmogorov, John Cherniavsky. This is all from the early 90's, I think.

- Doug Hoffman wrote a nice little article about how they used a hypercube machine to test its own sqrt instruction for *all* 2^32 possible inputs. They found two failing values due to a microcode bug that the developers thought would have been unlikely to have been found with more conventional methods.

- A problem with random testing is mapping existence of a bug to impact. For example, the long-ago Pentium floating-point divide bug did not (if I recall correctly) affect very many floating point numbers, but the economic cost was huge. Similarly, a rare bug for the population as a whole may be an extremely frequent bug for some particular subset. [That's a bit - maybe a lot - afield from the topic of the post, but consider that "confidence" in the Bayesian sense may be different for different populations, and that "confidence" in the statistical sense can be quite different than "confidence" in the "I'll stake my job on releasing this" sense.]

tonyx said...
This comment has been removed by the author.
Immo Hüneke said...

Hi Keith,

This is thought-provoking stuff. A more realistic "evidence of absence" example than the elephant in the room might be the difficulty of determining when a species is truly extinct in the wild. For example, a specimen of the Pyant Cheezar turtle was discovered in a Chinese food market in 1994, the last living individual having been observed in 1908. Perhaps naturalists have something to tell us about deciding the probability that something you can't find actually isn't there. (Perhaps disturbingly, sometimes I can't find something that I know must be there. My wife and I have concluded that there must be aliens in cloaked space craft studying the human race, who occasionally beam up artefacts for examination and surreptitiously replace them later).

I don't know whether IBM still does this, but they used to have a requirement on projects to measure defect statistics throughout the lifecycle. By studying the trend line of defects being discovered and rectified week by week, it was possible to predict fairly accurately how many defects remained in a software product and use this information to direct resources to finding them. If the curve showed a sudden upward kink after the release of a product to end users, then clearly the testing had not been thorough enough - subsequent phases and other projects could use this information to allocate correspondingly more effort to testing.

Michael Bolton said...

Hi, Keith...

Thank you for the reply.

What I'm investigating here (and will continue to in some subsequent posts) is very narrow, and deliberately so. Perhaps I should make that clearer.

I look forward to those subsequent posts.


---Michael B.

Dathan said...


Very cool stuff, and timely since I'm experimenting with Bayesian methods for a project at work right now.

The first thought that occurs to me is, for a method that's more or less unbounded (the square function comes to mind, say, with 64-bit integers), isn't it still going to take a prohibitive number of trials to get your confidence high? In this case, is a Bayesian method applicable?