peripatetic axiom: testing

Showing posts with label testing. Show all posts

Tests vs checks

Trying to spread the good word on "testing" vs "checking" in this article for T. E. S. T. Magazine.

Introduction

I'm tossing this idea out into the world. It's half-formed and I'm learning as I go along. It may be invalid, it may be old news, it may not. What I'm hoping for is that someone who knows more about at least one of testing and Bayesian inference than I do will come and set me straight.

UPDATE: Laurent Bossavit turned out to be that person. The results below have be adjusted significantly a a result of a very illuminating conversation with him. Whatever virtue these results now have is due to him (and the defects remain my responsibility). Laurent, many thanks.

In addition, a bunch of folks kindly came along to an open space session at Xp Day London this year. Here is the commentary of one. From that already the idea became better formed, and this article reflects that improvement, thanks all. If you want to skip the motivation and cut to the chase, go here.

Evidence

You may have read that absence of evidence is not evidence of absence. Of course, this is exactly wrong. I've just looked, and there is no evidence to be found that the room in which I am sitting (nor the room in which you are, I'll bet: look around you right now) contains an elephant. I consider this strong evidence that there is no elephant in the room. Not proof, and in some ways not the best reason for inferring that there is no elephant, but certainly evidence that there is none. This seems to be different from the form of bad logic that Sagan is actually criticising, in which the absence of evidence that there isn't an elephant in the room would be considered crackpot-style evidence that there was an elephant in the room.

You may also have read (on page 7 of that pdf) that program testing can be used to show the presence of bugs, but never to show their absence! I wonder. In the general case this certainly seems to be so, but I'm going to claim that working programmers don't often address the general case.

Dijkstra's argument is that, even in the simple example of a multiplication instruction, we do not have the resources available to exhaustively test the implementation but we still demand that it should correctly multiply any two numbers within the range of the representation. Dijkstra says that we can't afford to take even a representative sample (whatever that might look like) of all the possible multiplications that our multiplier might be asked to do. And that seems plausible, too. Consider how many distinct values a numerical variable in your favourite language can take, and then square it. That's how many cases you expect the multiplication operation in your language to deal with, and deal with correctly. As an aside: do you expect it to work correctly? If so, why do you?

In this post I want to explore an approach that seems as if it might help us to decide how much confidence in the correctness of some code we should have, given the test results we can obtain about it.

A Small Example of Confidence

Let's say that we wish to write some code to recognise if a stone played in a game of Go is in atari or not (this is my favourite example, for the moment). The problem is simple to state: a stone with two or more "liberties" is not in atari, a stone with one liberty is in atari. A stone can have 1 or more liberties. In a real game situation it can be some work to calculate how many liberties a stone has, but the condition for atari is that simple.

A single stone can have only 1, 2, 3 or 4 liberties and those are the cases I will address here. I write some code to implement this function and I'll say that I'm fairly confident I've got it right (after all, it's only an if), but not greatly so. Laurent proposed a different question to ask from the one I was asking before—a better question, and he helped me find and understand a better answer.

The prior probability of correctness that question leads to is 1 ⁄ 16. This is because there are 16 possible one-to-one onto mappings from {1, 2, 3, 4} to {T, F} and only one of them is the correct function. Thus, the prior is the prior probability that my function behaves identically to some other function that is correct by definition.

How might a test result influence that probability of correctness? There is a spreadsheet which shows a scheme for doing that using what very little I understand of Bayesian inference, slightly less naïvely applied than before.

Cells in the spreadsheet are colour–coded to give a guide as to how the various values are used in the Bayesian formula. The key, as discussed in the XpDay session is how to count cases to find the conditional probabilities of seeing the evidence.

The test would look something like this:

One Liberty Means Atari
liberties	atari?
1	true

The posterior probability of correctness is 0.125

Adding More Test Cases

Suppose that I add another case that shows that when there are 2 liberties the code correctly determines that the stone is not in atari.

One Liberty Means Atari
liberties	atari?
1	true
2	false

Using the same counting scheme as in the first case and using the updated probability from the first case as the prior in the second then it seems as if the updated probability of correctness with the new evidence is increased to 0.25 as this sheet shows.

But suppose that the second test actually showed an incorrect result: 2 liberties and atari true.

One Liberty Means Atari
liberties	atari?
1	true
2	true

Then, as we might expect, the updated probability of correctness falls to 0.0 as shown here. And as the formula works by multiplication of the prior probability by a factor based on the evidence, the updated probability will stay at zero no matter what further evidence is presented—which seems like the right behaviour to me.

This problem is very small, so in fact we can exhaustively test the solution. What happens to the probability of correctness then? Extending test coverage to these cases

One Liberty Means Atari
liberties	atari?
1	true
2	false
3	false

gives an updated probability of 0.5 as shown here.

One more case remains to be added:

One Liberty Means Atari
liberties	atari?
1	true
2	false
3	false
4	false

and the posterior probability of correctness is updated to 1.0 as shown here.

That result seems at to contradict Dijkstra: exhaustive testing, in a case where we can do that, does show the absence of bugs. He probably knew that.

Next?

My brain is fizzing with all sorts of questions to ask about this approach: I talked here about retrofitted tests, can it help with TDD? Can this approach guide us in choosing good tests to write next? How can the structure of the domain and co-domain of the functions we test guide us to high confidence quickly? Or can't they? Can the current level of confidence be a guide to how much further investment we should make in testing?

Some interesting suggestions are coming in in the comments, many thanks for those.

My next plan I think will be to repeat this exercise for a slightly more complex function.

Bridging the Communication Gap

Call it automated acceptance testing, functional test driven development, checked examples or what you will—the use of automatic validation is one of the most effective tools in the Agile developer's kit.

It's a large and involved field, touching on almost every aspect of development. Handy, then, that Gojko Adzic has published a very comprehensive guide to the use of automated acceptance tests in contemporary Agile development practice, Bridging the Communication Gap: Specification by example and agile acceptance testing.

This book is a much-needed checkpoint in the on-going adventure to discover (and re-discover) how to write software effectively. Gojko is a very energetic enthusiast for these ideas, and a very experienced practitioner of them. His knowledge and expertise is present on every page.

A strong theme runs through the book—that the reason we capture examples of required behaviour and automate validation of them is to improve communication. Examples turn out to be a very powerful way to understand a problem domain and to explore a solution suitable for that domain. There turn out to be fascinating reasons for why this is true, but Gojko quite reasonably focusses on practical advice.

The main body of the book tells a story, a story of understanding, finding, and using examples to create shared understanding across a team. Gojko gives very concrete advice in a series of short chapters and explains how to do this. How to organise a workshop to find examples, how to find good examples, how to use tools to automate validation, how to use the resulting tests to guide development. Each chapter ends with a handy bullet list of key points. Together with other material on the best use for developers to make of such checked examples, and how to fit example discovery and capture into a typical Agile development process Bridging the Communication Gap provides as close to a vade mecum for newcomers to the discipline of functional test driven development as we are likely to see.

Gojko draws informative parallels with other techniques more or less strongly aligned with the Agile development world. This places the practice of Agile acceptance testing in context, and as a team-wide activity, reinforcing the cross-functional nature of the tool. Always the emphasis is on helping the various stakeholders in a development project communicate better.

There is a survey of tools available for this kind of work, which I might wish were slightly broader in scope and a little more detailed, but it does give a good overview of the market leaders. "Market leaders" in the weakest sense, since it turns out that the best tools for this kind of work are all FOSS: big-ticket corporate testing tools really aren't in this game.

Various points regarding writing and using tests are illustrated with (of course) illuminating examples. Also described are limitations of these techniques and some pitfalls to watch out for, something that more promoters of development techniques should provide.

The book is self-published and my copy was printed by Lightening Source. Books produced this way are getting better all the time, but are still not presented at the level of quality one would expect from a commercial publishing house. The pages seem very full and with the choice of font made the text a very dark colour which I don't find easy to read. The section and sub-section headings are sometimes over long and are not laid out well, a combination that I found made the book less easy to navigate than it might have been.

I will be using with book with clients and recommending it to them for future reference. A boon to the community.

Dynamic vs static: once more with feeling

In what feels to me like a voyage through a time-warp to the beginnings of my programming career in the mid–90's, Jason Gorman has revived the old static vs dynamic typing debate.

Oh, how it all comes flooding back: "strong[sic] typing is for weak minds", "static type systems catch the kind of bug that managers understand" etc. etc. etc.

Jason's concerns seem to have been raised by someting to do with this sort of thing (see the part on the var keyword) although he seems to muddle up a language having a dynamic type system with it being dynamic. These aspects are closely related, but are not quite the same (as that post explains reasonably well). It's picking nits of this variety that keeps us all in work.

Anyway, I very much agree with Jason that this resurgence of interest in dynamicish, scripty lanaguages is driven largely by fashion, and that claims made about it should be closely examined. I'm not so sure about the rest of his argument. Not least because I'm not sure that what he complains about:

Proponents of such languages cite the relative flexibility of dynamic typing compared to statically-typed languages like Java and C++. Type safety, they argue, can be achieved through unit testing.

is actually being said by anyone. Type safety through unit testing? Really? Maybe someone is, and I haven't seen it. I'd be interested to if anyone has a link.

Personally, I do tend towards dynamic languages and away form manifest static type systems. My preference away from manifest static typing is that (in the words of, IIRC Kent Back) they make me say things that I don't know are true. On the other hand I've been dabbling a bit with Haskell, which has a very strong, very expressive static type system and offers the promise (through type inferencing) to not require all those pesky declarations. That has been both educational and fun. Unfortunately, as Nat pointed out in another context, that lovely promise might not be delivered upon:

[in Haskell] If you don't write explicit type constraints you can end up with a type inference error somewhere in your code. Where? The only way to find out is to incrementally add explicit type constraints (using binary chop, for example) to narrow down where the error is. It's not much different, and no easier, than using printf for debugging C code.

If this is the best case of static typing, then we have a problem.

Meanwhile, let's consider the distinction between systems programming and application programming. Try googling around the various attempts to nail down that distinction. I don't find any of them terribly satisfactory. For me, the crucial distinction is that the system programmer must allow for any possibly use of their code, whereas an application programmer does not.

This means that in systems land a function declared like f :: int -> int must, unles very carefully specified otherwise, be known to do the right thing at every element of the whole of the cartesian product of the int type with itself. But in application land we might, for example, know that those ints are really the numbers of the days in the week, so we only really need the function to do the right thing over {0,1,2,3,4,5,6}² Demonstrating those two different kinds of correctness require different techniques, I think.

Of course, using int when you mean {0,1,2,3,4,5,6} is a smell. And here is the deodorant.

I want to join these two things up. I want to make a connection between the kind of correctness that systems code needs to have and the way that static typing is might help us with that versus the kind of correctness that application code needs to have and the way that unit testing might help us with that. But I'm not there yet. Watch this space.

An unreasonably high bar?

Some time ago Adam Goucher posted this response to my 5 questions interview with Michael Hunter. There's a few points in there that I want to come back to, but right now the one that's at the front of my mind is this:

Count tests to get a useless number; I can write a million tests that provide useless information but still shows 7 figures in the count.

Well yes, you could. But why would you? We seem to have a hankering in the industry for techniques that would give good results even when badly applied by malicious idiots. That seems unreasonable. And also pointless: I don't believe that the industry is populated by malicious idiots. On the other hand, the kind of answer one gets depends a lot on how a question is asked.

There is (I read somewhere recently) a principle in economics that one cannot use one number as both a measure of and a target for the same thing and expect anything sensible to happen. [Allan tells me that this is Goodhart's Law --kb] In our world this is the route to the gaming of metrics. I also don't believe that gaming works by folks consciously sitting down and conspiring to fabricate results. I do believe that if we measure, say, test coverage at every check-in and publish it on our whizzy CI server dashboard thingy and have a trend line of coverage over time and we talk a lot about higher coverage being better, or even that test coverage has something to do with "quality" [that would be the "surrogate measure" part of Goodhart's Law --kb] then it is in fact the response of a smart and well intentioned team member to write more tests to get the number up. Even if those tests turn out not to be much use for anything else.

I think (certainly I hope so) that my recommendation to measure scope by counting tests doesn't fall into that trap. Don't write the tests so that you can measure scope. But observe that you can if you write the tests the right way. Of which I shall have a bit more to say later.

Shameless self promotion dept.

I've been interviewed by Michael Hunter in his "five questions" column for Dr. Dobbs

Pols' Perfect Purchaser

Andy Pols tells a great story about just how great it can be to have your customer 1) believe in testing and 2) completely engage in your development activities. Nice.

Subtext continues to amaze

I just watched the video describing the latest incarnation of Subtext. The video shows how the liveness, direct manipulation and example–based nature of this new table–based Subtext makes the writing and testing of complex conditionals easy.

If you develop software in a way that makes heavy use of examples and tables, you owe it to yourself to know about Subtext.

Tests and Gauges

At the recent Unicom conference re Agility and business value I presented on a couple of cases where I'd seen teams get real, tangible, valuable...value from adopting user/acceptance/checked-example type testing. David Peterson was also presenting. He's the creator of Concordion, an alternative to Fit/Library/nesse. Now, David had a look at some of my examples and didn't like them very much. In fact, they seemed to be the sort of thing that he found that Fit/etc encouraged, of which he disapproved and to avoid which he created Concordion in the first place. Fair enough. We tossed this back and forth for a while, and I came to an interesting realization. I would in fact absolutely agree with David's critique of the Fit tests that I was exhibiting, if I though that they were for the purpose that David thinks his Concordion tests are for. Which probably means that any given project should probably have both. But I don't think that, so I don't.

Turning the Tables

David contends that Fit's table–oriented approach affords large tests, with lots of information in each test. He's right. I like the tables because most of the automated testing gigs that I've done have involved financial trading systems and the users of those eat, drink, and breath spreadsheets. I love that I can write a fixture that will directly parse rows off a spreadsheet built by a real trader to show the sort of thing that they mean when they say that the proposed system should blah blah blah. The issue that David sees is that these rows probably contain information that is, variously: redundant, duplicated, irrelevant, obfuscatory and various other epithets. He's right, often they do.

What David seems to want is a larger number of smaller, simpler tests. I don't immediately agree that more, simpler things to deal with all together is easier than fewer, more complex things, but that's another story. And these smaller, simpler tests would have the principle virtue that they more nearly capture a single functional dependency. That's a good thing to have. These tests would capture all and only the information required to exercise the function being tested for. This would indeed be an excellent starting point for implementation.

There's only one problem: such tests are further away from the users' world and close to the programmers' . All that stuff about duplication and redundancy is programmer's talk. And that's fair enough. And its not enough. I see David's style of test as somewhat intermediate between unit tests and what I want, which is executable examples in the users' language. When constructing these small, focussed tests we're already doing abstraction, and I don't want to make my users do that. Not just yet, anyway.

So then I realised where the real disagreement was. The big, cumbersome, Fit style tests are very likely too complicated and involved to be a good starting point for development. And I don't want them to be that. If they are, as I've suggested, gauges, then they serve only to tell the developers whether or not their users' goals have been met. The understanding of the domain required to write the code will, can (should?) come from elsewhere.

Suck it and See

And this is how gauges are used in fabrication. You don't work anything out from a gauges. What you do is apply it to see if the workpiece is within tolerance or not. And then you trim a bit off, or build a bit up, or bend it a bit more, or whatever, a re–apply the gauge. And repeat. And it doesn't really matter how complicated an object the gauge itself is (or how hard it was to make—and it's really hard to make good gauges), because it is used as if it were both atomic and a given. It's also made once, and used again and again and again and again...

Until this very illuminating conversation with David I hadn't really fully realised myself quite the full implications of the gauge metaphor. It actually implies something potentially quite deep about how these artifacts are built, used and managed. Something I need to think about some more.

Oh, and when (as we should) we start to produce exactly those finer–grained, simpler, more focussed tests that David rightly promotes, and we find out that the users' understanding of their world is all denormalised and stuff, what interesting conversations we can have with them then about how their world really works, it turns out.

Might even uncover the odd onion in the varnish. But let's not forget that having the varnish (with onion) is more valuable to them than getting rid of the onion.

Exemplary Thoughts

So, I was asked to write up the "lighting talk" on examples and exemplars I gave at Agile 2007. That was a short and largely impromptu talk, so there is some extra material here.

Trees

It used to be that botanists thought that the Sugar Maple, the Sycamore and the Plane trees were closely related. These days they are of the opinion that the Sycamore and Sugar Maple are closely related to one another, but the Plane is not to either. This is one example of the way that our idea of how we organise the world can change.

As it happens, this confusion is encoded in the binomial names of these species: the Sugar Maple is Acer saccarum, the (London) Plane tree is Platanus x acerifolia, while the (European) Sycamore is Acer pseudoplatanus. Oh, and if you are a North American then you call your Plane trees "Sycamores" anyway. And further more, not one of these trees is the true Sycamore: that's a fig, Ficus sycomorus.

Botanists and zoölogists are the masters of classification, but as we see they have to modify their ideas from time to time (and these days the rise of cladistics is turning the whole enterprise inside out)

Greeks

A well-known dead Greek laid the foundations of our study of classification about two and a half thousand years ago, in terms of what can be predicated of a thing. It all seemed perfectly reasonable, and was the basis of ontological and taxonomic thinking for many centuries. This is interesting to us who build systems, because the predicates that are (jointly) sufficient and (individually) necessary for an thing to be a member of a category in Aristotle's scheme can be nicely reinterpreted as, say, the properties of a class in an OO language, or the attributes of an entity in an E-R model, and so forth. All very tidy. One small problem arises, however: this isn't actually how people put things into categories! It also has a terrible failure mode: as we can see from all this "acerifolia" and "pseudoplatanus" stuff in the tree's names, the shape of the leaves was not a good choice of shared characteristic to use to classify them. It is of this mistake (amongst other reasons) that the unspeakable pain of schema evolution arises.

The Greeks, by the way, already knew that there were difficulties with definitions. After much straining between the ears, an almost certainly apocryphal story goes, the school of Plato decided that the category of "man" (in the broadest sense, perhaps even including women and slaves) was composed of all those featherless bipeds. Diogenes of Sinope ("the cynic") promptly presented the Academicians with a plucked chicken. At which they refined their definition of Man to be featherless bipeds with flat nails and not claws.

In 1973 Eleanor Rosch published the results of some actual experiments into how people really categorise things, which seem to show that the members of a category are not equal (as they are in the Aristotle's scheme), a small number of them are dominant: Rosch calls these the "prototypes" of the category. And what these prototypes are (and therefore what categories you recognise in the world) is intimately tied in with your experience of being in the world. And these ideas have been developed in various directions since.

One implication of the non-uniformness of of categories is that they are fuzzy, and that they overlap. The import for us in building systems is that maybe the reason that people have difficulty in writing down all these hard-and-fast rules about hard-edged, homogeneous categories of thing as many requirements elicitation techniques want is because that's just not a good fit for how they think about the world, really.

Germans

But perhaps examples do. Examples can be extracted from a person's usual existential setting which means that they can be more ready-at-hand than present-at-hand. This is probably good for requirements and specifications (it's not universally good: retrospectives force a process in which one if usually immersed to be present-at-hand, and this is good too). Also, people can construct bags of examples that have a family resemblance without necessarily having to be able to systematize exactly why they think of them as having that resemblance. This can usefully help delay the tendency of us system builders to prematurely kill off options and strangle flexibility by wanting to know the nethermost implication and/or abstraction of a thing before working with it.

And maybe that's why example-based "testing", which is really requirements engineering, which is really a communication mode, does so much better than the other.

I'm proposing a session on this very topic for Agile 2008. I encourage you to think about proposing a session there, too.

Gauges

The "test" word in TDD is problematical. People are (rightly) uncomfortable with using it to describe the executable design documents that get written in TDD. The idea of testing has become too tightly bound to the practice of building a system and then shaking it really hard to see what defects fall out. There is an older sense of test, meaning "to prove", which would help but isn't current enough. Fundamentally, though, these artefacts are called tests for historical reasons (ie, intellectual laziness). One attempt to fix this vocabulary problem has the twin defects of going too far in the direction of propaganda, and not far enough in the actual changes it proposes.

In any case, I'm more interested in finding explanatory metaphors to help people use the tools that are currently widely available and supported than I am in...doing whatever it is to people's heads that the BDD crowd think they are doing. Anyway, I've found that it's a bit helpful to talk about test-first tests being gauges (as I've mentioned in passing before. Trouble is that too few people these days have done any metalwork.

A Metaphor Too Far

So, the important thing about a plug gauge or such is that it isn't, in the usual sense, a measuring tool. It gives a binary result, the work piece is correctly sized to within a certain tolerance or it isn't. This makes, for example, turning a bushing to a certain outside diameter a much quicker operation than it would be if the machinist had to get out the vernier micrometer and actually measure the diameter after each episode of turning and compare that with the dimensioned drawing that specifies the part. Instead, they get (or assemble, or make) a gauge that will tell whether or not a test article conforms to the drawing, and use that.

And this is exactly what we do with tests: rather than compare the software we build against the requirement after each development episode, we build a test that will tell us if the requirement is being conformed to. But so few people these days have spent much time in front of a lathe that this doesn't really fly.

But, flying home from a client visit today my eye was caught by one of those cage-like affairs into which you dunk your cabin baggage (or not). It would be far too slow for the check-in staff to get out a tape measure, measure your bag, and compare the measurements with the permitted limits. So instead, they have a gauge. From now on (until I find a better one), that's my explanatory metaphor. Hope it works.

peripatetic axiom