Complexity and Test-first 0

It recently occurred to me to run a Cyclomatic Complexity tool over a codebase I know to have been largely written test-first/code/refactor (and often test-driven). The code is Java, so I used this Eclispe plugin to measure the per-method complexity.

The distribution of complexity per method looked like this:

Note that the count of methods is shown on a logarithmic scale.

There were 19,917 methods in this codebase. The mean complexity per method was 1.46, and the median complexity is 4.5 Not shown on the chart are a few outliers: one with complexity 91, one with 47, and one with 35.

The Extrema

The 91 complexity method turned out to be the doGet of a servlet providing a simple web service, crammed full of if (method.equals("getLocales")) printLocales(out, ids); type dispatching code. Similarly, the 47 complexity method is mainly concerned with calling out to a third-party API and finding which of a small number of equivalence sets a value from a large domain of return codes falls into. It does this by means of a huge switch with many, many fall-throughs. So in these cases we see a disconnect between the arithmatic of cyclomatic complexity and any notion of the code being complicated, in the sense of difficult to undetrstand (and therefore maintain).

Down at the other end of the scale the tens of thousands of complexity 1 and 2 methods are mostly of these sorts:
  • constructors
  • bean methods
  • forwarders
In between are some more interesting methods, which I will deal with in a later post (which is why this is post #0), comparing the amount of testing that was done to write the method test-first with the amount that the Cyclomatic complexity measure would suggest would be required after writing the method.


In one disucssion by McCabe there is the suggestion that 10 is an important threshold for cyclomatic complexity. The SEI treatment gives 11 as the lower bound of the "simple program, without much risk" class. They don't state quite what a "program" is in this sense, I guess they mean "subroutine" in the structured programming sense. As we can see, methods with complexity >= 10 are vanishingly uncommon in the codebase I examined. Evidence that the team working on this code are keeping per-method complexity under good control.

Now, while we can be confident that these methods are overwhelmingly individually very simple the codebase itself is non-the-less considered (by the folks that work on it) highly complex. It does a lot of different things, and interacts with many external systems, and although it is all written in Java it contains several very different technology stacks. So the distribution of cyclomatic complexity isn't the whole story. The total complexity of the whole codebase is about 20,000. It has far fewer that 20,000 test methods (about 7000, in fact), as a simpleminded application of cyclomatic complexity suggests would be required. Although, if you leave out the complexity 1 methods, the total complexity is only 9,000 or so, which paints a different picture.

I don't want to get embroiled in a disucssion of whether or not cyclomatic complexity even makes sense for object-oriented code, not least because the code inside a Java method isn't particularly object-oriented. I do want to examine a mainsteam notion of complexity and testing (the rule is that the complexity equals the smallest number of test cases requied to cover all paths through the code) in the light of code written test-first. Once I've found a suitable candidate method and tests to explore, I'll let you know.

Wait a minute though...

Something about the shape of that chart caught my eye. It looks very much as if it has something close to a power-law distribution. This shows up better on a log-log rendering:

If we switch to a cumulative count of methods we can extract a pretty convincing Pareto distribution:

It so happens that inside the same repository is a bunch of Java that I'm confident was not written test-first (and certianly not test-driven). That doesn't mean that it's bad code, just written differently. What does its complexity chart look like?

Looks like fairly good scale-free beahviour, but let's look at a fitted Pareto distribution:

Not as good a fit. But perhaps more significantly, the slope is about half the slope of the test-first code, 1.39 vs 2.79


We can imagine all sorts of reasons for these differences (not least, the non-test-first code is an order of magnitude smaller than the test first, which might account for the lower R-squared) but I'm interested in further investigating the tentative hypotheses that the test-code-refactor cycle results in code that has a distribution of complexity closer to Pareto, and with a steeper slope, than traditional approaches.

If that were true, how might we use it? Well, if the complexity distribution is Pareto/Zipf then there is no "preferred" complexity for the codebase. I'd imagine that if there were a preferred complexity then that might point to a (perhaps widely dispersed) collection of methods that could be candidates for refactoring. Question: do the well-known code smells [pdf] cause such a thing?

I'd further guess that the higher level of dupliction expected in an inadequately refactored codebase (not DRY enough, not OAOO enough) would cause a bit of a hump in the complexity distribution. A sag in the distribution I have less of an inution about, although it might point to something along these lines. You'll have to dig around in Prof Salingaros's work to see what I'm getting at, but it's worth it.

Also, if the slope of the Pareto distribution is low then there will be proportionatly more more complex methods in the codebase than less complex ones--that can't be good, can it?

Well, if it wasn't for the day job I'd be looking into this in ernest. But as it is, I'm going to tinker about with this some more as and when I get the time. I'll report back when I do. If anyone else out there does any measurements like this I'd be fascinated to hear about it.

The story continues here.


Anonymous said...

I think your graphs sort of represent an euqilibrium between a few forces: Adding more features tends to complicate matters, adding more structure statements. On the other hand, overly complex methods tend to "fall apart" as the people working on it try to make it easier to understand and maintain.

You'll have noticed that the lines that are fitted through the points are not such a good fit. The "best fit" curves are no lines. Especially, the more complex methods are positioned above the line, until just before the maximum complexity. That may have to do with the fact that large (legacy) methods are harder to refactor. You cannot break down a huge and hugely complex method anymore in small steps. So, especially if it is barely tested or if the tests are also huge, such a method, that is in great need of refactoring to understand, will resist refactoring because it cannot be understood or handled.

It would be nice to follow these metrics over time. You might be able to calculate the half-life of a complex method.

Best regards,
Willem Bogaerts

Anonymous said...

pareto and zipf are different distributions and this is neither

Robin Goodfellow said...

While this is an interesting area of research, it would be impossible to say that your work here represented any degree of rigor. Thus I can only say your investigations are interesting and deserve further study, but it would be foolish to say that this work supports any conclusions whatsoever.