November 4, 2007

Indirection Has a Price, Too

Posted in Software tagged , , , at 9:00 am by mj

Robert C. Martin asks How big should a function be?.

His conclusion? Functions should be tiny:

Functions should be small. Very small. A well written function might be 4, 5, perhaps 8 lines long; but no longer.
Typically, the body of an if or while statement should be no more than a single function call.

The body of a try block should also be a function call; and the word try should be the first word in the function.

I’ve seen this principle applied too vigorously. To figure out in exactly what way a value of 2 for the baz parameter affects the outcome of a function, you end up stepping into twelve other functions. Not fun.

As programmers, we need to recognize that any indirection has two costs (excluding performance considerations). First, there is the present cost to producing the code. Second, there is the future cost of understanding the code.

We’re accustomed to indirection. Object hierarchies, composition, configuration files, resource bundles, enumerations, … all introduce indirection, often with lofty goals like reducing coupling and isolating future changes.

What are the lofty goals behind keeping functions to 4-8 lines?

One goal is better unit testing. The more you decompose your function, the more individual pieces of it you can test in isolation, and the easier it is to figure out precisely what’s broken when a test fails.

Another goal is code reuse. If you have four high-level functions that do almost the same thing, but some criticial piece changes depending on the value of a parameter, they can each delegate most of their work to other functions.

But does that necessarily outweight the cost?

Consider this analogy. Some bloggers like to write entries that look like this:

Foo wonders if the Martian invasion may be good for humans. Then Bar responded eloquently with this look at Martian imperial history. Very Important Baz then took it a step further with Earth: the Final Frontier. I concur! What do you think?

Pop quiz: what is the thesis of that post?

You simply don’t know because of all the indirection!

Sure, it’s great if you have an hour to click on every link, and read the other entries, and, if you really care about the poster–because, say, you have a secret crush on this person–you’ll even tie it all together in the same way and leave a comment like “You’re absolutely right!”

Now, imagine a Wikipedia entry written in that way. Wikipedia has a policy against original research, and forces you into using secondary sources. So, why not do away entirely with summaries and excerpts?

Is it because humans just aren’t equipped to deal with indirection in that way?

So it is with code, because code is as much about communicating with other programmers as it is about driving the processor.

In my experience, functions around 20-25 lines usually hit the sweet spot. Such functions usually call at least two or three other functions, so they reasonably decompose distinct subtasks. Yet, they avoid unnecessary indirection, so a human reading the function can easily understand its corner cases, and perhaps has a better shot at simplifying it by spotting duplicated code, or introducing earlier returns/exception throwing.

My habit leans toward writing a single function, and decomposing into sub functions when (a) unit testing is significantly simplified or clarified, or (b) I need to reuse pieces in other contexts, or (c) I find that, in the end, it just looks messy and wouldn’t be something I’d want to come back to.

If we must talk about perfect function sizes as a function of lines of code, it’s best to distinguish types of functions. Data access functions should tend to be small. Input validation functions will call more sub functions. String parsing functions will tend to use a loop and only call sub functions if the processing is complex. Basic algorithms like searching and sorting will do all their processing inline. And so on.


October 7, 2007

The Evolution of a Programming Puzzle

Posted in Coding tagged , , at 10:59 am by mj

Mark-Jason Dominus is currently examining a series of programs attempting to solve Van der Waerden’s problem: four that he wrote in 1988, and one he wrote this year.

Van der Waerden’s theorem says that for any number of colors, say C, a sufficiently-long row of colored dots will contain n evenly-spaced same-color dots for any n. Or, put another way, if you partition the integers into C disjoint classes, at least one class will contain arbitrarily long arithmetic progressions.

So far, he’s only written an introduction and dissected two of his 1988 programs, but it’s already a memorable series. Not only does it show the iterative process of solving an interesting problem (suitable for a SSE interview?), but it tells the story of his evolution as a programmer.

Mark’s blog is one of my favorites, and even though his posts tend to be long and interesting, I never mark them as “toread” the way I do most other long posts when I’m catching up on feeds. I always reflect on them, even if it means making negative progress on my feed reading.

Some quotes to set the mood:

In yesterday’s article I wrote about a crappy program to search for “good” strings in van der Waerden’s problem. It was crappy because it searched the entire space of all 327 3^27 strings, with no pruning. I can’t remember whether I expected this to be practical at the time. Did I really think it would work?
I was thinking this afternoon how it’s intersting the way I wrote this. It’s written the way it would have been done, had I been using a functional programming language. In a functional language, you would never mutate the same string for each function call; you always copy the old structure and construct a new one, just as I did in this program. This is why C programmers abominate functional languages.


On an unrelated note, I’m thinking about auto-posting items from my Google Reader shared items to this blog. Good idea? Bad idea? Should I do so on a daily basis, or put a bit more work into rating them and posting the top N weekly? Or do it all manually?

Update (2007-10-13): Mark pointed out that the exponent did not survive my copy and paste, so it appeared he was only searching through 327 strings, instead of 3^27. D’oh!