October 7, 2007

The Evolution of a Programming Puzzle

Posted in Coding tagged , , at 10:59 am by mj

Mark-Jason Dominus is currently examining a series of programs attempting to solve Van der Waerden’s problem: four that he wrote in 1988, and one he wrote this year.

Van der Waerden’s theorem says that for any number of colors, say C, a sufficiently-long row of colored dots will contain n evenly-spaced same-color dots for any n. Or, put another way, if you partition the integers into C disjoint classes, at least one class will contain arbitrarily long arithmetic progressions.

So far, he’s only written an introduction and dissected two of his 1988 programs, but it’s already a memorable series. Not only does it show the iterative process of solving an interesting problem (suitable for a SSE interview?), but it tells the story of his evolution as a programmer.

Mark’s blog is one of my favorites, and even though his posts tend to be long and interesting, I never mark them as “toread” the way I do most other long posts when I’m catching up on feeds. I always reflect on them, even if it means making negative progress on my feed reading.

Some quotes to set the mood:

In yesterday’s article I wrote about a crappy program to search for “good” strings in van der Waerden’s problem. It was crappy because it searched the entire space of all 327 3^27 strings, with no pruning. I can’t remember whether I expected this to be practical at the time. Did I really think it would work?
I was thinking this afternoon how it’s intersting the way I wrote this. It’s written the way it would have been done, had I been using a functional programming language. In a functional language, you would never mutate the same string for each function call; you always copy the old structure and construct a new one, just as I did in this program. This is why C programmers abominate functional languages.


On an unrelated note, I’m thinking about auto-posting items from my Google Reader shared items to this blog. Good idea? Bad idea? Should I do so on a daily basis, or put a bit more work into rating them and posting the top N weekly? Or do it all manually?

Update (2007-10-13): Mark pointed out that the exponent did not survive my copy and paste, so it appeared he was only searching through 327 strings, instead of 3^27. D’oh!

September 22, 2007

Turning a corner on unit testing

Posted in Coding, Development, Me at 3:53 pm by mj

I think I’ve finally done it.

This week, I found myself struggling within the confines of existing infrastructure code that had no unit tests. The task was relatively simple: a new feature at the infrastructure-level that would enable many other product-level features in coming months. If I’d gone down the hackish route and not worked at the infrastructure level, I could’ve isolated this code and tested it in complete isolation. But that would just lead to more bugs down the road, and I’d still have a lot of integration testing to do.

It was one of the more painful and slow development experiences in recent memory. Even worse than dealing with 10-year old, undocumented, legacy Perl code written by non-programmers. Yeah, that painful.

For a long time, I contended that unit testing’s primary benefit was long-term: ensuring that refactoring code which you did not write and do not fully understand does not break. And that you need to reach a certain critical mass in terms of number and variety of tasks before it becomes effective. And that’s still true. I think my skepticism rubbed off, too, because I hardly hear anyone saying, “But the unit tests pass, there can’t be a bug” anymore.

But then I started looking at how I approached the practice of programming. Before I write code–often as part of designing the system–I write out stubs of how I’m going to use the code. Because an aesthetically unpleasing or overly complicated API is an error-prone API. And it dawned on me that this is one of the big benefits of test-first development.

Last year, when I read Agile Software Development: Principles, Patterns, and Practices (finally after it had been on my bookshelf for years), it was a bit like a revelation. Agile development, really, is formalizing what are otherwise good development-time practices anyway. So I resolved to better formalize my development activities.

It’s taken a while, but now I am so accustomed to using unit tests to isolate problems, and the quick turnaround that entails, that there is a significant mental barrier to any other way of doing it. I’ve turned that corner and am heading down NotReallyUtopiaButGoodEnough Street.

July 21, 2007

Non-cooked mode input problems with piping over SSH

Posted in Coding, input redirection, Linux, Software, ssh at 8:46 pm by mj

Thought I’d share this, since what I thought was a 2-hour Saturday early morning side project went horribly awry.

Consider this Perl code snippet:

use Term::ReadKey;

while (1) {
    my $key = ReadKey(-1);
    if (defined($key)) {
        print "Got key $key\\n";

Simple enough: it does nothing until you press a key. This idiom might be used, for example, with non-blocking reads while you’re processing another file, as below:

use Term::ReadKey;
use IO::Select;

my $handle = IO::File->new("tail -f /some/file |");
my $selectable = IO::Select->new();

my ($tailCommandPid) = getChildPids();

while (1) {
    my $key = ReadKey(-1);
    if (defined($key)) {
        print "Got key $key\\n";
    my @ready = $selectable->can_read(1);
    foreach my $h (@ready) {
        my $line = <$h>;
        print $line if defined($line);
kill 9, $tailCommandPid;

This also works. It continually echoes the lines it reads, until you press a key.

But what if you want to read the lines from a remote shell? Can we replace the line

my $handle = IO::File->new("tail -f /some/file |");


my $handle = IO::File->new("ssh someServer tail -f /some/file |");

and be on our merry way?

Well, no. In this case, your input on STDIN is completely ignored, unless you switch back to plain-old cooked mode.

Or is it really ignored? Maybe ssh is using your input for its own nefarious purposes. Makes perfect sense if your program is but a mediary between your user and an interactive remote shell.

Turns out, ssh accepts -f as an argument to force it into the background. Will that work?

my $handle = IO::File->new("ssh -f someServer tail -f /some/file |");

Success! But now we’re killing the wrong process at the end.

By forcing itself into the background, ssh is really daemonizing itself: its parent PID (on Linux) is 1. Your program will exit, but you’ll leave ssh happily running in the background, unless you result to less-than-safe tricks with ps and grep.

Further down in the man page, however, and we find the trick: -n keeps ssh running as a normal child process, but does not read from STDIN:

my $handle = IO::File->new("ssh -n someServer tail -f /some/file |");

And this, my friends, is a mistake I will never make again. And neither will you.

June 3, 2007

Linus Torvalds on Source Control

Posted in Coding, Development, Software at 12:51 pm by mj

This was on Slashdot earlier. It’s a really interesting video of Linus speaking at the Googleplex about the design philosophy of GIT, and why other version control systems are solving the wrong problems.

Forget all the comments about Linus being an arrogant git and so on, he makes a compelling point about distributed repositories and networks of trust, even inside corporate environments. Toward the end, he also talks about how GIT’s use of SHA-1 to provide consistency also aids in recovery of a lost/corrupted repository even from people you don’t trust.

Of course, he is a bit wrong about the problems with CVS/SVN merges. Developers experienced with merging in those environments know how to get around some of the most annoying problems (e.g., remember (tag) the point at which you last merged to avoid phantom conflicts). But he’s still right about the more fundamental design flaws.

January 25, 2007

Is Code Quality Subjective?

Posted in Coding, Development, Software, Style at 4:42 am by mj

Most developers I know fall into one of two camps: either they believe that their code is the best code and that everybody else who does it differently needs to learn a thing or two, or they believe that some “standard” codebase is the best code and, de facto, represents the best way to do something.

Certainly, I have a hard time understanding why anybody would write

    if (someCondition) doSomething();

rather than

    if (someCondition) {

Nowhere has my annoyance been higher than when I started dealing with code from a team of offshore contractors, who not only do the former (against all Java coding conventions), but they like to confuse others with beauties like:

/* is the following code active or inactive? * /
if (complexLogic1() && complexLogic2()) doSomethingAgain();


 <!-- MyTag
 </MyTag -->
   <SubTag>another value</SubTag>

Another recent example: I’ve been evaluating Solr for use at Webshots. The Solr guys have done a tremendous job producing an enterprise-y application layer on top of Lucene, and it came out of an internal CNET project by another team (developed in response, IIRC, to the acquisition of Webshots, which introduced Lucene to CNET). There are still some suspicious gaps in Solr that make it not yet suitable for Webshots’ scale, but nothing a few good contributions from us can’t tackle.

But the code style!

One good, and recent, example is negative filters. It just so happens that I solved the same issue recently at Webshots without knowledge that Solr already had an issue open on it. My solution was a NegatedFilter class that can be utilized by any consumer of Lucene, usually in conjunction with a CachingWrapperFilter. I just didn’t contribute it to Lucene (bad developer, no cookie!).

Their solution (patch available from the link) is quite Solr-specific, messy, and long. I haven’t yet compared to see if theirs is a significantly better performer, but since mine is just flipping bits and took, e.g., a 4 minute filter down to 10 seconds, I don’t see that there’s much need for improved performance.

And yet, Yonik and team will tell you that the Webshots search application is also messy and fragile, despite its scalability. (I know because I’ve had this discussion with them before. :)) In fact, they’ll also tell you that our application isn’t scalable. Which is somewhat true–but, so far, I haven’t seen where Solr is more scalable. I’ll have better performance metrics in a few weeks, though.

Finally, another example is using Spring for app configuration. The problem that I see with spring is precisely the same discussion that takes place in the Ruby community: the ability to use domain-specific languages. If you specify your own XML format for configuration, the XML file becomes a domain-specific language that feels natural for the application that you’re trying to configure. If you use Spring for your configuration, you get a lot of nicetities on the code side, but your vocabulary becomes limited to beans and properties and props, which feels natural and intuitive in only a few contexts, and certainly is nothing you’d want to give non-engineers to manage.

Thoughts, further examples or rebuttals?