September 24, 2006

M.J. Dominus on Design Patterns

Posted in Design Patterns, Programming Languages, Software at 6:51 pm by mj

In his recent essay, Design Patterns of 1972, Mark-Jason Dominus–a long-time favorite writer of mine from the Perl community–writes:

Had the “Design Patterns” movement been popular in 1960, its goal would have been to train programmers to recognize situations in which the “subroutine” pattern was applicable, and to implement it habitually when necessary. While this would have been a great improvement over not using subroutines at all, it would have been vastly inferior to what really happened, which was that the “subroutine” pattern was codified and embedded into subsequent languages.

He draws this analogy because, he believes, the “Design Patterns” movement, noble in its intent, has weakened programmers by desensitizing them to defects in their tools.

He continues:

The stance of the “Design Patterns” movement seems to be that it is somehow inevitable that programmers will need to implement Visitors, Abstract Factories, Decorators, and Façades. But these are no more inevitable than the need to implement Subroutine Calls or Object-Oriented Classes in the source language. These patterns should be seen as defects or missing features in Java and C++. The best response to identification of these patterns is to ask what defects in those languages cause the patterns to be necessary, and how the languages might provide better support for solving these kinds of problems.

This isn’t the first time he’s made this point. Several years ago, he offered a presentation on Christopher Alexander, where he asserted that the “Design Patterns” movement has turned out to be less useful for programmers than what Alexander proposed for architects. (To reduce the possibility of misunderstanding, it’s probably best to skip ahead and read slide 9, slide 10, slide 11 and slide 12 first.)

It’s sometimes difficult to envision a programming language embodying what we commonly think of as a pattern, which is why his analogies to the “subroutine pattern” of the 1960’s and the “object pattern” of the 1980’s are so perfect.

September 17, 2006

Responding to Your Users: How Does Webshots Fare?

Posted in Blogging, webshots, Work at 2:01 pm by mj

The recent Facebook controversy brought out a few more good posts on how to engage, learn from, and respond to your users.

Teresa Valdez Klein at Blog Business Summit reiterates the standard advice: ask what your customers want, admit when you screw up, and, most important, don’t sneak back into your shell once the immediate crisis is over.

As usual, Robert Scoble has the best advice on corporate blogging. He takes it all several steps further: publish a video blog, meet with a cross-section of your users in person, link to blogger criticisms “out there”, and post responses on others’ blogs.

That’s quite a bit to take in for any business.

I am happy to say that Webshots has been improving on its communications.

The Webshots blog has been a growing source of communication between members and staff. Anne, Amy and Jessica have been doing a great job.

Anne put together an invited group of members on a separate blog, with whom we communicate about new ideas, demonstrate new features, and so on. It’s quite a different kind of feeling.

We also have a staff picks blog, which, while not truly a communications avenue, is a volunteer-run blog of Webshots employees who pick and write about our great members. It’s also a different kind of feeling, and puts even more of a personal touch on things.

Gone are the days when Webshots shunned communicating with its users, and avoided catering to the tech-savvy crowd. We’re now open with our failings, and when we do something wrong, Anne and Amy take a lot of flak so that we engineers don’t have to. 🙂 (Which isn’t usually fair to them.)

Going back to Scoble’s advice, probably the biggest struggle we’ve had is putting together a true cross-section of our users with whom to interact. Every group thinks they’re the target audience. And why not? Without a site that individualizes the content it features to each member, everybody has a claim. That’s a problem with pushing the same content to 20M+ members.

After reading Scoble’s post, I realize how far we have left to go. Webshots never links to other blogs (I don’t think it’s a policy, maybe just an oversight or unspoken fear). And we’re never officially commenting on others’ blogs, or promoting Webshots in our own blogs.

The latter point merits some extra thinking. Posting on others’ blogs in discussions about your employer blurs the line between “offcial company activity” and “personal opinion.” The opportunities for screwing up are pretty numerous. It’s easier if you’re the CEO.

Even this post blurs the line. (Obviously it’s all personal opinion and my employer will disavow just about everything I write. But the average reader might not see it that way.)

Is it good if your employees are blogging about you, even if they sometimes say stupid things, or criticise your management, or leave for a competitor? It’s been good for Scoble. Does it work for everyone?

September 11, 2006

SIGIR ’06 – Faceted Search & Indexing Efficiency

Posted in Conferences, Information Retrieval, Scale, Search, sigir2006 at 11:17 pm by mj

Writing recently about faceted search compared to “live filter” reminds me that I haven’t yet written much about SIGIR.

My favorite paper was Type Less, Find More: Fast Autocompletion Search with a Succinct Index. This paper was presented during the Effiency section by Holger Blast of Max-Planck-Institut für Informatik, and the results kind of blew me away, because they seem immediately practical.

The paper was also adapted into When You’re Lost for Words: Faceted Search with Autocompletion, and presented by his student and cohort, Ingmar Weber.

(I’ve learned that, a few days later, Holger also gave a Google talk on efficient autocompletion. Watch the video – Holger’s really a blast.)

There are two sides to this paper: first, proposing a dialogue-style search UI; and second, proposing a hybrid inverted index data structure to perform autocompletions efficiently (referred to as HYB, whereas inverted indexes are referred to as INV). The latter is what really piqued my interest.

Let’s jump directly to the results. On a 426GB raw document collection, they claim to have obtained the following:

  • mean query time: 0.581s INV -vs- 0.106s HYB
  • 90%-ile query time: 0.545s INV -vs- 0.217s HYB
  • 99%-ile query time: 16.83s INV -vs- 0.865s HYB
  • max query time: 28.84s INV -vs- 1.821s HYB

That’s a 300 – 2000% improvement. Now, the tests they were performing were specific to the task of finding autocompletion terms, and displaying intermediate results immediately as the user is typing. But get this..

Once they solved this problem, they realized it applies equally well to faceted search: simply treat your facets as prefix searches, and store your values as, e.g., cat:family. Then, for a given query of "holger blast", you convert that on the back-end to the prefix query "holger blast cat:"which instantly returns you all of the categories in which Holger Blast has been classified.

The reception during the faceted search workshop was mixed:

  • Yoelle Maarek of Google in Haifa (one of the organizers) argued with Holger over whether this was the same as Google Suggest (it’s not–Google suggest uses a pre-computed list of popular queries, and does not perform query intersections).
  • Marti Hearst of UC Berkely (the “grandmother” of faceted search–although she is much younger and cuter than the name might imply) at first did not see the applicability to faceted search.
  • Several members complained that the index had to be huge and inefficient

On the last point, I think there was some confusion. (It’s hard to read all the papers before a session.) It took me a couple of readings before I got it, too.

The confusion seemed to be over the assumption that the words were being stored as prefixes. For example, a prefix list with minimum size 3 would store the word “alphabet” as the (unique) terms "alp" - "alph" - "alpha" - "alphab" - "alphabe" - "alphabet". This is (obviously) inefficient in disk usage.

What their HYB index is actually doing is storing the word “alphabet” as a multiset of postings (document Id lists from inverted index fame), along with the words “alpaca”, “alpha”, “alphameric”, and so on, assuming those terms exist in your document collection. They demonstrate a mathematical model for choosing the size of the range of words within a multiset based on the total size of the block–that is, the size of the encodings of the range of words plus the size of the encodings of the document Ids within which those words appear (the postings).

They are trading off (much) better performance for computing auto completion results with (slightly) worse performance for computing non-prefix result sets.

It’s clear, then, there is minimal overhead in terms of disk usage: each word is still stored exactly once within the hybrid inverted index. The overhead comes from weaving the word encodings with the posting encodings within each multiset block.

Unanswered questions: how well does this scale with real-world use (query throughput versus index size)? how much does this impact index build times/complexities (they claim no impact)? does this affect relevancy?

September 10, 2006

Facebook: Publicity to Die Invade Privacy For

Posted in Business, Community, Fun at 8:09 pm by mj

(Full disclosure: I work for Webshots, which is a sort of competitor to Facebook.)

When the “Facebook Fiasco” started, I felt a little uneasy. Everybody I knew, and most in the blogosphere, were saying what an embarrassment this was for Facebook.

Hot on the heels of the AOL stir-up, I could feel management at every community and social networking site gritting its collective teeth, preparing morning memos decreeing that all new features have to be vetted through legal.

My thought? No publicity is bad publicity.

As if a company who gets 100K+ of its members to protest a new change by using its own services is really going to experience any lasting repercussions.

It now appears that I was right:

This is an excellent example of a company listening to its users and quickly pushing intelligent changes, in a transparent manner, to deal with a problem. Facebook is growing up, in a good way.

Also see their Alexa traffic spike.

Now that’s how to launch a new feature.

Now, Facebook didn’t do this intentionally. And many of these users certainly would have fled–eventually. There are some serious points in here, but it’s all quite funny, too.

Just consider their pitch to their advertisers: Last month, we committed a bit of a faux pas with a small little feature, and over one hundred and fifty thousand people came together on our site in a single day! Thousands of newspapers and blogs linked to us. Imagine if your campaign were running that day…

Backlashes–especially when “unprecedented”–are a better proof of your reach and the vitality of your business than anything else. For better or worse.

Interestingly, Sam Ruby has a different take on what was most disconcerting about Facebook’s feature: information overload.

One year assessment

Posted in Lessons Learned at 3:40 pm by mj

Last Monday (September 4) was my “blogiversary!” I didn’t even remember. That’s how much I suck. 🙂

So, in one year I posted a paltry 59 entries, going an average of 6 days between posts, and a max of 40 (during the winter holidays). Median time elapsed? 3 days. The mode of days elapsed? 1. Those stats sound much better, but are misleading.

Number of legitimate comments? 9.

Number of spam comments? I don’t know. I’ve manually deleted about a hundred over the year, but WordPress’ spam filter catches a dozen or more per day.

I can’t get WordPress to show individual post stats beyond the past two days (*sigh*), but, simply by sampling my stats page, my most popular post, by far, is last December’s Webshots and Flickr: A (possibly) more thorough analysis.

Similarly, from my own sampling of my stats page, the most popular search term that leads to a pageview, is…. sex. It wasn’t until a month after I titled by blog that I realized I could guess the consequences. I’m probably also penalized by Google et al. Maybe it’s time for a name change.

My dismal Technorati rank is 838,064. (This is the first time I’ve checked my rank.)

My lesson is: post more. post better. link more. link better. and blogger, promote thyself.

September 9, 2006

“Live Filter” compared to Faceted Search

Posted in Programming Languages, Search, Software at 6:11 pm by mj

Ajaxian points to an article on a so-called “Live Filter” pattern (if the original article‘s site is down, see the Google cache).

The definition from the article:

We propose that for many problem domains, the basic concept of filtering is now much more appropriate than searching. With a search, you start off with nothing and potentially end up with nothing. Counter to this approach is filtering, where we present everything available, and then encourage the user to progressively remove what they do not need. Using Ajax to provide immediate feedback, we can display the number of results a user can expect to see if they were to submit their query at this moment.

The original article goes into many details about how it’s implemented underneath, and there is a Live Filter demo available.

This seems very much like a solution in need of a problem. For example, nowhere in the article is there a mention of faceted searching/navigation. It’s unclear why faceted searching did not meet the author’s needs, or why the concept of filtering should be thought unique in 2006. In all, it looks a bit like an excuse to promote AJAX and, especially, Ruby on Rails.

Judging by the demo, this is a poor man’s faceted search system. When you click on one of the (static and pre-determined) “filters”, the only thing the AJAX interface gives you is a message that reads: Your search will return 7 results. and a button you can click to actually see the results.

A real faceted search system, on the other hand, will present you with context-aware options and immediately tell you how many results each option will give you. This is usually implemented without AJAX, but an AJAXY interface can be a great design help: hiding suboptions, preventing full page refreshes, or even displaying the best results for an option as you hover over it.

I can see how “live filters” are slightly better than a normal search box for some specific, narrow applications, but, compared to building out a faceted search system, you might as well be doing all this in a TN3270 terminal.

The original article goes into good detail on what’s unique about Ruby on Rails in making building and maintaining this applicatation easier than on other platforms. As an introduction to building applications in Rails and/or server-side AJAX design considerations, it’s great, and probably on par with a lot of toy applications you’ll find in intro books. (I recommend looking at Live Filter source.)

September 7, 2006

Google Bigtable at OSDI ’06

Posted in Information Retrieval, Scale, Software at 11:01 pm by mj

A team of eight at Google are presenting a surprisingly detailed paper titled Bigtable: A distributed storage system for structured data at OSDI ’06 (another conference I wish I were hip enough to attend).

From the abstract:

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

*drool*

A couple of highlights/thoughts (read the paper for the technical stuff):

  • Bigtable itself contains about 100,000 lines of code, excluding tests;
  • they leveraged GFS and Chubby (a “highly-available and persistent distributed lock service”); remember, kids: internal corporate infrastructure and services really matter;
  • in some ways, the Bigtable API is more-or-less logically equivalent to maintaining multiple distributed hashes, one for each column or property (e.g., title, date, etc.); however, unification not only feels more right to the developer, but allows for a more optimized implementation;
  • row keys are sorted lexicographically, allowing developers to control locality of access (e.g., “com.example/example1.html” and “com.example/example2.html” are very likely to be in the same cluster, which means processing all pages from the same domain is more efficient);
  • scale is slightly less than linear due to varying temporal load imbalances (good to know even Google has this problem);
  • “One group of 14 busy clusters with 8069 total tablet servers saw an aggregate volume of more than 1.2 million requests per second, with incoming RPC traffic of about 741MB/s and outgoing RPC traffic of about 16GB/s.” I started drooling until I realized that’s only 148 requests per second per server, and 94KB/s in and 2MB/s out per server. Then I just laughed at those numbers!
  • various teams at Google are now using Bigtable, but it took some convincing (apparent resistance to a non-RDBMS for non-search activities); now many large services–among them, Google Earth, Google Analytics, and Personalized Search–are employing Bigtable to great effect; remember, kids: internal corporate infrastructure and services really matter (did I already say that?)

    Finally, an excerpt from their “lessons learned,” which we’d all do well to remember:

    […] [L]arge, distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols. […] memory and network consumption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using, overflow of GFS quotas, and planned and unplanned hardware maintenance.
    […]
    [I]t is important to delay adding new features until it is clear how the new features will be used.
    […]
    the importance of system-level monitoring (i.e., monitoring Bigtable itself, as well as the client processes using Bigtable).
    […]
    The most important lesson we learned is the value of simple designs.