September 11, 2006

SIGIR ’06 – Faceted Search & Indexing Efficiency

Posted in Conferences, Information Retrieval, Scale, Search, sigir2006 at 11:17 pm by mj

Writing recently about faceted search compared to “live filter” reminds me that I haven’t yet written much about SIGIR.

My favorite paper was Type Less, Find More: Fast Autocompletion Search with a Succinct Index. This paper was presented during the Effiency section by Holger Blast of Max-Planck-Institut für Informatik, and the results kind of blew me away, because they seem immediately practical.

The paper was also adapted into When You’re Lost for Words: Faceted Search with Autocompletion, and presented by his student and cohort, Ingmar Weber.

(I’ve learned that, a few days later, Holger also gave a Google talk on efficient autocompletion. Watch the video – Holger’s really a blast.)

There are two sides to this paper: first, proposing a dialogue-style search UI; and second, proposing a hybrid inverted index data structure to perform autocompletions efficiently (referred to as HYB, whereas inverted indexes are referred to as INV). The latter is what really piqued my interest.

Let’s jump directly to the results. On a 426GB raw document collection, they claim to have obtained the following:

  • mean query time: 0.581s INV -vs- 0.106s HYB
  • 90%-ile query time: 0.545s INV -vs- 0.217s HYB
  • 99%-ile query time: 16.83s INV -vs- 0.865s HYB
  • max query time: 28.84s INV -vs- 1.821s HYB

That’s a 300 – 2000% improvement. Now, the tests they were performing were specific to the task of finding autocompletion terms, and displaying intermediate results immediately as the user is typing. But get this..

Once they solved this problem, they realized it applies equally well to faceted search: simply treat your facets as prefix searches, and store your values as, e.g., cat:family. Then, for a given query of "holger blast", you convert that on the back-end to the prefix query "holger blast cat:"which instantly returns you all of the categories in which Holger Blast has been classified.

The reception during the faceted search workshop was mixed:

  • Yoelle Maarek of Google in Haifa (one of the organizers) argued with Holger over whether this was the same as Google Suggest (it’s not–Google suggest uses a pre-computed list of popular queries, and does not perform query intersections).
  • Marti Hearst of UC Berkely (the “grandmother” of faceted search–although she is much younger and cuter than the name might imply) at first did not see the applicability to faceted search.
  • Several members complained that the index had to be huge and inefficient

On the last point, I think there was some confusion. (It’s hard to read all the papers before a session.) It took me a couple of readings before I got it, too.

The confusion seemed to be over the assumption that the words were being stored as prefixes. For example, a prefix list with minimum size 3 would store the word “alphabet” as the (unique) terms "alp" - "alph" - "alpha" - "alphab" - "alphabe" - "alphabet". This is (obviously) inefficient in disk usage.

What their HYB index is actually doing is storing the word “alphabet” as a multiset of postings (document Id lists from inverted index fame), along with the words “alpaca”, “alpha”, “alphameric”, and so on, assuming those terms exist in your document collection. They demonstrate a mathematical model for choosing the size of the range of words within a multiset based on the total size of the block–that is, the size of the encodings of the range of words plus the size of the encodings of the document Ids within which those words appear (the postings).

They are trading off (much) better performance for computing auto completion results with (slightly) worse performance for computing non-prefix result sets.

It’s clear, then, there is minimal overhead in terms of disk usage: each word is still stored exactly once within the hybrid inverted index. The overhead comes from weaving the word encodings with the posting encodings within each multiset block.

Unanswered questions: how well does this scale with real-world use (query throughput versus index size)? how much does this impact index build times/complexities (they claim no impact)? does this affect relevancy?

August 6, 2006

SIGIR ’06 – Day 0: Welcoming Reception

Posted in Conferences, sigir2006 at 11:47 pm by mj

After a fourteen hour drive shared with my wife, I arrived in time to check into my hotel, check in to the conference, and attend the welcoming reception. I’m sure I looked like a walking zombie.

First, let me say that I am a bit out of my league here. I did not go down the mathematical/theoretical route with my career, and I have, thus far, not been able to take Webshots’ search where I believe it should go. So I am a bit embarrassed introducing myself to the people here, since most of them are either researching in an academic environment or applying theory on a large-ish scale.

The first thing that has struck me is how heavily dominated by the GYM team this conference is. While it’s the largest SIGIR to date, my guess is most of that’s due to the GYM recruiting competition: by my count, 135 of the 653 official attendees are from Google, Yahoo, or Microsoft (that’s 20%!!!). The largest contingent, of course, being from Microsoft (maybe 70?). Every one of the grad students in attendence can look forward to a lucrative career. 🙂

Unfortunately, I forgot my camera at the hotel, and my Webshots account is having trouble accepting mobile uploads from my phone. (See? Even Webshots engineers sometimes have problems. And yes, I did track down the issue. And no, I can not magically push the fix through in the middle of the night. What I can do is call customer support in the morning…) I have some grainy photos of the Microsoft and Google booths sitting side-by-side: a match made in, um, Seattle.

The welcoming reception was all right. I expected a half hour of socializing followed by two hours of (essentially boring) speeches, announcements, introductions, instructions, etc. Instead, we had a relaxing 30-minute bus ride to Boeing’s Future of Flight Museum, several hours of socialing/networking/recruiting in a party-like atmosphere, pretty good food, and five minutes of “screw it. we’re not going to bore you. welcome to sigir. now get back to socializing.”

But, as I said, I was a walking zombie and talked to only a few people. My natural shyness didn’t help, of course.

Tomorrow, the real interesting stuff begins. But for now, I need some rest in a real bed.