September 11, 2006

SIGIR ’06 – Faceted Search & Indexing Efficiency

Posted in Conferences, Information Retrieval, Scale, Search, sigir2006 at 11:17 pm by mj

Writing recently about faceted search compared to “live filter” reminds me that I haven’t yet written much about SIGIR.

My favorite paper was Type Less, Find More: Fast Autocompletion Search with a Succinct Index. This paper was presented during the Effiency section by Holger Blast of Max-Planck-Institut für Informatik, and the results kind of blew me away, because they seem immediately practical.

The paper was also adapted into When You’re Lost for Words: Faceted Search with Autocompletion, and presented by his student and cohort, Ingmar Weber.

(I’ve learned that, a few days later, Holger also gave a Google talk on efficient autocompletion. Watch the video – Holger’s really a blast.)

There are two sides to this paper: first, proposing a dialogue-style search UI; and second, proposing a hybrid inverted index data structure to perform autocompletions efficiently (referred to as HYB, whereas inverted indexes are referred to as INV). The latter is what really piqued my interest.

Let’s jump directly to the results. On a 426GB raw document collection, they claim to have obtained the following:

  • mean query time: 0.581s INV -vs- 0.106s HYB
  • 90%-ile query time: 0.545s INV -vs- 0.217s HYB
  • 99%-ile query time: 16.83s INV -vs- 0.865s HYB
  • max query time: 28.84s INV -vs- 1.821s HYB

That’s a 300 – 2000% improvement. Now, the tests they were performing were specific to the task of finding autocompletion terms, and displaying intermediate results immediately as the user is typing. But get this..

Once they solved this problem, they realized it applies equally well to faceted search: simply treat your facets as prefix searches, and store your values as, e.g., cat:family. Then, for a given query of "holger blast", you convert that on the back-end to the prefix query "holger blast cat:"which instantly returns you all of the categories in which Holger Blast has been classified.

The reception during the faceted search workshop was mixed:

  • Yoelle Maarek of Google in Haifa (one of the organizers) argued with Holger over whether this was the same as Google Suggest (it’s not–Google suggest uses a pre-computed list of popular queries, and does not perform query intersections).
  • Marti Hearst of UC Berkely (the “grandmother” of faceted search–although she is much younger and cuter than the name might imply) at first did not see the applicability to faceted search.
  • Several members complained that the index had to be huge and inefficient

On the last point, I think there was some confusion. (It’s hard to read all the papers before a session.) It took me a couple of readings before I got it, too.

The confusion seemed to be over the assumption that the words were being stored as prefixes. For example, a prefix list with minimum size 3 would store the word “alphabet” as the (unique) terms "alp" - "alph" - "alpha" - "alphab" - "alphabe" - "alphabet". This is (obviously) inefficient in disk usage.

What their HYB index is actually doing is storing the word “alphabet” as a multiset of postings (document Id lists from inverted index fame), along with the words “alpaca”, “alpha”, “alphameric”, and so on, assuming those terms exist in your document collection. They demonstrate a mathematical model for choosing the size of the range of words within a multiset based on the total size of the block–that is, the size of the encodings of the range of words plus the size of the encodings of the document Ids within which those words appear (the postings).

They are trading off (much) better performance for computing auto completion results with (slightly) worse performance for computing non-prefix result sets.

It’s clear, then, there is minimal overhead in terms of disk usage: each word is still stored exactly once within the hybrid inverted index. The overhead comes from weaving the word encodings with the posting encodings within each multiset block.

Unanswered questions: how well does this scale with real-world use (query throughput versus index size)? how much does this impact index build times/complexities (they claim no impact)? does this affect relevancy?

Advertisements

1 Comment »

  1. […] eBay was among the first to require not only near-real-time indexing, but also to provide large-scale faceted search. Listening to Eric, solving the near-real-time indexing problem using message queues and what-not was apparently easier than solving the faceted browsing problems: showing exactly how many results there are in any given category, where the categories change with every query. (See also my previous post on faceted search and indexing effiency for a recent cool attempt to solve this.) […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: