September 11, 2006

SIGIR ’06 – Faceted Search & Indexing Efficiency

Posted in Conferences, Information Retrieval, Scale, Search, sigir2006 at 11:17 pm by mj

Writing recently about faceted search compared to “live filter” reminds me that I haven’t yet written much about SIGIR.

My favorite paper was Type Less, Find More: Fast Autocompletion Search with a Succinct Index. This paper was presented during the Effiency section by Holger Blast of Max-Planck-Institut für Informatik, and the results kind of blew me away, because they seem immediately practical.

The paper was also adapted into When You’re Lost for Words: Faceted Search with Autocompletion, and presented by his student and cohort, Ingmar Weber.

(I’ve learned that, a few days later, Holger also gave a Google talk on efficient autocompletion. Watch the video – Holger’s really a blast.)

There are two sides to this paper: first, proposing a dialogue-style search UI; and second, proposing a hybrid inverted index data structure to perform autocompletions efficiently (referred to as HYB, whereas inverted indexes are referred to as INV). The latter is what really piqued my interest.

Let’s jump directly to the results. On a 426GB raw document collection, they claim to have obtained the following:

  • mean query time: 0.581s INV -vs- 0.106s HYB
  • 90%-ile query time: 0.545s INV -vs- 0.217s HYB
  • 99%-ile query time: 16.83s INV -vs- 0.865s HYB
  • max query time: 28.84s INV -vs- 1.821s HYB

That’s a 300 – 2000% improvement. Now, the tests they were performing were specific to the task of finding autocompletion terms, and displaying intermediate results immediately as the user is typing. But get this..

Once they solved this problem, they realized it applies equally well to faceted search: simply treat your facets as prefix searches, and store your values as, e.g., cat:family. Then, for a given query of "holger blast", you convert that on the back-end to the prefix query "holger blast cat:"which instantly returns you all of the categories in which Holger Blast has been classified.

The reception during the faceted search workshop was mixed:

  • Yoelle Maarek of Google in Haifa (one of the organizers) argued with Holger over whether this was the same as Google Suggest (it’s not–Google suggest uses a pre-computed list of popular queries, and does not perform query intersections).
  • Marti Hearst of UC Berkely (the “grandmother” of faceted search–although she is much younger and cuter than the name might imply) at first did not see the applicability to faceted search.
  • Several members complained that the index had to be huge and inefficient

On the last point, I think there was some confusion. (It’s hard to read all the papers before a session.) It took me a couple of readings before I got it, too.

The confusion seemed to be over the assumption that the words were being stored as prefixes. For example, a prefix list with minimum size 3 would store the word “alphabet” as the (unique) terms "alp" - "alph" - "alpha" - "alphab" - "alphabe" - "alphabet". This is (obviously) inefficient in disk usage.

What their HYB index is actually doing is storing the word “alphabet” as a multiset of postings (document Id lists from inverted index fame), along with the words “alpaca”, “alpha”, “alphameric”, and so on, assuming those terms exist in your document collection. They demonstrate a mathematical model for choosing the size of the range of words within a multiset based on the total size of the block–that is, the size of the encodings of the range of words plus the size of the encodings of the document Ids within which those words appear (the postings).

They are trading off (much) better performance for computing auto completion results with (slightly) worse performance for computing non-prefix result sets.

It’s clear, then, there is minimal overhead in terms of disk usage: each word is still stored exactly once within the hybrid inverted index. The overhead comes from weaving the word encodings with the posting encodings within each multiset block.

Unanswered questions: how well does this scale with real-world use (query throughput versus index size)? how much does this impact index build times/complexities (they claim no impact)? does this affect relevancy?

September 7, 2006

Google Bigtable at OSDI ’06

Posted in Information Retrieval, Scale, Software at 11:01 pm by mj

A team of eight at Google are presenting a surprisingly detailed paper titled Bigtable: A distributed storage system for structured data at OSDI ’06 (another conference I wish I were hip enough to attend).

From the abstract:

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

*drool*

A couple of highlights/thoughts (read the paper for the technical stuff):

  • Bigtable itself contains about 100,000 lines of code, excluding tests;
  • they leveraged GFS and Chubby (a “highly-available and persistent distributed lock service”); remember, kids: internal corporate infrastructure and services really matter;
  • in some ways, the Bigtable API is more-or-less logically equivalent to maintaining multiple distributed hashes, one for each column or property (e.g., title, date, etc.); however, unification not only feels more right to the developer, but allows for a more optimized implementation;
  • row keys are sorted lexicographically, allowing developers to control locality of access (e.g., “com.example/example1.html” and “com.example/example2.html” are very likely to be in the same cluster, which means processing all pages from the same domain is more efficient);
  • scale is slightly less than linear due to varying temporal load imbalances (good to know even Google has this problem);
  • “One group of 14 busy clusters with 8069 total tablet servers saw an aggregate volume of more than 1.2 million requests per second, with incoming RPC traffic of about 741MB/s and outgoing RPC traffic of about 16GB/s.” I started drooling until I realized that’s only 148 requests per second per server, and 94KB/s in and 2MB/s out per server. Then I just laughed at those numbers!
  • various teams at Google are now using Bigtable, but it took some convincing (apparent resistance to a non-RDBMS for non-search activities); now many large services–among them, Google Earth, Google Analytics, and Personalized Search–are employing Bigtable to great effect; remember, kids: internal corporate infrastructure and services really matter (did I already say that?)

    Finally, an excerpt from their “lessons learned,” which we’d all do well to remember:

    […] [L]arge, distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols. […] memory and network consumption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using, overflow of GFS quotas, and planned and unplanned hardware maintenance.
    […]
    [I]t is important to delay adding new features until it is clear how the new features will be used.
    […]
    the importance of system-level monitoring (i.e., monitoring Bigtable itself, as well as the client processes using Bigtable).
    […]
    The most important lesson we learned is the value of simple designs.