November 5, 2006
Google CSE: Google Mashup
I was busy with other things, so I’m just now getting around to checking out Google Custom Search Engine (GCSE). I find I’m a bit disappointed after reading where Ethan Zuckerman explains how GCSE is lacking:
A little poking solves the mystery pretty quickly. Google Coop Search works by searching against the main Google search catalog, retrieving 1000 results and filtering them against the sites you’ve included in your catalog. This makes sense, computationally – these searches are fast, almost as fast as normal Google searches. Rather than conducting 3000 “site:” searches and collating and reranking the results, Google is sacrificing recall, getting 1000 results and discarding those not in your set of chosen sites, which requires one call to the index and a really big regular expression match.
With the result being:
In other words, the little engine I’ve built is useful only if the sites I’ve chosen are relatively high ranking and authoritative sites on the topics I’m searching on.
When I first read about GCSE, I was picturing tens of millions of bit vectors (and entries in BigTable), corresponding to each “custom engine,” and updated with every refresh of their index. Perhaps some smart stuff to make sure entries that haven’t been rebuilt yet use the old index until they are (BigTable seems good for managing that – see my previous entry on the BigTable paper).
I couldn’t imagine a way to scale it practically, but I figured, “Hey, it’s Google…”
Instead it turns out that it’s pretty much a mash-up. Anybody off the street could retrieve the top N results from Google’s API, filter out sites based on include/exclude lists, and dynamically rerank the rest based on preferences.
I’m not knocking it. That’s the definition of dynamic reranking and usually is how personalization is implemented. I’m just disappointed that they’re not doing something way beyond the norm, technically speaking.
Probably more interesting is how they’ll take the data from CSEs and feed some of the keywords and usage data back into Google Co-op.