February 4, 2007
Ever get curious about the trends in the APIs provided by web search engines and social sites with a public search? Well, I did, couldn’t find a convenient reference, spent a morning doing some research, and am sharing my data here.
I’ve only included mainstream communities with public search APIs that do not require user-level authentication. That is, it’s possible to get “whole web” or “whole site” results that match keywords/tags, and not just get back a user’s own posts/photos/etc. (which excludes del.icio.us, simpy, bloglines, tailrank, facebook, among others).
Highlights for the ADD crowd: Nearly everybody requires an API key. Most rate limit. Almost nobody supports OpenSearch. REST APIs are overwhelmingly preferred. Yahoo! (+ Flickr) wins the “easiest to work with” award (no surprise).
Read on for comparisons of eight players, presented in alphabetical order. Then, add comments with corrections or APIs that I missed.
January 19, 2007
Greg Linden–founder, architect, designer, programmer, visionary, banker behind Findory—put Findory on autopilot until its resources are depleted.
Findory has been my primary start page for quite a while. Sure, it had its shortcomings. Too many of its sources were aggregators: Slashdot, Metafilter, Planet fill-in-the-blank, and so on. Clicking on a news story through one of those aggregators meant weeks or months of getting stories from them in your top results.
While its traffic peaked in 2006, Greg recently lamented on his blog that Q3-Q4 saw a serious decline in uniques. I suspect that’s what sparked his introspection.
I stand with those asking Greg to open source his code, or at least produce a well-footnoted, well-referenced book from it.
People say they want personalization, but not in a void. What’s more important than personalization is social context. For news, that means: who’s talking about this story? what are the reactions and additions to this story? do other people like this story? is it a story that crosses social cliques?
And, yes, it also means that what the A-listers write about carries more weight than what, say, I write about (a Z-lister?), regardless of overlapping interests.
As Greg has pointed out previously, personalized search means a smaller shared context among searchers, which is good for fighting spam (in the short-term). But is it bad for searchers? How do you share your search results with somebody else by simply copying the URL?
That problem is magnified with personalized news. Although you may get stories that are really interesting to you, you may just as well get no interesting stories since your interests aren’t generating much worth writing about at the moment. And you’re less likely to engage in a shared cultural experience.
I take note that the good bloggers didn’t use Findory to find interesting things to write about. I don’t think Findory’s bugs and little quirks are the reason for that. I think it’s the lack of social awareness, which suppresses to the social cues we use to find interesting stuff.
November 5, 2006
I was busy with other things, so I’m just now getting around to checking out Google Custom Search Engine (GCSE). I find I’m a bit disappointed after reading where Ethan Zuckerman explains how GCSE is lacking:
A little poking solves the mystery pretty quickly. Google Coop Search works by searching against the main Google search catalog, retrieving 1000 results and filtering them against the sites you’ve included in your catalog. This makes sense, computationally – these searches are fast, almost as fast as normal Google searches. Rather than conducting 3000 “site:” searches and collating and reranking the results, Google is sacrificing recall, getting 1000 results and discarding those not in your set of chosen sites, which requires one call to the index and a really big regular expression match.
With the result being:
In other words, the little engine I’ve built is useful only if the sites I’ve chosen are relatively high ranking and authoritative sites on the topics I’m searching on.
When I first read about GCSE, I was picturing tens of millions of bit vectors (and entries in BigTable), corresponding to each “custom engine,” and updated with every refresh of their index. Perhaps some smart stuff to make sure entries that haven’t been rebuilt yet use the old index until they are (BigTable seems good for managing that – see my previous entry on the BigTable paper).
I couldn’t imagine a way to scale it practically, but I figured, “Hey, it’s Google…”
Instead it turns out that it’s pretty much a mash-up. Anybody off the street could retrieve the top N results from Google’s API, filter out sites based on include/exclude lists, and dynamically rerank the rest based on preferences.
I’m not knocking it. That’s the definition of dynamic reranking and usually is how personalization is implemented. I’m just disappointed that they’re not doing something way beyond the norm, technically speaking.
Probably more interesting is how they’ll take the data from CSEs and feed some of the keywords and usage data back into Google Co-op.
October 22, 2006
A number of indexing technologies and standards – robots.txt, nocache, noindex – have been adopted by all major search engines to protect the authorship rights of websites across the internet. Yet, to date, the search engines have not created a standard of privacy for their users.
#privacy as a sort of pre-processing directive in search queries:
The standard is simple: if a user includes #privacy in a search query, the search engine should not associate that IP (or other tracking mechanism such as cookies) with the query, nor should that query be made available via public or private keyword tools
I think what Russ and his co-workers are going for–aside from a bit of nice publicity for their company–is much simpler than what many “privacy rights” advocates are seeking, and more feasible too.
For example, Michael Zimmer says #privacy doesn’t go far enough, and I think he speaks for a lot of people:
Forcing users to append their searches with a tag in order to protect their privacy accepts the premise that search engines should be allowed to collect personal information by default. And that is what must change.
The argument for 100% email encryption is valid here, as well: namely, if you only protect yourself when you have “something to hide,” then it becomes a lot easier to determine who’s doing things they’re not supposed to be, and to show intent in a legal proceeding.
I’m not clear how far the original proposal wants to go. The straight-forward reading would imply that such searches do not contribute to popularity lists, relatedness of queries, relevancy feedback (including clickthru tracking), etc. That’s a lot of stuff around search to prevent innovations upon, should such a standard become the default setting. I’m not sure what harm it does to know that 10,000 people searched for
stuffed bears in microwaves if none of those queries are attributable to a specific individual or ISP.
It also probably rules out keyword-based advertising, and especially keyword-based advertising targeted to your interests, or from which your profile might be gleaned (for example, clicking on most ad links will give the advertiser information about you and the context of your click–it has to, or else advertisers will not be able to track success rates).
It gets worse. Even if the search engine respects my privacy, any links I click on will, by default, send my search query to the host’s site (through the HTTP “referer” header). Should search engines somehow mangle the Urls, or push every click through a redirect that has no correlation with the original search? (A conspiracy theorist will say that is the goal, as it will make SEO Marketing firms much more valuable. ;-))
There’s something in me that likes
#privacy as a manual, special-circumstance directive. A naive implementation, though, will lull people into a false sense of
securityprivacy, as it cuts across several areas of the business and underlying infrastructure.
Beyond that, search engines can go a long way toward alleviating fear, uncertainty and doubt by simply being totally clear how their personal information is being used. For example, to establish the relatedness of one query to another, you need to associate each search with a unique user and then correlate multiple searches to similar users. However, that data does not need to be queryable on a per-user basis, nor does it need to survive a long time (maybe 30 days). Be clear about that, and most people won’t care most of the time.
October 12, 2006
Early on, Eric talks about the design of their search engine in 2002. They had been trying to make a legacy in-house search work, and brought in all the major search vendors (Verity, Google, Excite, …) but found their search systems were no better.
Why is that? IR systems usually are built for query-time performance, not for real-time indexing. There are some very good reasons for that.
eBay was among the first to require not only near-real-time indexing, but also to provide large-scale faceted search. Listening to Eric, solving the near-real-time indexing problem using message queues and what-not was apparently easier than solving the faceted browsing problems: showing exactly how many results there are in any given category, where the categories change with every query. (See also my previous post on faceted search and indexing effiency for a recent cool attempt to solve this.)
Like many things in this changing world, what was once large-scale (20M items) and high-performance in 2002 has quickly become expected behavior, not only at eBay but at most social networking sites as well. For example, tag browsing is expected to be near-real-time. Blog search is expected to be near-real-time: Technorati indexes more blog entries every week than eBay had in total in 2002.
Later, Eric demos a couple of ideas in the oven, including what appears to be dynamic ranking of results (reordering of results based on click streams).
One thing that sticks out is he talks about eBay having a full site release every two weeks, and he describes this as “massively high frequency.” In the Web world, I think that is an exaggeration. Weekly–or even daily–releases are more and more the norm. Java shops have a more difficult time keeping up than Perl, Python, PHP or Ruby shops (though the Java shops tend to be larger).
What is probably unique is that he says they have a highly regimented release schedule, which, presumably, means no slips and no code freezes. That’s hard to do with a company the size of eBay (in terms of number of developers).
The interviews goes on about operational issues and a nice-looking (but probably not very useful) social interaction browser.
Interesting fact from near the end: eBay has 768 back-end servers just serving search queries. Documents (items) are split into one of 16 buckets, each of which is served by a cluster of 16 servers, and there are 3 full redundant versions of the whole system, each capable of taking all the traffic in a crunch.
Worth a watch.
September 11, 2006
My favorite paper was Type Less, Find More: Fast Autocompletion Search with a Succinct Index. This paper was presented during the Effiency section by Holger Blast of Max-Planck-Institut für Informatik, and the results kind of blew me away, because they seem immediately practical.
The paper was also adapted into When You’re Lost for Words: Faceted Search with Autocompletion, and presented by his student and cohort, Ingmar Weber.
(I’ve learned that, a few days later, Holger also gave a Google talk on efficient autocompletion. Watch the video – Holger’s really a blast.)
There are two sides to this paper: first, proposing a dialogue-style search UI; and second, proposing a hybrid inverted index data structure to perform autocompletions efficiently (referred to as HYB, whereas inverted indexes are referred to as INV). The latter is what really piqued my interest.
Let’s jump directly to the results. On a 426GB raw document collection, they claim to have obtained the following:
- mean query time: 0.581s INV -vs- 0.106s HYB
- 90%-ile query time: 0.545s INV -vs- 0.217s HYB
- 99%-ile query time: 16.83s INV -vs- 0.865s HYB
- max query time: 28.84s INV -vs- 1.821s HYB
That’s a 300 – 2000% improvement. Now, the tests they were performing were specific to the task of finding autocompletion terms, and displaying intermediate results immediately as the user is typing. But get this..
Once they solved this problem, they realized it applies equally well to faceted search: simply treat your facets as prefix searches, and store your values as, e.g.,
cat:family. Then, for a given query of
"holger blast", you convert that on the back-end to the prefix query
"holger blast cat:" — which instantly returns you all of the categories in which Holger Blast has been classified.
The reception during the faceted search workshop was mixed:
- Yoelle Maarek of Google in Haifa (one of the organizers) argued with Holger over whether this was the same as Google Suggest (it’s not–Google suggest uses a pre-computed list of popular queries, and does not perform query intersections).
- Marti Hearst of UC Berkely (the “grandmother” of faceted search–although she is much younger and cuter than the name might imply) at first did not see the applicability to faceted search.
- Several members complained that the index had to be huge and inefficient
On the last point, I think there was some confusion. (It’s hard to read all the papers before a session.) It took me a couple of readings before I got it, too.
The confusion seemed to be over the assumption that the words were being stored as prefixes. For example, a prefix list with minimum size 3 would store the word “alphabet” as the (unique) terms
"alp" - "alph" - "alpha" - "alphab" - "alphabe" - "alphabet". This is (obviously) inefficient in disk usage.
What their HYB index is actually doing is storing the word “alphabet” as a multiset of postings (document Id lists from inverted index fame), along with the words “alpaca”, “alpha”, “alphameric”, and so on, assuming those terms exist in your document collection. They demonstrate a mathematical model for choosing the size of the range of words within a multiset based on the total size of the block–that is, the size of the encodings of the range of words plus the size of the encodings of the document Ids within which those words appear (the postings).
They are trading off (much) better performance for computing auto completion results with (slightly) worse performance for computing non-prefix result sets.
It’s clear, then, there is minimal overhead in terms of disk usage: each word is still stored exactly once within the hybrid inverted index. The overhead comes from weaving the word encodings with the posting encodings within each multiset block.
Unanswered questions: how well does this scale with real-world use (query throughput versus index size)? how much does this impact index build times/complexities (they claim no impact)? does this affect relevancy?
September 9, 2006
Ajaxian points to an article on a so-called “Live Filter” pattern (if the original article‘s site is down, see the Google cache).
The definition from the article:
We propose that for many problem domains, the basic concept of filtering is now much more appropriate than searching. With a search, you start off with nothing and potentially end up with nothing. Counter to this approach is filtering, where we present everything available, and then encourage the user to progressively remove what they do not need. Using Ajax to provide immediate feedback, we can display the number of results a user can expect to see if they were to submit their query at this moment.
The original article goes into many details about how it’s implemented underneath, and there is a Live Filter demo available.
This seems very much like a solution in need of a problem. For example, nowhere in the article is there a mention of faceted searching/navigation. It’s unclear why faceted searching did not meet the author’s needs, or why the concept of filtering should be thought unique in 2006. In all, it looks a bit like an excuse to promote AJAX and, especially, Ruby on Rails.
Judging by the demo, this is a poor man’s faceted search system. When you click on one of the (static and pre-determined) “filters”, the only thing the AJAX interface gives you is a message that reads: Your search will return 7 results. and a button you can click to actually see the results.
A real faceted search system, on the other hand, will present you with context-aware options and immediately tell you how many results each option will give you. This is usually implemented without AJAX, but an AJAXY interface can be a great design help: hiding suboptions, preventing full page refreshes, or even displaying the best results for an option as you hover over it.
I can see how “live filters” are slightly better than a normal search box for some specific, narrow applications, but, compared to building out a faceted search system, you might as well be doing all this in a TN3270 terminal.
The original article goes into good detail on what’s unique about Ruby on Rails in making building and maintaining this applicatation easier than on other platforms. As an introduction to building applications in Rails and/or server-side AJAX design considerations, it’s great, and probably on par with a lot of toy applications you’ll find in intro books. (I recommend looking at Live Filter source.)
June 11, 2006
The ACM SIGIR is back in the states this year, hosted at the University of Washington in Seattle the second week of August. Registration was just posted last week, after months of a “coming soon” page.
Conference registration is $550 for ACM members.
The panels I’m planning on attending, from the program:
- User behavior and modelling (day 1, session 1, room 1)
- Exploiting Graph Structure (day 1, session 2, room 1)
- Formal Models (day 1, session 3, room 2)
- Machine Learning (day 2, session 1, room 2)
- Efficiency (day 2, session 2, room 3)
- Clustering (day 3, session 1, room 2)
- Recommendations: Use and Abuse (day 3, session 2, room 2)
- Web IR: Current Topics (day 3, session 3, room 2)
- Faceted Search Workshop (day 4) ($110 more, plus an extra night stay; Andrei Broder, who I wrote about earlier, is the co-leader)
I’m not much of a conference person, and even less of a socialite, but I’m looking forward to learning a lot from the people there–both the panel members, and the audience.
Who knows, I may even come away having recruited a couple of smart young researchers to work on my team.. (though, with Yahoo, Microsoft, Google, and several start-ups well represented on the panels, it will be tough.)
I’d love to hear if any of my six readers (including the four spammers) is planning on attending.
As reported seemingly everywhere, Ebay is entering the contextual advertising business, where ads on affiliates’ sites will link directly to active auctions on Ebay whose items match the content on the current page. This is most likely a good thing for Ebay sellers. The value to small-time content publishers remains to be seen, since I believe the TOS on the GYM team offerings forbids forays into multiple advertisers.
This marks the fourth major player to enter this arena, which means it’s time for somebody else to come along and change the nature of the game. Once everybody has the know-how and the infrastructure, the market becomes ripe for a superior differentiated product.
Contextual advertising is a bit of a misnomer, since the actual context of the user’s session really doesn’t come into play. Rather, it refers to an advertisement appearing in the context of the content on the page.
For example, let’s say I (as the content publisher) know that you came to a page by searching for “aluminum siding” (yeah, I know). Although the page itself probably has at least one of those words, my advertising partner of choice has no real way of distinguishing my interest in aluminum siding from my interest in vinyl siding (which is also contained on the page). And they certainly have no clue that I’ve skipped over 12 other search results because they didn’t contain exactly what I wanted.
But intent through explicit search is only a small piece of the puzzle. What if I knew you came to a page through a recommendation my system offered you, and I (of course) know the criteria that was used to make that recommendation?
Most advertisers are equipped to take “hints” from the publisher, in the form of additional keywords, but they’re not equipped to (a) accept a lot of additional keywords, or (b) accept keywords that we’d like to negate, or (c) consider the real context of the user’s session, or (d) learn from a user’s behavior, to further refine their model of the user’s context (intent).
Maybe by considering, say, the last 8 pageviews within the last 30 minutes (those with contextual ads, anyway), they’d get closer in some circumstances, but they’d flub it in many situations. This is even more true when only certain pages contain calls to the advertiser, and those pages probably are not the ones providing the meat of the context.
Further down in the report, the reporter also mentions that Ebay is studying the possibility of opening up their user feedback system in some way. That seems like more of a trial balloon being floated to gauge interest and, more imporantly, to take suggestions on how to do so in a way that provides value, but still keeps the most important part proprietary. Hence the “it could take several years” comment from their director of developer relations.
Still, tying reputation systems into advertising–and, maybe going even further, establishing seller reputation on a publisher-by-publisher or user-by-user basis–seems like the next logical step.