January 19, 2007
Greg Linden–founder, architect, designer, programmer, visionary, banker behind Findory—put Findory on autopilot until its resources are depleted.
Findory has been my primary start page for quite a while. Sure, it had its shortcomings. Too many of its sources were aggregators: Slashdot, Metafilter, Planet fill-in-the-blank, and so on. Clicking on a news story through one of those aggregators meant weeks or months of getting stories from them in your top results.
While its traffic peaked in 2006, Greg recently lamented on his blog that Q3-Q4 saw a serious decline in uniques. I suspect that’s what sparked his introspection.
I stand with those asking Greg to open source his code, or at least produce a well-footnoted, well-referenced book from it.
People say they want personalization, but not in a void. What’s more important than personalization is social context. For news, that means: who’s talking about this story? what are the reactions and additions to this story? do other people like this story? is it a story that crosses social cliques?
And, yes, it also means that what the A-listers write about carries more weight than what, say, I write about (a Z-lister?), regardless of overlapping interests.
As Greg has pointed out previously, personalized search means a smaller shared context among searchers, which is good for fighting spam (in the short-term). But is it bad for searchers? How do you share your search results with somebody else by simply copying the URL?
That problem is magnified with personalized news. Although you may get stories that are really interesting to you, you may just as well get no interesting stories since your interests aren’t generating much worth writing about at the moment. And you’re less likely to engage in a shared cultural experience.
I take note that the good bloggers didn’t use Findory to find interesting things to write about. I don’t think Findory’s bugs and little quirks are the reason for that. I think it’s the lack of social awareness, which suppresses to the social cues we use to find interesting stuff.
October 22, 2006
A number of indexing technologies and standards – robots.txt, nocache, noindex – have been adopted by all major search engines to protect the authorship rights of websites across the internet. Yet, to date, the search engines have not created a standard of privacy for their users.
#privacy as a sort of pre-processing directive in search queries:
The standard is simple: if a user includes #privacy in a search query, the search engine should not associate that IP (or other tracking mechanism such as cookies) with the query, nor should that query be made available via public or private keyword tools
I think what Russ and his co-workers are going for–aside from a bit of nice publicity for their company–is much simpler than what many “privacy rights” advocates are seeking, and more feasible too.
For example, Michael Zimmer says #privacy doesn’t go far enough, and I think he speaks for a lot of people:
Forcing users to append their searches with a tag in order to protect their privacy accepts the premise that search engines should be allowed to collect personal information by default. And that is what must change.
The argument for 100% email encryption is valid here, as well: namely, if you only protect yourself when you have “something to hide,” then it becomes a lot easier to determine who’s doing things they’re not supposed to be, and to show intent in a legal proceeding.
I’m not clear how far the original proposal wants to go. The straight-forward reading would imply that such searches do not contribute to popularity lists, relatedness of queries, relevancy feedback (including clickthru tracking), etc. That’s a lot of stuff around search to prevent innovations upon, should such a standard become the default setting. I’m not sure what harm it does to know that 10,000 people searched for
stuffed bears in microwaves if none of those queries are attributable to a specific individual or ISP.
It also probably rules out keyword-based advertising, and especially keyword-based advertising targeted to your interests, or from which your profile might be gleaned (for example, clicking on most ad links will give the advertiser information about you and the context of your click–it has to, or else advertisers will not be able to track success rates).
It gets worse. Even if the search engine respects my privacy, any links I click on will, by default, send my search query to the host’s site (through the HTTP “referer” header). Should search engines somehow mangle the Urls, or push every click through a redirect that has no correlation with the original search? (A conspiracy theorist will say that is the goal, as it will make SEO Marketing firms much more valuable. ;-))
There’s something in me that likes
#privacy as a manual, special-circumstance directive. A naive implementation, though, will lull people into a false sense of
securityprivacy, as it cuts across several areas of the business and underlying infrastructure.
Beyond that, search engines can go a long way toward alleviating fear, uncertainty and doubt by simply being totally clear how their personal information is being used. For example, to establish the relatedness of one query to another, you need to associate each search with a unique user and then correlate multiple searches to similar users. However, that data does not need to be queryable on a per-user basis, nor does it need to survive a long time (maybe 30 days). Be clear about that, and most people won’t care most of the time.
October 9, 2006
I met Yi Zhang at SIGIR, and I hope her team wins the grand prize. (For one thing, I was impressed by her forging ahead with an IR program at UCSC.)
Here’s the Netflix Prize in a nutshell: Netflix released 100M customer movie recommendations (out of > 1B), and announced a competition where they’re offering a $1M “grand prize” (plus a $50K yearly “progress prize”). The goal? Using the 100M movie recommendations as your training set, produce an algorithm that beats Netflix’s own in-house “Cinematch” recommender system by at least 10%. You have at least five years to pull it off.
I love it when private entities release large-ish data sets into the research community. Usually, there are limited partnerships involving NDAs signed in triplicate. (That’s still beneficial – I wish Webshots would partner with some researchers.)
Anyhow, Zhang’s was the first to (slightly) beat Cinematch’s using the RMSE metric, but, as of this writing, another team has already shown greater improvement. In fact, “The Thought Gang” is the first to qualify for the yearly “progress prize” (and, hence, $50K next October).
I’m not surprised that researchers are already beating Netflix’s system on this metric a week into the competition.
First, and most important, Netflix places no constraints on performance. While their in-house recommender system needs to scale in multiple directions, the winning algorithm need not. That’s worth at least a percentage point.
Second, while 100M is larger than the typical “large” training sets used in academia for these types of problems, it’s not so large that it requires a complete rethinking of how you approach scaling your solution.
Third, movie recommendations are the kind of thing academic researchers in personalization often focus on, in the absence of more large scale projects. I’d expect a few good academic recommender systems to be so well-tuned to this particular problem, that their first iteration scores very well.
Regardless of who wins the prize, contests like this are good for researchers in this area, as it provides a nice, high-profile introduction to non-researchers. Especially once that $1M prize is handed out.
As a further digression, I wonder how many papers will be published in the coming years based on the Netflix data?
June 20, 2006
For the last week or two, I’ve been listening to Pandora almost non-stop at work. (And, when working from home, I’m putting my 7-speaker computer sound system to good use.)
I was skeptical at first, but it is surprisingly easy to use, and generates good recommendations.
At the heart of Pandora’s service are “stations,” which are much like individually personalized radio stations. You create a station by entering the name of a song or band you like. Pandora will then find music similar to the song or band you entered. You refine the music played on that station by either entering more songs/bands whom you like, or giving thumbs up/thumbs down to songs as Pandora plays them.
Their revenue model seems to be a combination of (a) advertising, (b) offering a premium service, and (c) providing referrals to Amazon and iTunes. I can’t imagine many people look at their ads (I keep mine playing on my Windows machine while I work on Linux), but I can see the direct links to Amazon and iTunes generating quite a bit of traffic.
The killer is that Pandora’s free service not only allows unlimited listening, but up to 100 stations. Of course, like real radio, you can’t rewind or select a song to play. They also limit the number of songs you can skip in one session. (Consequently, after the limit is reached, when you give a thumbs down to a song, you still have to finish listening to it.)
I started my first (and, so far, only) station by entering Marillion, and soon supplemented that with Van der Graaf Generator and Nick Cave, mostly to see if I could fool their algorithm. Between those and the recommended songs that I have thumbed up, my station has probably 4 (maybe 5) strong-ish clusters.
One annoyance with Pandora is that their algorithm tends to stick to one cluster at a time, lasting maybe a dozen songs before heading into a new direction–and those durations are increasing. (Their algorithm may be designed to converge on a single cluster per station.) In fact, the best way to get them to start playing music from a different genre is to enter the name of a different kind of song that you like. Otherwise, you could be waiting quite a while.
The second annoyance–and maybe related to the first–is that they tend to wear out an album. For example, for several days, I could have sworn the only Marillion album they had was Real to Reel, since all of the Marillion songs they played were from that album. Similarly for VDGG and H to He, and Nick Cave and From Her to Eternity. But, eventually, they do expand the songs they play from an artist.
The third annoyance is that they weight songs that you’ve thumbed up heavily in selecting which song to play next. This often results in hearing the same song twice an hour, and produces a ton of “repeats.” Not so bad if you’ve thumbed up 500 songs, but as I was just starting out, I really wanted to hear a wider variety of recommendations. And, though you can ask them to not play a song for 30 days because you’re tired of it, that just seems like using a sledgehammer for a thumbtack: no nuance.
All in all, it’s a good service, and one that is still being actively developed. Yesterday, I logged on and found that they’d improved their interface a bit, and added song bookmarks. Previously–and still–I can not find a way to get a list of all the songs I’ve given a thumbs up to, so I’ve started using bookmarks in addition. You can view my profile at my Pandora profile page. I do no bookmark songs I already own, and I’ve only been bookmarking since Monday.
The feature I really want is an “I Own This” button, which would weight the song highly when computing similar songs, but prevent the song from getting played often (after all, if I own it, why would I need to listen to it on Pandora?).
In addition, there should be a way of submitting corrected title information, as I’ve discovered a number of songs with the wrong titles (usually by giving it the title of another song on the same album).
Anyhow, when I make my next Amazon purchase, I’ll be sure to do so in a way so Pandora gets the referral for the artists they’ve turned me onto.
June 11, 2006
As reported seemingly everywhere, Ebay is entering the contextual advertising business, where ads on affiliates’ sites will link directly to active auctions on Ebay whose items match the content on the current page. This is most likely a good thing for Ebay sellers. The value to small-time content publishers remains to be seen, since I believe the TOS on the GYM team offerings forbids forays into multiple advertisers.
This marks the fourth major player to enter this arena, which means it’s time for somebody else to come along and change the nature of the game. Once everybody has the know-how and the infrastructure, the market becomes ripe for a superior differentiated product.
Contextual advertising is a bit of a misnomer, since the actual context of the user’s session really doesn’t come into play. Rather, it refers to an advertisement appearing in the context of the content on the page.
For example, let’s say I (as the content publisher) know that you came to a page by searching for “aluminum siding” (yeah, I know). Although the page itself probably has at least one of those words, my advertising partner of choice has no real way of distinguishing my interest in aluminum siding from my interest in vinyl siding (which is also contained on the page). And they certainly have no clue that I’ve skipped over 12 other search results because they didn’t contain exactly what I wanted.
But intent through explicit search is only a small piece of the puzzle. What if I knew you came to a page through a recommendation my system offered you, and I (of course) know the criteria that was used to make that recommendation?
Most advertisers are equipped to take “hints” from the publisher, in the form of additional keywords, but they’re not equipped to (a) accept a lot of additional keywords, or (b) accept keywords that we’d like to negate, or (c) consider the real context of the user’s session, or (d) learn from a user’s behavior, to further refine their model of the user’s context (intent).
Maybe by considering, say, the last 8 pageviews within the last 30 minutes (those with contextual ads, anyway), they’d get closer in some circumstances, but they’d flub it in many situations. This is even more true when only certain pages contain calls to the advertiser, and those pages probably are not the ones providing the meat of the context.
Further down in the report, the reporter also mentions that Ebay is studying the possibility of opening up their user feedback system in some way. That seems like more of a trial balloon being floated to gauge interest and, more imporantly, to take suggestions on how to do so in a way that provides value, but still keeps the most important part proprietary. Hence the “it could take several years” comment from their director of developer relations.
Still, tying reputation systems into advertising–and, maybe going even further, establishing seller reputation on a publisher-by-publisher or user-by-user basis–seems like the next logical step.
June 8, 2006
More and more I find myself thinking, “Those guys at Yahoo get it.”
The latest example is Andrei Broder from Yahoo! Research, who, at last month’s Future of Web Search Workshop, gave the keynote talk titled, From query based Information Retrieval to Context Driven Information Supply [link is a PDF].
While this is a CTO-level presentation (i.e., high level, few details), it was well illustrated and to the point (and quite funny in spots).
According to Broder, Classic Information Retrieval makes all the wrong assumptions for the web context: classic IR ignores context, ignores individuals, and ignores dynamism (which is to say, the corpus is static). This is one reason I’ve never put much faith in academic search criteria (such as the TREC corpii).
He goes on to outline the first three generations of web search: from keyword matching and word frequency analysis (1st generation); to link analysis, clickthrough feedback, and social linktext (2nd generation); and, currently, in the midst of “answering the need behind the query” (3rd generation), which is mostly about supplementing core search with tools (spell checking, shortcuts, dynamic result filtering, …), or with high-ranking, high-certainty results from verticals (maps, local searches, phonebooks, …).
And what of the newborn 4th generation? It’s about going “from information retrieval to information supply” (emphasis mine), which is all about implicit searches: personalization, recommendations, …and, of course, advertising.
If you know me, it’s this 4th (and possibly 5th, see my notes below) generation that I’m always harping about. I wrote a bit about the future of search last year. I make a bit of an ass about it sometimes, but it turns me on.
And advertising, of course, is the big payoff from a corporate POV.
His final slide (slide 49) lists the challenges with this 4th generation of search: it involves a lot more data collection, a lot more data modelling, a lot more math, and a lot more understanding of the significance of the relationships between users and content.
What I find even more interesting, though, is what Broder left out:
- Search is well on its way to being integrated into normal navigation (faceted search is just one step)
- Social networks can, and should, affect relevance (social search is just one step)
- Search is being used as a platform, and soon, partners will be able to affect relevance for their users–providing yet more information to the core search system
- Search will soon be but a mediator between users and content–whether integrated into normal navigation or not–which provides the missing context for advertising, which can not be merely gleaned from content matching, or third-party user profiling
Did he stop short because much of the future of search makes Yahoo’s current search business positioning irrelevant?
Or because he has his team secretly working on the 5th generation of search and doesn’t want to give away his edge?
Or maybe he just wants to underpromise and overdeliver…
(Like via Greg Linden, who is one of the few bloggers I’ve made time to read in the last two weeks.)
October 2, 2005
Maybe this is a silly question, but I want to follow this line of thought.
Let’s start with this answer: search is a textbox, into which users enter text queries in order to retrieve relevant results. Sounds kind of boring, doesn’t it? Not that there aren’t a lot of interesting challenges here: content relevancy factors, social network analysis, targeted relevancy based on submitter behavior, and so on. In fact, a lot of resources have been poured into improving upon the “search is a textbox” mode of operation, and a lot of really cool innovations have come from it. The result is a firm mental model among users that search == textbox.
What about tagging? Tagging, like search, is an explicit activity: both in tag creation, and tag browsing. It is a bit more compelling than plain “search is a textbox,” because it offers additional navigation elements that are closely linked with the content. Some problems that are intermediate or difficult in the “search is a textbox” model of full-text indexing–such as clustering, related content, and so on–become technically simpler with tagging, and integrate well with the existing tag browsing UI. Users also gain more benefit from tag systems if they have a relationship with the organization. So if you’re a tag system developer, are you developing a search system? Does it matter whether the back-end is primarily MySQL, or Lucene, or Berkely DB?
The next level “above” tagging is faceted navigation, especially where tags (or metatags) become facets. Faceted navigation merges tagging and textbox searching into one UI. One way of thinking about faceted navigation is as an interactive boolean query builder. Faceted systems make a ton of queries into the back-end to fulfill their promise of providing context-sensitive navigation, and benefit from efficient search algorithms. A faceted navigation developer is obviously a search developer, even if all the context is pre-generated (good luck with that strategy, though).
The context-senstive links in faceted navigation are, of course, primitive (or not-so-primitive, depending on POV) recommendations, which brings us to recommender systems. Recommender systems usually rely heavily on statistical analysis, but then so do more traditional search features such as targeted/personalized relevancy and clustering. Recommender systems can be entirely driven by profiles consisting of a small set of keywords based on past (or recent) behavior. Recommender systems can be used to recommend just about anything: news stories, user-generated content, fellow members in a community, editorially “programmed” content, advertisers.
But now we’re talking about more than a textbox with a submit button. We’re talking dynamically generated content, which might even look like ordinary content, with ordinary navigation choices. That’s the heart of personalization, isn’t it? Isn’t that the dream we’re all (users, producers, pundits) sharing? That can’t all be driven by something as mundane as “search,” can it?