October 22, 2006
A number of indexing technologies and standards – robots.txt, nocache, noindex – have been adopted by all major search engines to protect the authorship rights of websites across the internet. Yet, to date, the search engines have not created a standard of privacy for their users.
#privacy as a sort of pre-processing directive in search queries:
The standard is simple: if a user includes #privacy in a search query, the search engine should not associate that IP (or other tracking mechanism such as cookies) with the query, nor should that query be made available via public or private keyword tools
I think what Russ and his co-workers are going for–aside from a bit of nice publicity for their company–is much simpler than what many “privacy rights” advocates are seeking, and more feasible too.
For example, Michael Zimmer says #privacy doesn’t go far enough, and I think he speaks for a lot of people:
Forcing users to append their searches with a tag in order to protect their privacy accepts the premise that search engines should be allowed to collect personal information by default. And that is what must change.
The argument for 100% email encryption is valid here, as well: namely, if you only protect yourself when you have “something to hide,” then it becomes a lot easier to determine who’s doing things they’re not supposed to be, and to show intent in a legal proceeding.
I’m not clear how far the original proposal wants to go. The straight-forward reading would imply that such searches do not contribute to popularity lists, relatedness of queries, relevancy feedback (including clickthru tracking), etc. That’s a lot of stuff around search to prevent innovations upon, should such a standard become the default setting. I’m not sure what harm it does to know that 10,000 people searched for
stuffed bears in microwaves if none of those queries are attributable to a specific individual or ISP.
It also probably rules out keyword-based advertising, and especially keyword-based advertising targeted to your interests, or from which your profile might be gleaned (for example, clicking on most ad links will give the advertiser information about you and the context of your click–it has to, or else advertisers will not be able to track success rates).
It gets worse. Even if the search engine respects my privacy, any links I click on will, by default, send my search query to the host’s site (through the HTTP “referer” header). Should search engines somehow mangle the Urls, or push every click through a redirect that has no correlation with the original search? (A conspiracy theorist will say that is the goal, as it will make SEO Marketing firms much more valuable. ;-))
There’s something in me that likes
#privacy as a manual, special-circumstance directive. A naive implementation, though, will lull people into a false sense of
securityprivacy, as it cuts across several areas of the business and underlying infrastructure.
Beyond that, search engines can go a long way toward alleviating fear, uncertainty and doubt by simply being totally clear how their personal information is being used. For example, to establish the relatedness of one query to another, you need to associate each search with a unique user and then correlate multiple searches to similar users. However, that data does not need to be queryable on a per-user basis, nor does it need to survive a long time (maybe 30 days). Be clear about that, and most people won’t care most of the time.
October 20, 2006
Here’s a bit of semi-fun, semi-seriousness.
Two interesting–provokative?–articles about women’s Halloween costumes.
The first from the New York Times. I admit just reading the intro paragraph is a bit of a turn-on:
IN her thigh-highs and ruby miniskirt, Little Red Riding Hood does not appear to be en route to her grandmother’s house. And Goldilocks, in a snug bodice and platform heels, gives the impression she has been sleeping in everyone’s bed. There is a witch wearing little more than a Laker Girl uniform, a fairy who appears to shop at Victoria’s Secret and a cowgirl with a skirt the size of a tea towel.
Of course, I am picturing my wife in these outfits. Really.
What I find puzzling is this little observation from one of them “experts” we’re always reading about:
“Decades after the second wave of the women’s movement, you would expect more of a gender-neutral range of costumes,” said Adie Nelson
Because, yeah, girls want to look like men all the time.
Gray, drab dress pants are fine, and I’m hardly an expert, but I see a lot of women being taken seriously without sacrificing their femininity.
Further down, another puzzle:
Deborah Tolman, the director of the Center for Research on Gender and Sexuality at San Francisco State University and a professor of human sexuality studies there, found that some 30 teenage girls she studied understood being sexy as “being sexy for someone else, not for themselves,” she said.
When the girls were asked what makes them feel sexy, they had difficulty answering, Dr. Tolman said, adding that they heard the question as “What makes you look sexy?”
Yes, it is puzzling how anybody can think the most basic social impulse has anything to do with attracting the attention of other people.
Remember–I’m not saying this because I’m a hot, sexy thing who likes to strut my stuff. I’d be better off if everybody wore paper bags over their heads. I’m just a realist.
With names like “Transylvania Temptress,” “Handy Candy,” “Major Flirt,” and “Red Velvet Devil Bride,” there is no doubt that costumes marketed to children and teens have become more suggestive.
If you know me, you probably know I have a hard enough time buying into the whole little-girl-dreams-of-being-a-princess schtick that children are taught. (I mean, politically, you do know princes and princesses exist on the backs of other people for no reason other than their heritage… right? And that dreaming of marrying rich is probably the worst dream you can foist upon your child if you want her to be successful in life? Yes?)
I sympathize very much with the intent to show children–especially girls–that there’s more to life than “looking pretty” for other people, and this article has some good parenting advice. (Where, by “good,” I mean, “seems reasonable,” though I have not tried it myself, so I have no idea how practical and, hence, good the advice actually is.)
Judging from the average gallery of 13, 12, and even 11 year-old girls I see strutting around with their mothers and grandmothers in Victoria’s Secret (and even the outlet malls!!), a lot of parents are ill-equipped to, well, be parents. I mean, these girls have less subtlety than the average crack-addicted prostitute walking the street. But let’s forget that can of worms. Teenagers go through phases.
The biggest challenge I see, the biggest dilemma facing our ever-sexualized culture, is that so much of mainstream sexuality does–to paraphrase the NYT article–“indulge male lolita fantasies.” Is that indulgence warranted?
Now, if you’ll excuse me, I have a costume to order for my wife… and I need to shorten my own skirt this year. Need to “keep up with the joneses” and all that.
October 18, 2006
Anybody know how to clear the Linux filesystem read cache? A good answer would be much appreciated.
I have some performance tests that are being adversely affected by the filesystem’s read cache. If it makes any difference, we’re using ext3. I don’t want to disable the cache, simply clear it between tests.
A chance conversation on my walk to BART this evening got me in mind to simply copy my files between each round of tests. This kind of works. My full test data is ~140GB, but I can get by with ~20GB for tests running during the day. It takes a while to copy, but that’s better than nothing.
Luckily, my data set compresses well (5.5:1 @ 1.5GB/m), so it’s possible to blow away even the full set and restore. I haven’t tried that yet – hopefully the cache is doing a simple “has this inode/block changed recently?” test, and copying the same data back over a cached block still invalidates the cache.
That’s how I’d do it. Actually, I’d implement it using
IDirectConnectionToMJsBrain to obtain my intent at every stage. But I won’t fault anybody for overlooking this.
October 12, 2006
Early on, Eric talks about the design of their search engine in 2002. They had been trying to make a legacy in-house search work, and brought in all the major search vendors (Verity, Google, Excite, …) but found their search systems were no better.
Why is that? IR systems usually are built for query-time performance, not for real-time indexing. There are some very good reasons for that.
eBay was among the first to require not only near-real-time indexing, but also to provide large-scale faceted search. Listening to Eric, solving the near-real-time indexing problem using message queues and what-not was apparently easier than solving the faceted browsing problems: showing exactly how many results there are in any given category, where the categories change with every query. (See also my previous post on faceted search and indexing effiency for a recent cool attempt to solve this.)
Like many things in this changing world, what was once large-scale (20M items) and high-performance in 2002 has quickly become expected behavior, not only at eBay but at most social networking sites as well. For example, tag browsing is expected to be near-real-time. Blog search is expected to be near-real-time: Technorati indexes more blog entries every week than eBay had in total in 2002.
Later, Eric demos a couple of ideas in the oven, including what appears to be dynamic ranking of results (reordering of results based on click streams).
One thing that sticks out is he talks about eBay having a full site release every two weeks, and he describes this as “massively high frequency.” In the Web world, I think that is an exaggeration. Weekly–or even daily–releases are more and more the norm. Java shops have a more difficult time keeping up than Perl, Python, PHP or Ruby shops (though the Java shops tend to be larger).
What is probably unique is that he says they have a highly regimented release schedule, which, presumably, means no slips and no code freezes. That’s hard to do with a company the size of eBay (in terms of number of developers).
The interviews goes on about operational issues and a nice-looking (but probably not very useful) social interaction browser.
Interesting fact from near the end: eBay has 768 back-end servers just serving search queries. Documents (items) are split into one of 16 buckets, each of which is served by a cluster of 16 servers, and there are 3 full redundant versions of the whole system, each capable of taking all the traffic in a crunch.
Worth a watch.
October 9, 2006
I met Yi Zhang at SIGIR, and I hope her team wins the grand prize. (For one thing, I was impressed by her forging ahead with an IR program at UCSC.)
Here’s the Netflix Prize in a nutshell: Netflix released 100M customer movie recommendations (out of > 1B), and announced a competition where they’re offering a $1M “grand prize” (plus a $50K yearly “progress prize”). The goal? Using the 100M movie recommendations as your training set, produce an algorithm that beats Netflix’s own in-house “Cinematch” recommender system by at least 10%. You have at least five years to pull it off.
I love it when private entities release large-ish data sets into the research community. Usually, there are limited partnerships involving NDAs signed in triplicate. (That’s still beneficial – I wish Webshots would partner with some researchers.)
Anyhow, Zhang’s was the first to (slightly) beat Cinematch’s using the RMSE metric, but, as of this writing, another team has already shown greater improvement. In fact, “The Thought Gang” is the first to qualify for the yearly “progress prize” (and, hence, $50K next October).
I’m not surprised that researchers are already beating Netflix’s system on this metric a week into the competition.
First, and most important, Netflix places no constraints on performance. While their in-house recommender system needs to scale in multiple directions, the winning algorithm need not. That’s worth at least a percentage point.
Second, while 100M is larger than the typical “large” training sets used in academia for these types of problems, it’s not so large that it requires a complete rethinking of how you approach scaling your solution.
Third, movie recommendations are the kind of thing academic researchers in personalization often focus on, in the absence of more large scale projects. I’d expect a few good academic recommender systems to be so well-tuned to this particular problem, that their first iteration scores very well.
Regardless of who wins the prize, contests like this are good for researchers in this area, as it provides a nice, high-profile introduction to non-researchers. Especially once that $1M prize is handed out.
As a further digression, I wonder how many papers will be published in the coming years based on the Netflix data?
October 8, 2006
I’m fooling around with alternate titles for my blog. Too many lost souls come here seeking “sex”–and not of “I’m a cute twentysomething chick who wants to meet you in the alley near BART” kind! (That would be creepy, too.)
I tried simply “mj’s bs” which is probably the best name. But it looks too barren and out of place in the heading. I think it suits me, though. (The connotations and the fact that I, too, look barren and out of place much of the time.)
For now, I’ve settled on replacing “Sex” with “S*”. Let’s see how that works.
I’ve also restored my “Things I Read” section. I forget why I pulled it down, I think I intended to slowly add links as I write about them. Yeah, that worked. So I’ve taken a somewhat random mix of things I find myself reading often.
I should also trim down that Categories panel on the side…
October 1, 2006
If you haven’t read it already, Googler Steve Yegge rants about Agile development and paints a rosy picture of life at Google. Go ahead, read it. (My second favorite rant of the year, but more informative than my favorite.)
Commenters have already critiqued his post. To summarize:
- Google’s products are rarely profitable; Google is acting much as a VC firm, except their bets are almost all in-house
- Google hires from the rightmost end of the Bell curve, which, while great and representing the kind of people most of us would want to work with, isn’t universally applicable (by definition)
- Google has yet to weather a downturn; it is young and its methods are unproven; they’ll likely modify their practices during lean times, as all companies who survived the late 90’s did
With that out of the way, Steve makes some great points and has helped me pinpoint some of my own frustrations. I think, ultimately, what Google has found works for them addresses the same frustrations.