October 22, 2006

Search privacy through #privacy: somewhat feasible, or dreaming too small?

Posted in Personalization, Search at 8:39 pm by mj

Russ Jones (no relation), CTO of SEO Marketing shop Virante, has generated quite a bit of buzz around pound privacy. In his open letter to G/Y/M/A, he says

A number of indexing technologies and standards – robots.txt, nocache, noindex – have been adopted by all major search engines to protect the authorship rights of websites across the internet. Yet, to date, the search engines have not created a standard of privacy for their users.

His solution? #privacy as a sort of pre-processing directive in search queries:

The standard is simple: if a user includes #privacy in a search query, the search engine should not associate that IP (or other tracking mechanism such as cookies) with the query, nor should that query be made available via public or private keyword tools

I think what Russ and his co-workers are going for–aside from a bit of nice publicity for their company–is much simpler than what many “privacy rights” advocates are seeking, and more feasible too.

For example, Michael Zimmer says #privacy doesn’t go far enough, and I think he speaks for a lot of people:

Forcing users to append their searches with a tag in order to protect their privacy accepts the premise that search engines should be allowed to collect personal information by default. And that is what must change.

The argument for 100% email encryption is valid here, as well: namely, if you only protect yourself when you have “something to hide,” then it becomes a lot easier to determine who’s doing things they’re not supposed to be, and to show intent in a legal proceeding.

It’s an interesting dilemma. There is obvious scare mongering going on, of the kind that first stigmatized the use of cookies 10 years ago. Still, there is definitely risk here, as your searches can be (and maybe already have been) subpoenaed.

I’m not clear how far the original proposal wants to go. The straight-forward reading would imply that such searches do not contribute to popularity lists, relatedness of queries, relevancy feedback (including clickthru tracking), etc. That’s a lot of stuff around search to prevent innovations upon, should such a standard become the default setting. I’m not sure what harm it does to know that 10,000 people searched for stuffed bears in microwaves if none of those queries are attributable to a specific individual or ISP.

It also probably rules out keyword-based advertising, and especially keyword-based advertising targeted to your interests, or from which your profile might be gleaned (for example, clicking on most ad links will give the advertiser information about you and the context of your click–it has to, or else advertisers will not be able to track success rates).

It gets worse. Even if the search engine respects my privacy, any links I click on will, by default, send my search query to the host’s site (through the HTTP “referer” header). Should search engines somehow mangle the Urls, or push every click through a redirect that has no correlation with the original search? (A conspiracy theorist will say that is the goal, as it will make SEO Marketing firms much more valuable. ;-))

There’s something in me that likes #privacy as a manual, special-circumstance directive. A naive implementation, though, will lull people into a false sense of securityprivacy, as it cuts across several areas of the business and underlying infrastructure.

Beyond that, search engines can go a long way toward alleviating fear, uncertainty and doubt by simply being totally clear how their personal information is being used. For example, to establish the relatedness of one query to another, you need to associate each search with a unique user and then correlate multiple searches to similar users. However, that data does not need to be queryable on a per-user basis, nor does it need to survive a long time (maybe 30 days). Be clear about that, and most people won’t care most of the time.


October 20, 2006

How short will your skirt be on Halloween?

Posted in Fun, Sex at 12:01 am by mj

Here’s a bit of semi-fun, semi-seriousness.

Two interesting–provokative?–articles about women’s Halloween costumes.

The first from the New York Times. I admit just reading the intro paragraph is a bit of a turn-on:

IN her thigh-highs and ruby miniskirt, Little Red Riding Hood does not appear to be en route to her grandmother’s house. And Goldilocks, in a snug bodice and platform heels, gives the impression she has been sleeping in everyone’s bed. There is a witch wearing little more than a Laker Girl uniform, a fairy who appears to shop at Victoria’s Secret and a cowgirl with a skirt the size of a tea towel.

Of course, I am picturing my wife in these outfits. Really.

What I find puzzling is this little observation from one of them “experts” we’re always reading about:

“Decades after the second wave of the women’s movement, you would expect more of a gender-neutral range of costumes,” said Adie Nelson

Because, yeah, girls want to look like men all the time.

Gray, drab dress pants are fine, and I’m hardly an expert, but I see a lot of women being taken seriously without sacrificing their femininity.

Further down, another puzzle:

Deborah Tolman, the director of the Center for Research on Gender and Sexuality at San Francisco State University and a professor of human sexuality studies there, found that some 30 teenage girls she studied understood being sexy as “being sexy for someone else, not for themselves,” she said.

When the girls were asked what makes them feel sexy, they had difficulty answering, Dr. Tolman said, adding that they heard the question as “What makes you look sexy?”

Yes, it is puzzling how anybody can think the most basic social impulse has anything to do with attracting the attention of other people.

Remember–I’m not saying this because I’m a hot, sexy thing who likes to strut my stuff. I’d be better off if everybody wore paper bags over their heads. I’m just a realist.

Anyhow, the second article from azcentral is a bit more disturbing: Sexy Halloween costumes rile parents:

With names like “Transylvania Temptress,” “Handy Candy,” “Major Flirt,” and “Red Velvet Devil Bride,” there is no doubt that costumes marketed to children and teens have become more suggestive.

If you know me, you probably know I have a hard enough time buying into the whole little-girl-dreams-of-being-a-princess schtick that children are taught. (I mean, politically, you do know princes and princesses exist on the backs of other people for no reason other than their heritage… right? And that dreaming of marrying rich is probably the worst dream you can foist upon your child if you want her to be successful in life? Yes?)

I sympathize very much with the intent to show children–especially girls–that there’s more to life than “looking pretty” for other people, and this article has some good parenting advice. (Where, by “good,” I mean, “seems reasonable,” though I have not tried it myself, so I have no idea how practical and, hence, good the advice actually is.)

Judging from the average gallery of 13, 12, and even 11 year-old girls I see strutting around with their mothers and grandmothers in Victoria’s Secret (and even the outlet malls!!), a lot of parents are ill-equipped to, well, be parents. I mean, these girls have less subtlety than the average crack-addicted prostitute walking the street. But let’s forget that can of worms. Teenagers go through phases.

The biggest challenge I see, the biggest dilemma facing our ever-sexualized culture, is that so much of mainstream sexuality does–to paraphrase the NYT article–“indulge male lolita fantasies.” Is that indulgence warranted?


Now, if you’ll excuse me, I have a costume to order for my wife… and I need to shorten my own skirt this year. Need to “keep up with the joneses” and all that.

October 18, 2006

Clearing Linux Filesystem Read Cache?

Posted in Linux, Questions at 8:24 pm by mj

Anybody know how to clear the Linux filesystem read cache? A good answer would be much appreciated.

I have some performance tests that are being adversely affected by the filesystem’s read cache. If it makes any difference, we’re using ext3. I don’t want to disable the cache, simply clear it between tests.

A chance conversation on my walk to BART this evening got me in mind to simply copy my files between each round of tests. This kind of works. My full test data is ~140GB, but I can get by with ~20GB for tests running during the day. It takes a while to copy, but that’s better than nothing.

Luckily, my data set compresses well (5.5:1 @ 1.5GB/m), so it’s possible to blow away even the full set and restore. I haven’t tried that yet – hopefully the cache is doing a simple “has this inode/block changed recently?” test, and copying the same data back over a cached block still invalidates the cache.

That’s how I’d do it. Actually, I’d implement it using IDirectConnectionToMJsBrain to obtain my intent at every stage. But I won’t fault anybody for overlooking this.

October 12, 2006

Eric Billingsley of eBay Research Interviewed by Scoble

Posted in Development, Excellence, Search, Software at 5:50 am by mj

Scoble recently interviewed Eric Billingsley, head of eBay Research, for his video podcast. (Note: you’re better off downloading the cast in the background.)

Early on, Eric talks about the design of their search engine in 2002. They had been trying to make a legacy in-house search work, and brought in all the major search vendors (Verity, Google, Excite, …) but found their search systems were no better.

Why is that? IR systems usually are built for query-time performance, not for real-time indexing. There are some very good reasons for that.

eBay was among the first to require not only near-real-time indexing, but also to provide large-scale faceted search. Listening to Eric, solving the near-real-time indexing problem using message queues and what-not was apparently easier than solving the faceted browsing problems: showing exactly how many results there are in any given category, where the categories change with every query. (See also my previous post on faceted search and indexing effiency for a recent cool attempt to solve this.)

Like many things in this changing world, what was once large-scale (20M items) and high-performance in 2002 has quickly become expected behavior, not only at eBay but at most social networking sites as well. For example, tag browsing is expected to be near-real-time. Blog search is expected to be near-real-time: Technorati indexes more blog entries every week than eBay had in total in 2002.

Later, Eric demos a couple of ideas in the oven, including what appears to be dynamic ranking of results (reordering of results based on click streams).

One thing that sticks out is he talks about eBay having a full site release every two weeks, and he describes this as “massively high frequency.” In the Web world, I think that is an exaggeration. Weekly–or even daily–releases are more and more the norm. Java shops have a more difficult time keeping up than Perl, Python, PHP or Ruby shops (though the Java shops tend to be larger).

What is probably unique is that he says they have a highly regimented release schedule, which, presumably, means no slips and no code freezes. That’s hard to do with a company the size of eBay (in terms of number of developers).

The interviews goes on about operational issues and a nice-looking (but probably not very useful) social interaction browser.

Interesting fact from near the end: eBay has 768 back-end servers just serving search queries. Documents (items) are split into one of 16 buckets, each of which is served by a cluster of 16 servers, and there are 3 full redundant versions of the whole system, each capable of taking all the traffic in a crunch.

Worth a watch.

October 9, 2006

Netflix Prize Competition Gets Going

Posted in Personalization at 10:01 pm by mj

Slashdot jumped the gun today in reporting that a team led by Yi Zhang at UCSC was atop the Netflix Prize leaderboard.

I met Yi Zhang at SIGIR, and I hope her team wins the grand prize. (For one thing, I was impressed by her forging ahead with an IR program at UCSC.)

Here’s the Netflix Prize in a nutshell: Netflix released 100M customer movie recommendations (out of > 1B), and announced a competition where they’re offering a $1M “grand prize” (plus a $50K yearly “progress prize”). The goal? Using the 100M movie recommendations as your training set, produce an algorithm that beats Netflix’s own in-house “Cinematch” recommender system by at least 10%. You have at least five years to pull it off.

I love it when private entities release large-ish data sets into the research community. Usually, there are limited partnerships involving NDAs signed in triplicate. (That’s still beneficial – I wish Webshots would partner with some researchers.)

Anyhow, Zhang’s was the first to (slightly) beat Cinematch’s using the RMSE metric, but, as of this writing, another team has already shown greater improvement. In fact, “The Thought Gang” is the first to qualify for the yearly “progress prize” (and, hence, $50K next October).

I’m not surprised that researchers are already beating Netflix’s system on this metric a week into the competition.

First, and most important, Netflix places no constraints on performance. While their in-house recommender system needs to scale in multiple directions, the winning algorithm need not. That’s worth at least a percentage point.

Second, while 100M is larger than the typical “large” training sets used in academia for these types of problems, it’s not so large that it requires a complete rethinking of how you approach scaling your solution.

Third, movie recommendations are the kind of thing academic researchers in personalization often focus on, in the absence of more large scale projects. I’d expect a few good academic recommender systems to be so well-tuned to this particular problem, that their first iteration scores very well.

Regardless of who wins the prize, contests like this are good for researchers in this area, as it provides a nice, high-profile introduction to non-researchers. Especially once that $1M prize is handed out.

As a further digression, I wonder how many papers will be published in the coming years based on the Netflix data?

October 8, 2006

Changing titles

Posted in General at 8:42 pm by mj

I’m fooling around with alternate titles for my blog. Too many lost souls come here seeking “sex”–and not of “I’m a cute twentysomething chick who wants to meet you in the alley near BART” kind! (That would be creepy, too.)

I tried simply “mj’s bs” which is probably the best name. But it looks too barren and out of place in the heading. I think it suits me, though. (The connotations and the fact that I, too, look barren and out of place much of the time.)

For now, I’ve settled on replacing “Sex” with “S*”. Let’s see how that works.

I’ve also restored my “Things I Read” section. I forget why I pulled it down, I think I intended to slowly add links as I write about them. Yeah, that worked. So I’ve taken a somewhat random mix of things I find myself reading often.

I should also trim down that Categories panel on the side…

Other suggestions?

October 1, 2006

I’m Agile, You’re Agile…mmm’kay?

Posted in Development, Excellence at 11:12 pm by mj

If you haven’t read it already, Googler Steve Yegge rants about Agile development and paints a rosy picture of life at Google. Go ahead, read it. (My second favorite rant of the year, but more informative than my favorite.)

Commenters have already critiqued his post. To summarize:

  • Google’s products are rarely profitable; Google is acting much as a VC firm, except their bets are almost all in-house
  • Google hires from the rightmost end of the Bell curve, which, while great and representing the kind of people most of us would want to work with, isn’t universally applicable (by definition)
  • Google has yet to weather a downturn; it is young and its methods are unproven; they’ll likely modify their practices during lean times, as all companies who survived the late 90’s did

With that out of the way, Steve makes some great points and has helped me pinpoint some of my own frustrations. I think, ultimately, what Google has found works for them addresses the same frustrations.

Read the rest of this entry »

Closing out my tabs: some quickies

Posted in Fun, Links, Search, Software at 12:23 am by mj

Some things that have remained in my open Firefox tabs…some, for weeks…which I’ve realized I’m never going to blog by themselves, but I still want to share.


Why Search Sucks
Danny Sullivan on Why Search Sucks & You Won’t Fix It The Way You Think. A nice look at the relatively static history of (web) search UIs and some of the more interesting experiments. All of which give too many options in an unintuitive way.


Mad Money: Or Just Mad?
Is the Cramer Effect costing average investors money?

I gave Cramer’s advice on “markup week” a test. There are already good tracking blogs out there: Mad Money Machine, Mad Money Recap, and, for longer-term performance, Booyah Boy Audit has pretty graphs.

My result was mixed. Up 1.1% following his advice to a tee. So, with $10K to invest, and (very) low trading fees, you could be up $100 for the week.

Sounds low, but if all weeks performed just as well, a 50% pre-tax annual return on your discretionary investment income is pretty good. Most of his advice, though, is longer-term than one week.


Best rant of the year
Leon Spencer responding to Tim Bray on domain-specific markup languages in Ruby.

Frankly, if you’re embedding enough HTML markup in your code to make it worthwhile to create, use, or even argue over HTML-like DSLs…you’re doing something wrong.

Now, (ir)regular expressions and XPath, on the other hand…


Is YouTube Worth Buying?
Mark Cuban says no. True, YouTube is big only because of all the illegal or questionable content. (Odd how they do a good job of policing porn, though.) But don’t forget its name. It is the Napster of its time. It could be burned to the ground by the RIAA, MPAA, SAG, WGA, and NCAA, and still…it will rise again, in some form or another, under one ownership or another. That could be valuable. Hell, even Netscape is still kinda sorta in some weird, scandalous way “recognizable” today.

How much would it take to make a YouTube “killer” in six months? Let’s go all out and hire a team of 100 engineers (overkill). 1000 high-end servers (overkill). SpammingViral Marketing like nobody’s seen since MySpace (overkill). Add in PHB overhead (always overkill). Total cost? Maybe $80M.

So is the name and the (unloyal, but impressionable) audience worth $920+M? I don’t think so.

Which just means we’re looking at a purchase price of about $3B by January. You read it here first.


“Jim Mcgreevey” does Letterman’s Top Ten List
I was just thinking how I rarely watch Letterman anymore… and I miss the best Top Ten in probably five years. (Thanks, YouTube!)

Would they have allowed this on the air 10 years ago?


That’s enough tabs for tonight…