December 19, 2008

How to Use Comments To Attract Visitors…And Make Money

Posted in Community, Software tagged , at 11:22 pm by mj

Obligatory disclaimer: I’m hardly a “typical case,” and I don’t have the resources to conduct usability studies, experiments, or surveys for these kinds of things (wouldn’t that be an interesting job).

Sometime in 2005, Steven Levitt and Stephen Dubner–authors of Freakonomics–started an excellent Freakonomics blog. It was more interesting than the book, honestly, and far less self-aggrandizing than the first edition.

By late 2007, it was bought by the New York Times, and soon after, they switched to partial feeds (after breaking their feeds entirely for a time, as I recall). I, of course, summarily removed it from my feed reader. Who has time for partial feeds, especially if you spend most of your time reading on the train, plane, or automobile (OK, rarely do I read blogs in an automobile, but trains and planes have easily constituted 75% of my blog reading time for the last several years)? And how can 30 characters or 30 words or whatever arbitrary cutoff point provide you enough information to decide whether you want to star it to read later?

This evening, I just happened to follow a link to a Freakonomics blog article (it was this story on standardized test answers followed from God plays dice, a math-oriented blog) and spent about 90 minutes on the site. It really does have interesting content. Most of that time was perusing some of the interesting comments, and that’s also what drove me to click through additional stories–hoping to find more interesting comments.

Here’s an example: A Career Option for Bernie Madoff?. I would never, ever guess that this would be worth reading, especially if it meant first starring the item in my reader, and then finding it and clicking it, and waiting for it to load and then render. Never. I doubt it was even intended to be interesting; it was more of a throw-away story submitted for the amusement of regular readers.

But I’ve found the discussion interesting, mostly because the (convicted felon) former CFO of Crazy Eddie was commenting.

Robert Scoble recently lamented a related shortcoming with blog presentation. Scoble wants a service that highlights individual commenters’ “social capital” so you know who’s talking out of their ass (such as me), and whose opinions really matter or might even hint at things to come.

The projects discussed in his comments all seem to be headed in the same direction: some kind of Javascript pop-up or other icon that tells you something about a comment author as you’re reading the comments (I’m convinced that in the future, we will only hear our friends and idols and important decision-makers; history really is cyclic). But what if you’re deciding whether to read the comments at all?

Slashdot is another example. I’ve been “reading” (skimming, really) slashdot for 10 years, since its first year of operation. I have never bothered creating an account. Aside from a handful of anonymous comments over the years, I’ve never really cared to participate in that community or discussions. It’s rarely my first source for news (unless I’ve been under a metaphorical rock for a few weeks); the summaries are usually wrong; and my opthamologist has attributed 28.3% of my retinal decay to the site design.

Yet I keep coming back–often weeks after a story was submitted–because of the interesting comments. But even if you go to the front page–sans feed reader–it’s a crap-shoot. Even with the community moderation system, you simply don’t know if there are great comments embedded within a story. So I sometimes click hoping for a definitive comment, and sort by score (highest first). That often yields good results, at least in a statistical sense: the more comments on a story, the more of them will be truly excellent (maybe because authors are trying to increase their karma by commenting on high-traffic stories?).

So here’s what I think can make us all happy, and make partial feeds useful too: find some way to incorporate the interesting-ness of comments on a story into the feed.

It’s not going to be enough to say “ten of your friends have commented on this story”–although that would be exciting, and doable with existing feed readers.

It’s not even going to be enough to say “two CEOs and three software engineers of companies whose stocks you own have commented.” That would be awesome, too.

It’ll have to combine “people who are interesting” with “comments that are interesting, regardless of the author.”

And dammit, if it wouldn’t revolutionize the way I, at least, read blogs.

If I had that, I’d have all the information I need to once again start skimming partial feeds. It’s even better than ratings on the story level, since it tells you so much more about how it’s engaging its audience.

Is there a market opportunity in there somewhere?

October 23, 2008

Cost of servers 20 years ago — request for help!

Posted in Software tagged , , , at 9:30 pm by mj

I’m putting together a presentation, and I need your help!

I’m looking for the cost and specs of typical “commodity” hardware 20 years ago, versus the cost and specs for typical “big guns” at the same time.

I’m also looking for database options other than the usual suspects (Oracle, DB2, Sybase) that may have been available at the time. Cost comparisons are ideal.

In other words… if you were transported back to 1988, and you had to support a large (for the time) data set, and still knew what you know now about scaling … would you have had any alternatives than (something akin to) Oracle on big iron?

Many thanks, and I’ll post a follow-up later.

October 19, 2008

Restlet: A REST ‘Framework’ for Java

Posted in Development, Software tagged , , , at 2:33 pm by mj

Building an API on the REST architectural style? Building it in Java?

This past week, on a project doing just that, I ran into Restlet. I’d never heard of a REST framework for Java before, but it’s been featured InfoQ, TSS, ONJava, and others over the past three years. (Damn, I need to pay more attention.)

And it kicks ass.

Here’s a quick run-down:

Restlet is an API for developing REST-based services in the same way that Servlet is an API for developing Web-based services. Your application never deals with the Servlet API, HTTP-specific attributes, cookies, sessions, JSPs, or any of that baggage.

Instead, you think and code in terms of REST: Routing, Resources, Representations.

It has an intuitive URL routing system that parses out resource identifiers and other data (that is, a URL template of /user/{user_id} would give you a ‘user_id’ attribute for any URL matching that pattern, which is fed into your User resource).

Resources are easily able to define which of the verbs (GET, POST, PUT, DELETE) they respond to, with default behavior defined for verbs that are unsupported.

There are plug-ins available for SSL and OAuth, an emerging best practice for authenticating third party access to user accounts.

The documentation is a bit lacking. However, there is an excellent IBM developerWorks tutorial on Restlet (registration required) that lays out pretty much everything you need, with a (nearly-)complete example for study.

September 27, 2008

Three subversion tips: svn:ignore, svn merge, svn move

Posted in Development, Software tagged , , at 7:57 am by mj

Since I complained earlier this year about the state of Subversion tools, I’ve been thinking about a follow-up that’s a bit more positive.

This doesn’t exactly count, but I thought I’d share a few productivity lessons I’ve learned recently.

Using svn:ignore
svn:ignore is a special subversion property that instructs Subversion to ignore any files (or directories) that match a given pattern.

The common use case is to ignore build artifacts to prevent accidental check-ins and eliminate clutter on svn status, etc. For example, you can ignore all *.jar files in a particular directory, or ignore your build directory, etc.

Unfortunately, this can tend to hide problems with your build artifacts. For a project I’m working on now, we have timestamped JAR files stuffed into a common directory. The JAR files themselves are svn:ignore‘d, which means svn status will never display them.

And as I found recently, this could result in 8 GB of “hidden” files that only becomes apparent when you, say, try to copy a remote workspace into a local one for managing with Eclipse.

Shame on the developers for not deleting them as part of ant clean. But it happens, no getting around that.

Thankfully, the Subversion developers thought about this case, and introduced the --no-ignore flag to svn status. With this option, ignored files are displayed along with added, modified and deleted files, with an I in the first column.

Cleaning up your subversion repository is, therefore, as simple as:

svn status --no-ignore |
grep -P '^I' |
perl -n -e '/^\I[\s\t]+(.*)$/; my $f=$1; if (-d $f) { print "Deleting directory $f\n"; `rm -rv "$f"`; } else { print "Deleting file $f\n"; `rm -v "$f"`; }'

That will remove all files and directories that Subversion is ignoring (but not files that just have not yet been added to source control). Stick that in a script in your path, and live happily ever after.


Merging back into trunk
The most common use case when merging is to provide a range of revisions in trunk to pull into your branch. For example:

svn merge -r 100:114

What happens is you tell Subversion, “I don’t care what happened before revision 100, because that’s already in my branch…so just apply changes between version 100 and 114.”

But what’s not obvious–nor, as far as I can tell, available in standard reference books–is how to merge back into trunk. It turns out, the way to do this is to disregard everything you’ve learned about subversion.

The problem is that you’ve been merging changes from trunk into your branch. So if you simply choose the naive approach of picking up all changes since your base branch revision until your final check-in, and try to apply those to trunk, you’ll get conflicts galore, even on files you never touched in your branch (except to pull from trunk).

The solution is to use a different form of the merge command, as so:

svn merge ./@115

where revision 115 represents your last merge from trunk.

This actually just compares the two repositories at the specified revision, and pulls in the differences, all the differences, and nothing but the differences. So help me Knuth.


Beware the power of svn move
One of the much-touted benefits of subversion (particularly as compared to CVS) is the support for moving files around. But, until 1.5, there has been a glaring error that is often overlooked and can get you into trouble.

Because svn move is implemented as a svn delete followed by a svn add, Subversion thinks the new file has no relation to the old file. Therefore, if you have local changes to foo, and your arch nemesisco-worker Randy moves it to bar, your changes will simply disappear!

Subversion 1.5 has partially addressed this, at least for single files. Under the new regime, your changes to foo will be merged with any changes to bar. However, you still need to be careful with moving directories.

This is more insidious than moving methods around inside the same file. While in that case Subversion will freak out and your merges will become difficult, at least you’ll see the conflict and your changes won’t disappear while you’re not looking.

The lesson, then, is to talk with your team-mates before any refactoring. (svn lock doesn’t seem to provide any help unless everybody’s in trunk.)

Rumor has it svn 1.6 will address this even more practically by introducing the svn fuck-you-and-your-dog command. But until then, you have to do it the old fashion way.

September 20, 2008

Match-making for Java Strings

Posted in Software tagged , , , at 11:10 am by mj

(Inspired by Jeff Atwood’s recent ‘outing’ as a regex sympathizer, which got me thinking about the line between “too many” and “too few” regular expressions and how some languages make it a choice between “too few” and “none”.)

Java has a Pattern, which forces you to pre-declare your regex string.

And it has a Matcher, which matches on a String.

It should be noted that a Pattern‘s pattern turns most patterns into a mess of backslashes since the pattern is wrapped in a plain-old Java String.

So a Matcher has matches(), which matches using the Pattern‘s pattern, but only if the pattern would otherwise match the Matcher‘s whole string.

A Matcher can also find(), which matches a Pattern pattern even if the pattern would only match a substring of the String, which is what most patterns match and what most languages call matching on a string.

A Matcher can lookAt(), which matches on a Pattern pattern, which, like find(), can match a pattern on a substring of the string, but only if the String‘s matching substring starts at the beginning of the string.

The String matched by the Matcher can be sliced by a call to region(start,end), which allows matches() and lookAt() to interpret a substring of the String as being the whole string.

Now, after calling find() or any of Matcher‘s String-matching cousins, a consumer of a Matcher can call group(int) to get the String substring that the Matcher‘s Pattern‘s pattern captured when matching on the Matcher‘s String‘s string.

But if you’re lazy, and you have no groups in your pattern, and a Matcher‘s matches() is sufficient, then String gives you matches(pattern) which is precisely equivalent to constructing a Pattern with your pattern and passing a new Matcher your existing String!

So with effective use of Java object syntax, you too can use regular expressions to make your matches on Java Strings almost as obscurely as other languages clearly make matches on their strings!

Is it any wonder Java programmers don’t realize that regular expressions are a beautiful… thing?

September 17, 2008

Shameless Promotion for my Shared Items

Posted in Software tagged , , , , , at 8:25 am by mj

I may not be writing much lately, but, thanks to Google Reader’s Offline Mode, I try to continue reading and adding to my shared items.

Unfortunately, I tend to sync at most once a week (except when I’m back in the Bay Area), so they tend to come in batches…and 3 weeks after the original post. It looks like the Reader team finally fixed the problem of only showing the sync time, though (except in my private view).

In today’s sync, I shared 24 items from 18 bloggers.

While some may go for quantity (ahem, Scoble, Digg), I only share things I’d want to read and refer to again…and which I’d prefer my whole team read, too.

Fortunately, the world is teeming with interestingness.

Some examples of things I’ve found interesting and shared recently:

Implementing Persistent Vectors in Scala.
Daniel Spiewak explains how immutable data structures can, nevertheless, be efficient even during writes. Perhaps the clearest example I’ve seen.

I still don’t claim to understand how multiple threads can share the same (changing) state without locking (perhaps something along the lines of Java’s ConcurrentHashMap, the code from which is well worth studying).


Shard Lessons.
Dan Pritchett shares his experience with database sharding. Worth it for the second lesson alone (“Use Math on Shard Counts”), where he explains why multiples of 12 are a more efficient scaling strategy.


Singletons are Pathological Liars.
Miško Hevery has been writing the clearest (bestest!) introductions to designing for unit testing that I have seen. They’re not really introductions so much as they are motivators.

You know the drill: you join a new team with a code base that’s been around seemingly since Pascal walked the Earth. Maybe everybody has heard of unit testing, but nobody really understands what it’s all about or why their singleton-ridden/new-operator-ridden existing code (or existing so-called “unit tests”) isn’t sufficient.

Don’t buy them a book. Point them to Miško Hevery‘s blog.


There’s more. (Much more.) There’s the excellent ongoing REST discussion involving Tim Bray, Dare, Dave Winer, Bill de hÓra, Damien Katz (and others); a lot of fantastic Drizzle commentary that go into mucho detail; discussions on edge cases and performance degradation in MySQL; and so on.

I wish I had a job that allowed me to just take what I’ve read and experiment with the ideas and contribute to some of the projects.

Yes, that would be an awesome job.

September 13, 2008

Designing a Distributed Hi-Lo Strategy

Posted in Scale, Software tagged , , , at 6:57 am by mj

In a previous post, I lamented that the “hi-lo” ID generation strategy has one big wart: it’s a single point of failure.

After thinking about it a bit, though, it occurred to me that we can eliminate the SPOF without limiting our ability to add more shards later. And it’s quite easy–just more sophisticated than you typically need.

WARNING: I consider this design to be a bit over-engineered. You probably don’t need to eliminate the SPOF. But, if ID generation is the only critical SPOF in your system, and UUIDs aren’t practical for your purpose, it may be worth going this route.

That basis of this lies in expanding the fields in our id_sequences table, reproducing the table in each shard, and introducing a stateless agent that’s always running in the background to maintain the various id_sequences tables across all our shards.

Database Design

 sequence_name        varchar(255) not null
 window_start_value   bigint(19) not null
 window_end_value     bigint(19) not null
 next_hi_value        bigint(19) not null
 PRIMARY KEY (sequence_name, window_start_value)
 KEY idx_find_window (sequence_name, window_end_value, next_hi_value, window_start_value)

The key change is the introduction of window_start_value and window_end_value, which together define an ID window from which threads can reserve IDs on each shard.

Each shard will have multiple open windows, but only one is used at a time. A window is open if next_hi_value < window_end_value.

Windows are created (and pruned, if necessary) by the agent, more on which later.

Application Hi-Lo Strategy

As expected, the in-memory buffer works as normal. The difference is in the select and increment operation.

When a shard has exhausted its in-memory buffer, we reserve another batch with the following sequence of steps:

Step 1. Begin transaction

Step 2. Query the database for the first open window

    SELECT *
    FROM id_sequences
    WHERE sequence_name = ?
       AND window_end_value > next_hi_value
    ORDER BY window_start_value 
    LIMIT 1

Step 3. Increment the max reserved ID by our buffer size, but do not exceed window_end_value.

Step 4. Update the window with the new next_hi_value

Step 5. Commit

This is guaranteed to always return a single open window (unless there are no open windows). Multiple threads trying to fetch the open window at the same time will not conflict. If thread A and B arrive simultaneously, and thread A exhausts the first open window, thread B will simply use the next window.

Controlling Agent

This agent can be always running, or it could be a simple nightly cron job, or even a periodic manual process. Its responsibility is to create windows on each shard.

Since the current state of the system can be reconstructed on-the-fly without locking (see caveat below), we don’t have to worry about agents crashing or getting killed in the middle of their thing.

There are only two system parameters that the agent concerns itself with: min_open_windows and window_size. Whenever any shard has fewer than the minimum number of open windows, the agent creates a new window on that shard.

Re-constructing the system state can be as simple as running

    SELECT max(window_end_value)
    FROM id_sequences
    WHERE sequence_name = ?

on each shard before proceeding.

You probably also want a first pass that finds all unique sequence_names

    SELECT DISTINCT(sequence_name)
    FROM id_sequences

so introducing a new sequence is as simple as inserting a single row into one of your shards, and the agent will propagate it elsewhere.

Then, for each shard, it queries a count of the open windows for each sequence, and inserts new windows as necessary.

No locking. No external state.

Is the Agent a SPOF?

That’s true – if the server on which the agent is set to run goes down, game over. But, you can run the agent from a cron job hourly, and stagger it across N servers, each running at a different hour.

I can’t envision a scenario where you’d need the agent to be continuously running and this would not suffice as a highly available design. If N-1 of your servers go down, then at most you’d go N hours without creating new windows. But your window sizes are sufficient to support a week or more of growth, yes?

What about Master-Master?

Some database systems are deployed in master-master pairs. In this case, you can either stick the id_sequences table on only one of the masters and always hit that master, or give each master its own window. The latter is probably preferable, although it means excluding id_sequences from replication.

Adding a new shard

Before opening up a new shard, the agent needs an opportunity to initialize the table. Not a big deal.

Deleting a shard or taking a server out of rotation

This is the one flaw, as hinted above. Reconstructing the system state on-the-fly requires that the server with the highest window_end_value can be reached by the agent, and that we know which server that is.

This may require external state to work around, such as always writing the largest window_end_value into at least one other server.

It’s probably sufficient for the agent to simply refuse to run when any server is unavailable. If you have a shard that’s offline long enough to exhaust all of your open ID windows, you have bigger problems.


As I said, this is probably a bit over-engineered for most systems. While I have tested the behavior under MySQL, I have not had an opportunity to deploy it in a running system (I may soon, if we determine it’s not too over-engineered), and I have not heard that anybody else has employed a similar solution.

Which also means no existing frameworks support it out of the box.

September 6, 2008

Creating Database Sandboxes for Unit/Integration Tests

Posted in Development, Software tagged , , , at 9:45 am by mj

After Baron Schwartz’s recent hint at having solved unit testing database sandboxes at a previous employer, I got to thinking about the problem again.

To be clear: this is not really unit testing, but I’ve found integration tests at various levels are just as important as unit tests. So much so that I have taken to creating both test and integration source directories, and, whenever possible, requiring both suites to pass as part of the build process.

There are two suggestions I’ve seen for solving this problem, both of which are applicable for local in-memory databases as well.

First, starting with a completely empty database, populating it, and then tearing it down. Unfortunately, this is not only difficult, it’s time consuming. If you do this before each test, your tests will take hours to run. If you do this before the whole suite, your tests will not be isolated enough.

A previous co-worker had suggested an incremental approach. Start out with an empty data set, and let each test (perhaps through annotations) define which data must be fresh. I like that. It requires a lot of infrastructure and discipline. It could encourage simpler tests, although with simpler tests come more tests, thus more discipline.

The other approach I’ve seen suggested a couple of times now (including in a comment on Baron’s blog) is the use of a global transaction. Unfortunately, this does not work with all database engines. MySQL tends to be the real killjoy, because nested transactions are not supported and DDL statements are not transactional. Yeah, even in the transactional engines.

So, here’s what I’m thinking. If I were starting over with a new team, with minimal code already existing, I think I wouldn’t solve this problem from an engineering/code perspective. Instead, I’d solve it from an operational perspective (though it still requires application/test infrastructure changes).

Picture a central test database server with one pristine copy of the data, and thousands of database instances. The application (test) asks this server for an available database instance, uses it for a single test, and then moves on. The next test resets the application state, so it asks the server for another available database instance, and so on.

Meanwhile, there is a daemon on that server that is constantly checking each database instance. If the underlying data files do not match the pristine copy, they are restored and the instance is placed back into the available pool.

An instance is considered available for testing when (a) there are no threads connected to it, and (b) its data files match the pristine copy.

Tests that do not alter the underlying data files do not require restoration.

What about schema changes? Answer: you have to unit/integration test them too. When you’re ready to promote your code, you deploy to the pristine copy as part of the standard production release process. An interesting side effect of this is it will, in many cases, force other developers to merge production changes back into their private branches, because many of their tests will probably fail.

Contrary to Baron’s suggestion, in a properly designed system this does not require changes to production code. As long as you can inject a database connection pool into your application–and all quality code should have this property (cough, cough)–your test framework can inject a connection pool that interrogates the test server first.

And it can scale to multiple test database servers as your team and the number of tests grows.

I haven’t tried this, and I have too many deadlines (and too much legacy code that I’m still learning in my current team) to experiment with a real-world application.

But what do you think? What holes are there in this proposal?

Aside from violating the Engineering Aesthetic that the application should control the environment it needs for testing. Which is what I think has caused me the most problems over the years.

August 23, 2008

Wisdom of Crowds

Posted in Software tagged , , at 9:33 am by mj

I am living without internet in my temporary condo in Seattle (the horror! my god! I’m dying! how did people live twenty years ago? no wonder there are so many wars!), and am working on deadlines at the office, so have few chances to write.

But I am trying to keep up with the news using Google Reader’s offline mode.

Which brings me to this bit on collective intelligence from Nat Torkington:

Systems that channel individual behaviours to create new and valuable data are showing up everywhere. We point to Amazon Recommendations as the canonical example, but it’s hard to find an area that isn’t using individual actions to produce collective wisdom.

Not that I disagree, but the thought just struck me. We always bring up recommendation engines on Amazon or Pandora or Netflix or Facebook…

Wisdom represents the ability to understand the world better and, through that understanding, improve it (or at least one’s standing in it). (See the Wikipedia entry on wisdom, which agrees with me.)

How is “finding your niche” (or even moving outside your niche) in books or music or movies or online friends…wisdom?

That seems more like plain ol’ socialization to me. On a much larger scale than ever before, granted, but can we call what we’re doing with these tools at this moment in history increasing our wisdom?

Coincidentally, this chart from Newscientist (via Paul Kedrosky–no direct link available) shows what happens when people “recommend” (in a generic sense) stocks to one another:

I bet if we plotted the popularity of artists and movies (at all points along the head and long tail) we’d find similar results.

I guess I’m still waiting for automated tools that increase my wisdom. Are there tools that will look for trends in people that live longer, which will help me live to 100? Are there tools that will look for trends in people that are successful, which will help me retire when I’m 45 and spend the next 55 years traveling the world (and, hopefully, the moon and Mars)? Are there tools that will help us reform our political structure so that it’s even worth living longer? Tools that make it harder to not have compassion? Tools that prevent us from foisting dictators and nanny-states upon ourselves?

Yes, some of these things are coming–but at the moment, we basically have simple data mining tools that help experts know where to focus their attention, then filter and draw conclusions and make suggestions to the rest of us.

I love my Pandora. I love recommendation engines in general. I even love my credit card company’s fraud algorithms. My life is much better–I am much happier–as a result.

But I don’t feel any more wise.

July 17, 2008

MySQL to MSSQL Server Replication?

Posted in Software tagged , , , , at 1:51 pm by mj

The team I’m working with is in the process of moving from SQL Server to MySQL.

To ease the burden, we need all writes to the new MySQL databases to be replicated back into SQL Server, so that all of our back-end reports and what-not will continue to function. It’s short term, but not shorter than six months.

I can’t find any generic tool online to do this.

I thought about setting up a bridge MySQL instance with triggers that create a message queue, then reuse our existing Java DAOs to do the heavy lifting (they know how to read/write either SQL Server or MySQL). But, that quickly gets invasive and brittle.

My next thought was a lightweight wrapper around the mysqlbinlog program which does essentially what the MySQL replication threads do: reads a portion of the binlogs, persists its current position, then replays the statements.

Then I thought, “well, if we’re doing that, why not just reuse MySQL’s replication code and build something around that core?”

You can see the hole I’m getting myself into.

Are there better strategies? Are there (free or commercial) tools that can replay replication logs on a MSSQL Server instance?

The lightweight Perl/Python wrapper seems the best solution. Anybody have experience with something like this (for SQL Server or otherwise)?

Next page