July 16, 2008

UUIDs for Database Primary Keys?

Posted in Software tagged , , , at 3:18 am by mj

We’ve all been through this exercise a number of times: you’re partitioning your data set across multiple physical servers, which means you can no longer rely on your database’s built-in auto_increment/sequence generation.

One common pattern is what’s come to be known as the “hi-lo” strategy. This requires each component in your distributed environment to communicate with a central node (table) to generate an ID, usually by reserving a batch of available IDs for itself to reduce the communication.

(“Hi lo” isn’t exactly self descriptive, so I always use “externalized ID generation” which, despite being longer and more difficult to say, seems to be more descriptive for those not on your core team. Anyway…)

Another pattern that’s emerging is the use of UUIDs. These are 64-bit unique identifiers that can be generated in isolation by any component in your system without talking with a central node (table).

I’m going through this exercise again with a new team, and I’m debating whether I should step outside my comfort zone and try UUIDs. I’m having a hard time
seeing many advantages, however.

The disadvantages that I see are: longer IDs mean longer URLs (unless you go the “slug” route with SEO-friendly URLs), and they don’t play nice with clustered indexes.

The latter is a bit more damning in most scenarios, actually. It just blows away InnoDB’s clustered primary key, for example, which is going to be an issue even outside of web contexts.

The main problem with “hi-lo” is it still introduces a SPOF. In reality, it’s not going to be a bottleneck, since you can always adjust both the granularity and the number of IDs you reserve in each transaction (the “lo” part of “hi-lo”). But that lingering SPOF is a bit ugly.

The other advantages I see are: better use of the available ID space and ubiquitous implementations (even MySQL now has the uuid() built-in). Many hi-lo implementations are specific to a given framework and/or broken for real distributed environments.

Are there advantages I’m missing?

The only situation that I can see where UUIDs offer a clear advantage is for aggregating content across the web, where the UUID can be based on a SHA-1 hash of the source URL, and you’re not clustering on primary key (or, indeed, may not even be using a relational database). FriendFeed seems to have gone this route, for example.

Now obviously, I’m talking about a narrow context. UUIDs are applicable in many other situations (device IDs, node IDs, etc.).