September 13, 2008

Designing a Distributed Hi-Lo Strategy

Posted in Scale, Software tagged , , , at 6:57 am by mj

In a previous post, I lamented that the “hi-lo” ID generation strategy has one big wart: it’s a single point of failure.

After thinking about it a bit, though, it occurred to me that we can eliminate the SPOF without limiting our ability to add more shards later. And it’s quite easy–just more sophisticated than you typically need.

WARNING: I consider this design to be a bit over-engineered. You probably don’t need to eliminate the SPOF. But, if ID generation is the only critical SPOF in your system, and UUIDs aren’t practical for your purpose, it may be worth going this route.

That basis of this lies in expanding the fields in our id_sequences table, reproducing the table in each shard, and introducing a stateless agent that’s always running in the background to maintain the various id_sequences tables across all our shards.

Database Design

 sequence_name        varchar(255) not null
 window_start_value   bigint(19) not null
 window_end_value     bigint(19) not null
 next_hi_value        bigint(19) not null
 PRIMARY KEY (sequence_name, window_start_value)
 KEY idx_find_window (sequence_name, window_end_value, next_hi_value, window_start_value)

The key change is the introduction of window_start_value and window_end_value, which together define an ID window from which threads can reserve IDs on each shard.

Each shard will have multiple open windows, but only one is used at a time. A window is open if next_hi_value < window_end_value.

Windows are created (and pruned, if necessary) by the agent, more on which later.

Application Hi-Lo Strategy

As expected, the in-memory buffer works as normal. The difference is in the select and increment operation.

When a shard has exhausted its in-memory buffer, we reserve another batch with the following sequence of steps:

Step 1. Begin transaction

Step 2. Query the database for the first open window

    SELECT *
    FROM id_sequences
    WHERE sequence_name = ?
       AND window_end_value > next_hi_value
    ORDER BY window_start_value 
    LIMIT 1
    FOR UPDATE

Step 3. Increment the max reserved ID by our buffer size, but do not exceed window_end_value.

Step 4. Update the window with the new next_hi_value

Step 5. Commit

This is guaranteed to always return a single open window (unless there are no open windows). Multiple threads trying to fetch the open window at the same time will not conflict. If thread A and B arrive simultaneously, and thread A exhausts the first open window, thread B will simply use the next window.

Controlling Agent

This agent can be always running, or it could be a simple nightly cron job, or even a periodic manual process. Its responsibility is to create windows on each shard.

Since the current state of the system can be reconstructed on-the-fly without locking (see caveat below), we don’t have to worry about agents crashing or getting killed in the middle of their thing.

There are only two system parameters that the agent concerns itself with: min_open_windows and window_size. Whenever any shard has fewer than the minimum number of open windows, the agent creates a new window on that shard.

Re-constructing the system state can be as simple as running

    SELECT max(window_end_value)
    FROM id_sequences
    WHERE sequence_name = ?

on each shard before proceeding.

You probably also want a first pass that finds all unique sequence_names

    SELECT DISTINCT(sequence_name)
    FROM id_sequences

so introducing a new sequence is as simple as inserting a single row into one of your shards, and the agent will propagate it elsewhere.

Then, for each shard, it queries a count of the open windows for each sequence, and inserts new windows as necessary.

No locking. No external state.

Is the Agent a SPOF?

That’s true – if the server on which the agent is set to run goes down, game over. But, you can run the agent from a cron job hourly, and stagger it across N servers, each running at a different hour.

I can’t envision a scenario where you’d need the agent to be continuously running and this would not suffice as a highly available design. If N-1 of your servers go down, then at most you’d go N hours without creating new windows. But your window sizes are sufficient to support a week or more of growth, yes?

What about Master-Master?

Some database systems are deployed in master-master pairs. In this case, you can either stick the id_sequences table on only one of the masters and always hit that master, or give each master its own window. The latter is probably preferable, although it means excluding id_sequences from replication.

Adding a new shard

Before opening up a new shard, the agent needs an opportunity to initialize the table. Not a big deal.

Deleting a shard or taking a server out of rotation

This is the one flaw, as hinted above. Reconstructing the system state on-the-fly requires that the server with the highest window_end_value can be reached by the agent, and that we know which server that is.

This may require external state to work around, such as always writing the largest window_end_value into at least one other server.

It’s probably sufficient for the agent to simply refuse to run when any server is unavailable. If you have a shard that’s offline long enough to exhaust all of your open ID windows, you have bigger problems.

Caveats

As I said, this is probably a bit over-engineered for most systems. While I have tested the behavior under MySQL, I have not had an opportunity to deploy it in a running system (I may soon, if we determine it’s not too over-engineered), and I have not heard that anybody else has employed a similar solution.

Which also means no existing frameworks support it out of the box.

Advertisements