September 13, 2008
In a previous post, I lamented that the “hi-lo” ID generation strategy has one big wart: it’s a single point of failure.
After thinking about it a bit, though, it occurred to me that we can eliminate the SPOF without limiting our ability to add more shards later. And it’s quite easy–just more sophisticated than you typically need.
WARNING: I consider this design to be a bit over-engineered. You probably don’t need to eliminate the SPOF. But, if ID generation is the only critical SPOF in your system, and UUIDs aren’t practical for your purpose, it may be worth going this route.
That basis of this lies in expanding the fields in our
id_sequences table, reproducing the table in each shard, and introducing a stateless agent that’s always running in the background to maintain the various
id_sequences tables across all our shards.
sequence_name varchar(255) not null window_start_value bigint(19) not null window_end_value bigint(19) not null next_hi_value bigint(19) not null PRIMARY KEY (sequence_name, window_start_value) KEY idx_find_window (sequence_name, window_end_value, next_hi_value, window_start_value)
The key change is the introduction of
window_end_value, which together define an ID window from which threads can reserve IDs on each shard.
Each shard will have multiple open windows, but only one is used at a time. A window is open if
next_hi_value < window_end_value.
Windows are created (and pruned, if necessary) by the agent, more on which later.
Application Hi-Lo Strategy
As expected, the in-memory buffer works as normal. The difference is in the select and increment operation.
When a shard has exhausted its in-memory buffer, we reserve another batch with the following sequence of steps:
Step 1. Begin transaction
Step 2. Query the database for the first open window
SELECT * FROM id_sequences WHERE sequence_name = ? AND window_end_value > next_hi_value ORDER BY window_start_value LIMIT 1 FOR UPDATE
Step 3. Increment the max reserved ID by our buffer size, but do not exceed
Step 4. Update the window with the new
Step 5. Commit
This is guaranteed to always return a single open window (unless there are no open windows). Multiple threads trying to fetch the open window at the same time will not conflict. If thread A and B arrive simultaneously, and thread A exhausts the first open window, thread B will simply use the next window.
This agent can be always running, or it could be a simple nightly cron job, or even a periodic manual process. Its responsibility is to create windows on each shard.
Since the current state of the system can be reconstructed on-the-fly without locking (see caveat below), we don’t have to worry about agents crashing or getting killed in the middle of their thing.
There are only two system parameters that the agent concerns itself with:
window_size. Whenever any shard has fewer than the minimum number of open windows, the agent creates a new window on that shard.
Re-constructing the system state can be as simple as running
SELECT max(window_end_value) FROM id_sequences WHERE sequence_name = ?
on each shard before proceeding.
You probably also want a first pass that finds all unique
SELECT DISTINCT(sequence_name) FROM id_sequences
so introducing a new sequence is as simple as inserting a single row into one of your shards, and the agent will propagate it elsewhere.
Then, for each shard, it queries a count of the open windows for each sequence, and inserts new windows as necessary.
No locking. No external state.
Is the Agent a SPOF?
That’s true – if the server on which the agent is set to run goes down, game over. But, you can run the agent from a cron job hourly, and stagger it across N servers, each running at a different hour.
I can’t envision a scenario where you’d need the agent to be continuously running and this would not suffice as a highly available design. If N-1 of your servers go down, then at most you’d go N hours without creating new windows. But your window sizes are sufficient to support a week or more of growth, yes?
What about Master-Master?
Some database systems are deployed in master-master pairs. In this case, you can either stick the
id_sequences table on only one of the masters and always hit that master, or give each master its own window. The latter is probably preferable, although it means excluding
id_sequences from replication.
Adding a new shard
Before opening up a new shard, the agent needs an opportunity to initialize the table. Not a big deal.
Deleting a shard or taking a server out of rotation
This is the one flaw, as hinted above. Reconstructing the system state on-the-fly requires that the server with the highest
window_end_value can be reached by the agent, and that we know which server that is.
This may require external state to work around, such as always writing the largest
window_end_value into at least one other server.
It’s probably sufficient for the agent to simply refuse to run when any server is unavailable. If you have a shard that’s offline long enough to exhaust all of your open ID windows, you have bigger problems.
As I said, this is probably a bit over-engineered for most systems. While I have tested the behavior under MySQL, I have not had an opportunity to deploy it in a running system (I may soon, if we determine it’s not too over-engineered), and I have not heard that anybody else has employed a similar solution.
Which also means no existing frameworks support it out of the box.