May 20, 2007
A month ago, Paul Tuckfield’s (of YouTube) keynote at the MySQL Conference & Expo received a lot of attention, mainly for outlining a parallelized prefetching strategy for MySQL replication.
The basic strategy is that, prior to running updates from the binlog, you instead turn the updates into selects and parallelize them. This brings the necessary data into buffer, which speeds up the (serialized) updates. Paul called this the “oracle algorithm,” so-called because it lets the replication thread see into the future and prime its cache for the upcoming data update. Jan Lehnardt gave a better run-down of the reasoning.
Earlier this week, Farhan Mashraqi implemented the same MySQL replication strategy. Farhan is the DBA at Fotolog, whose MySQL conference presentation I blogged about earlier
Paul claimed a 400% improvement in replication performance at YouTube via this method. I’m most interested in Farhan’s results. I’m skeptical that you’d get that much improvement if your writes lean more heavily toward inserts than updates, but the beauty is that it’s an improvement that can be done completely outside the normal MySQL replication, and does not significantly affect your slave’s ability to respond to end user requests.
I’m wondering, though, if it’s possible to really parallelize MySQL replication given the emerging trends in data partitioning. But, first, let’s back up.
When replicating queries (insert/update/delete/alter table/drop table/etc.) from the master to the slave, all your statements get serialized in the order in which they were executed by the MySQL master. Yet, when you’re performing the original queries, they’re parallelized–multiple application servers are executing queries simultaneously. Depending on your hardware (and, especially, disk) configuration, you’re achieving varying levels of real concurrency at the database level. This is why replication on the slaves often falls behind the master–it’s not just, or even primarily, a data transfer issue. (Update: obviously, another factor is that slaves are also accepting end user selects, where masters often do not.)
So why serialize the statements when they’re replicated? The reason is determinancy. If inserts/deletes/updates are getting executed in a different order on the slaves than on the master, you could end up with inconsistent data.
For example, you might upload a photo and then delete it. But if the slave executes them in the opposite order, then you could end up deleting the photo before you uploaded it–effectively the same as not deleting it at all! (Or, in some cases, stopping the replication thread completely. D’oh!)
In a typical by-the-book RDBMS, the same applies to transactions. For example, you might own a site that allows logged out visitors to sign up for a new account on the same page that they upload their first photo. On the server, you’re going to create the user account first, and then create the photo and store the pixel data.
But in many modern high-volume Web applications, this already doesn’t hold. Often, your users table and your photos table are going to be on different servers, and you’re using MySQL. So what does this mean? This means that in order to eek out the last bit of performance, you’re already accepting that the slaves may be inconsistent. The photo might be created first, and the user may get replicated seconds, minutes or even hours later. Depending on how resilient your code is, a friend viewing your photo may see missing parts of the page, or a bunch of “null”s displayed, or an error page.
Increasingly, high traffic–and, these days, even moderate and low traffic–sites are converging on data partitioning as the optimal solution. Specifically, a particular kind of data partitioning that we may call sharding. That is, splitting a single table (say, your photos table) into multiple tables spread across multiple physical servers, each of which is responsible for only a (usually non-overlapping) segment of your data.
As I observed earlier, a lot of people leave it at that, and assume that there is a 1:1 correspondence between the number of shards and the number of physical servers. However, there are already both scaling and performance advantages to splitting your table into more shards than you have physical servers.
So, let’s assume you’ve done just that: maybe you have 16 masters, each serving 16 shards of your data. And each of those 16 masters is seeing a lot of writes, such that their slaves often run several minutes behind during peak traffic.
What’s really most important for data consistency in this kind of environment is that statements that affect the same pieces of data get serialized. Then in a non-hierarchical, sharded table, you can accomplish that by simply serializing the writes to each table. See where I’m going?
What I propose is configuring a small number of binlogs, and mapping each table in your database to those binlogs. In the above example, each master might put 4 shards into each of 4 binlogs. Each slave then runs 4 replication threads corresponding to each binlog. Voila!, better concurrency!
Of course, if you’re only running with a single disk in your servers, nothing is going to help you anyway. Or if you have a very, very high slave-to-master ratio, you have to be careful (remember, this quadruples the number of connections to your master).
My original thought on this took it a step further–that is, serializing based on a hash of the primary key (or row number), but that has issues, is more complicated, and requires too many assumptions about how the application is behaving. What I like about this proposal is that your application need behave no different than it already does to accomplish its partitioning strategy; it’s easy to configure and reconfigure; and it is as easy to scale down as it is to scale up. Oh, and it’s more cost effective than simply adding more masters with their own slaves.
Has anybody tried this before? Can anybody see fundamental flaws?