May 20, 2007
A month ago, Paul Tuckfield’s (of YouTube) keynote at the MySQL Conference & Expo received a lot of attention, mainly for outlining a parallelized prefetching strategy for MySQL replication.
The basic strategy is that, prior to running updates from the binlog, you instead turn the updates into selects and parallelize them. This brings the necessary data into buffer, which speeds up the (serialized) updates. Paul called this the “oracle algorithm,” so-called because it lets the replication thread see into the future and prime its cache for the upcoming data update. Jan Lehnardt gave a better run-down of the reasoning.
Earlier this week, Farhan Mashraqi implemented the same MySQL replication strategy. Farhan is the DBA at Fotolog, whose MySQL conference presentation I blogged about earlier
Paul claimed a 400% improvement in replication performance at YouTube via this method. I’m most interested in Farhan’s results. I’m skeptical that you’d get that much improvement if your writes lean more heavily toward inserts than updates, but the beauty is that it’s an improvement that can be done completely outside the normal MySQL replication, and does not significantly affect your slave’s ability to respond to end user requests.
I’m wondering, though, if it’s possible to really parallelize MySQL replication given the emerging trends in data partitioning. But, first, let’s back up.
When replicating queries (insert/update/delete/alter table/drop table/etc.) from the master to the slave, all your statements get serialized in the order in which they were executed by the MySQL master. Yet, when you’re performing the original queries, they’re parallelized–multiple application servers are executing queries simultaneously. Depending on your hardware (and, especially, disk) configuration, you’re achieving varying levels of real concurrency at the database level. This is why replication on the slaves often falls behind the master–it’s not just, or even primarily, a data transfer issue. (Update: obviously, another factor is that slaves are also accepting end user selects, where masters often do not.)
So why serialize the statements when they’re replicated? The reason is determinancy. If inserts/deletes/updates are getting executed in a different order on the slaves than on the master, you could end up with inconsistent data.
For example, you might upload a photo and then delete it. But if the slave executes them in the opposite order, then you could end up deleting the photo before you uploaded it–effectively the same as not deleting it at all! (Or, in some cases, stopping the replication thread completely. D’oh!)
In a typical by-the-book RDBMS, the same applies to transactions. For example, you might own a site that allows logged out visitors to sign up for a new account on the same page that they upload their first photo. On the server, you’re going to create the user account first, and then create the photo and store the pixel data.
But in many modern high-volume Web applications, this already doesn’t hold. Often, your users table and your photos table are going to be on different servers, and you’re using MySQL. So what does this mean? This means that in order to eek out the last bit of performance, you’re already accepting that the slaves may be inconsistent. The photo might be created first, and the user may get replicated seconds, minutes or even hours later. Depending on how resilient your code is, a friend viewing your photo may see missing parts of the page, or a bunch of “null”s displayed, or an error page.
Increasingly, high traffic–and, these days, even moderate and low traffic–sites are converging on data partitioning as the optimal solution. Specifically, a particular kind of data partitioning that we may call sharding. That is, splitting a single table (say, your photos table) into multiple tables spread across multiple physical servers, each of which is responsible for only a (usually non-overlapping) segment of your data.
As I observed earlier, a lot of people leave it at that, and assume that there is a 1:1 correspondence between the number of shards and the number of physical servers. However, there are already both scaling and performance advantages to splitting your table into more shards than you have physical servers.
So, let’s assume you’ve done just that: maybe you have 16 masters, each serving 16 shards of your data. And each of those 16 masters is seeing a lot of writes, such that their slaves often run several minutes behind during peak traffic.
What’s really most important for data consistency in this kind of environment is that statements that affect the same pieces of data get serialized. Then in a non-hierarchical, sharded table, you can accomplish that by simply serializing the writes to each table. See where I’m going?
What I propose is configuring a small number of binlogs, and mapping each table in your database to those binlogs. In the above example, each master might put 4 shards into each of 4 binlogs. Each slave then runs 4 replication threads corresponding to each binlog. Voila!, better concurrency!
Of course, if you’re only running with a single disk in your servers, nothing is going to help you anyway. Or if you have a very, very high slave-to-master ratio, you have to be careful (remember, this quadruples the number of connections to your master).
My original thought on this took it a step further–that is, serializing based on a hash of the primary key (or row number), but that has issues, is more complicated, and requires too many assumptions about how the application is behaving. What I like about this proposal is that your application need behave no different than it already does to accomplish its partitioning strategy; it’s easy to configure and reconfigure; and it is as easy to scale down as it is to scale up. Oh, and it’s more cost effective than simply adding more masters with their own slaves.
Has anybody tried this before? Can anybody see fundamental flaws?
April 24, 2007
For Monday’s afternoon “tutorial” session (yes, I’m behind, so what?), I attended Wikipedia: Site Internals, Configuration and Code Examples, and Management Issues, presented by Domas Mituzas.
I have to say that my main interest at this conference is more on what’s being actively deployed and improved on in high-traffic production systems. Scalability is an area where theory interests me less than war stories.
Wikipedia’s story reminds me a lot of mailinator’s story. That is, Domas repeatedly emphasized that Wikipedia is free, is run mostly by volunteers, has no shareholders, and nobody’s going to get fired if the site goes down. Which means they can take some shortcuts and simplify their maintenance tasks with the right architectural designs, which may not scale as well as they’d like, but work anyway.
There were a lot of details here. Maybe too many. Any discussion is going to leave out at least a dozen interesting things. Here’s what I found interesting.
Data: 110 million revisions over 8 million pages. 26 million watch lists. So, not as large as Webshots, Flickr, Facebook, Photobucket, etc.
They utilize several layers of caching, from multiple Squid caches to app-level caches that reside on local disk. They also use UDP-based cache invalidation, and, in keeping with the theme, don’t care much if a few packets are dropped.
Their databases sit behind an LVS load balancer, which will take slaves out of service if replication falls behind. If all slaves are behind, the site is put into read-only mode.
Logged in users always bypass the Squid caches. Anonymous users who edit a page get a cookie set that also bypasses the Squid caches.
There was some discussion that page edits wait to ensure the slaves are caught up, but two direct questions from my colleague were sidestepped. So, my best guess is that either they’re relying on their load balancer’s slave status check, or they’re writing a sentinel value into another table within the same database then selecting that sentinel first thing after getting a connection to a slave.
They never issue
ORDER BY clauses at the SQL level, even when paginating results. Instead, they rely on the natural ordering of their indexes and issue something akin to
WHERE id > ? LIMIT ?. I don’t know how they handle jumping straight to the 500th page, but it seems a reasonable performance adjustment for many queries in the context of their application.
They’re still running MySQL 4.0.x, have no problems, and don’t plan to upgrade anytime soon.
I didn’t quite grasp their partitioning strategy. The 29-page book of notes he provided discusses various partitioning strategies more hypothetically, and more in terms of distributing reads or intensive tasks with indexes that reside on a subset of slaves.
Finally, revision histories are not stored in their own records, but are stored as compressed blobs, with each revision concatenated together uncompressed, then compressed. Makes a lot of sense to me.
My feeling is that, underneath, Wikipedia’s architecture strikes me as a bit overly complex for their size, as something that’s grown incrementally without the requisite resources to trim down some of the complexities. So, while their philosophy is: “simple, simple, simple, who cares if we’re down a few hours?” there still remains some cruft and relics of prior architectural decisions that they wouldn’t choose again if they were starting over. Which is great. It means they’re human after all.
April 23, 2007
Today was the first day of the MySQL Conference & Expo in Santa Clara. This first day consisted of two extended (3.5 hour) “tutorial” sessions, with your choice of seven topics during each session.
Interesting audience metric: most of the audience members considered their applications to be in their “late teens,” that is, the awkward time where audience size has grown and you’re seriously in need of a new architecture to scale to the next level.
I wouldn’t have expected so many to self-identify with that camp, although, as Jeremy pointed out, the whole interaction-everywhere UI of “Web 2.0” means a lot more sites are in that position than would be otherwise.
Jumping right to the “takeaway” slide: their recommendation to achieve both scalability and high availability is data partitioning (using HiveDB) across multiple tiers, coupled with a master-master replication topology using the IP takeover strategy to avoid writing to both masters in a tier simultaneously. (That sounds like a run-on sentence, doesn’t it?)
I’d like to focus on a couple of observatons/critiques.
The audience expressed concerns about how much of scalability was people/process management, and how early in the lifecycle should you bite the bullet and commit to a really scalable architecture. Neither concern was really addressed in any depth, which, given the audience mix noted above, probably wasn’t important. It would have been more important in a room full of two-month old startups.
When Jeremy talked about partitioning, he referred to distributing data with the same schema across multiple physical servers. This needed clarification several times. He also totally discounted the value (for either raw performance or scalability) of distributing across multiple tables on the same server.
It’s pretty trivial to show that, performance wise, a MyISAM-based table with a lot of writes benefits greatly from splitting into multiple tables on the same server. Since MyISAM tables only support table-level locks, the more tables you have, the less contention.
I also suspect–though I’m not enough of an expert on InnoDB to prove it at this point–that smaller tables will be faster with InnoDB, especially in applications where you have a lot of inserts that require rearranging by primary key. (InnoDB tables are index-ordered; if you’re not using auto_increment, you’re probably rearranging data on inserts.) Not to mention it’s more likely that blocks that make it into the file system cache incidentally will actually be used later.
Even barring performance gains, though, it seems a good recommendation from a scaling perspective. One of the things Jeremy spent some time on is the difficulty of repartitioning as you add more servers. But such growth is made much easier if you start off with many independent tables, and move them onto new servers as necessary.
There were two interesting audience-suggested partitioning strategies. A woman from Travelocity suggested geography-based paritioning with enough authority to make me think they’re doing that in production. Then, a guy from Technorati suggested date-based partitioning, providing an example of calculating the authority of a blog. While this is usually an archiving strategy, it feels perfectly natural in their application for on-line data.
Another possible oversight was the suggestion that critical reads should always hit the master to avoid stale data. The downside, of course, is that critical reads can swamp the master and take the master down–a far worse situation than merely serving stale data. There are strategies for dealing with replication latency outside of the database (session affinity, copying new data into a local cache, etc.), but they were not touched on in this presentation. That makes me wonder how often such strategies are deployed in the (webified) real world.
Finally, he spent a lot of time on discussing various replication topologies. More time than I would have liked, but I think it served a purpose: to drive home the fact that replication never provides gains for writes. Something that’s easy to overlook if you’re new to replication.
For a good overview of the presentation’s structure and progession, I direct you to Ronald Bradford’s excellent session notes.