September 13, 2008
In a previous post, I lamented that the “hi-lo” ID generation strategy has one big wart: it’s a single point of failure.
After thinking about it a bit, though, it occurred to me that we can eliminate the SPOF without limiting our ability to add more shards later. And it’s quite easy–just more sophisticated than you typically need.
WARNING: I consider this design to be a bit over-engineered. You probably don’t need to eliminate the SPOF. But, if ID generation is the only critical SPOF in your system, and UUIDs aren’t practical for your purpose, it may be worth going this route.
That basis of this lies in expanding the fields in our
id_sequences table, reproducing the table in each shard, and introducing a stateless agent that’s always running in the background to maintain the various
id_sequences tables across all our shards.
sequence_name varchar(255) not null window_start_value bigint(19) not null window_end_value bigint(19) not null next_hi_value bigint(19) not null PRIMARY KEY (sequence_name, window_start_value) KEY idx_find_window (sequence_name, window_end_value, next_hi_value, window_start_value)
The key change is the introduction of
window_end_value, which together define an ID window from which threads can reserve IDs on each shard.
Each shard will have multiple open windows, but only one is used at a time. A window is open if
next_hi_value < window_end_value.
Windows are created (and pruned, if necessary) by the agent, more on which later.
Application Hi-Lo Strategy
As expected, the in-memory buffer works as normal. The difference is in the select and increment operation.
When a shard has exhausted its in-memory buffer, we reserve another batch with the following sequence of steps:
Step 1. Begin transaction
Step 2. Query the database for the first open window
SELECT * FROM id_sequences WHERE sequence_name = ? AND window_end_value > next_hi_value ORDER BY window_start_value LIMIT 1 FOR UPDATE
Step 3. Increment the max reserved ID by our buffer size, but do not exceed
Step 4. Update the window with the new
Step 5. Commit
This is guaranteed to always return a single open window (unless there are no open windows). Multiple threads trying to fetch the open window at the same time will not conflict. If thread A and B arrive simultaneously, and thread A exhausts the first open window, thread B will simply use the next window.
This agent can be always running, or it could be a simple nightly cron job, or even a periodic manual process. Its responsibility is to create windows on each shard.
Since the current state of the system can be reconstructed on-the-fly without locking (see caveat below), we don’t have to worry about agents crashing or getting killed in the middle of their thing.
There are only two system parameters that the agent concerns itself with:
window_size. Whenever any shard has fewer than the minimum number of open windows, the agent creates a new window on that shard.
Re-constructing the system state can be as simple as running
SELECT max(window_end_value) FROM id_sequences WHERE sequence_name = ?
on each shard before proceeding.
You probably also want a first pass that finds all unique
SELECT DISTINCT(sequence_name) FROM id_sequences
so introducing a new sequence is as simple as inserting a single row into one of your shards, and the agent will propagate it elsewhere.
Then, for each shard, it queries a count of the open windows for each sequence, and inserts new windows as necessary.
No locking. No external state.
Is the Agent a SPOF?
That’s true – if the server on which the agent is set to run goes down, game over. But, you can run the agent from a cron job hourly, and stagger it across N servers, each running at a different hour.
I can’t envision a scenario where you’d need the agent to be continuously running and this would not suffice as a highly available design. If N-1 of your servers go down, then at most you’d go N hours without creating new windows. But your window sizes are sufficient to support a week or more of growth, yes?
What about Master-Master?
Some database systems are deployed in master-master pairs. In this case, you can either stick the
id_sequences table on only one of the masters and always hit that master, or give each master its own window. The latter is probably preferable, although it means excluding
id_sequences from replication.
Adding a new shard
Before opening up a new shard, the agent needs an opportunity to initialize the table. Not a big deal.
Deleting a shard or taking a server out of rotation
This is the one flaw, as hinted above. Reconstructing the system state on-the-fly requires that the server with the highest
window_end_value can be reached by the agent, and that we know which server that is.
This may require external state to work around, such as always writing the largest
window_end_value into at least one other server.
It’s probably sufficient for the agent to simply refuse to run when any server is unavailable. If you have a shard that’s offline long enough to exhaust all of your open ID windows, you have bigger problems.
As I said, this is probably a bit over-engineered for most systems. While I have tested the behavior under MySQL, I have not had an opportunity to deploy it in a running system (I may soon, if we determine it’s not too over-engineered), and I have not heard that anybody else has employed a similar solution.
Which also means no existing frameworks support it out of the box.
May 20, 2007
A month ago, Paul Tuckfield’s (of YouTube) keynote at the MySQL Conference & Expo received a lot of attention, mainly for outlining a parallelized prefetching strategy for MySQL replication.
The basic strategy is that, prior to running updates from the binlog, you instead turn the updates into selects and parallelize them. This brings the necessary data into buffer, which speeds up the (serialized) updates. Paul called this the “oracle algorithm,” so-called because it lets the replication thread see into the future and prime its cache for the upcoming data update. Jan Lehnardt gave a better run-down of the reasoning.
Earlier this week, Farhan Mashraqi implemented the same MySQL replication strategy. Farhan is the DBA at Fotolog, whose MySQL conference presentation I blogged about earlier
Paul claimed a 400% improvement in replication performance at YouTube via this method. I’m most interested in Farhan’s results. I’m skeptical that you’d get that much improvement if your writes lean more heavily toward inserts than updates, but the beauty is that it’s an improvement that can be done completely outside the normal MySQL replication, and does not significantly affect your slave’s ability to respond to end user requests.
I’m wondering, though, if it’s possible to really parallelize MySQL replication given the emerging trends in data partitioning. But, first, let’s back up.
When replicating queries (insert/update/delete/alter table/drop table/etc.) from the master to the slave, all your statements get serialized in the order in which they were executed by the MySQL master. Yet, when you’re performing the original queries, they’re parallelized–multiple application servers are executing queries simultaneously. Depending on your hardware (and, especially, disk) configuration, you’re achieving varying levels of real concurrency at the database level. This is why replication on the slaves often falls behind the master–it’s not just, or even primarily, a data transfer issue. (Update: obviously, another factor is that slaves are also accepting end user selects, where masters often do not.)
So why serialize the statements when they’re replicated? The reason is determinancy. If inserts/deletes/updates are getting executed in a different order on the slaves than on the master, you could end up with inconsistent data.
For example, you might upload a photo and then delete it. But if the slave executes them in the opposite order, then you could end up deleting the photo before you uploaded it–effectively the same as not deleting it at all! (Or, in some cases, stopping the replication thread completely. D’oh!)
In a typical by-the-book RDBMS, the same applies to transactions. For example, you might own a site that allows logged out visitors to sign up for a new account on the same page that they upload their first photo. On the server, you’re going to create the user account first, and then create the photo and store the pixel data.
But in many modern high-volume Web applications, this already doesn’t hold. Often, your users table and your photos table are going to be on different servers, and you’re using MySQL. So what does this mean? This means that in order to eek out the last bit of performance, you’re already accepting that the slaves may be inconsistent. The photo might be created first, and the user may get replicated seconds, minutes or even hours later. Depending on how resilient your code is, a friend viewing your photo may see missing parts of the page, or a bunch of “null”s displayed, or an error page.
Increasingly, high traffic–and, these days, even moderate and low traffic–sites are converging on data partitioning as the optimal solution. Specifically, a particular kind of data partitioning that we may call sharding. That is, splitting a single table (say, your photos table) into multiple tables spread across multiple physical servers, each of which is responsible for only a (usually non-overlapping) segment of your data.
As I observed earlier, a lot of people leave it at that, and assume that there is a 1:1 correspondence between the number of shards and the number of physical servers. However, there are already both scaling and performance advantages to splitting your table into more shards than you have physical servers.
So, let’s assume you’ve done just that: maybe you have 16 masters, each serving 16 shards of your data. And each of those 16 masters is seeing a lot of writes, such that their slaves often run several minutes behind during peak traffic.
What’s really most important for data consistency in this kind of environment is that statements that affect the same pieces of data get serialized. Then in a non-hierarchical, sharded table, you can accomplish that by simply serializing the writes to each table. See where I’m going?
What I propose is configuring a small number of binlogs, and mapping each table in your database to those binlogs. In the above example, each master might put 4 shards into each of 4 binlogs. Each slave then runs 4 replication threads corresponding to each binlog. Voila!, better concurrency!
Of course, if you’re only running with a single disk in your servers, nothing is going to help you anyway. Or if you have a very, very high slave-to-master ratio, you have to be careful (remember, this quadruples the number of connections to your master).
My original thought on this took it a step further–that is, serializing based on a hash of the primary key (or row number), but that has issues, is more complicated, and requires too many assumptions about how the application is behaving. What I like about this proposal is that your application need behave no different than it already does to accomplish its partitioning strategy; it’s easy to configure and reconfigure; and it is as easy to scale down as it is to scale up. Oh, and it’s more cost effective than simply adding more masters with their own slaves.
Has anybody tried this before? Can anybody see fundamental flaws?
April 29, 2007
John Engates, CTO of Rackspace, presented his experiences on The 7 Stages of Scaling Web Applications: Strategies for Architects.
This was a bit less interesting that I’d hoped, mainly because there were a lot of generalities and few specifics. One thing that the CTO of Rackspace brings, of course, is experience working with many different organizations as they grow. Obviously, he wasn’t in a position to give specifics of any given organization’s growing pains.
He did provide a good sanity check and general road map to make sure your application is evolving correctly. If you find yourself deviating significantly, you should pause to reflect on the reasons.
John gave the best definitions of “high availability” and “scalability” that I saw at the conference. Namely:
- high availability
a design and implementation that ensures a certain degree of operational continuity
a desirable property of a system which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged as demands increase
I’m kind of blogged out at the moment. Here are the 7 stages, in brief:
- 2-tier; pair of web servers, single database, internal storage, low operational cost, low complexity
- more of same, just bigger; maybe shared storage across multiple databases
- expontential traffic increase from publicity; more web servers, simple master-slave replication topology, split reads and writes, some application retooling
- intensified pain; replication latency, too many writes from single master, segmenting data by features, shared storage for application data, big-time application rearchitecting
- panicky, severe pain; rethinking the whole application; data partitioning (by geographical, user id, etc.), user clustering with all features available on each cluster, directory-based mapping of users to clusters
- less pain; finally adding new features again, horizontally scalable with more hardware, acceptable performance
- “entering the unknown”; investigating potential bottleness in firewalls, load balancers, network, storage, processes, backup/recovery; thinking about moving beyond a single datacenter; still difficult to replicate and load balance geographically
Most of these should sound familiar to many readers. Some of us do it a bit backwards (for example, eliminating network bottlenecks before application or database bottlenecks), and the smart ones focus on bottlenecks 12 months before they’re the limiting factor.
Among his recommendations:
- leverage existing technologies and platforms
- favor horizontal scaling over vertical scaling
- shared nothing architectures are common for a reason
- develop sufficient useful instrumentation
- don’t overoptimize, but do load test
- RAM is more important than CPU
- consider each feature in the context of its performance and scaling implications
April 28, 2007
Farhan “Frank Mash” Mashraqi, DBA at Fotolog, gave a nice little presentation titled Scaling the world’s largest photo blogging community.
Fotolog is a bit different from most sites presenting at this year’s conference. Because it’s based on the idea of photo “blogging,” free members are limited to posting (uploading) one photo per day, while paid members are limited to posting 6 photos per day.
He presented an Alexa graph showing Fotolog recently surpassing Flickr in pageviews. This was really surprising to me and made me take notice (I wasn’t the only one). However, later I looked at the same data in compete.com, which shows an utterly different picture.
- 2.4 billion comments
- 228 million photos (unsure whether they’re counting the “Flickr way” or the normal way)
- 500,000 uploads per day peak (probably 200-300K unique members uploading)
- average 24 minutes per visit (high for a web site)
- running Solaris 10
- converting from PHP to Java (no motivation given)
- 40 x 3GB memcached instances
- 32 MySQL servers, segmented into 4 tiers (user, guestbook, photo, friends+faves)
- recently converted from MyISAM to InnoDB
- using 3par servers for image store
When they converted to InnoDB, they found they still had table lock contentions. Why? Because they were using auto_increment to generate their IDs. To get around this, they changed their primary key to be a composite of existing fields, which, additionally, represents the way data is commonly queried.
For example, their comments use a (photo identifier, post date, comment identifier) composite for their primary key. Since they usually show comments from a given photo ordered by date, that can be done entirely through the primary key lookup, which, with InnODB, is much faster even than a secondary key lookup.
One thing not discussed is whether the photo identifier in that case is ordered, or how often “random inserts” happen. This is important because of InnoDB’s clustered primary key, which sorts row data in the same order as the primary key. I think he kind of touched on this from a different direction when he digressed a bit to explain how InnoDB’s primary keys are stored and the implications for secondary keys.
I was impressed by some of the benchmarking graphs he produced. He claimed a 30% performance improvement by disabling MySQL’s query cache, and a similar improvement (I think – wasn’t quite sure about this in my notes) by moving from 4GB to 16GB RAM.
Currently, their data is partitioned by the first letter of the username. This, of course, is quite skewed toward certain letters, and results in under-utilization for some instances. It wasn’t clear how these map to physical servers.
The latter part of his presentation focused on the driving factors behind planning their new architecture, wherein he proposed partitioning by date. There seemed to be confusion here, as the lines between “current implementation” and “proposed implementation” were blurred. That may have been cleared up in the Q&A, but I had co-workers tapping on my shoulder and had to leave. 😦
April 25, 2007
Technology at Digg presented by Eli White and Tim Ellis on Tuesday.
98% reads. 30GB data. Running MySQL 5.0 on Debian across 20 databases, with another ~80 app servers. InnoDB for real-time data, MyISAM for OLAP purposes.
The big thing at Digg has been effective use of memcached, particularly in failure scenarios. The problem: you start with 10 memcache daemons running. One of them goes down, often, they say, because the daemon is doing internal memory management and simply is non-responsive for a bit. So your client starts putting those buckets of data onto another instance. Then…the original instance comes back. The client starts reading data from the original instance, which means potential for reading stale data.
They didn’t give many details about how they solved this problem. One solution given toward the end is to store the daemon server’s name with the key and store that information in a database. When the memcache daemon comes back up, the key names don’t match, so you invalidate it. This requires not only a highly available MySQL database to work, but it also requires two network accesses per data fetch in the best case.
One interesting thing is they’re running their memcached instances on their DB slaves. It sounds like this developed simply because their MySQL servers have more RAM (4GB) than their web servers. I wasn’t the only one a little concerned by this, and I wonder if part of their problem with unresponsive memcache daemons stems from this.
They’ve had an initiative underway for a year to partition their data, which hasn’t been implemented yet. Once again, there was terminology confusion. At Digg, a “shard” refers to a physical MySQL server (node), and “partition” refers to a table on the server. Prior discussions at the conference used opposite definitions. I suspect the community will come to a consensus pretty soon (more on that in a later post).
There was a brief audience digression into the difference between horizontal partitioning (scaling across servers) and vertical partitioning (multiple tables on the same server), which is closer to what partitioning connotes in the Oracle world.
- developers are pushing back hard against partitioning, partly, it sounds, because it fudges up their query joins. No mention of MySQL’s inefficient CPU/memory usage on joins.
- struggling with optimizing bad I/O-bound queries
- issuing a lot of
select * from ...queries, which causes problems with certain kinds of schema changes that leave outdated fields in their wake
- had issues with filesystem reporting writes were synced to disk when they hadn’t really been synced; lots of testing/fudging with parameters; wrote diskTest.pl to assist with testing
- image filers running xfs because ext3 “doesn’t work at all” for that purpose?! not substantiated by any data; unclear whether they’re talking about storing the images or serving them, or what their image serving architecture is (Squid proxy?)
- memcached serves as a write-through cache for submitted stories, which hides any replication delay for the user submitting the story
April 24, 2007
For Monday’s afternoon “tutorial” session (yes, I’m behind, so what?), I attended Wikipedia: Site Internals, Configuration and Code Examples, and Management Issues, presented by Domas Mituzas.
I have to say that my main interest at this conference is more on what’s being actively deployed and improved on in high-traffic production systems. Scalability is an area where theory interests me less than war stories.
Wikipedia’s story reminds me a lot of mailinator’s story. That is, Domas repeatedly emphasized that Wikipedia is free, is run mostly by volunteers, has no shareholders, and nobody’s going to get fired if the site goes down. Which means they can take some shortcuts and simplify their maintenance tasks with the right architectural designs, which may not scale as well as they’d like, but work anyway.
There were a lot of details here. Maybe too many. Any discussion is going to leave out at least a dozen interesting things. Here’s what I found interesting.
Data: 110 million revisions over 8 million pages. 26 million watch lists. So, not as large as Webshots, Flickr, Facebook, Photobucket, etc.
They utilize several layers of caching, from multiple Squid caches to app-level caches that reside on local disk. They also use UDP-based cache invalidation, and, in keeping with the theme, don’t care much if a few packets are dropped.
Their databases sit behind an LVS load balancer, which will take slaves out of service if replication falls behind. If all slaves are behind, the site is put into read-only mode.
Logged in users always bypass the Squid caches. Anonymous users who edit a page get a cookie set that also bypasses the Squid caches.
There was some discussion that page edits wait to ensure the slaves are caught up, but two direct questions from my colleague were sidestepped. So, my best guess is that either they’re relying on their load balancer’s slave status check, or they’re writing a sentinel value into another table within the same database then selecting that sentinel first thing after getting a connection to a slave.
They never issue
ORDER BY clauses at the SQL level, even when paginating results. Instead, they rely on the natural ordering of their indexes and issue something akin to
WHERE id > ? LIMIT ?. I don’t know how they handle jumping straight to the 500th page, but it seems a reasonable performance adjustment for many queries in the context of their application.
They’re still running MySQL 4.0.x, have no problems, and don’t plan to upgrade anytime soon.
I didn’t quite grasp their partitioning strategy. The 29-page book of notes he provided discusses various partitioning strategies more hypothetically, and more in terms of distributing reads or intensive tasks with indexes that reside on a subset of slaves.
Finally, revision histories are not stored in their own records, but are stored as compressed blobs, with each revision concatenated together uncompressed, then compressed. Makes a lot of sense to me.
My feeling is that, underneath, Wikipedia’s architecture strikes me as a bit overly complex for their size, as something that’s grown incrementally without the requisite resources to trim down some of the complexities. So, while their philosophy is: “simple, simple, simple, who cares if we’re down a few hours?” there still remains some cruft and relics of prior architectural decisions that they wouldn’t choose again if they were starting over. Which is great. It means they’re human after all.
April 23, 2007
Today was the first day of the MySQL Conference & Expo in Santa Clara. This first day consisted of two extended (3.5 hour) “tutorial” sessions, with your choice of seven topics during each session.
Interesting audience metric: most of the audience members considered their applications to be in their “late teens,” that is, the awkward time where audience size has grown and you’re seriously in need of a new architecture to scale to the next level.
I wouldn’t have expected so many to self-identify with that camp, although, as Jeremy pointed out, the whole interaction-everywhere UI of “Web 2.0” means a lot more sites are in that position than would be otherwise.
Jumping right to the “takeaway” slide: their recommendation to achieve both scalability and high availability is data partitioning (using HiveDB) across multiple tiers, coupled with a master-master replication topology using the IP takeover strategy to avoid writing to both masters in a tier simultaneously. (That sounds like a run-on sentence, doesn’t it?)
I’d like to focus on a couple of observatons/critiques.
The audience expressed concerns about how much of scalability was people/process management, and how early in the lifecycle should you bite the bullet and commit to a really scalable architecture. Neither concern was really addressed in any depth, which, given the audience mix noted above, probably wasn’t important. It would have been more important in a room full of two-month old startups.
When Jeremy talked about partitioning, he referred to distributing data with the same schema across multiple physical servers. This needed clarification several times. He also totally discounted the value (for either raw performance or scalability) of distributing across multiple tables on the same server.
It’s pretty trivial to show that, performance wise, a MyISAM-based table with a lot of writes benefits greatly from splitting into multiple tables on the same server. Since MyISAM tables only support table-level locks, the more tables you have, the less contention.
I also suspect–though I’m not enough of an expert on InnoDB to prove it at this point–that smaller tables will be faster with InnoDB, especially in applications where you have a lot of inserts that require rearranging by primary key. (InnoDB tables are index-ordered; if you’re not using auto_increment, you’re probably rearranging data on inserts.) Not to mention it’s more likely that blocks that make it into the file system cache incidentally will actually be used later.
Even barring performance gains, though, it seems a good recommendation from a scaling perspective. One of the things Jeremy spent some time on is the difficulty of repartitioning as you add more servers. But such growth is made much easier if you start off with many independent tables, and move them onto new servers as necessary.
There were two interesting audience-suggested partitioning strategies. A woman from Travelocity suggested geography-based paritioning with enough authority to make me think they’re doing that in production. Then, a guy from Technorati suggested date-based partitioning, providing an example of calculating the authority of a blog. While this is usually an archiving strategy, it feels perfectly natural in their application for on-line data.
Another possible oversight was the suggestion that critical reads should always hit the master to avoid stale data. The downside, of course, is that critical reads can swamp the master and take the master down–a far worse situation than merely serving stale data. There are strategies for dealing with replication latency outside of the database (session affinity, copying new data into a local cache, etc.), but they were not touched on in this presentation. That makes me wonder how often such strategies are deployed in the (webified) real world.
Finally, he spent a lot of time on discussing various replication topologies. More time than I would have liked, but I think it served a purpose: to drive home the fact that replication never provides gains for writes. Something that’s easy to overlook if you’re new to replication.
For a good overview of the presentation’s structure and progession, I direct you to Ronald Bradford’s excellent session notes.
April 14, 2007
As a follow-up 8 hours later to my previous post on Twitter & Rails, I’ll point to a couple of great technical discussions of how to solve the problem in Rails.
First, Ryan Tomayko discusses a solution he uses in production, which seems to be more a partitioned/federated approach.
He also points to an excellent proof-of-concept how-to for an in-Rails load balancing solution.
As a side note, a person with experience in more mature platforms could offer a lot by focusing in on Rails for a while…
Running on Rails has forced us to deal with scaling issues – issues that any growing site eventually contends with – far sooner than I think we would on another framework.
The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there’s no facility in Rails to talk to more than one database at a time. The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it’s not just cost, it’s time, and time is that much more precious when people can[’t] reach your site.
It was compounded this week by Mike Pence’s attempt to dramatize disagreements between the Rails and Seaside communities.
Then Coding Horror tried to break it down by per-language runtime execution time.
I’ll be honest: I don’t know much beyond the basics about Rails. I cut my teeth on the Active Record design pattern, even if I find it lacking now. And what I know about Ruby, I like, because it feels like a much improved Perl.
I find it implausible that Twitter has reached the limits of a non-partitioned (some would call it “federated,” though that is, technically, incorrect) database schema. That comes later. It also doesn’t seem plausible that Rails forces all of your tables into a single database.
That leaves database replication. So, my understanding of his complaint is that Rails doesn’t support a one master-many slaves architecture. Otherwise, that is fairly quick and easy for a company in trouble to set up.
There are really three takeaways from this.
First, it doesn’t matter how “slow” Ruby is compared to C++ or Java. Their problem isn’t with the Ruby VM as an execution environment, it’s with Rails as a platform. And the only problem there is its lack of support for common, sensible data-level scaling strategies. Really, any language out there today is capable of scaling. (Sure, I’m simplifying. Platforms that don’t require spawning a new process for each request are “more scalable” and also may have shorter response times.)
Second, Twitter is a company with a lot of goodwill in the community. They’ll get through this. A lot of us only wish our employer were in the position of not losing members and positive press every minute our site cannot withstand the traffic. I envy their developers right now: their team will gel in a way it probably hasn’t yet, and a year from now they’ll be sharing their scaling experiences at all the conferences, and swapping war stories with new team members. (Aren’t they still hiring?)
Third, Rails will get through this. Twitter is, as far as I know, the first Rails-based company to push Rails to its (well-known, well-criticized) limits. If the Twitter developers don’t contribute improvements back to the Rails community, other Rails developers will. This experience only makes Rails more attractive to new start-ups who, if they’re lucky, will have the same traffic issues in a year.
December 26, 2006
While I’ve been on vacation, others have pointed out an SDForum presentation given by two senior eBay technologists, Randy Shoup and Dan Pritchett, on The Ebay Architecture: Striking a balance between site stability, feature velocity, performance and cost.
They tell a scalability story that is becoming more common, one which many people still find counter-intuitive:
- partitioned database schemas, based on usage patterns; mostly partitioned by primary key, with mapping tables where necessary;
- running with database slaves with various lag times;
- no stored procedures;
- no joins, sorts, or referential integrity constraints performed at the database layer: all this is done at the application layer, where CPU availability is easier to scale;
- no client-side or distributed transactions;
- homebrew frameworks as necessary (in eBay’s case, they use their own Java ORM framework, and their own connection pooling)
Aside from the tale of what they’re doing now, they provide an excellent history of eBay’s scaling problems through the years. It’s a great overview of what leads a company to these solutions.
If rules like “no joins at the database level” are becoming so common (or, as Joe Gregorio commented, it’s almost as if everybody is slowly implementing their own version of BigTable), why is it still counter-intuitive? I blame it on University Education. The approach to teaching databases at most universities is a lot like teaching multiplication tables to first graders: a lot of rule learning and regurgitation.
(There’s a very predictable 10-to-15 month cycle for a new hire at Webshots, which starts with blaming MySQL for all the world’s problems, moves through secretly inserting joins into production code, followed by resentment of the Establishment Authority, finally leading to enlightenment. Not that Webshots is anywhere near as good as the big guys when it comes to scaling.)
If you find this interesting, be sure to check out Scoble’s interview of Eric Billingsley, head of eBay Research, which I blogged about in October. Eric focused more on scaling search, and also goes into some of the history.
What I still find most fascinating about eBay is their approach to rolling out features: there’s a new release that contains 100K+ LOC (how much is auto-generated?) every two weeks, and they complete 300 new features per quarter. From what I hear from those inside eBay, this requires careful release management and putting in hours during night shifts, but it’s still awe-inspiring.
Finally, check out the summary and commentary by Frank Sommers over at Artima, which concludes with the following insight:
[T]he main message of the eBay presentation […] [is] the fact that eBay was able to meet the challenges of its growth with subsequent refinements to its system, all the while keeping the site operational.
The reason that’s interesting is because it suggests that you can start with almost any architecture—even with Perl or Rails or JSP pages—as long as you know how to migrate to the next step, and have the capability to do so, if and when you need to scale your app. That, in turn, suggests that the key test of scalability is not so much how each architecture stage scales, but how readily a company or an organization can move an application from one architecture step to the next. That indicates that scaling is as much an individual or organizational question as a technical one.