April 29, 2007
John Engates, CTO of Rackspace, presented his experiences on The 7 Stages of Scaling Web Applications: Strategies for Architects.
This was a bit less interesting that I’d hoped, mainly because there were a lot of generalities and few specifics. One thing that the CTO of Rackspace brings, of course, is experience working with many different organizations as they grow. Obviously, he wasn’t in a position to give specifics of any given organization’s growing pains.
He did provide a good sanity check and general road map to make sure your application is evolving correctly. If you find yourself deviating significantly, you should pause to reflect on the reasons.
John gave the best definitions of “high availability” and “scalability” that I saw at the conference. Namely:
- high availability
a design and implementation that ensures a certain degree of operational continuity
a desirable property of a system which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged as demands increase
I’m kind of blogged out at the moment. Here are the 7 stages, in brief:
- 2-tier; pair of web servers, single database, internal storage, low operational cost, low complexity
- more of same, just bigger; maybe shared storage across multiple databases
- expontential traffic increase from publicity; more web servers, simple master-slave replication topology, split reads and writes, some application retooling
- intensified pain; replication latency, too many writes from single master, segmenting data by features, shared storage for application data, big-time application rearchitecting
- panicky, severe pain; rethinking the whole application; data partitioning (by geographical, user id, etc.), user clustering with all features available on each cluster, directory-based mapping of users to clusters
- less pain; finally adding new features again, horizontally scalable with more hardware, acceptable performance
- “entering the unknown”; investigating potential bottleness in firewalls, load balancers, network, storage, processes, backup/recovery; thinking about moving beyond a single datacenter; still difficult to replicate and load balance geographically
Most of these should sound familiar to many readers. Some of us do it a bit backwards (for example, eliminating network bottlenecks before application or database bottlenecks), and the smart ones focus on bottlenecks 12 months before they’re the limiting factor.
Among his recommendations:
- leverage existing technologies and platforms
- favor horizontal scaling over vertical scaling
- shared nothing architectures are common for a reason
- develop sufficient useful instrumentation
- don’t overoptimize, but do load test
- RAM is more important than CPU
- consider each feature in the context of its performance and scaling implications
April 28, 2007
Farhan “Frank Mash” Mashraqi, DBA at Fotolog, gave a nice little presentation titled Scaling the world’s largest photo blogging community.
Fotolog is a bit different from most sites presenting at this year’s conference. Because it’s based on the idea of photo “blogging,” free members are limited to posting (uploading) one photo per day, while paid members are limited to posting 6 photos per day.
He presented an Alexa graph showing Fotolog recently surpassing Flickr in pageviews. This was really surprising to me and made me take notice (I wasn’t the only one). However, later I looked at the same data in compete.com, which shows an utterly different picture.
- 2.4 billion comments
- 228 million photos (unsure whether they’re counting the “Flickr way” or the normal way)
- 500,000 uploads per day peak (probably 200-300K unique members uploading)
- average 24 minutes per visit (high for a web site)
- running Solaris 10
- converting from PHP to Java (no motivation given)
- 40 x 3GB memcached instances
- 32 MySQL servers, segmented into 4 tiers (user, guestbook, photo, friends+faves)
- recently converted from MyISAM to InnoDB
- using 3par servers for image store
When they converted to InnoDB, they found they still had table lock contentions. Why? Because they were using auto_increment to generate their IDs. To get around this, they changed their primary key to be a composite of existing fields, which, additionally, represents the way data is commonly queried.
For example, their comments use a (photo identifier, post date, comment identifier) composite for their primary key. Since they usually show comments from a given photo ordered by date, that can be done entirely through the primary key lookup, which, with InnODB, is much faster even than a secondary key lookup.
One thing not discussed is whether the photo identifier in that case is ordered, or how often “random inserts” happen. This is important because of InnoDB’s clustered primary key, which sorts row data in the same order as the primary key. I think he kind of touched on this from a different direction when he digressed a bit to explain how InnoDB’s primary keys are stored and the implications for secondary keys.
I was impressed by some of the benchmarking graphs he produced. He claimed a 30% performance improvement by disabling MySQL’s query cache, and a similar improvement (I think – wasn’t quite sure about this in my notes) by moving from 4GB to 16GB RAM.
Currently, their data is partitioned by the first letter of the username. This, of course, is quite skewed toward certain letters, and results in under-utilization for some instances. It wasn’t clear how these map to physical servers.
The latter part of his presentation focused on the driving factors behind planning their new architecture, wherein he proposed partitioning by date. There seemed to be confusion here, as the lines between “current implementation” and “proposed implementation” were blurred. That may have been cleared up in the Q&A, but I had co-workers tapping on my shoulder and had to leave. 😦
April 25, 2007
Technology at Digg presented by Eli White and Tim Ellis on Tuesday.
98% reads. 30GB data. Running MySQL 5.0 on Debian across 20 databases, with another ~80 app servers. InnoDB for real-time data, MyISAM for OLAP purposes.
The big thing at Digg has been effective use of memcached, particularly in failure scenarios. The problem: you start with 10 memcache daemons running. One of them goes down, often, they say, because the daemon is doing internal memory management and simply is non-responsive for a bit. So your client starts putting those buckets of data onto another instance. Then…the original instance comes back. The client starts reading data from the original instance, which means potential for reading stale data.
They didn’t give many details about how they solved this problem. One solution given toward the end is to store the daemon server’s name with the key and store that information in a database. When the memcache daemon comes back up, the key names don’t match, so you invalidate it. This requires not only a highly available MySQL database to work, but it also requires two network accesses per data fetch in the best case.
One interesting thing is they’re running their memcached instances on their DB slaves. It sounds like this developed simply because their MySQL servers have more RAM (4GB) than their web servers. I wasn’t the only one a little concerned by this, and I wonder if part of their problem with unresponsive memcache daemons stems from this.
They’ve had an initiative underway for a year to partition their data, which hasn’t been implemented yet. Once again, there was terminology confusion. At Digg, a “shard” refers to a physical MySQL server (node), and “partition” refers to a table on the server. Prior discussions at the conference used opposite definitions. I suspect the community will come to a consensus pretty soon (more on that in a later post).
There was a brief audience digression into the difference between horizontal partitioning (scaling across servers) and vertical partitioning (multiple tables on the same server), which is closer to what partitioning connotes in the Oracle world.
- developers are pushing back hard against partitioning, partly, it sounds, because it fudges up their query joins. No mention of MySQL’s inefficient CPU/memory usage on joins.
- struggling with optimizing bad I/O-bound queries
- issuing a lot of
select * from ...queries, which causes problems with certain kinds of schema changes that leave outdated fields in their wake
- had issues with filesystem reporting writes were synced to disk when they hadn’t really been synced; lots of testing/fudging with parameters; wrote diskTest.pl to assist with testing
- image filers running xfs because ext3 “doesn’t work at all” for that purpose?! not substantiated by any data; unclear whether they’re talking about storing the images or serving them, or what their image serving architecture is (Squid proxy?)
- memcached serves as a write-through cache for submitted stories, which hides any replication delay for the user submitting the story
April 24, 2007
I have done a lot of embarrassing things in my life. At least two every day, I think. (OK, three.)
But none as odd as falling off into the Hyatt pool during the MySQL conference lunch on Tuesday.
Didn’t see it coming.
Didn’t even realize it was happening until I saw the plate breaking in my hand as I sank into the water.
Had to ask myself whether I was dreaming.
“Yes,” I lied.
The Hyatt people took care of me, dried my clothes, and pretended that these things happen. Something for them to laugh at at office parties for years to come.
(I couldn’t tell if the liability guy taking my statement was trying to suppress laughter or trying to avoid accusing me of planning the whole thing.)
Yes, the laptop survived. My Targus backpack, in fact, didn’t let in any water, as far as I can tell. The laptop and all my papers were dry.
My cell phone, on the other hand… it works, but the screen is blank. I hope it can be repaired.
For Monday’s afternoon “tutorial” session (yes, I’m behind, so what?), I attended Wikipedia: Site Internals, Configuration and Code Examples, and Management Issues, presented by Domas Mituzas.
I have to say that my main interest at this conference is more on what’s being actively deployed and improved on in high-traffic production systems. Scalability is an area where theory interests me less than war stories.
Wikipedia’s story reminds me a lot of mailinator’s story. That is, Domas repeatedly emphasized that Wikipedia is free, is run mostly by volunteers, has no shareholders, and nobody’s going to get fired if the site goes down. Which means they can take some shortcuts and simplify their maintenance tasks with the right architectural designs, which may not scale as well as they’d like, but work anyway.
There were a lot of details here. Maybe too many. Any discussion is going to leave out at least a dozen interesting things. Here’s what I found interesting.
Data: 110 million revisions over 8 million pages. 26 million watch lists. So, not as large as Webshots, Flickr, Facebook, Photobucket, etc.
They utilize several layers of caching, from multiple Squid caches to app-level caches that reside on local disk. They also use UDP-based cache invalidation, and, in keeping with the theme, don’t care much if a few packets are dropped.
Their databases sit behind an LVS load balancer, which will take slaves out of service if replication falls behind. If all slaves are behind, the site is put into read-only mode.
Logged in users always bypass the Squid caches. Anonymous users who edit a page get a cookie set that also bypasses the Squid caches.
There was some discussion that page edits wait to ensure the slaves are caught up, but two direct questions from my colleague were sidestepped. So, my best guess is that either they’re relying on their load balancer’s slave status check, or they’re writing a sentinel value into another table within the same database then selecting that sentinel first thing after getting a connection to a slave.
They never issue
ORDER BY clauses at the SQL level, even when paginating results. Instead, they rely on the natural ordering of their indexes and issue something akin to
WHERE id > ? LIMIT ?. I don’t know how they handle jumping straight to the 500th page, but it seems a reasonable performance adjustment for many queries in the context of their application.
They’re still running MySQL 4.0.x, have no problems, and don’t plan to upgrade anytime soon.
I didn’t quite grasp their partitioning strategy. The 29-page book of notes he provided discusses various partitioning strategies more hypothetically, and more in terms of distributing reads or intensive tasks with indexes that reside on a subset of slaves.
Finally, revision histories are not stored in their own records, but are stored as compressed blobs, with each revision concatenated together uncompressed, then compressed. Makes a lot of sense to me.
My feeling is that, underneath, Wikipedia’s architecture strikes me as a bit overly complex for their size, as something that’s grown incrementally without the requisite resources to trim down some of the complexities. So, while their philosophy is: “simple, simple, simple, who cares if we’re down a few hours?” there still remains some cruft and relics of prior architectural decisions that they wouldn’t choose again if they were starting over. Which is great. It means they’re human after all.
April 23, 2007
Today was the first day of the MySQL Conference & Expo in Santa Clara. This first day consisted of two extended (3.5 hour) “tutorial” sessions, with your choice of seven topics during each session.
Interesting audience metric: most of the audience members considered their applications to be in their “late teens,” that is, the awkward time where audience size has grown and you’re seriously in need of a new architecture to scale to the next level.
I wouldn’t have expected so many to self-identify with that camp, although, as Jeremy pointed out, the whole interaction-everywhere UI of “Web 2.0” means a lot more sites are in that position than would be otherwise.
Jumping right to the “takeaway” slide: their recommendation to achieve both scalability and high availability is data partitioning (using HiveDB) across multiple tiers, coupled with a master-master replication topology using the IP takeover strategy to avoid writing to both masters in a tier simultaneously. (That sounds like a run-on sentence, doesn’t it?)
I’d like to focus on a couple of observatons/critiques.
The audience expressed concerns about how much of scalability was people/process management, and how early in the lifecycle should you bite the bullet and commit to a really scalable architecture. Neither concern was really addressed in any depth, which, given the audience mix noted above, probably wasn’t important. It would have been more important in a room full of two-month old startups.
When Jeremy talked about partitioning, he referred to distributing data with the same schema across multiple physical servers. This needed clarification several times. He also totally discounted the value (for either raw performance or scalability) of distributing across multiple tables on the same server.
It’s pretty trivial to show that, performance wise, a MyISAM-based table with a lot of writes benefits greatly from splitting into multiple tables on the same server. Since MyISAM tables only support table-level locks, the more tables you have, the less contention.
I also suspect–though I’m not enough of an expert on InnoDB to prove it at this point–that smaller tables will be faster with InnoDB, especially in applications where you have a lot of inserts that require rearranging by primary key. (InnoDB tables are index-ordered; if you’re not using auto_increment, you’re probably rearranging data on inserts.) Not to mention it’s more likely that blocks that make it into the file system cache incidentally will actually be used later.
Even barring performance gains, though, it seems a good recommendation from a scaling perspective. One of the things Jeremy spent some time on is the difficulty of repartitioning as you add more servers. But such growth is made much easier if you start off with many independent tables, and move them onto new servers as necessary.
There were two interesting audience-suggested partitioning strategies. A woman from Travelocity suggested geography-based paritioning with enough authority to make me think they’re doing that in production. Then, a guy from Technorati suggested date-based partitioning, providing an example of calculating the authority of a blog. While this is usually an archiving strategy, it feels perfectly natural in their application for on-line data.
Another possible oversight was the suggestion that critical reads should always hit the master to avoid stale data. The downside, of course, is that critical reads can swamp the master and take the master down–a far worse situation than merely serving stale data. There are strategies for dealing with replication latency outside of the database (session affinity, copying new data into a local cache, etc.), but they were not touched on in this presentation. That makes me wonder how often such strategies are deployed in the (webified) real world.
Finally, he spent a lot of time on discussing various replication topologies. More time than I would have liked, but I think it served a purpose: to drive home the fact that replication never provides gains for writes. Something that’s easy to overlook if you’re new to replication.
For a good overview of the presentation’s structure and progession, I direct you to Ronald Bradford’s excellent session notes.
April 14, 2007
As a follow-up 8 hours later to my previous post on Twitter & Rails, I’ll point to a couple of great technical discussions of how to solve the problem in Rails.
First, Ryan Tomayko discusses a solution he uses in production, which seems to be more a partitioned/federated approach.
He also points to an excellent proof-of-concept how-to for an in-Rails load balancing solution.
As a side note, a person with experience in more mature platforms could offer a lot by focusing in on Rails for a while…
Running on Rails has forced us to deal with scaling issues – issues that any growing site eventually contends with – far sooner than I think we would on another framework.
The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there’s no facility in Rails to talk to more than one database at a time. The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it’s not just cost, it’s time, and time is that much more precious when people can[’t] reach your site.
It was compounded this week by Mike Pence’s attempt to dramatize disagreements between the Rails and Seaside communities.
Then Coding Horror tried to break it down by per-language runtime execution time.
I’ll be honest: I don’t know much beyond the basics about Rails. I cut my teeth on the Active Record design pattern, even if I find it lacking now. And what I know about Ruby, I like, because it feels like a much improved Perl.
I find it implausible that Twitter has reached the limits of a non-partitioned (some would call it “federated,” though that is, technically, incorrect) database schema. That comes later. It also doesn’t seem plausible that Rails forces all of your tables into a single database.
That leaves database replication. So, my understanding of his complaint is that Rails doesn’t support a one master-many slaves architecture. Otherwise, that is fairly quick and easy for a company in trouble to set up.
There are really three takeaways from this.
First, it doesn’t matter how “slow” Ruby is compared to C++ or Java. Their problem isn’t with the Ruby VM as an execution environment, it’s with Rails as a platform. And the only problem there is its lack of support for common, sensible data-level scaling strategies. Really, any language out there today is capable of scaling. (Sure, I’m simplifying. Platforms that don’t require spawning a new process for each request are “more scalable” and also may have shorter response times.)
Second, Twitter is a company with a lot of goodwill in the community. They’ll get through this. A lot of us only wish our employer were in the position of not losing members and positive press every minute our site cannot withstand the traffic. I envy their developers right now: their team will gel in a way it probably hasn’t yet, and a year from now they’ll be sharing their scaling experiences at all the conferences, and swapping war stories with new team members. (Aren’t they still hiring?)
Third, Rails will get through this. Twitter is, as far as I know, the first Rails-based company to push Rails to its (well-known, well-criticized) limits. If the Twitter developers don’t contribute improvements back to the Rails community, other Rails developers will. This experience only makes Rails more attractive to new start-ups who, if they’re lucky, will have the same traffic issues in a year.
April 11, 2007
The latter still has some deal breakers (no web interface being one), but Reader has come a long way. It’s really much slicker than it used to be.
So, this weekend, I imported about 200 or so feeds into Reader, and made a commitment that I’ll stick with it for at least 2 months. I hope I will become a better blogger and just generally more in-touch. Right now, of course, I’m just feeling overwhelmed.
Anyhow, Reader has a nice little sharing feature, which gives me a pretty good link blog.
In fact, it was because I’d grown addicted to Scoble’s link blog that I decided to check out Reader again.
So, check out my link blog. It’s a hundred times better than this thing here, mainly because there’s no inane rambling from yours truly. 🙂
April 5, 2007
I’m home sick today, and found this interesting. It’s a clip from Japanese TV of a train tour of North Korea. Apparently they start out boarding in China and take a train to Pyongyang, then head out on a local train line to see a less idealized version of the country.