April 20, 2009
I’m back at the MySQL Conference again. This year, I skipped the tutorials/workshops. And it’s a good thing, too, because I had a full day between my day jobs, tax issues and other matters.
You might be interested in my tentative schedule of sessions. Even after much editing, there are still conflicts. Not to mention MySQL Camp and Percona sessions upstairs.
This year I’ve decided my focus will be on MySQL performance and areas that could use some contributions. I need to get involved with the tools I use. I’ve also been looking to evaluate PostgreSQL more, and think a deeper understanding of many of the performance trouble spots that I’ve taken for granted will help.
In years past, I focused on my first love: hearing what other people are building, the scaling challenges they’ve faced, and their (sometimes novel) solutions. Which I just call “emerging best practices,” but it’s more than that.
This year I will not be going with co-workers, so I’m on my own mostly. I know of at least two former co-workers who will be at the conference, but many are skipping this year (read: their employers are struggling and don’t want to pay up). Only thing left is for me to show up naked.
Maybe this year I can avoid falling into the swimming pool.
Finally: this year, I have no pretense that I will be able to blog the sessions. Tried it two years in a row, didn’t work out so well. Besides, there are plenty of other sources for that.
My first challenge is going to be being up and dressed and ready to go by 7:30am. I’ve been working from home way too much…
April 14, 2008
Today was tutorial/workshop day. It’s good to be back. This year, the Santa Clara Convention Center is undergoing a bit of construction, and so the large green yard has been torn up. Too bad. Having lunch on that lawn on a nice Bay Area afternoon is hard to beat. Instead, we’re limited to the pool area (still nice), or an indoor area.
My morning tutorial was Florian Haas and Philipp Reisner’s Hands-on Introduction to High-availability MySQL and DRBD.
It’s hard to read any MySQL blogs and not have heard about DRBD in some capacity, but somehow I’ve totally missed picking up any details about it. This really was a good introduction, with a combination overview of what DRBD is and what it’s good for, and walking through multiple examples of configuring and operating DRBD (including failure scenarios).
There were two big take-aways for me.
First, DRBD replicates block devices, and operates beneath the file system. The downside to this is only one server can mount the file system at a time. Thus, DRBD is only useful in “hot standby” scenarios. This was a little disappointing to me.
Second, failing over to the standby is not a task for DRBD, but for a cluster manager like Heartbeat. During a failover, the cluster manager mounts the file system on the (newly promoted) master, fsck’s the file system, and then starts DRBD. This is potentially slow for a large file system, and thus is not guaranteed to have near-zero downtime during the transition.
Overall, it was a good presentation, and the way they played off each other kind of reminded me of some of Peter and Heikik’s presentations. (This is a good thing.)
My afternoon tutorial was Ask Bjørn Hansen’s Real World Web: Performance & Scalability, MySQL Edition.
Overall, this was a good, energetic presentation. Ask wants people to “think scalable”–which means thinking critically about all advice, and making decisions from actual numbers instead of assumptions.
He gave an excellent overview of the myriad scaling options that have emerged in the last decade. Horizontal scaling, shared nothing architectures, page vs fragment vs data vs query caching, memcached, intelligent cache invalidation, partitioning, master-slave replication, master-master replication, summary tables, and so on.
I think by the end of the week, Ask’s tutorial will be remembered as one of the better ones. I think it would have been better if he had gone slower, kept his focus on scaling (he sort of lost focus and spent the last hour talking about version control and code reviews and so on–important, but not strictly scalability topics), and tied his advice back to actual teams implementing that advice. Who’s using master-master replication, and why, and at what scale are they operating?
There were a couple of “unconventional” bits of advice in his presentation that are worth considering.
First, he advocates against RAID 5. If you need that level of redundancy, better to use RAID 6 because it allows faster and less read-intensive recovery of a single failed disk, and because it can withstand 3 failures instead of only 2. Makes sense, but all the teams whose presentations I’ve seen or read have gone with RAID 5 or RAID 10 depending on need. Is anybody using RAID 6?
Second, he advocates against persistent database connections, or at least reconsidering them. This isn’t completely unheard of. Flickr, for example, re-establishes a DB connection on each request. It’s acceptable because MySQL is fast at establishing connections, and in a master-master sharded architecture like Flickr’s you’re only dealing with one or two database connections per request anyway. I think it’s safer to say “if you’re following other best practices, don’t put any extra effort into database connection pooling until you’ve proven it’s a bottleneck.”
Finally, and something we hear too little of, he says it’s OK to use the “scale up” strategy for a while. It’s refreshing to hear somebody say this, because too often we’re so focused on “getting it right” from the ground floor that we forget that, say, “scaling out with cheap commodity hardware” for YouTube means purchasing hundreds of servers that are each four times more expensive than most of us can afford in the first year of our site’s existence.
Anyway, this was a good start.
The MySQL conference is still the best in terms of having power strips everywhere and generally good wireless coverage. Unfortunately, there were a couple of times today when the network became unusably slow. I hope this is not a sign of problems the rest of the week.
I got to briefly speak with several people, including a guy from iPhotoStock and a guy from Adzilla. I’m not good at striking up conversations with people, so I try to use conferences as a place to practice my terrible socialization skills. So far, it seems to have gone all right. And that there are so many people from outside the Bay Area–and a ton of Canadians–makes the conference even more worthwhile.
April 13, 2008
To the (at least) eight of you whom I owe e-mails or phone calls (or more…), I apologize. The past three weeks, I’ve been commuting from SF to Seattle (Virgin America rocks!), and the week before I was in Cleveland. It’s been an exciting time, but it hasn’t left much room for other things.
Anyway, today–after finishing my taxes–I finally sat down with the MySQL conference program and drafted out my schedule.
Unfortunately, O’Reilly does not provide a way to share my personal schedule with anybody, which is disappointing. (Um, hello?)
My primary interests really run in two directions: hard-core scaling and performance metrics and strategies, and details about how other teams have solved their problems.
As long as I remind myself of that focus, I can minimize conflicts between sessions. Although, there are still several.
Last year’s conference was excellent. This year’s conference seems geared in a different direction, but still excellent. I hope to learn and think a lot, and get out of my comfort zone a bit.
I haven’t yet seen any of the BoF sessions scheduled, but I intend on attending a couple this year.
If you’re interested in meeting up with me–particularly if you’re someone I owe an email or a call to–ping me.
I shall endeavour to post at least once per day this week with my notes and observations.
April 29, 2007
John Engates, CTO of Rackspace, presented his experiences on The 7 Stages of Scaling Web Applications: Strategies for Architects.
This was a bit less interesting that I’d hoped, mainly because there were a lot of generalities and few specifics. One thing that the CTO of Rackspace brings, of course, is experience working with many different organizations as they grow. Obviously, he wasn’t in a position to give specifics of any given organization’s growing pains.
He did provide a good sanity check and general road map to make sure your application is evolving correctly. If you find yourself deviating significantly, you should pause to reflect on the reasons.
John gave the best definitions of “high availability” and “scalability” that I saw at the conference. Namely:
- high availability
a design and implementation that ensures a certain degree of operational continuity
a desirable property of a system which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged as demands increase
I’m kind of blogged out at the moment. Here are the 7 stages, in brief:
- 2-tier; pair of web servers, single database, internal storage, low operational cost, low complexity
- more of same, just bigger; maybe shared storage across multiple databases
- expontential traffic increase from publicity; more web servers, simple master-slave replication topology, split reads and writes, some application retooling
- intensified pain; replication latency, too many writes from single master, segmenting data by features, shared storage for application data, big-time application rearchitecting
- panicky, severe pain; rethinking the whole application; data partitioning (by geographical, user id, etc.), user clustering with all features available on each cluster, directory-based mapping of users to clusters
- less pain; finally adding new features again, horizontally scalable with more hardware, acceptable performance
- “entering the unknown”; investigating potential bottleness in firewalls, load balancers, network, storage, processes, backup/recovery; thinking about moving beyond a single datacenter; still difficult to replicate and load balance geographically
Most of these should sound familiar to many readers. Some of us do it a bit backwards (for example, eliminating network bottlenecks before application or database bottlenecks), and the smart ones focus on bottlenecks 12 months before they’re the limiting factor.
Among his recommendations:
- leverage existing technologies and platforms
- favor horizontal scaling over vertical scaling
- shared nothing architectures are common for a reason
- develop sufficient useful instrumentation
- don’t overoptimize, but do load test
- RAM is more important than CPU
- consider each feature in the context of its performance and scaling implications
April 28, 2007
Farhan “Frank Mash” Mashraqi, DBA at Fotolog, gave a nice little presentation titled Scaling the world’s largest photo blogging community.
Fotolog is a bit different from most sites presenting at this year’s conference. Because it’s based on the idea of photo “blogging,” free members are limited to posting (uploading) one photo per day, while paid members are limited to posting 6 photos per day.
He presented an Alexa graph showing Fotolog recently surpassing Flickr in pageviews. This was really surprising to me and made me take notice (I wasn’t the only one). However, later I looked at the same data in compete.com, which shows an utterly different picture.
- 2.4 billion comments
- 228 million photos (unsure whether they’re counting the “Flickr way” or the normal way)
- 500,000 uploads per day peak (probably 200-300K unique members uploading)
- average 24 minutes per visit (high for a web site)
- running Solaris 10
- converting from PHP to Java (no motivation given)
- 40 x 3GB memcached instances
- 32 MySQL servers, segmented into 4 tiers (user, guestbook, photo, friends+faves)
- recently converted from MyISAM to InnoDB
- using 3par servers for image store
When they converted to InnoDB, they found they still had table lock contentions. Why? Because they were using auto_increment to generate their IDs. To get around this, they changed their primary key to be a composite of existing fields, which, additionally, represents the way data is commonly queried.
For example, their comments use a (photo identifier, post date, comment identifier) composite for their primary key. Since they usually show comments from a given photo ordered by date, that can be done entirely through the primary key lookup, which, with InnODB, is much faster even than a secondary key lookup.
One thing not discussed is whether the photo identifier in that case is ordered, or how often “random inserts” happen. This is important because of InnoDB’s clustered primary key, which sorts row data in the same order as the primary key. I think he kind of touched on this from a different direction when he digressed a bit to explain how InnoDB’s primary keys are stored and the implications for secondary keys.
I was impressed by some of the benchmarking graphs he produced. He claimed a 30% performance improvement by disabling MySQL’s query cache, and a similar improvement (I think – wasn’t quite sure about this in my notes) by moving from 4GB to 16GB RAM.
Currently, their data is partitioned by the first letter of the username. This, of course, is quite skewed toward certain letters, and results in under-utilization for some instances. It wasn’t clear how these map to physical servers.
The latter part of his presentation focused on the driving factors behind planning their new architecture, wherein he proposed partitioning by date. There seemed to be confusion here, as the lines between “current implementation” and “proposed implementation” were blurred. That may have been cleared up in the Q&A, but I had co-workers tapping on my shoulder and had to leave. :(
April 25, 2007
Technology at Digg presented by Eli White and Tim Ellis on Tuesday.
98% reads. 30GB data. Running MySQL 5.0 on Debian across 20 databases, with another ~80 app servers. InnoDB for real-time data, MyISAM for OLAP purposes.
The big thing at Digg has been effective use of memcached, particularly in failure scenarios. The problem: you start with 10 memcache daemons running. One of them goes down, often, they say, because the daemon is doing internal memory management and simply is non-responsive for a bit. So your client starts putting those buckets of data onto another instance. Then…the original instance comes back. The client starts reading data from the original instance, which means potential for reading stale data.
They didn’t give many details about how they solved this problem. One solution given toward the end is to store the daemon server’s name with the key and store that information in a database. When the memcache daemon comes back up, the key names don’t match, so you invalidate it. This requires not only a highly available MySQL database to work, but it also requires two network accesses per data fetch in the best case.
One interesting thing is they’re running their memcached instances on their DB slaves. It sounds like this developed simply because their MySQL servers have more RAM (4GB) than their web servers. I wasn’t the only one a little concerned by this, and I wonder if part of their problem with unresponsive memcache daemons stems from this.
They’ve had an initiative underway for a year to partition their data, which hasn’t been implemented yet. Once again, there was terminology confusion. At Digg, a “shard” refers to a physical MySQL server (node), and “partition” refers to a table on the server. Prior discussions at the conference used opposite definitions. I suspect the community will come to a consensus pretty soon (more on that in a later post).
There was a brief audience digression into the difference between horizontal partitioning (scaling across servers) and vertical partitioning (multiple tables on the same server), which is closer to what partitioning connotes in the Oracle world.
- developers are pushing back hard against partitioning, partly, it sounds, because it fudges up their query joins. No mention of MySQL’s inefficient CPU/memory usage on joins.
- struggling with optimizing bad I/O-bound queries
- issuing a lot of
select * from ...queries, which causes problems with certain kinds of schema changes that leave outdated fields in their wake
- had issues with filesystem reporting writes were synced to disk when they hadn’t really been synced; lots of testing/fudging with parameters; wrote diskTest.pl to assist with testing
- image filers running xfs because ext3 “doesn’t work at all” for that purpose?! not substantiated by any data; unclear whether they’re talking about storing the images or serving them, or what their image serving architecture is (Squid proxy?)
- memcached serves as a write-through cache for submitted stories, which hides any replication delay for the user submitting the story
April 24, 2007
For Monday’s afternoon “tutorial” session (yes, I’m behind, so what?), I attended Wikipedia: Site Internals, Configuration and Code Examples, and Management Issues, presented by Domas Mituzas.
I have to say that my main interest at this conference is more on what’s being actively deployed and improved on in high-traffic production systems. Scalability is an area where theory interests me less than war stories.
Wikipedia’s story reminds me a lot of mailinator’s story. That is, Domas repeatedly emphasized that Wikipedia is free, is run mostly by volunteers, has no shareholders, and nobody’s going to get fired if the site goes down. Which means they can take some shortcuts and simplify their maintenance tasks with the right architectural designs, which may not scale as well as they’d like, but work anyway.
There were a lot of details here. Maybe too many. Any discussion is going to leave out at least a dozen interesting things. Here’s what I found interesting.
Data: 110 million revisions over 8 million pages. 26 million watch lists. So, not as large as Webshots, Flickr, Facebook, Photobucket, etc.
They utilize several layers of caching, from multiple Squid caches to app-level caches that reside on local disk. They also use UDP-based cache invalidation, and, in keeping with the theme, don’t care much if a few packets are dropped.
Their databases sit behind an LVS load balancer, which will take slaves out of service if replication falls behind. If all slaves are behind, the site is put into read-only mode.
Logged in users always bypass the Squid caches. Anonymous users who edit a page get a cookie set that also bypasses the Squid caches.
There was some discussion that page edits wait to ensure the slaves are caught up, but two direct questions from my colleague were sidestepped. So, my best guess is that either they’re relying on their load balancer’s slave status check, or they’re writing a sentinel value into another table within the same database then selecting that sentinel first thing after getting a connection to a slave.
They never issue
ORDER BY clauses at the SQL level, even when paginating results. Instead, they rely on the natural ordering of their indexes and issue something akin to
WHERE id > ? LIMIT ?. I don’t know how they handle jumping straight to the 500th page, but it seems a reasonable performance adjustment for many queries in the context of their application.
They’re still running MySQL 4.0.x, have no problems, and don’t plan to upgrade anytime soon.
I didn’t quite grasp their partitioning strategy. The 29-page book of notes he provided discusses various partitioning strategies more hypothetically, and more in terms of distributing reads or intensive tasks with indexes that reside on a subset of slaves.
Finally, revision histories are not stored in their own records, but are stored as compressed blobs, with each revision concatenated together uncompressed, then compressed. Makes a lot of sense to me.
My feeling is that, underneath, Wikipedia’s architecture strikes me as a bit overly complex for their size, as something that’s grown incrementally without the requisite resources to trim down some of the complexities. So, while their philosophy is: “simple, simple, simple, who cares if we’re down a few hours?” there still remains some cruft and relics of prior architectural decisions that they wouldn’t choose again if they were starting over. Which is great. It means they’re human after all.
September 11, 2006
My favorite paper was Type Less, Find More: Fast Autocompletion Search with a Succinct Index. This paper was presented during the Effiency section by Holger Blast of Max-Planck-Institut für Informatik, and the results kind of blew me away, because they seem immediately practical.
The paper was also adapted into When You’re Lost for Words: Faceted Search with Autocompletion, and presented by his student and cohort, Ingmar Weber.
(I’ve learned that, a few days later, Holger also gave a Google talk on efficient autocompletion. Watch the video – Holger’s really a blast.)
There are two sides to this paper: first, proposing a dialogue-style search UI; and second, proposing a hybrid inverted index data structure to perform autocompletions efficiently (referred to as HYB, whereas inverted indexes are referred to as INV). The latter is what really piqued my interest.
Let’s jump directly to the results. On a 426GB raw document collection, they claim to have obtained the following:
- mean query time: 0.581s INV -vs- 0.106s HYB
- 90%-ile query time: 0.545s INV -vs- 0.217s HYB
- 99%-ile query time: 16.83s INV -vs- 0.865s HYB
- max query time: 28.84s INV -vs- 1.821s HYB
That’s a 300 – 2000% improvement. Now, the tests they were performing were specific to the task of finding autocompletion terms, and displaying intermediate results immediately as the user is typing. But get this..
Once they solved this problem, they realized it applies equally well to faceted search: simply treat your facets as prefix searches, and store your values as, e.g.,
cat:family. Then, for a given query of
"holger blast", you convert that on the back-end to the prefix query
"holger blast cat:" — which instantly returns you all of the categories in which Holger Blast has been classified.
The reception during the faceted search workshop was mixed:
- Yoelle Maarek of Google in Haifa (one of the organizers) argued with Holger over whether this was the same as Google Suggest (it’s not–Google suggest uses a pre-computed list of popular queries, and does not perform query intersections).
- Marti Hearst of UC Berkely (the “grandmother” of faceted search–although she is much younger and cuter than the name might imply) at first did not see the applicability to faceted search.
- Several members complained that the index had to be huge and inefficient
On the last point, I think there was some confusion. (It’s hard to read all the papers before a session.) It took me a couple of readings before I got it, too.
The confusion seemed to be over the assumption that the words were being stored as prefixes. For example, a prefix list with minimum size 3 would store the word “alphabet” as the (unique) terms
"alp" - "alph" - "alpha" - "alphab" - "alphabe" - "alphabet". This is (obviously) inefficient in disk usage.
What their HYB index is actually doing is storing the word “alphabet” as a multiset of postings (document Id lists from inverted index fame), along with the words “alpaca”, “alpha”, “alphameric”, and so on, assuming those terms exist in your document collection. They demonstrate a mathematical model for choosing the size of the range of words within a multiset based on the total size of the block–that is, the size of the encodings of the range of words plus the size of the encodings of the document Ids within which those words appear (the postings).
They are trading off (much) better performance for computing auto completion results with (slightly) worse performance for computing non-prefix result sets.
It’s clear, then, there is minimal overhead in terms of disk usage: each word is still stored exactly once within the hybrid inverted index. The overhead comes from weaving the word encodings with the posting encodings within each multiset block.
Unanswered questions: how well does this scale with real-world use (query throughput versus index size)? how much does this impact index build times/complexities (they claim no impact)? does this affect relevancy?
August 20, 2006
Well, I am back in my cozy (read: messy) home, and I am pooped. My plan to blog SIGIR during SIGIR just sort of evaporated. They kept us much busier than I expected. I had to sneak away from a couple of lunches and one dinner just to catch up on email from work and try to read some of the papers prior to their presentation.
Instead, I will, over the coming week, write up some of the highs and lows of the conference through my eyes. I don’t really believe in the blogging-as-stream-of-consciousness paradigm of conference blogging, anyhow. I do have quite a few notes (both hand-written and typed), and there were some interesting papers presented.
Post-conference, we spent a week exploring Washington’s Olympic Peninsula and Oregon, from the shores to Crater Lake, and visited a number of smaller towns along the way. The locals were, without exception, friendly, helpful, upbeat, and environmentally-friendly. The vacationers from Portland, on the other hand, … well, I digress.
It’s all about the meals
Boeing, Google, and Microsoft sponsored buffet-style dinners on three consecutive nights at SIGIR. And, interestingly, they all offered salmon as the entree. I like salmon (as you’ll see, sometimes I can’t resist ordering a salmon entree, perhaps subconsciosly to test how the chef treats it as an indication of how he/she treats the rest of the menu), but none of these meals were all that great. The best SIGIR-sponsored meal was probably the last lunch, which offered self-serve fajitas.
During our vacation, we’d ask the locals what restaurants they recommend. (Yes, I should have checked Chowhound first.) This turns out (not surprisingly) to be a great strategy.
- Lunch at the 42nd Street Cafe in Long Beach, our favorite meal, and a surprise recommendation from the lady at the Chamber of Commerce. Better–and cheaper–than any other meal, including those more than twice as expensive. I had a salmon (hmm) with some kind of walnut-based sauce (and I don’t like walnuts!), while Yinghua had a bowl of clams seasoned with (I think) some kind of pesto base.
- Dinner at The Drift Inn in Yachats. From the outside, it looks like a typical local bar/hangout with typical food. On the inside, it’s a great atmosphere with awesome food. I had another salmon (wth?) with blackberry sauce (delicious–and I don’t like blackberry seeds!), and she had a bowl of seafood chowder, half of which I ate. The live music (the night we were there, it was Richard Sharpless) adds to the atmosphere, and somehow put us in a better mood leaving than when we went in.
- Lunch at The 3 Crabs in Dungeness, Washington. The building itself looks like a typical fast food joint, but the seafood is fresh and the wait staff (including the owner?) is friendly and fast. Of course we had a fresh crab and a bowl of mixed seafood. There’s not much room for chef-ly artistry here: it all comes down to freshness and not drowing the flavor in butter/herbs/salt the way many American restaurants do.
- Dinner at Crater Lake Lodge. Once again, I ordered Salmon the first night (Yinghua ordered it the second night). Their Chef’s magic didn’t happen with the flavoring, but by subtly undercooking the center (and serving a huge fish). Unfortunately, their other meals–including the duck and halibut–weren’t nearly as well prepared. The other downside was the garlic butter-based sauce served with the Salmon. Trust me: just move that to the side. I’m glad I didn’t think to smother my meal with it til I was at the last third, because it just gets in the way of the juiciest salmon I had the whole time. (The butter was good, it just gets in the way of the fish.)
The biggest disappoinment to me was our dinner at Sky City in Seattle’s Space Needle. Perversely expensive and touristy, none of the food was outstanding in any way. The wait staff was friendly, the atmosphere was great, the view was good. The food just wasn’t worth it.
With that, I must prepare for my first full on-site week in a month. Between the family emergency, SIGIR, and my real vacation, I’m actually starting to miss the urine smell in downtown San Francisco. I think tomorrow I shall open my window and breathe in huge wiffs of the stuff.
August 6, 2006
After a fourteen hour drive shared with my wife, I arrived in time to check into my hotel, check in to the conference, and attend the welcoming reception. I’m sure I looked like a walking zombie.
First, let me say that I am a bit out of my league here. I did not go down the mathematical/theoretical route with my career, and I have, thus far, not been able to take Webshots’ search where I believe it should go. So I am a bit embarrassed introducing myself to the people here, since most of them are either researching in an academic environment or applying theory on a large-ish scale.
The first thing that has struck me is how heavily dominated by the GYM team this conference is. While it’s the largest SIGIR to date, my guess is most of that’s due to the GYM recruiting competition: by my count, 135 of the 653 official attendees are from Google, Yahoo, or Microsoft (that’s 20%!!!). The largest contingent, of course, being from Microsoft (maybe 70?). Every one of the grad students in attendence can look forward to a lucrative career. :-)
Unfortunately, I forgot my camera at the hotel, and my Webshots account is having trouble accepting mobile uploads from my phone. (See? Even Webshots engineers sometimes have problems. And yes, I did track down the issue. And no, I can not magically push the fix through in the middle of the night. What I can do is call customer support in the morning…) I have some grainy photos of the Microsoft and Google booths sitting side-by-side: a match made in, um, Seattle.
The welcoming reception was all right. I expected a half hour of socializing followed by two hours of (essentially boring) speeches, announcements, introductions, instructions, etc. Instead, we had a relaxing 30-minute bus ride to Boeing’s Future of Flight Museum, several hours of socialing/networking/recruiting in a party-like atmosphere, pretty good food, and five minutes of “screw it. we’re not going to bore you. welcome to sigir. now get back to socializing.”
But, as I said, I was a walking zombie and talked to only a few people. My natural shyness didn’t help, of course.
Tomorrow, the real interesting stuff begins. But for now, I need some rest in a real bed.