April 25, 2007

MySQL Conference: Technology at Digg

Posted in Conferences, Development, digg, Scale, Software at 11:04 pm by mj

Technology at Digg presented by Eli White and Tim Ellis on Tuesday.

98% reads. 30GB data. Running MySQL 5.0 on Debian across 20 databases, with another ~80 app servers. InnoDB for real-time data, MyISAM for OLAP purposes.

The big thing at Digg has been effective use of memcached, particularly in failure scenarios. The problem: you start with 10 memcache daemons running. One of them goes down, often, they say, because the daemon is doing internal memory management and simply is non-responsive for a bit. So your client starts putting those buckets of data onto another instance. Then…the original instance comes back. The client starts reading data from the original instance, which means potential for reading stale data.

They didn’t give many details about how they solved this problem. One solution given toward the end is to store the daemon server’s name with the key and store that information in a database. When the memcache daemon comes back up, the key names don’t match, so you invalidate it. This requires not only a highly available MySQL database to work, but it also requires two network accesses per data fetch in the best case.

One interesting thing is they’re running their memcached instances on their DB slaves. It sounds like this developed simply because their MySQL servers have more RAM (4GB) than their web servers. I wasn’t the only one a little concerned by this, and I wonder if part of their problem with unresponsive memcache daemons stems from this.

They’ve had an initiative underway for a year to partition their data, which hasn’t been implemented yet. Once again, there was terminology confusion. At Digg, a “shard” refers to a physical MySQL server (node), and “partition” refers to a table on the server. Prior discussions at the conference used opposite definitions. I suspect the community will come to a consensus pretty soon (more on that in a later post).

There was a brief audience digression into the difference between horizontal partitioning (scaling across servers) and vertical partitioning (multiple tables on the same server), which is closer to what partitioning connotes in the Oracle world.

Other notes:

  • developers are pushing back hard against partitioning, partly, it sounds, because it fudges up their query joins. No mention of MySQL’s inefficient CPU/memory usage on joins.
  • struggling with optimizing bad I/O-bound queries
  • issuing a lot of select * from ... queries, which causes problems with certain kinds of schema changes that leave outdated fields in their wake
  • had issues with filesystem reporting writes were synced to disk when they hadn’t really been synced; lots of testing/fudging with parameters; wrote diskTest.pl to assist with testing
  • image filers running xfs because ext3 “doesn’t work at all” for that purpose?! not substantiated by any data; unclear whether they’re talking about storing the images or serving them, or what their image serving architecture is (Squid proxy?)
  • memcached serves as a write-through cache for submitted stories, which hides any replication delay for the user submitting the story