September 20, 2008
(Inspired by Jeff Atwood’s recent ‘outing’ as a regex sympathizer, which got me thinking about the line between “too many” and “too few” regular expressions and how some languages make it a choice between “too few” and “none”.)
Java has a
Pattern, which forces you to pre-declare your regex string.
And it has a
Matcher, which matches on a
It should be noted that a
Pattern‘s pattern turns most patterns into a mess of backslashes since the pattern is wrapped in a plain-old Java
matches(), which matches using the
Pattern‘s pattern, but only if the pattern would otherwise match the
Matcher‘s whole string.
Matcher can also
find(), which matches a
Pattern pattern even if the pattern would only match a substring of the
String, which is what most patterns match and what most languages call matching on a string.
lookAt(), which matches on a
Pattern pattern, which, like
find(), can match a pattern on a substring of the string, but only if the
String‘s matching substring starts at the beginning of the string.
String matched by the
Matcher can be sliced by a call to
region(start,end), which allows
lookAt() to interpret a substring of the
String as being the whole string.
Now, after calling
find() or any of
String-matching cousins, a consumer of a
Matcher can call
group(int) to get the
String substring that the
Pattern‘s pattern captured when matching on the
But if you’re lazy, and you have no groups in your pattern, and a
matches() is sufficient, then
String gives you
matches(pattern) which is precisely equivalent to constructing a
Pattern with your pattern and passing a new
Matcher your existing
So with effective use of Java object syntax, you too can use regular expressions to make your matches on Java
Strings almost as obscurely as other languages clearly make matches on their strings!
Is it any wonder Java programmers don’t realize that regular expressions are a beautiful… thing?
September 17, 2008
I may not be writing much lately, but, thanks to Google Reader’s Offline Mode, I try to continue reading and adding to my shared items.
Unfortunately, I tend to sync at most once a week (except when I’m back in the Bay Area), so they tend to come in batches…and 3 weeks after the original post. It looks like the Reader team finally fixed the problem of only showing the sync time, though (except in my private view).
In today’s sync, I shared 24 items from 18 bloggers.
While some may go for quantity (ahem, Scoble, Digg), I only share things I’d want to read and refer to again…and which I’d prefer my whole team read, too.
Fortunately, the world is teeming with interestingness.
Some examples of things I’ve found interesting and shared recently:
Implementing Persistent Vectors in Scala.
Daniel Spiewak explains how immutable data structures can, nevertheless, be efficient even during writes. Perhaps the clearest example I’ve seen.
I still don’t claim to understand how multiple threads can share the same (changing) state without locking (perhaps something along the lines of Java’s ConcurrentHashMap, the code from which is well worth studying).
Dan Pritchett shares his experience with database sharding. Worth it for the second lesson alone (“Use Math on Shard Counts”), where he explains why multiples of 12 are a more efficient scaling strategy.
Singletons are Pathological Liars.
Miško Hevery has been writing the clearest (bestest!) introductions to designing for unit testing that I have seen. They’re not really introductions so much as they are motivators.
You know the drill: you join a new team with a code base that’s been around seemingly since Pascal walked the Earth. Maybe everybody has heard of unit testing, but nobody really understands what it’s all about or why their singleton-ridden/new-operator-ridden existing code (or existing so-called “unit tests”) isn’t sufficient.
Don’t buy them a book. Point them to Miško Hevery‘s blog.
There’s more. (Much more.) There’s the excellent ongoing REST discussion involving Tim Bray, Dare, Dave Winer, Bill de hÓra, Damien Katz (and others); a lot of fantastic Drizzle commentary that go into mucho detail; discussions on edge cases and performance degradation in MySQL; and so on.
I wish I had a job that allowed me to just take what I’ve read and experiment with the ideas and contribute to some of the projects.
Yes, that would be an awesome job.
September 13, 2008
In a previous post, I lamented that the “hi-lo” ID generation strategy has one big wart: it’s a single point of failure.
After thinking about it a bit, though, it occurred to me that we can eliminate the SPOF without limiting our ability to add more shards later. And it’s quite easy–just more sophisticated than you typically need.
WARNING: I consider this design to be a bit over-engineered. You probably don’t need to eliminate the SPOF. But, if ID generation is the only critical SPOF in your system, and UUIDs aren’t practical for your purpose, it may be worth going this route.
That basis of this lies in expanding the fields in our
id_sequences table, reproducing the table in each shard, and introducing a stateless agent that’s always running in the background to maintain the various
id_sequences tables across all our shards.
sequence_name varchar(255) not null window_start_value bigint(19) not null window_end_value bigint(19) not null next_hi_value bigint(19) not null PRIMARY KEY (sequence_name, window_start_value) KEY idx_find_window (sequence_name, window_end_value, next_hi_value, window_start_value)
The key change is the introduction of
window_end_value, which together define an ID window from which threads can reserve IDs on each shard.
Each shard will have multiple open windows, but only one is used at a time. A window is open if
next_hi_value < window_end_value.
Windows are created (and pruned, if necessary) by the agent, more on which later.
Application Hi-Lo Strategy
As expected, the in-memory buffer works as normal. The difference is in the select and increment operation.
When a shard has exhausted its in-memory buffer, we reserve another batch with the following sequence of steps:
Step 1. Begin transaction
Step 2. Query the database for the first open window
SELECT * FROM id_sequences WHERE sequence_name = ? AND window_end_value > next_hi_value ORDER BY window_start_value LIMIT 1 FOR UPDATE
Step 3. Increment the max reserved ID by our buffer size, but do not exceed
Step 4. Update the window with the new
Step 5. Commit
This is guaranteed to always return a single open window (unless there are no open windows). Multiple threads trying to fetch the open window at the same time will not conflict. If thread A and B arrive simultaneously, and thread A exhausts the first open window, thread B will simply use the next window.
This agent can be always running, or it could be a simple nightly cron job, or even a periodic manual process. Its responsibility is to create windows on each shard.
Since the current state of the system can be reconstructed on-the-fly without locking (see caveat below), we don’t have to worry about agents crashing or getting killed in the middle of their thing.
There are only two system parameters that the agent concerns itself with:
window_size. Whenever any shard has fewer than the minimum number of open windows, the agent creates a new window on that shard.
Re-constructing the system state can be as simple as running
SELECT max(window_end_value) FROM id_sequences WHERE sequence_name = ?
on each shard before proceeding.
You probably also want a first pass that finds all unique
SELECT DISTINCT(sequence_name) FROM id_sequences
so introducing a new sequence is as simple as inserting a single row into one of your shards, and the agent will propagate it elsewhere.
Then, for each shard, it queries a count of the open windows for each sequence, and inserts new windows as necessary.
No locking. No external state.
Is the Agent a SPOF?
That’s true – if the server on which the agent is set to run goes down, game over. But, you can run the agent from a cron job hourly, and stagger it across N servers, each running at a different hour.
I can’t envision a scenario where you’d need the agent to be continuously running and this would not suffice as a highly available design. If N-1 of your servers go down, then at most you’d go N hours without creating new windows. But your window sizes are sufficient to support a week or more of growth, yes?
What about Master-Master?
Some database systems are deployed in master-master pairs. In this case, you can either stick the
id_sequences table on only one of the masters and always hit that master, or give each master its own window. The latter is probably preferable, although it means excluding
id_sequences from replication.
Adding a new shard
Before opening up a new shard, the agent needs an opportunity to initialize the table. Not a big deal.
Deleting a shard or taking a server out of rotation
This is the one flaw, as hinted above. Reconstructing the system state on-the-fly requires that the server with the highest
window_end_value can be reached by the agent, and that we know which server that is.
This may require external state to work around, such as always writing the largest
window_end_value into at least one other server.
It’s probably sufficient for the agent to simply refuse to run when any server is unavailable. If you have a shard that’s offline long enough to exhaust all of your open ID windows, you have bigger problems.
As I said, this is probably a bit over-engineered for most systems. While I have tested the behavior under MySQL, I have not had an opportunity to deploy it in a running system (I may soon, if we determine it’s not too over-engineered), and I have not heard that anybody else has employed a similar solution.
Which also means no existing frameworks support it out of the box.
September 12, 2008
I wrote this in July (!!) as a morning-after continuation on this update on my life.
Since it still accurately reflects my thoughts at this time, and since I keep sucking at connecting with old friends in real life, I’m publishing it.
I have started thinking through my career trajectory for the next 2-5 years.
I’ve been lucky thus far. Yeah, there were some rough patches (many depressing rounds of layoffs at Excite, for example), but I’ve had a number of excellent managers, and been able to observe and learn from many awesome co-workers. Webshots paid off both literally (twice!), and figuratively, in all that I’ve had the opportunity to stumble through.
What I didn’t have before, really, was the ability to choose my path.
In 2000, I thought what I really wanted to was to earn enough money to go back and pursue my PhD. I’d even started discussions with my manager at the time to this effect. Thankfully, the layoffs came, and I discovered that the real geniuses in a company (the PhDs) get let go before the inexperienced idjits (that would be, uh, me). And then they go over to Google and make a ton of money.
Er. What was my point?
When I landed at Webshots, there was one thing I’d wanted to accomplish, and I had many thoughts of leaving. Then I developed my weird health issues, and it was all I could do to get to work (almost) every day. And yet, they stuck with me, and I worked it out, and I accomplished and learned quite a bit, and it worked out remarkably well for me.
Three employers, four job titles, six job responsibilities, and eight teams later, here I am.
That kind of change certainly helped keep me from stagnating.
And now… now, I am having a great time, applying lessons on scaling in a new context, increasing the operational complexity I have to tame, and learning (slowly) how to navigate political waters in a large, established company.
Is this where I’ll be in 2 years? Will this be what will make me happiest for the next 18 months?
Honestly, I don’t know.
What I do know is that I intend to intentionally choose what I do next.
September 6, 2008
After Baron Schwartz’s recent hint at having solved unit testing database sandboxes at a previous employer, I got to thinking about the problem again.
To be clear: this is not really unit testing, but I’ve found integration tests at various levels are just as important as unit tests. So much so that I have taken to creating both
integration source directories, and, whenever possible, requiring both suites to pass as part of the build process.
There are two suggestions I’ve seen for solving this problem, both of which are applicable for local in-memory databases as well.
First, starting with a completely empty database, populating it, and then tearing it down. Unfortunately, this is not only difficult, it’s time consuming. If you do this before each test, your tests will take hours to run. If you do this before the whole suite, your tests will not be isolated enough.
A previous co-worker had suggested an incremental approach. Start out with an empty data set, and let each test (perhaps through annotations) define which data must be fresh. I like that. It requires a lot of infrastructure and discipline. It could encourage simpler tests, although with simpler tests come more tests, thus more discipline.
The other approach I’ve seen suggested a couple of times now (including in a comment on Baron’s blog) is the use of a global transaction. Unfortunately, this does not work with all database engines. MySQL tends to be the real killjoy, because nested transactions are not supported and DDL statements are not transactional. Yeah, even in the transactional engines.
So, here’s what I’m thinking. If I were starting over with a new team, with minimal code already existing, I think I wouldn’t solve this problem from an engineering/code perspective. Instead, I’d solve it from an operational perspective (though it still requires application/test infrastructure changes).
Picture a central test database server with one pristine copy of the data, and thousands of database instances. The application (test) asks this server for an available database instance, uses it for a single test, and then moves on. The next test resets the application state, so it asks the server for another available database instance, and so on.
Meanwhile, there is a daemon on that server that is constantly checking each database instance. If the underlying data files do not match the pristine copy, they are restored and the instance is placed back into the available pool.
An instance is considered available for testing when (a) there are no threads connected to it, and (b) its data files match the pristine copy.
Tests that do not alter the underlying data files do not require restoration.
What about schema changes? Answer: you have to unit/integration test them too. When you’re ready to promote your code, you deploy to the pristine copy as part of the standard production release process. An interesting side effect of this is it will, in many cases, force other developers to merge production changes back into their private branches, because many of their tests will probably fail.
Contrary to Baron’s suggestion, in a properly designed system this does not require changes to production code. As long as you can inject a database connection pool into your application–and all quality code should have this property (cough, cough)–your test framework can inject a connection pool that interrogates the test server first.
And it can scale to multiple test database servers as your team and the number of tests grows.
I haven’t tried this, and I have too many deadlines (and too much legacy code that I’m still learning in my current team) to experiment with a real-world application.
But what do you think? What holes are there in this proposal?
Aside from violating the Engineering Aesthetic that the application should control the environment it needs for testing. Which is what I think has caused me the most problems over the years.
August 23, 2008
I am living without internet in my temporary condo in Seattle (the horror! my god! I’m dying! how did people live twenty years ago? no wonder there are so many wars!), and am working on deadlines at the office, so have few chances to write.
But I am trying to keep up with the news using Google Reader’s offline mode.
Which brings me to this bit on collective intelligence from Nat Torkington:
Systems that channel individual behaviours to create new and valuable data are showing up everywhere. We point to Amazon Recommendations as the canonical example, but it’s hard to find an area that isn’t using individual actions to produce collective wisdom.
Not that I disagree, but the thought just struck me. We always bring up recommendation engines on Amazon or Pandora or Netflix or Facebook…
Wisdom represents the ability to understand the world better and, through that understanding, improve it (or at least one’s standing in it). (See the Wikipedia entry on wisdom, which agrees with me.)
How is “finding your niche” (or even moving outside your niche) in books or music or movies or online friends…wisdom?
That seems more like plain ol’ socialization to me. On a much larger scale than ever before, granted, but can we call what we’re doing with these tools at this moment in history increasing our wisdom?
Coincidentally, this chart from Newscientist (via Paul Kedrosky–no direct link available) shows what happens when people “recommend” (in a generic sense) stocks to one another:
I bet if we plotted the popularity of artists and movies (at all points along the head and long tail) we’d find similar results.
I guess I’m still waiting for automated tools that increase my wisdom. Are there tools that will look for trends in people that live longer, which will help me live to 100? Are there tools that will look for trends in people that are successful, which will help me retire when I’m 45 and spend the next 55 years traveling the world (and, hopefully, the moon and Mars)? Are there tools that will help us reform our political structure so that it’s even worth living longer? Tools that make it harder to not have compassion? Tools that prevent us from foisting dictators and nanny-states upon ourselves?
Yes, some of these things are coming–but at the moment, we basically have simple data mining tools that help experts know where to focus their attention, then filter and draw conclusions and make suggestions to the rest of us.
I love my Pandora. I love recommendation engines in general. I even love my credit card company’s fraud algorithms. My life is much better–I am much happier–as a result.
But I don’t feel any more wise.
July 19, 2008
Thus concludes any feelings of affection I’ve retained for the Libertarian Party.
(Yes, I’ve been under a rock, and this is news to me.)
If Barr is really serious about reversing his position on…well, everything he ever advocated…he should do it within the Republican Party framework. Because it’s no good to spend decades doing damage with the weight of the federal government, then “repent” when a third party offers you an ego boost late in life. You have to reverse the policies you’ve helped implement, not just speak a few words that the mainstream won’t even take seriously.
And how the hell does a guy like that get accepted into the Libertarian Party, of all things? The party of “principle”? The party that would rather come in fourth in an election than give up an ounce of liberty?
I mean…are they all smoking dope?
July 17, 2008
The team I’m working with is in the process of moving from SQL Server to MySQL.
To ease the burden, we need all writes to the new MySQL databases to be replicated back into SQL Server, so that all of our back-end reports and what-not will continue to function. It’s short term, but not shorter than six months.
I can’t find any generic tool online to do this.
I thought about setting up a bridge MySQL instance with triggers that create a message queue, then reuse our existing Java DAOs to do the heavy lifting (they know how to read/write either SQL Server or MySQL). But, that quickly gets invasive and brittle.
My next thought was a lightweight wrapper around the
mysqlbinlog program which does essentially what the MySQL replication threads do: reads a portion of the binlogs, persists its current position, then replays the statements.
Then I thought, “well, if we’re doing that, why not just reuse MySQL’s replication code and build something around that core?”
You can see the hole I’m getting myself into.
Are there better strategies? Are there (free or commercial) tools that can replay replication logs on a MSSQL Server instance?
The lightweight Perl/Python wrapper seems the best solution. Anybody have experience with something like this (for SQL Server or otherwise)?
July 16, 2008
We’ve all been through this exercise a number of times: you’re partitioning your data set across multiple physical servers, which means you can no longer rely on your database’s built-in
One common pattern is what’s come to be known as the “hi-lo” strategy. This requires each component in your distributed environment to communicate with a central node (table) to generate an ID, usually by reserving a batch of available IDs for itself to reduce the communication.
(“Hi lo” isn’t exactly self descriptive, so I always use “externalized ID generation” which, despite being longer and more difficult to say, seems to be more descriptive for those not on your core team. Anyway…)
Another pattern that’s emerging is the use of UUIDs. These are 64-bit unique identifiers that can be generated in isolation by any component in your system without talking with a central node (table).
I’m going through this exercise again with a new team, and I’m debating whether I should step outside my comfort zone and try UUIDs. I’m having a hard time
seeing many advantages, however.
The disadvantages that I see are: longer IDs mean longer URLs (unless you go the “slug” route with SEO-friendly URLs), and they don’t play nice with clustered indexes.
The latter is a bit more damning in most scenarios, actually. It just blows away InnoDB’s clustered primary key, for example, which is going to be an issue even outside of web contexts.
The main problem with “hi-lo” is it still introduces a SPOF. In reality, it’s not going to be a bottleneck, since you can always adjust both the granularity and the number of IDs you reserve in each transaction (the “lo” part of “hi-lo”). But that lingering SPOF is a bit ugly.
The other advantages I see are: better use of the available ID space and ubiquitous implementations (even MySQL now has the
uuid() built-in). Many hi-lo implementations are specific to a given framework and/or broken for real distributed environments.
Are there advantages I’m missing?
The only situation that I can see where UUIDs offer a clear advantage is for aggregating content across the web, where the UUID can be based on a SHA-1 hash of the source URL, and you’re not clustering on primary key (or, indeed, may not even be using a relational database). FriendFeed seems to have gone this route, for example.
Now obviously, I’m talking about a narrow context. UUIDs are applicable in many other situations (device IDs, node IDs, etc.).
July 6, 2008
There’s an excellent Google TechTalk by Mary Poppendieck on the history of management/leadership organization, starting with railroads in the 1840’s and working through WWII and many of the buzzwords we soaked up in the 1990’s.
By the end, she brings it back around to software engineering teams and identifies two essential leadership roles. Well worth the 90 minutes.