September 27, 2008

Three subversion tips: svn:ignore, svn merge, svn move

Posted in Development, Software tagged , , at 7:57 am by mj

Since I complained earlier this year about the state of Subversion tools, I’ve been thinking about a follow-up that’s a bit more positive.

This doesn’t exactly count, but I thought I’d share a few productivity lessons I’ve learned recently.

Using svn:ignore
svn:ignore is a special subversion property that instructs Subversion to ignore any files (or directories) that match a given pattern.

The common use case is to ignore build artifacts to prevent accidental check-ins and eliminate clutter on svn status, etc. For example, you can ignore all *.jar files in a particular directory, or ignore your build directory, etc.

Unfortunately, this can tend to hide problems with your build artifacts. For a project I’m working on now, we have timestamped JAR files stuffed into a common directory. The JAR files themselves are svn:ignore‘d, which means svn status will never display them.

And as I found recently, this could result in 8 GB of “hidden” files that only becomes apparent when you, say, try to copy a remote workspace into a local one for managing with Eclipse.

Shame on the developers for not deleting them as part of ant clean. But it happens, no getting around that.

Thankfully, the Subversion developers thought about this case, and introduced the --no-ignore flag to svn status. With this option, ignored files are displayed along with added, modified and deleted files, with an I in the first column.

Cleaning up your subversion repository is, therefore, as simple as:

svn status --no-ignore |
grep -P '^I' |
perl -n -e '/^\I[\s\t]+(.*)$/; my $f=$1; if (-d $f) { print "Deleting directory $f\n"; `rm -rv "$f"`; } else { print "Deleting file $f\n"; `rm -v "$f"`; }'

That will remove all files and directories that Subversion is ignoring (but not files that just have not yet been added to source control). Stick that in a script in your path, and live happily ever after.


Merging back into trunk
The most common use case when merging is to provide a range of revisions in trunk to pull into your branch. For example:

svn merge -r 100:114

What happens is you tell Subversion, “I don’t care what happened before revision 100, because that’s already in my branch…so just apply changes between version 100 and 114.”

But what’s not obvious–nor, as far as I can tell, available in standard reference books–is how to merge back into trunk. It turns out, the way to do this is to disregard everything you’ve learned about subversion.

The problem is that you’ve been merging changes from trunk into your branch. So if you simply choose the naive approach of picking up all changes since your base branch revision until your final check-in, and try to apply those to trunk, you’ll get conflicts galore, even on files you never touched in your branch (except to pull from trunk).

The solution is to use a different form of the merge command, as so:

svn merge ./@115

where revision 115 represents your last merge from trunk.

This actually just compares the two repositories at the specified revision, and pulls in the differences, all the differences, and nothing but the differences. So help me Knuth.


Beware the power of svn move
One of the much-touted benefits of subversion (particularly as compared to CVS) is the support for moving files around. But, until 1.5, there has been a glaring error that is often overlooked and can get you into trouble.

Because svn move is implemented as a svn delete followed by a svn add, Subversion thinks the new file has no relation to the old file. Therefore, if you have local changes to foo, and your arch nemesisco-worker Randy moves it to bar, your changes will simply disappear!

Subversion 1.5 has partially addressed this, at least for single files. Under the new regime, your changes to foo will be merged with any changes to bar. However, you still need to be careful with moving directories.

This is more insidious than moving methods around inside the same file. While in that case Subversion will freak out and your merges will become difficult, at least you’ll see the conflict and your changes won’t disappear while you’re not looking.

The lesson, then, is to talk with your team-mates before any refactoring. (svn lock doesn’t seem to provide any help unless everybody’s in trunk.)

Rumor has it svn 1.6 will address this even more practically by introducing the svn fuck-you-and-your-dog command. But until then, you have to do it the old fashion way.

September 24, 2008

FDIC Insurance Myths & Sound Personal Banking Practices

Posted in finance tagged , , at 3:08 pm by mj

The economy is in the crapper. Banks are failing. The “full faith and credit of the United States government” is all people believe in. Which is scary, if you think about it.


Everybody’s concerned about FDIC insurance coverage, and graphs like this (via the WSJ via Paul Kedrosky) are sending people into fits:

From Paul:

The U.S. has a $6.881-trillion on deposit with banks, but only $4.241-trillion is insured. In the case of IndyMac something like $1-billion deposits was uninsured.

It seems this is one of those cases where subtleties are nearly impossible to communicate, because summaries of FDIC regulations are incomplete.

Hence why I’m writing this, hoping to do my part to help spread accurate information and reduce fear in my tiny part of the world.

First, let’s get this out of the way:

NEVER put all your money in a single bank.

The examples below are extreme cases. In addition to the possibility of bank failures or robberies, you also have to deal with compromised account numbers, being held at gun point, and so on.

A rule we like is a minimum three banks, and a minimum of two accounts that require going to the physical bank to access (e.g., CDs).

OK. So the rule everybody hears is “The FDIC insures you up to $100,000.” What they leave out is the multiple “ownership categories.”

The best source of information is the FDIC’s own introduction to FDIC insurance.

To quote:

Deposits maintained in different categories of legal ownership at the same bank can be separately insured. Therefore, it is possible to have deposits of more than $100,000 at one insured bank and still be fully insured.

Read that carefully. Then read the following pages that describe the eight ownership categories.

For example, use the FDIC Deposit Insurance Estimator to calculate your coverage under the following scenario at a single banking institution:

  • Bob & Alice have a joint savings account with $200,000 balance;
  • Bob has a single savings account with $100,000 balance; and
  • Alice has a single savings account with $100,000 balance

The result? Bob and Alice have $400,000 covered under the FDIC program.

How does that work?

Under FDIC rules, a combined savings account is split equally among all owners of the account, each of whom can be covered up to $100,000 in the “joint ownership” category.

And the “join ownership” category is independent from any coverage in the “single ownership” category.

This has other advantages, as well. If Alice were to get held at gun point, she could not, alone, wipe out their savings, because she does not have access to her husband’s money.

Similarly, if Bob were to get hit by a bus, Alice would immediately have access to 1/4 of their savings–even if there were some hold placed on the joint account (say, if Alice were being investigated because her best friend was driving the bus).

Also, if Bob and Alice were to get a divorce, they’d both be able to get by for a while–and amicably–even if there were a dispute about their shared property. And if they love each other now, it only makes sense they’d want to protect their partner in the event that things turn sour.

This is actually less than what’s possible. Add in a couple of IRAs ($250,000 each) and requited “payable-on-death” accounts, and it balloons to $1,100,000. Beyond that, and I believe you’re past typical personal banking needs. (Unfortunately, the EDIE tool doesn’t allow deep linking to the final reports.)

Given this, how is it that so much of the nation’s deposits are not insured? Too many single people? Too many rich idiots? Or are those graphs wrong and simply based on assuming any amount over $100,000 is uninsured?

My take-away is that the FDIC’s rules–which may seem a little troubling (why only $100,000?)–reinforce sound personal banking practices.

But more troubling for me is the possibility that the FDIC may not actually be financially prepared for what’s coming. From this article:

The total amount of losses to be covered is estimated to be as high as $8 billion. According to the FDIC 2007 Annual Report, the FDIC has only $53 billion to cover losses of this nature. If all the banks on the FDIC watch list were to fail, how much would it cost the FDIC? Does the FDIC have estimates calculated for this?

Of course the FDIC has calculated the estimates.

And, as with almost all institutions, they don’t have enough money.

So of course if all the banks on the list failed, they’d be in trouble, and so would we all.

But Bob and Alice have each other.

And they have their gold bullion investments.

And that secret stash of diamonds buried in their basement.

And that’s all that matters when all the world economies fail. Love. And diamonds.

What’s that you say? They should have buried gasoline instead? Dang. I guess they’ll just suffer, then. Poor Bob and Alice.

September 20, 2008

Match-making for Java Strings

Posted in Software tagged , , , at 11:10 am by mj

(Inspired by Jeff Atwood’s recent ‘outing’ as a regex sympathizer, which got me thinking about the line between “too many” and “too few” regular expressions and how some languages make it a choice between “too few” and “none”.)

Java has a Pattern, which forces you to pre-declare your regex string.

And it has a Matcher, which matches on a String.

It should be noted that a Pattern‘s pattern turns most patterns into a mess of backslashes since the pattern is wrapped in a plain-old Java String.

So a Matcher has matches(), which matches using the Pattern‘s pattern, but only if the pattern would otherwise match the Matcher‘s whole string.

A Matcher can also find(), which matches a Pattern pattern even if the pattern would only match a substring of the String, which is what most patterns match and what most languages call matching on a string.

A Matcher can lookAt(), which matches on a Pattern pattern, which, like find(), can match a pattern on a substring of the string, but only if the String‘s matching substring starts at the beginning of the string.

The String matched by the Matcher can be sliced by a call to region(start,end), which allows matches() and lookAt() to interpret a substring of the String as being the whole string.

Now, after calling find() or any of Matcher‘s String-matching cousins, a consumer of a Matcher can call group(int) to get the String substring that the Matcher‘s Pattern‘s pattern captured when matching on the Matcher‘s String‘s string.

But if you’re lazy, and you have no groups in your pattern, and a Matcher‘s matches() is sufficient, then String gives you matches(pattern) which is precisely equivalent to constructing a Pattern with your pattern and passing a new Matcher your existing String!

So with effective use of Java object syntax, you too can use regular expressions to make your matches on Java Strings almost as obscurely as other languages clearly make matches on their strings!

Is it any wonder Java programmers don’t realize that regular expressions are a beautiful… thing?

September 17, 2008

Shameless Promotion for my Shared Items

Posted in Software tagged , , , , , at 8:25 am by mj

I may not be writing much lately, but, thanks to Google Reader’s Offline Mode, I try to continue reading and adding to my shared items.

Unfortunately, I tend to sync at most once a week (except when I’m back in the Bay Area), so they tend to come in batches…and 3 weeks after the original post. It looks like the Reader team finally fixed the problem of only showing the sync time, though (except in my private view).

In today’s sync, I shared 24 items from 18 bloggers.

While some may go for quantity (ahem, Scoble, Digg), I only share things I’d want to read and refer to again…and which I’d prefer my whole team read, too.

Fortunately, the world is teeming with interestingness.

Some examples of things I’ve found interesting and shared recently:

Implementing Persistent Vectors in Scala.
Daniel Spiewak explains how immutable data structures can, nevertheless, be efficient even during writes. Perhaps the clearest example I’ve seen.

I still don’t claim to understand how multiple threads can share the same (changing) state without locking (perhaps something along the lines of Java’s ConcurrentHashMap, the code from which is well worth studying).


Shard Lessons.
Dan Pritchett shares his experience with database sharding. Worth it for the second lesson alone (“Use Math on Shard Counts”), where he explains why multiples of 12 are a more efficient scaling strategy.


Singletons are Pathological Liars.
Miško Hevery has been writing the clearest (bestest!) introductions to designing for unit testing that I have seen. They’re not really introductions so much as they are motivators.

You know the drill: you join a new team with a code base that’s been around seemingly since Pascal walked the Earth. Maybe everybody has heard of unit testing, but nobody really understands what it’s all about or why their singleton-ridden/new-operator-ridden existing code (or existing so-called “unit tests”) isn’t sufficient.

Don’t buy them a book. Point them to Miško Hevery‘s blog.


There’s more. (Much more.) There’s the excellent ongoing REST discussion involving Tim Bray, Dare, Dave Winer, Bill de hÓra, Damien Katz (and others); a lot of fantastic Drizzle commentary that go into mucho detail; discussions on edge cases and performance degradation in MySQL; and so on.

I wish I had a job that allowed me to just take what I’ve read and experiment with the ideas and contribute to some of the projects.

Yes, that would be an awesome job.

September 13, 2008

Designing a Distributed Hi-Lo Strategy

Posted in Scale, Software tagged , , , at 6:57 am by mj

In a previous post, I lamented that the “hi-lo” ID generation strategy has one big wart: it’s a single point of failure.

After thinking about it a bit, though, it occurred to me that we can eliminate the SPOF without limiting our ability to add more shards later. And it’s quite easy–just more sophisticated than you typically need.

WARNING: I consider this design to be a bit over-engineered. You probably don’t need to eliminate the SPOF. But, if ID generation is the only critical SPOF in your system, and UUIDs aren’t practical for your purpose, it may be worth going this route.

That basis of this lies in expanding the fields in our id_sequences table, reproducing the table in each shard, and introducing a stateless agent that’s always running in the background to maintain the various id_sequences tables across all our shards.

Database Design

 sequence_name        varchar(255) not null
 window_start_value   bigint(19) not null
 window_end_value     bigint(19) not null
 next_hi_value        bigint(19) not null
 PRIMARY KEY (sequence_name, window_start_value)
 KEY idx_find_window (sequence_name, window_end_value, next_hi_value, window_start_value)

The key change is the introduction of window_start_value and window_end_value, which together define an ID window from which threads can reserve IDs on each shard.

Each shard will have multiple open windows, but only one is used at a time. A window is open if next_hi_value < window_end_value.

Windows are created (and pruned, if necessary) by the agent, more on which later.

Application Hi-Lo Strategy

As expected, the in-memory buffer works as normal. The difference is in the select and increment operation.

When a shard has exhausted its in-memory buffer, we reserve another batch with the following sequence of steps:

Step 1. Begin transaction

Step 2. Query the database for the first open window

    SELECT *
    FROM id_sequences
    WHERE sequence_name = ?
       AND window_end_value > next_hi_value
    ORDER BY window_start_value 
    LIMIT 1

Step 3. Increment the max reserved ID by our buffer size, but do not exceed window_end_value.

Step 4. Update the window with the new next_hi_value

Step 5. Commit

This is guaranteed to always return a single open window (unless there are no open windows). Multiple threads trying to fetch the open window at the same time will not conflict. If thread A and B arrive simultaneously, and thread A exhausts the first open window, thread B will simply use the next window.

Controlling Agent

This agent can be always running, or it could be a simple nightly cron job, or even a periodic manual process. Its responsibility is to create windows on each shard.

Since the current state of the system can be reconstructed on-the-fly without locking (see caveat below), we don’t have to worry about agents crashing or getting killed in the middle of their thing.

There are only two system parameters that the agent concerns itself with: min_open_windows and window_size. Whenever any shard has fewer than the minimum number of open windows, the agent creates a new window on that shard.

Re-constructing the system state can be as simple as running

    SELECT max(window_end_value)
    FROM id_sequences
    WHERE sequence_name = ?

on each shard before proceeding.

You probably also want a first pass that finds all unique sequence_names

    SELECT DISTINCT(sequence_name)
    FROM id_sequences

so introducing a new sequence is as simple as inserting a single row into one of your shards, and the agent will propagate it elsewhere.

Then, for each shard, it queries a count of the open windows for each sequence, and inserts new windows as necessary.

No locking. No external state.

Is the Agent a SPOF?

That’s true – if the server on which the agent is set to run goes down, game over. But, you can run the agent from a cron job hourly, and stagger it across N servers, each running at a different hour.

I can’t envision a scenario where you’d need the agent to be continuously running and this would not suffice as a highly available design. If N-1 of your servers go down, then at most you’d go N hours without creating new windows. But your window sizes are sufficient to support a week or more of growth, yes?

What about Master-Master?

Some database systems are deployed in master-master pairs. In this case, you can either stick the id_sequences table on only one of the masters and always hit that master, or give each master its own window. The latter is probably preferable, although it means excluding id_sequences from replication.

Adding a new shard

Before opening up a new shard, the agent needs an opportunity to initialize the table. Not a big deal.

Deleting a shard or taking a server out of rotation

This is the one flaw, as hinted above. Reconstructing the system state on-the-fly requires that the server with the highest window_end_value can be reached by the agent, and that we know which server that is.

This may require external state to work around, such as always writing the largest window_end_value into at least one other server.

It’s probably sufficient for the agent to simply refuse to run when any server is unavailable. If you have a shard that’s offline long enough to exhaust all of your open ID windows, you have bigger problems.


As I said, this is probably a bit over-engineered for most systems. While I have tested the behavior under MySQL, I have not had an opportunity to deploy it in a running system (I may soon, if we determine it’s not too over-engineered), and I have not heard that anybody else has employed a similar solution.

Which also means no existing frameworks support it out of the box.

September 12, 2008

Musings on the Next Few Years

Posted in Me at 4:22 am by mj

I wrote this in July (!!) as a morning-after continuation on this update on my life.

Since it still accurately reflects my thoughts at this time, and since I keep sucking at connecting with old friends in real life, I’m publishing it.


I have started thinking through my career trajectory for the next 2-5 years.

I’ve been lucky thus far. Yeah, there were some rough patches (many depressing rounds of layoffs at Excite, for example), but I’ve had a number of excellent managers, and been able to observe and learn from many awesome co-workers. Webshots paid off both literally (twice!), and figuratively, in all that I’ve had the opportunity to stumble through.

What I didn’t have before, really, was the ability to choose my path.

In 2000, I thought what I really wanted to was to earn enough money to go back and pursue my PhD. I’d even started discussions with my manager at the time to this effect. Thankfully, the layoffs came, and I discovered that the real geniuses in a company (the PhDs) get let go before the inexperienced idjits (that would be, uh, me). And then they go over to Google and make a ton of money.

Er. What was my point?

When I landed at Webshots, there was one thing I’d wanted to accomplish, and I had many thoughts of leaving. Then I developed my weird health issues, and it was all I could do to get to work (almost) every day. And yet, they stuck with me, and I worked it out, and I accomplished and learned quite a bit, and it worked out remarkably well for me.

Three employers, four job titles, six job responsibilities, and eight teams later, here I am.

That kind of change certainly helped keep me from stagnating.

And now… now, I am having a great time, applying lessons on scaling in a new context, increasing the operational complexity I have to tame, and learning (slowly) how to navigate political waters in a large, established company.

Is this where I’ll be in 2 years? Will this be what will make me happiest for the next 18 months?

Honestly, I don’t know.

We’ll see.

What I do know is that I intend to intentionally choose what I do next.

September 6, 2008

Creating Database Sandboxes for Unit/Integration Tests

Posted in Development, Software tagged , , , at 9:45 am by mj

After Baron Schwartz’s recent hint at having solved unit testing database sandboxes at a previous employer, I got to thinking about the problem again.

To be clear: this is not really unit testing, but I’ve found integration tests at various levels are just as important as unit tests. So much so that I have taken to creating both test and integration source directories, and, whenever possible, requiring both suites to pass as part of the build process.

There are two suggestions I’ve seen for solving this problem, both of which are applicable for local in-memory databases as well.

First, starting with a completely empty database, populating it, and then tearing it down. Unfortunately, this is not only difficult, it’s time consuming. If you do this before each test, your tests will take hours to run. If you do this before the whole suite, your tests will not be isolated enough.

A previous co-worker had suggested an incremental approach. Start out with an empty data set, and let each test (perhaps through annotations) define which data must be fresh. I like that. It requires a lot of infrastructure and discipline. It could encourage simpler tests, although with simpler tests come more tests, thus more discipline.

The other approach I’ve seen suggested a couple of times now (including in a comment on Baron’s blog) is the use of a global transaction. Unfortunately, this does not work with all database engines. MySQL tends to be the real killjoy, because nested transactions are not supported and DDL statements are not transactional. Yeah, even in the transactional engines.

So, here’s what I’m thinking. If I were starting over with a new team, with minimal code already existing, I think I wouldn’t solve this problem from an engineering/code perspective. Instead, I’d solve it from an operational perspective (though it still requires application/test infrastructure changes).

Picture a central test database server with one pristine copy of the data, and thousands of database instances. The application (test) asks this server for an available database instance, uses it for a single test, and then moves on. The next test resets the application state, so it asks the server for another available database instance, and so on.

Meanwhile, there is a daemon on that server that is constantly checking each database instance. If the underlying data files do not match the pristine copy, they are restored and the instance is placed back into the available pool.

An instance is considered available for testing when (a) there are no threads connected to it, and (b) its data files match the pristine copy.

Tests that do not alter the underlying data files do not require restoration.

What about schema changes? Answer: you have to unit/integration test them too. When you’re ready to promote your code, you deploy to the pristine copy as part of the standard production release process. An interesting side effect of this is it will, in many cases, force other developers to merge production changes back into their private branches, because many of their tests will probably fail.

Contrary to Baron’s suggestion, in a properly designed system this does not require changes to production code. As long as you can inject a database connection pool into your application–and all quality code should have this property (cough, cough)–your test framework can inject a connection pool that interrogates the test server first.

And it can scale to multiple test database servers as your team and the number of tests grows.

I haven’t tried this, and I have too many deadlines (and too much legacy code that I’m still learning in my current team) to experiment with a real-world application.

But what do you think? What holes are there in this proposal?

Aside from violating the Engineering Aesthetic that the application should control the environment it needs for testing. Which is what I think has caused me the most problems over the years.