October 21, 2007

Streaming MySQL Results in Java

Posted in Development, Software tagged , , , at 7:17 pm by mj

The MySQL C API provides two methods for fetching results:

mysql_store_result()
preferred method, which stores all rows in a temporary buffer on the client
mysql_use_result()
optional method, which gives the caller access to each row as it’s accessed through mysql_fetch_row() without first storing all rows in a temporary buffer

For example, when I need to process a huge number of results from a MySQL table–such as for offline data migration–I reach for the mysql command-line client with the --quick argument. This prevents the dreaded knock on the door from kswapd, and often is faster than doing data transformations in the database (when combined with a subsequent LOAD DATA LOCAL INFILE ....).

You can do the same thing with Perl’s DBD::mysql with something like $dbh->{’mysql_use_result’}=1. In PHP, you use the method mysql_unbuffered_query. I assume any library built on top of the native C API provides a similar step for setting this on a per-handle basis.

What I didn’t know was that Connector/J also supports streaming results. According to their documentation, you achieve this with the following:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

But it turns out that com.mysql.jdbc.Connection#createStatement() uses TYPE_FORWARD_ONLY and CONCUR_READ_ONLY by default. All that’s left for the caller to do, then, is use the unintuitive “magic value” for fetchSize.

What you get back from executeQuery(query) is a ResultSet implementation with a backing store of type com.mysql.jdbc.RowDataDynamic, which reads the next row on each call to next().

This means it’s easy to achieve this even from within Hibernate using session.createSQLQuery. It’s unfortunate that you have to work outside the framework, but special cases like this are OK as far as I’m concerned. And if you’re processing a huge number of results, you probably don’t want all the reflection and serialization overhead of Hibernate, anyway. Update (2007-10-23): I was wrong. You can set the fetch size from a normal Hibernate Query, so no need to work outside the framework. What I said about the reflection overhead when processing a massive result set still goes, though.

Unfortunately, this doesn’t save you from the all GC overhead. Even if you’re careful to reuse your own buffers,
com.mysql.jdbc.MysqlIO#nextRow isn’t so careful. Each row allocates its own byte[][].

This is useful if, for example, you need to process tens of millions of rows from a table, but it’s incredibly expensive to generate the result set so batched fetches using LIMIT ofs,count won’t work. It’s also useful if you’re memory constrained on the client.

A couple of caveats

First, mysql_use_result() PREVENTS ANY OTHER CONCURRENT UPDATES. Even with InnoDB, this means each row in your result set will be locked until you completely process the results. You should only use this on a backend database designed for such queries, and then, you might consider writing your results into a temporary table first. Update (2007/10/23): The same is also apparently true whenever results are sent to the client; the caveat is just more important when streaming results because of the temptation/presumption of doing additional work, therefore taking longer to consume the results. I’ll have more on this later.

Second, each MySQL server has a net_write_timeout setting. Be sure you know what this value is. Default is 60 seconds, but it’s often set to 1 hour or more. You must process your results within this amount of time, or the gremlins will come in through your window.

October 14, 2007

The IEEE Secret Handshake?

Posted in Usability tagged at 9:49 am by mj

This morning I decided to plop down $215 for an IEEE + Computer Society membership. Note I said I decided to, not I did, because the IEEE wouldn’t take my money.

I’ve been considering joining IEEE for a while. I’m happy with my ACM membership (the Digital Library is awesome), but occasionally there will be a spate of good articles available only through IEEE. Plus, I expect a different perspective from their publications than you get from ACM.

It started with an ominous warning:

It appears you are using Firefox on Linux UNKNOWN. This site supports numerous Internet browsers on multiple platforms, however, it is best viewed with Internet Explorer 6.0 on Windows.

While I was surprised the warning doesn’t reference Internet Explore 4.0 (think 1997), I continued on.

Next up was “Web Account” creation. Which requires usernames to be entered in ALL UPPERCASE. (Huh?) When you fail to enter a password of the correct length, or a username in all uppercase, it takes you to an ugly error screen.

Somehow, I managed to make it through the cumbersome registration process, only to be greeted with this screen:

Update: This was not the original error message. The original error message was much less user friendly. The above error message was generated when I tried to do a screen capture, but mistakenly hit my ‘back’ button first. I didn’t notice the difference until I attempted to send IEEE member services the same screen shot. Bah.

Nice error message, folks. It appears my account was created, but the membership could not be added to my cart. No problem.

I started over. Only then to forevermore add memberships to /dev/null:

That’s right. It claims to have added my membership to my cart, but the same page says my cart is empty.

Either I’m not cool enough, or they don’t want to just straight up tell me that I don’t qualify, or I need to find the secret handshake. But whatever, I’m not sure I want to join IEEE at this point. If I have this much trouble just signing up, what will my experience be when I want to search for articles or view their online periodicals?

For what it’s worth, I did also try Internet Experience 6 on Windows XP. Without first logging into my account (in case my account was wonky). On IE, all I get is a completely blank screen (no screen caps from Windows), even after I log in.

Granted, the ACM registration/renewal process sucks raw eggs, too. But at least they always accept my money.

October 7, 2007

The Evolution of a Programming Puzzle

Posted in Coding tagged , , at 10:59 am by mj

Mark-Jason Dominus is currently examining a series of programs attempting to solve Van der Waerden’s problem: four that he wrote in 1988, and one he wrote this year.

Van der Waerden’s theorem says that for any number of colors, say C, a sufficiently-long row of colored dots will contain n evenly-spaced same-color dots for any n. Or, put another way, if you partition the integers into C disjoint classes, at least one class will contain arbitrarily long arithmetic progressions.

So far, he’s only written an introduction and dissected two of his 1988 programs, but it’s already a memorable series. Not only does it show the iterative process of solving an interesting problem (suitable for a SSE interview?), but it tells the story of his evolution as a programmer.

Mark’s blog is one of my favorites, and even though his posts tend to be long and interesting, I never mark them as “toread” the way I do most other long posts when I’m catching up on feeds. I always reflect on them, even if it means making negative progress on my feed reading.

Some quotes to set the mood:

In yesterday’s article I wrote about a crappy program to search for “good” strings in van der Waerden’s problem. It was crappy because it searched the entire space of all 327 3^27 strings, with no pruning. I can’t remember whether I expected this to be practical at the time. Did I really think it would work?
[…]
I was thinking this afternoon how it’s intersting the way I wrote this. It’s written the way it would have been done, had I been using a functional programming language. In a functional language, you would never mutate the same string for each function call; you always copy the old structure and construct a new one, just as I did in this program. This is why C programmers abominate functional languages.

 

On an unrelated note, I’m thinking about auto-posting items from my Google Reader shared items to this blog. Good idea? Bad idea? Should I do so on a daily basis, or put a bit more work into rating them and posting the top N weekly? Or do it all manually?

Update (2007-10-13): Mark pointed out that the exponent did not survive my copy and paste, so it appeared he was only searching through 327 strings, instead of 3^27. D’oh!