October 21, 2007

Streaming MySQL Results in Java

Posted in Development, Software tagged , , , at 7:17 pm by mj

The MySQL C API provides two methods for fetching results:

mysql_store_result()
preferred method, which stores all rows in a temporary buffer on the client
mysql_use_result()
optional method, which gives the caller access to each row as it’s accessed through mysql_fetch_row() without first storing all rows in a temporary buffer

For example, when I need to process a huge number of results from a MySQL table–such as for offline data migration–I reach for the mysql command-line client with the --quick argument. This prevents the dreaded knock on the door from kswapd, and often is faster than doing data transformations in the database (when combined with a subsequent LOAD DATA LOCAL INFILE ....).

You can do the same thing with Perl’s DBD::mysql with something like $dbh->{’mysql_use_result’}=1. In PHP, you use the method mysql_unbuffered_query. I assume any library built on top of the native C API provides a similar step for setting this on a per-handle basis.

What I didn’t know was that Connector/J also supports streaming results. According to their documentation, you achieve this with the following:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

But it turns out that com.mysql.jdbc.Connection#createStatement() uses TYPE_FORWARD_ONLY and CONCUR_READ_ONLY by default. All that’s left for the caller to do, then, is use the unintuitive “magic value” for fetchSize.

What you get back from executeQuery(query) is a ResultSet implementation with a backing store of type com.mysql.jdbc.RowDataDynamic, which reads the next row on each call to next().

This means it’s easy to achieve this even from within Hibernate using session.createSQLQuery. It’s unfortunate that you have to work outside the framework, but special cases like this are OK as far as I’m concerned. And if you’re processing a huge number of results, you probably don’t want all the reflection and serialization overhead of Hibernate, anyway. Update (2007-10-23): I was wrong. You can set the fetch size from a normal Hibernate Query, so no need to work outside the framework. What I said about the reflection overhead when processing a massive result set still goes, though.

Unfortunately, this doesn’t save you from the all GC overhead. Even if you’re careful to reuse your own buffers,
com.mysql.jdbc.MysqlIO#nextRow isn’t so careful. Each row allocates its own byte[][].

This is useful if, for example, you need to process tens of millions of rows from a table, but it’s incredibly expensive to generate the result set so batched fetches using LIMIT ofs,count won’t work. It’s also useful if you’re memory constrained on the client.

A couple of caveats

First, mysql_use_result() PREVENTS ANY OTHER CONCURRENT UPDATES. Even with InnoDB, this means each row in your result set will be locked until you completely process the results. You should only use this on a backend database designed for such queries, and then, you might consider writing your results into a temporary table first. Update (2007/10/23): The same is also apparently true whenever results are sent to the client; the caveat is just more important when streaming results because of the temptation/presumption of doing additional work, therefore taking longer to consume the results. I’ll have more on this later.

Second, each MySQL server has a net_write_timeout setting. Be sure you know what this value is. Default is 60 seconds, but it’s often set to 1 hour or more. You must process your results within this amount of time, or the gremlins will come in through your window.