CouchDB benchmarking followup

I wanted to post a public thanks to the great minds over in freenode #couchdb, including Paul Davis and Jan Lehnardt for helping me learn more about CouchDB, and helping me investigate the performance issue I posted about last time. In fact, they both posted on planet couchdb with some thoughts about benchmarking in general.

I wanted to share my reason for benchmarking CouchDB in the manner I did. The fact is that it was entirely personal in nature, as in trying to make my own dataset and code as fast as possible. I’m not trying to generate pretty graphs for my management, for publication or for personal gain. This work comes out of having to load and reload fairly large datasets from scratch on a regular basis as a part of my research methodology. I need a large dataset to get meaningful results out of the rest of my system, and I was not content to wait for an hour or two each time my system bulk loaded the content.

So the suggestions provided – not using random UUIDs to help CouchDB balance the b+-tree, correctly using bulk insert, and redoing my code to use the official couchdb interface (instead of a hacked-up version using raw PUTs and POSTs) helped a lot.

It turned out that the spike that I was seeing (see last post) disappeared when I randomized the order of incrementing that variable, so much so that 3 randomized runs show almost no peaks. However, when I “scroll” through that variable (increasing the batch size) I still see the peak around a batch size of 3k.

Trying the software on another platform (AMD x64-3500 Debian lenny with a 3ware hardware RAID array, as opposed to a 4-core Mac Pro with only a single local disk) revealed that the peak shifted to a different value, a much higher one.

Lesson? Always benchmark your own application against your own data, and tweak until you’re satisfied. Or, in the words immortalized at Soundtracks Recording Studio, New Trier High School, “Jiggle the thingy until it clicks.”

I suspect suspected that Jan’s article ranting about benchmarking was at least in part stimulated by my experiences as shared over IRC. (I was wrong – see the comments.) They must have seemed somewhat inscrutable — why would someone care so much about something most database-backed applications will do rarely, as compared to reads, queries and updates? My application right now has a very particular set of criteria for performance (repeatable high performance bulk load) that is not likely anyone’s standard use case but my own. Nor is it going to be a worthwhile effort on the developers’ part to spend a bundle of effort optimizing this particular scenario.

That said, Jan also is calling for people to start compiling profiling suites that “…simulate different usage patterns of their system.” With this research, I don’t have the weight of a corporation who is willing to agree on “…a set of benchmarks that objectively measure performance for easy comparison,” but I can at least contribute my use case for use by others. Paul Davis’ benchmark script looks quite a bit like what I’m doing, except the number of documents is larger by a factor of 100 (~2mil here) and the per-document size is smaller by a factor of 25 (100-200 bytes here). Knowing the time it takes to insert and to run a basic map/reduce function on fairly similar data is a great place to start thinking about performance considerations in my application.

Oh, and knowing the new code on the coming branches will get me a performance increase of at least 2x with no effort on my part is the icing on the cake.

Thdavisp. Thjan.

CouchDB 0.9.0 bulk document post performance

Based on a tip from my university colleague Chris Teplovs, I started looking at CouchDB for some analytics code I’ve been working on for my graduate studies. My experimental data set is approximately 1.9 million documents, with an average document size of 256 bytes. Documents range in size from approximately 100 to 512 bytes. (FYI, this represents about a 2x increase in size from the raw data’s original form, prior to the extraction of desired metadata.)

I struggled for a while with performance problems in initial data load, feeling unenlightened by other posts, until I cornered a few of the developers and asked them for advice. Here’s what they suggested:

  1. Use bulk insert. This is the single most important thing you can do. This reduced the initial load time from ~8 hours to under an hour, and prevents the need to compact the database.
  2. Baseline time: 42 minutes, using 1,000 documents per batch.

  3. Don’t use the default _id assigned by CouchDB. It’s just a random ID and apparently really slows down the insert operation. Instead, create your own sequence; a 10-digit sequential number was recommended. This bought me a 3x speedup and a 6x reduction in database size.
  4. Baseline time: 12 minutes, again using 1,000 documents per batch.

Using 1,000 documents per batch was a wild guess, so I decided it was time to run some tests. Using a simple shell script and GNU time, I generated the following plot of batch size vs. elapsed time:

Strange bulk insert performance under CouchDB 0.9.0

Strange bulk insert performance under CouchDB 0.9.0

The more-than-exponential growth at the right of the graph is expected; however, the peak around 3,000 documents per batch is not. I was so surprised by the results that I ran the test 3 times – and got consistent data. I’m currently running a denser set of tests between 1,000 and 6,000 documents per batch to qualify the peak a bit better.

Are there any CouchDB developers out there who can comment? You can find me on the #couchdb freenode channel as well.