Based on a tip from my university colleague Chris Teplovs, I started looking at CouchDB for some analytics code I’ve been working on for my graduate studies. My experimental data set is approximately 1.9 million documents, with an average document size of 256 bytes. Documents range in size from approximately 100 to 512 bytes. (FYI, this represents about a 2x increase in size from the raw data’s original form, prior to the extraction of desired metadata.)
I struggled for a while with performance problems in initial data load, feeling unenlightened by other posts, until I cornered a few of the developers and asked them for advice. Here’s what they suggested:
- Use bulk insert. This is the single most important thing you can do. This reduced the initial load time from ~8 hours to under an hour, and prevents the need to compact the database.
- Don’t use the default _id assigned by CouchDB. It’s just a random ID and apparently really slows down the insert operation. Instead, create your own sequence; a 10-digit sequential number was recommended. This bought me a 3x speedup and a 6x reduction in database size.
Baseline time: 42 minutes, using 1,000 documents per batch.
Baseline time: 12 minutes, again using 1,000 documents per batch.
Using 1,000 documents per batch was a wild guess, so I decided it was time to run some tests. Using a simple shell script and GNU time, I generated the following plot of batch size vs. elapsed time:
The more-than-exponential growth at the right of the graph is expected; however, the peak around 3,000 documents per batch is not. I was so surprised by the results that I ran the test 3 times – and got consistent data. I’m currently running a denser set of tests between 1,000 and 6,000 documents per batch to qualify the peak a bit better.
Are there any CouchDB developers out there who can comment? You can find me on the #couchdb freenode channel as well.
I *just* started looking at CouchDB today. I notice you mention using a sequential number.
http://books.couchdb.org/relax/why-couchdb
There’s a section on “Auto Increment IDs”. They seem to have a justification for doing what they’ve done.
Hi Daniel, thanks for stopping by.
There’s definitely some conflicting advice out there for the new person interested in CouchDB. The point made in the book is an important one – namely, that you don’t want a central single autoincrementing ID in a system that will have eventual consistency, lest you reinvent a single point of failure (and performance limiting).
HOWEVER.
It turns out that using the completely random UUIDs generated by CouchDB by default has a serious performance penalty, which is that it will lead to possibly worst-case performance for the underlying b+-trees used in the implementation. In the words of one of the developers (davisp):
So it turns out that using an incrementing (or decrementing) sequence is useful, even if that sequence is just the number of seconds since UNIX epoch. And generating a timestamp at the time of document insertion is effectively free computationally – cheaper than generating 512 bits of pseudorandom data to be sure!. But how should you deal with the points mentioned in the Why CouchDB chapter? A suggestion is to prefix or suffix some identifier unique to each instance of the DB or application server (depending on your architecture – whether you’re using CouchDB-hosted applications or an external application server architecture like RoR or Django). Now you can be sure you’ll never have duplicate IDs, you can be sure that grabbing documents “near each other” (in an assumed time series) is straightforward, and that you don’t have a singleton design pattern bottleneck with a centrally determined sequence.
In the short time I’ve been on #couchdb this has come up a few times. Perhaps I should start a new post on this specific topic, or post this on the CouchDB wiki…
Hmmm, it would be very interesting to run similar tests with MongoDB and compare the results.
Hi Dwight, thanks for stopping by.
I’ll take a look at MongoDB once this sprint is complete. Right now I’m not “shopping around” for DBs, just trying to figure out how to get the most performance out of the one I’ve chosen already.
Pingback: John P. Wood » CouchDB: Databases and Documents
Pingback: Massive CouchDB Brain Dump – Matt Woodward’s posterous « mnml
Pingback: CouchDB For A Real-Time Monitoring System | Stoat – Where?
Pingback: Importing Medline into CouchDB Redux
I see a lot of interesting posts on your blog.
You have to spend a lot of time writing, i know how
to save you a lot of work, there is a tool that creates high quality, SEO friendly posts in couple
of seconds, just search in google – k2 unlimited content
HI Evelyn, I’m afraid the units are long gone, but feel free to shoot me an email or similar with any specific questions you might have.