CouchDB benchmarking followup

I wanted to post a public thanks to the great minds over in freenode #couchdb, including Paul Davis and Jan Lehnardt for helping me learn more about CouchDB, and helping me investigate the performance issue I posted about last time. In fact, they both posted on planet couchdb with some thoughts about benchmarking in general.

I wanted to share my reason for benchmarking CouchDB in the manner I did. The fact is that it was entirely personal in nature, as in trying to make my own dataset and code as fast as possible. I’m not trying to generate pretty graphs for my management, for publication or for personal gain. This work comes out of having to load and reload fairly large datasets from scratch on a regular basis as a part of my research methodology. I need a large dataset to get meaningful results out of the rest of my system, and I was not content to wait for an hour or two each time my system bulk loaded the content.

So the suggestions provided – not using random UUIDs to help CouchDB balance the b+-tree, correctly using bulk insert, and redoing my code to use the official couchdb interface (instead of a hacked-up version using raw PUTs and POSTs) helped a lot.

It turned out that the spike that I was seeing (see last post) disappeared when I randomized the order of incrementing that variable, so much so that 3 randomized runs show almost no peaks. However, when I “scroll” through that variable (increasing the batch size) I still see the peak around a batch size of 3k.

Trying the software on another platform (AMD x64-3500 Debian lenny with a 3ware hardware RAID array, as opposed to a 4-core Mac Pro with only a single local disk) revealed that the peak shifted to a different value, a much higher one.

Lesson? Always benchmark your own application against your own data, and tweak until you’re satisfied. Or, in the words immortalized at Soundtracks Recording Studio, New Trier High School, “Jiggle the thingy until it clicks.”

I suspect suspected that Jan’s article ranting about benchmarking was at least in part stimulated by my experiences as shared over IRC. (I was wrong – see the comments.) They must have seemed somewhat inscrutable — why would someone care so much about something most database-backed applications will do rarely, as compared to reads, queries and updates? My application right now has a very particular set of criteria for performance (repeatable high performance bulk load) that is not likely anyone’s standard use case but my own. Nor is it going to be a worthwhile effort on the developers’ part to spend a bundle of effort optimizing this particular scenario.

That said, Jan also is calling for people to start compiling profiling suites that “…simulate different usage patterns of their system.” With this research, I don’t have the weight of a corporation who is willing to agree on “…a set of benchmarks that objectively measure performance for easy comparison,” but I can at least contribute my use case for use by others. Paul Davis’ benchmark script looks quite a bit like what I’m doing, except the number of documents is larger by a factor of 100 (~2mil here) and the per-document size is smaller by a factor of 25 (100-200 bytes here). Knowing the time it takes to insert and to run a basic map/reduce function on fairly similar data is a great place to start thinking about performance considerations in my application.

Oh, and knowing the new code on the coming branches will get me a performance increase of at least 2x with no effort on my part is the icing on the cake.

Thdavisp. Thjan.

CouchDB 0.9.0 bulk document post performance

Based on a tip from my university colleague Chris Teplovs, I started looking at CouchDB for some analytics code I’ve been working on for my graduate studies. My experimental data set is approximately 1.9 million documents, with an average document size of 256 bytes. Documents range in size from approximately 100 to 512 bytes. (FYI, this represents about a 2x increase in size from the raw data’s original form, prior to the extraction of desired metadata.)

I struggled for a while with performance problems in initial data load, feeling unenlightened by other posts, until I cornered a few of the developers and asked them for advice. Here’s what they suggested:

  1. Use bulk insert. This is the single most important thing you can do. This reduced the initial load time from ~8 hours to under an hour, and prevents the need to compact the database.
  2. Baseline time: 42 minutes, using 1,000 documents per batch.

  3. Don’t use the default _id assigned by CouchDB. It’s just a random ID and apparently really slows down the insert operation. Instead, create your own sequence; a 10-digit sequential number was recommended. This bought me a 3x speedup and a 6x reduction in database size.
  4. Baseline time: 12 minutes, again using 1,000 documents per batch.

Using 1,000 documents per batch was a wild guess, so I decided it was time to run some tests. Using a simple shell script and GNU time, I generated the following plot of batch size vs. elapsed time:

Strange bulk insert performance under CouchDB 0.9.0

Strange bulk insert performance under CouchDB 0.9.0

The more-than-exponential growth at the right of the graph is expected; however, the peak around 3,000 documents per batch is not. I was so surprised by the results that I ran the test 3 times – and got consistent data. I’m currently running a denser set of tests between 1,000 and 6,000 documents per batch to qualify the peak a bit better.

Are there any CouchDB developers out there who can comment? You can find me on the #couchdb freenode channel as well.

thing-a-day #6: recovering old digital performer projects

I had a terrible scare tonight. None of my Digital Performer (my DAW) projects from before I moved to my new Mac would open. It suddenly felt like I’d lost 5+ years worth of musical experimentation.

After panicing a bit, I did a whole lot of research, and came up with this process. It’s slow, but it works. And it’s a thing for today since no one else has ever written it all up in one place before.

  1. Go to the Terminal and change directories to your project, for example: cd Waynemanor/DP\ Projects/Barracuda\ Project/ (If use of a UNIX command prompt and escaping spaces are new to you, you may want to read through a tutorial first.)
  2. Use ls to find the files that are your project files. In this case, I have two: Barracuda and dys4ik 2006-02-28.
  3. Install the Apple OSX Developer Tools if you don’t already have them.
  4. Use the following command: SetFile -t PERF -c MOUP <project-file-name>substituting each project name in turn.

You’re not done. Your audio files may also be corrupted. Try loading the project into DP. Still problems? Getting a Resource file was not found (-193) error? Your DPII’s resource fork got lost, probably because you copied to a non-Mac system and back. Try these steps. Some guesswork may be required.

  1. Download, install and run SoundHack.
  2. Use File > Open Any (cmd-A) to open your first sound file from the Audio Files directory.
  3. Use Hack > Header change (cmd-H) to assign the correct sample rate, number of channels and encoding. Most DP projects have single channel. You just have to know what the sample rate is (usually 44.1 or 48) and how many bits deep it is (8, 16, 24, 32 are most common). Press Save Info
  4. Select File > Save a copy (cmd-S). Be sure to set the same bit depth here as you used in the file’s header, or SoundHack will do a conversion! Save a copy somewhere else, like your desktop. Be sure to save as the same name as the original file to prevent confusion later.
  5. Navigate to where you saved the file and double-click to open in your favourite sound program. This could be QuickTime, AudioFinder, DSP-Quattro, or even DP itself. Play to make sure it sounds right. If not, you got the sample rate or number of bits wrong. Go back to SoundHack and try again.
  6. Painstakingly repeat this for each of your sound files. This could take a while.
  7. In the DP project folder, move your Audio Files folder aside. Place all of the newly patched files into a new folder called Audio Files.
  8. Try re-opening the project in DP. You should be able to pick up where you left off.
  9. Grab a cold one. You deserve it!

microblogging sillyness

Today a friend linked me to this article in the NY Times about networks, the US presidential inauguration, and twitter. Here’s the key quote, emphasis his/mine:

Biz Stone, co-founder of Twitter, said the company was hoping to sidestep network hiccups. He is not expecting the same traffic spikes as during the election, when the site was flooded with as many as 10 messages a second, but says the service “will nevertheless be doubling our through-put capacity before Tuesday.”

Because of all the hype Twitter gets, I couldn’t believe the figure was so low, so I checked elsewhere. “Twitter had by one measure over 3 million accounts and, by another, well over 5 million visitors in September 2008.” Simple math says there’s about 2.5 million seconds in a month, so 5 million impressions translates to one request every half a second. Presuming each query pulls a few pictures as well as text, 10 messages a second sounds about right based on published data. Let’s further assume each message is the size of an average Twitter page; mine came in at 34,100 bytes just now, or 341,000 bytes a second.

Bandwidth wise, that’s 2.728 Mbit/s, or roughly the bandwidth of 2 T1s. My home DSL line can push 700 kbit/s. With 5-6 of them bonded together, and the appropriate back end servers, I could run Twitter out of my basement.

It also isn’t very much, if you compare it with other semi-synchronous messaging technologies like IRC, Jabber and IM servers, who have been capable of pushing more data per second since well over 15 years ago. I’m sure mainframes were doing similar amounts of data I/O 30 years ago.

The snarky nerd in me wants to smear Ruby on Rails, the technology platform on which Twitter relies, but others did that 2 years ago already (and yes, that link defends the technology, and makes the ridiculous assumption that you can’t build in scalability.) I’m convinced it’s the incorrect application of a specific technology to solve a problem for which it is ill-suited. Perhaps the Twitter infrastructure never planned to expand so greatly, but I find it laughable that we’re in 2009 and that “important” services like Twitter can’t survive a “flood” of 10 messages a second. My friend agrees: “no i’m sure facebook is laughing at the 10 messages/a second ‘flood’ too.”

I’m also quite surprised that such a “popular” site, one that gets so much hype and marketing, really doesn’t get that much use. For comparison, here’s the figures for the Top 10 sites. Being generous and assuming those 5 million hits for Twitter are all unique visitors, that means the largest sites see more than 25 times the traffic it does. Facebook sees at least 10 times the number of unique visitors, and certainly will push more content, what with all the pictures and rich media it has vs. twitter’s limited use of graphics (small avatars only). Of course, none of this even gets into what AWS/S3 and content accelerators push from a pure bandwidth standpoint.

Increasingly, I’m convinced microblogging sites are hiveminds for particular flavours of individual. Disingenuously: StumbleUpon/Digg are “OMG LOOK AT THIS LINK!” Twitter feels like “marketing marketing SEO yadda yadda bend to my will.” Plurk is “cheerleader YAY happy happy dancing banana.” BrightKite is “mommy! mommy! look at me now!” And yes, IRC is probably the Wild Wild West. Others I know have made similar comparisons between LiveJournal, mySpace, Facebook and Friendster. I’m not sure what predestines a technology for a specific sort of person, but the link is there. This might make a good research paper. ;)

plurk

Well, if you’ve visited my website in the last day, you’ll see I’ve added my plurk widget to the sidebar. I’m finding plurk altogether more motivational than twitter. A recent comparison of the two services I think was unfair to plurk, mostly due to overhyped twitter features that aren’t even working correctly half the time.

Even if Twitter solves their reliability concerns, plurk just feels more alive. It’s not a single thread with everyone’s crap mixed in (and hacks like #hashtags). It’s a threaded timeline: part web forum, part IM, and part IRC. And yes, a bit twitter, too. I like the fact that it’s Web 2.0 enough for my friends who won’t get on IRC (still my #1 place for synchronous) and that the web layout is attractive. The “get more points to get more features” thing is a pain, but I’d rather have that than some sort of “pay us $$ to get more features,” I think.

I moved over to plurk when some of my friends followed Leo Laporte over. I’m not a personality cult-er, but I do like keeping tabs on my friends. And I’m finding that plurk’s interface encourages more synchronous collaboration (read: chatting) than twitter ever did. In about 48h on the service I’ve managed 15 “plurks” (microblog entries) and 38 responses; I never hit that level of engagement with twitter.

Once they add some sort of SMS interface (I can’t get their IM interface to work…) it’ll be a sure fire hit, I think.

So you twitter types out there: would you miss me from twitter if i semi-abandoned it?

joanbits 2008.03

As the snow finally melts away, I start to get busier and busier. Here’s what I’ve been up to the past week:

  • Work: supporting new projects as usual, writing position papers, working on newsletters, technical enablement, beta testing.
  • House: cooking every night (pasta from scratch, ramen from scratch, gourmet hamburgers…will try and post some pics soon), planning garden, staring at wall that needs repairing and trying to motivate to fix it, regular cleaning, indoor gardening…
  • School: Developing axiology, epistemology, methodology for design research approach. Gave guest lecture on internal Wikipedia politics.
  • Other: Dealing with horrendous migraine. Developing novel database application. Attended One Of A Kind show with friends and got fabulous clothing, jewelry, housewares. Reverse engineering synth. Looking at motorcycle today. Petting cat to deal with stress from everything else i listed.

thing-a-day 2008 over

Here’s a photo collage of the most photogenic things I made for this year’s thing-a-day. Edit: If you can’t view the video below, the original is here.

I’d recommend thing-a-day to anyone who is looking to push their comfort zone and prove to themselves that they can be creative, and can produce something a day. It was eye-opening for me.

wohbits 2007.12

some neat tidbits from the past 24h:

fixed lj-pic comments

If you’re a LiveJournal user, and you want your LJ picture to come up when you comment on my blog (and don’t have a Gravatar), enter your email address as <lj-user>@livejournal.com. This will display your LJ default picture instead of asking Gravatar for your picture/avatar.

If you’d rather use your real email address, you should head over to Gravatar and register, uploading your pic there.