Sometimes you need to load to CouchBase server data that comes in the form of csv files. I am not talking about continuous monitoring of a file, but one time data loading.
There are several different approaches for this task, here are some of them:
The problem is that most of them are external to CouchBase and requires writing your own code (so there is no standard). I prefer to do this with an integrated tool like cbtransfer. cbtransfer is a utility program that comes with Cuchbase server installation and its main purpose is to transfer data between CouchBase instances. In addition, it can dump data into csv files or load data from them.
I used the following simple bash script to generate a sample csv file. Note that the first line must contain the column names with no spaces. cbtransfer uses those headers to determine field names.
#!/bin/bash MAXLINES=$1 COUNTER=1 echo $MAXLINES echo "id,value,message" > sampledata.csv while [ "$COUNTER" -le "$MAXLINES" ] do NUMBER=$RANDOM echo $COUNTER,$NUMBER,"Line number" $COUNTER >> sampledata.csv let "COUNTER+=1" done
This script accepts a numeric parameter that determines the number of lines in the output file. Each line has a line number that can be also used as an id, a value which is a random number, and a text message. It’s not a sophisticated or meaningful data, but it will do for demonstration.
Here is the full command we will run. We will use only part of cbtransfer options, here is the full documentation. cbtransfer does not create buckets automatically so you will have to create the target bucket manually beforehand (sampledata in our case).
cbtransfer -u Administrator -p manager -B sampledata -t 3 /opt/couchbase/sampledata.csv http://couch1.lan:8091
You can see I passed the program a username and password, the target bucket name (B), the number of concurrent threads to use (t), the source file and the url of the Couchbase server.
Let’s run it:
-bash-4.1$ cbtransfer -u Administrator -p manager -B sampledata /opt/couchbase/sampledata.csv http://couch1.lan:8091 ... bucket: sampledata.csv, msgs transferred... : total | last | per sec byte : 30000 | 30000 | 11021.4 done
Now we can take a look at the bucket and see all 10K documents are there:
I think this is a cleaner way to load csv data as you do not have to deal with code or reinvent the wheal.
One problem I did see is with performance. To check performance I tried to load a larger csv file with 30 Million lines. First I ran cbtransfer with only one thread. This ran forever, loading about 2000 lines per second. After two hours it only completed about a half of the file so I gave up and aborted the operation.
Tried again, this time with Three threads. Now it completed loading all 30 Million lines, but it took an hour and a half to complete with a load rate of about 5000 lines/second. I guess on stronger servers you can run some more threads and complete the loading faster, but performance was a bit disappointing (I did not have the chance to compare it to other loading methods at this time, they may be even slower).
Still, if you do not need to load large amounts of data very often this is a solid option.