There are many tools and techniques out there to generate artificial load on Hadoop systems. Some of them are bundled with Apache Hadoop distribution itself. However, when I tried them they did not run right away without problems and were sometimes more than I needed.
So I decided to create my own very simple load generating program using Hive.
First, I used this small bash script to generate large data file:
#!/bin/bash MAXLINES=$1 COUNTER=1 echo $MAXLINES while [ "$COUNTER" -le "$MAXLINES" ] do NUMBER=$RANDOM echo $COUNTER,$NUMBER,"Line number" $COUNTER >> sampledata.csv let "COUNTER+=1" done
Running this script with line count of 15,000,000 produces a nice beefy 489Mb file.
Our load generating program will run several concurrent threads. If we use Cloudera Hadoop then we are all set because cloudera uses PostreSQL as it’s default Hive metastore. But if we use Apache Hadoop, as I did, the default Derby metastore is not a good fit (it cannot handle more than one client connection). Although my favorite RDBMS flavor is Oracle, just to make things simple I used local MySQL for the Metastore:
yum install mysql-server yum install mysql-connector-java Now start MySQL: /etc/init.d/mysqld start cd $HIVE_HOME/lib ln –s /usr/share/java/mysql-connector-java.jar mysql-connector-java.jar
Now add this to hive-site.xml file:
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/hcatalog?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hve</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property>
Before staring hive we have to create the mysql user:
mysql –u root –p CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive'; Query OK, 0 rows affected (0.00 sec) create database hive; grant all on hive.* to hive; create database hcatalog; grant all on hcatalog.* to hive;
As I mentioned, my load generator program will run several Threads concurrently and I don’t want the storage to be a bottleneck. So I used the following statement to create Three identical tables called sampledata1, sampledata2 and sampledata3 and load the contents of the file we created earlier into those tables.
create table sampledata1 ( seqnum int, value bigint, message string ) row format delimited fields terminated by ',' stored as textfile; LOAD DATA local INPATH '/home/hive/sampledata.csv' OVERWRITE INTO TABLE sampledata1;
I created a simple Java program that connects to Hive via JDBC.
I used Cloudera JDBC driver that can be downloaded here, but I guess it will work with the Apache driver as well (I didn’t test it though). The driver comes bundled with a bunch of jar files which you should place in a “lib” directory next to my jar file..
You can find the jar file here.
The program takes it’s parameters from a file with this format:
# This is HiveLoader parameter file, all parameters should be in lowercae. # Host name, mandatory, no default name. host=xxxx.xxxx.net # Port, optional, default value is 10000 port=10000 # Schema, optionsl, default value is "default" schema=default # Number of worker threads, optional, default is 4 workers=5 # User name, optional, default is "hive" user=hive # Passwird, optional, default is "hive" password=hive # Queue name, optional, default is "default" queue=default # Weather to show debug information, optional, default is "no" debug=no
When you run the program you should point it to the properties file, for example:
java -jar HiveLoader.jar -file c:\users\guy\hiveloader.props
The program spawns some workers and each worker in an infinite loop picks a random table out of Three, picks a random query out of Three, and runs the selected query on the selected table and then starts over. The only way to stop the program is Ctrl-c.
It is very simple and basic and I can think about many additional features and parameters, but I wanted to keep it fast and simple and I do not need more than that right now.
This enables me to put some load on my cluster and later explore its behavior with different settings.