Kudu, called after the large African antelope is a new distributed filesystem that was developed by Cloudera and was released to Apache foundation and has recently become a top level project.
Kudu is a columnar storage system that consists of a master server which stores metadata and tablet servers which holds the tablets (the data is organized in tables and each table is divided into tablets that are analogous to blocks in other filesystems). It also supports a primary key for each table which can be a compound key (but secondary indexes are not supported).
Columnar storage can increase data retrieval speed as they can retrieve only the needed columns (unlike row oriented storage where whole rows are retrieved). Additionally, we can apply better compression rates to the data as each column contain data of the same type.
And last note before we begin: as of August 2016, Kudu is still in beta and considered experimental and Cloudera does not offer commercial support for it at this point.
We will not cover the Kudu installation procedure as it is well documented here. Just follow the instructions to install Kudu with Cloudera manager using parcels, which is the easiest way.
Currently Kudu is not fully integrated into the Hadoop ecosystem. It is tightly integrated with Impala, has a weaker integration with Spark and some integration with Hive, but things are dynamic and there is an ongoing work to improve this integration.
For our tests we will use a 30 Million lines sample data file. I have created Three different tables based on this file.
A regular Hive table:
create table sampledata_1 ( seqnum int, value bigint, message string ) row format delimited fields terminated by ',' stored as textfile;
A parquet based table. Parquet is a columnar file format and it will be interesting to compare it to Kudu:
create table sampledata_p ( seqnum int, value bigint, message string ) row format delimited fields terminated by ',' stored as parquet;
The two former tables reside on HDFS, but the last one is a table that resides on Kudu. It also have a primary key on the “seqnum” column:
CREATE TABLE sampledata_k( seqnum int, value bigint, message string ) DISTRIBUTE BY HASH into 5 buckets TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'sampledata_k', 'kudu.master_addresses' = 'cloudera1', 'kudu.key_columns' = 'seqnum' );
If you are using regular Impala, this statement will fail with such error:
ERROR: AnalysisException: Syntax error in line 5: distribute by hash (value) into 5 buckets ^ Encountered: IDENTIFIER Expected: CACHED, COMMENT, LOCATION, PARTITIONED, PRODUCED, ROW, STORED, TBLPROPERTIES, UNCACHED, WITH CAUSED BY: Exception: Syntax error
This is because Kudu is not yet well integrated with Impala. You can create the table but without the DISTRIBUTED statement, which is a Kudu specific statement. However, distributing the table is a good idea and to do so we need to install a special version of Impala called Impala-Kudu. It can coexist with the regular Impala and here is the procedure to install it. With this version of Impala I was able to create the table with the DISTRIBUTED option. To load data into the table I just used insert-select from the text table sampledata_1.
So I have three types of tables now: Regular text-based table named sampledata_1, a Parquet table with snappy compression named sampledata_p, and a Kudu table named sampledata_k.
First, I ran a select avg(value) on each of the tables. Impala is blazing fast as it is so the difference is not that evident, but it exists. In this kind of work, Kudu is just a bit before the text table and Parquet outperforms both of them. I would expect Parquet and Kudu to be closer since they both use a similar technology.
Then, I wanted to examine Kudu’s RDBMS style indexing which is unique for this kind of system. So I a query that returns only one column of one row by the table’s primary key. This query should maximize the impact of Kudus indexes and columnar nature:
select * from sampledata_1 where seqnum=29400387;
As expected, this is where Kudu really shines and outperforms both other competitors by far.
You can see the results in the table and chart below:
Kudu execution times (seconds)
|Specific row and column query||Average query|
|Parquet with snappy compression||1.1||1.32|
Kudu is a project at its early life and has a lot of things that can be improved. It is way ahead of the competition when using its index for selecting particular rows, but this kind of queries is not an analytics typical workload. When scanning and aggregating large bulks of data it doesn’t seem to perform better than Parquet based tables. It is not a general purpose storage like HDFS and you cannot upload files to it, it can be used solely for Impala/Hive tables at this time.
I wish it had better integration with Hadoop ecosystem (Hive, Spark). Currently only Impala is fully supported, and you need a special version of Impala too. Some features are mentioned in the documentation but are not supported in real life yet like data compression. I hope those will be added in later releases (it is still in Beta).
There is a non official support for Hive on Kudu in this nice project called Hive-Kudu handler. it is still in early stage so I do not know if it utilizes the whole performance potential, but it does enable accessing Kudu from Hive. I will examine it deeper in a later post. Impala is very fast even with text based tables and the difference is a matter of a second or two. It will be interesting to see how it performs with Hive, where the difference should be more visible.