Need for speed II: Using and testing Alluxio (formerly Tachyon)

In the last post we saw how to download and install Alluxio, along with some very basic tests. This time we will go on and see it in action.

Alluxio by itself is just a storage system and it needs an application framework to run its applications. According to the documentation it can work with Spark, MapReduce and Flink. I will use Spark to test it.

Our data file will be my standard sampledata.csv file that contains 15 million lines. In order to easily  process a CSV file, I used this nice library that enables running SQL queries against csv files. If you want to download the compiled JAR file I used, it’s here.

You will also need alluxio-core-client-1.1.1-jar-with-dependencies.jar

This file usually comes with the Alluxio installation and should be in $ALLUXIO_HOME/ core/client/target. If it is not there, you can download it from the Alluxio downloads page, choose the Alluxio Spark client.

Make sure those two jar files are on your server and keep note where you put them.

If you are on Apache spark Go to $SPARK_HOME/conf and add this to core-site.xml file (create it if it is missing):





If you are using Cloudera Hadoop then the configuration directory should be /opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/etc/spark/conf.dist

Alluxio documentation suggests adding the Alluxio jar file to spark classpath by adding the below line to or just add it to the environment (change the locations according to your setup):

export SPARK_CLASSPATH=:/usr/bin/Hadoop/alluxio-1.1.0/core/client/target/alluxio-core-client-1.1.0-jar-with-dependencies.jar

This works, but in Cloudera, spark-shell complains about it being deprecated. You can safely ignore ths message and continue. I managed to get rid of the deprecation message by editing /opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/etc/spark/conf.dist/spark-defaults.conf instead and adding this to it:

spark.driver.extraClassPath /usr/alluxio/alluxio-1.1.1/core/client/target/alluxio-core-client-1.1.1-jar-with-dependencies.jar

We will continue where we left off last time, with the sampledata.csv file we uploaded to Alluxio. You can see it here:

[[email protected] ~]$ alluxio fs ls /tmp/guy

488.94MB  07-17-2016 12:53:24:984  Not In Memory  /tmp/guy/sampledata.csv

You can see that the file is not in memory, so Alluxio cannot serve it to applications. Fortunately, we persisted it to the underFS so we just have to load it back to memory:

[[email protected] ~]$ alluxio fs load /tmp/guy/sampledata.csv

/tmp/guy/sampledata.csv loaded

[[email protected] ~]$ alluxio fs ls /tmp/guy

488.94MB  07-17-2016 12:53:24:984  In Memory      /tmp/guy/sampledata.csv

Now the file is in memory and ready to use.

You can see the format of the file, with the random “value” column, here. I will use the spark-shell to calculate the average of the “value” column. First we will run it directly from HDFS, without Alluxio.

I will use this small scala script:

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import java.util.Calendar

val mySchema = StructType(Array(
StructField("line", IntegerType, true),
StructField("value", IntegerType, true),
StructField("message", StringType, true)))
val sampledata1 ="com.databricks.spark.csv").option("header", "false").schema(mySchema).load("hdfs://cloudera1:8020/alluxio/tmp/guy/sampledata.csv")

val average=sqlContext.sql("select avg(value) from sample")



Now we will launch the spark shell with the csv file library:

spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

and run our program:

scala> :load /var/lib/spark/test.scala

The result looks like this:

scala> :load /var/lib/spark/test.scala

Loading /var/lib/spark/test.scala...

import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

import java.util.Calendar

mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(line,IntegerType,true), StructField(value,IntegerType,true), StructField(message,StringType,true))

sampledata1: org.apache.spark.sql.DataFrame = [line: int, value: int, message: string]

average: org.apache.spark.sql.DataFrame = [_c0: double]

res5: java.util.Date = Sun Jul 24 13:38:21 IDT 2016


|          _c0|




Now we can try the same thing, only accessing the file via Alluxio. Change the file location in test.scala file from hdfs://cloudera1:8020/alluxio/tmp/guy/sampledata.csv to alluxio://cloudera1:19998/tmp/guy/sampledata.csv and run the program again.

Running the same task, once with Alluxio and one without it, I immediately noticed that the runtime difference between the Alluxio run and the non-Alluxio run was not dramatic. Alluxio claims to be up to 300 times faster and what I saw was very far from it.

So I needed some more test runs, just to smooth any spikes and anomalies.

I decided to take my 15 Million lines file and run the program on it 10 times with Alluxio and then 10 time without Alluxio. Then, I wanted to check, maybe the difference will be more evident with larger files. So I created a 30 Million lines csv file of the same format and again, ran 10 tests with Alluxio and 10 tests without it.

The results are shown in those two charts (x-axis is test number and y-axis is seconds):

15 Million lines file:

30 Million lines file:


This table summarizes the results:

Difference between Alluxio and non-alluxio runs

 15 Million lines30 Million lines
With Alluxio70.6138.8
Without Alluxio92.7141
% Difference237

You can probably see that for some reason some runs took considerably longer than the average (like runs 9 and 10 of the 15 Million lines file).

So  trying to be fair I took the two last runs out and recalculated the averages. Its 67.6 seconds with Alluxio and 83 seconds without it, which makes a 18.5% difference.


I should say this with great caution. My tests do not represent all workloads and use cases. Some well-known enterprises use Alluxio in production so I guess the see a benefit in using it.

I started testing Alluxio with great excitement, expecting to find the holy grail of data processing (Alluxio claims to be much faster than Spark in memory).

At least in my tests, Alluxio did not live to the expectations. It does make a difference, but this difference is not consistent and not as big as expected and claimed. Still, 7-18 percent performance improvement is nice and you should test how much your application can benefit from using Alluxio.

This entry was posted in Alluxio and tagged , . Bookmark the permalink.

Leave a Reply