Need for speed: Introducing Alluxio (formerly Tachyon)

Alluxio is a very interesting and promising project (until few versions ago it was called Tachyon). It is a caching system that acts like a distributed storage that is kept entirely in memory. It is backed by an under storage system such as HDFS or Amazon S3 (it supports a bunch). Its API resembles the HDFS API and it claims to be even faster than Spark in memory processing. It has its own cluster and it does not need Hadoop (unless configured to use HDFS as an underFS).

So I started testing it and playing with it, first on my laptop as a local installation and later on my lab cluster.

There are two annoyances I found while working with it (this is my personal opinion):

  • Although some enterprises use it in production, this project seems to be in its early life. It is not very well documented. There is no real user manual, just short tutorials. There probably isn’t a wide install base, so if you encounter a problem or misunderstanding, google search will not provide too much information. Also there is almost no third-party documentation (like books). You are pretty much on your own here.
  • It is not very well integrated with Hadoop ecosystem (and doesn’t claim to be). It works good with Spark, but it has very weak integration with Hive.

In this post we will Install and configure Alluxio. Installing, configuring and testing is too long for one post, so I will show it’s usage and testing in the next post.


Download alluxio from here.

Although the documentation mentions source code in many places, I did not find where to download the sources from. Alluxio site download section offers pre compiled versions for many Hadoop versions. I first tried the Hadoop 2.6 version, but the one I used for this post is for CDH5.

The following steps need to be done on each node participating in the Alluxio cluster:

Create an alluxio user that can ssh to itself without a password. We also need to add user alluxio to the sudoers file so it can sudo without requiring a password.

Logged in as root run visudo and add this line:


Copy the alluxio tar file to each node and extract it.

From now on we will call the directory where we extracted Alluxio ALLUXIO_HOME.

Now go to $ALUXIO_HOME and run:

./bin/alluxio bootstrap-conf cloudera1

Replace the “cloudera1” with the name of the server that will be the master.

This will create a file called in $ALLUXIO_HOME/conf

This file holds the default values of some alluxio parameters. These values will be set by the file if they are not set by the environment. We will have to change them since the default value for worker memory size id 2/3 of host memory (which is too high for a server which is not dedicated to Alluxio) and the underfs is local (underfs is the underlying “real” filesystem).

So now we have to edit .profile or .bash_profile on all participating nodes and set those settings according to server memory size and workload and according to the chosen underfs.

This is how it looks in my .bash_profile:


export ALLUXIO_UNDERFS_ADDRESS=hdfs://cloudera1:8020/alluxio

If we are using HDFS as an underFS, then ALLUXIO_UNDERFS_ADDRESS should point to the nameNode.

We also need to create the alluxio directory in hdfs before starting alluxio:

[[email protected] ~]$ hdfs dfs -mkdir /alluxio

[[email protected] ~]$ hdfs dfs -chown alluxio:supergroup /alluxio

It’s also a good idea to add an ALLUXIO_HOME and add the bin directory to PATH:

export ALLUXIO_HOME=/usr/alluxio/alluxio-1.1.1


Formatting and starting Alluxio:

On the master server, login as alluxio and run “alluxio format”:

[[email protected] ~]$ alluxio format

/usr/alluxio/alluxio-1.1.1/bin/ line 43: /usr/alluxio/alluxio-1.1.1/logs/task.log: No such file or directory

Waiting for MASTER tasks to finish...

/usr/alluxio/alluxio-1.1.1/bin/ line 44: /usr/alluxio/alluxio-1.1.1/logs/task.log: No such file or directory

All MASTER tasks finished, please analyze the log at /usr/alluxio/alluxio-1.1.1/logs/task.log.

Formatting Alluxio Master @ cloudera1

Now we can start the master:

$ALLUXIO_HOME/bin/ all

Then on all workers run:


$ALLUCIO_HOME/bin/ worker

The workers communicate with the master to form a cluster. You can go to http://<master host>:19999 to see the admin UI. Here is a general screenshot:

View full size image

And here is a page showing the active workers and their capacity:

View full size image

Running some tests:


We can now run some built in tests by adding $ALLUXIO_HOME/bin to the path and just typing:

alluxio runTests

This will run a set of tests and they should all pass. The output is quite long so I will show here only a small part of it:


2016-07-14 15:47:30,977 INFO type ( - Connecting to remote worker @ cloudera6.lan/

2016-07-14 15:47:31,044 INFO type ( - Connected to remote machine cloudera6.lan/

2016-07-14 15:47:31,143 INFO type ( - status: SUCCESS from remote machine cloudera6.lan/ received

2016-07-14 15:47:31,185 INFO type ( - Connecting to remote worker @ cloudera6.lan/

2016-07-14 15:47:31,207 INFO type ( - Connecting to remote worker @ cloudera6.lan/

2016-07-14 15:47:31,225 INFO type ( - Connected to remote machine cloudera6.lan/

2016-07-14 15:47:31,233 INFO type ( - Data 151062052864 from remote machine cloudera6.lan/ received

Passed the test!


2016-07-14 15:47:31,322 INFO type ( - writeFile to file /default_tests_files/Basic_CACHE_PROMOTE_THROUGH took 75 ms.

2016-07-14 15:47:31,396 INFO type ( - Connecting to remote worker @ cloudera4.lan/

2016-07-14 15:47:31,500 INFO type ( - Connected to remote machine cloudera4.lan/

2016-07-14 15:47:31,590 INFO type ( - status: SUCCESS from remote machine cloudera4.lan/ received

2016-07-14 15:47:31,620 INFO type ( - readFile file /default_tests_files/Basic_CACHE_PROMOTE_THROUGH took 298 ms.

Passed the test!

Now we can work with alluxio in pretty much the way we work with HDFS. Let’s create a directory and load a csv file into it:

[[email protected] ~]$ alluxio fs mkdir /alluxio/guy

Successfully created directory /alluxio/guy

[[email protected] ~]$ alluxio fs ls /alluxio

0.00B 07-14-2016 16:05:24:325 /alluxio/guy

[[email protected] ~]$ ls


[[email protected] ~]$ alluxio fs copyFromLocal ./sampledata.csv /tmp/guy

Copied ./sampledata.csv to /tmp/guy

[[email protected] ~]$ alluxio fs ls /tmp/guy

488.94MB&nbsp; 07-17-2016 12:53:24:984 In Memory /tmp/guy/sampledata.csv

The file is now in memory only and it will be gone if we restart Alluxio. If we want it written to the underFS we should use the persist option:

[[email protected] ~]$ alluxio fs persist /tmp/guy/sampledata.csv

persisted file /tmp/guy/sampledata.csv with size 512688250

Now we can also see it directly from the HDFS underlying filesystem:

[[email protected] ~]$ hdfs dfs -ls /

Found 3 items

drwxr-xr-x   - alluxio supergroup          0 2016-07-17 12:55 /alluxio

drwxrwxrwx   - hdfs    supergroup          0 2016-06-26 23:04 /tmp

drwxr-xr-x   - hdfs    supergroup          0 2016-03-23 10:56 /user

[[email protected] ~]$ hdfs dfs -ls /alluxio

Found 2 items

drwxrwxrwx   - alluxio supergroup          0 2016-07-14 15:47 /alluxio/default_tests_files

drwxrwxrwx   - alluxio supergroup          0 2016-07-17 12:55 /alluxio/tmp

[[email protected] ~]$ hdfs dfs -ls /alluxio/tmp/guy

Found 1 items

-rwxrwxrwx   3 alluxio supergroup  512688250 2016-07-17 12:55 /alluxio/tmp/guy/sampledata.csv

You can see that the base directory for Alluxio is /alluxio and the sampledata.csv file we persisted earlier is there.

Just one final note: when I tested Alluxio on Apache Hadoop (not Cloudera), I initially encountered a “connection refused” message when I tried to persist files. Alluxio was not able to communicate with the underFS. I solved this by adding this parameter to core-site.xml:



<value>hdfs://<host name>:9000</value>


In the next post, running and testing Alluxio, we will see how to actually run programs on Alluxio and if it’s really that fast.

This entry was posted in Alluxio, HDFS and tagged , , . Bookmark the permalink.

Leave a Reply