Alluxio is a very interesting and promising project (until few versions ago it was called Tachyon). It is a caching system that acts like a distributed storage that is kept entirely in memory. It is backed by an under storage system such as HDFS or Amazon S3 (it supports a bunch). Its API resembles the HDFS API and it claims to be even faster than Spark in memory processing. It has its own cluster and it does not need Hadoop (unless configured to use HDFS as an underFS).
So I started testing it and playing with it, first on my laptop as a local installation and later on my lab cluster.
There are two annoyances I found while working with it (this is my personal opinion):
- Although some enterprises use it in production, this project seems to be in its early life. It is not very well documented. There is no real user manual, just short tutorials. There probably isn’t a wide install base, so if you encounter a problem or misunderstanding, google search will not provide too much information. Also there is almost no third-party documentation (like books). You are pretty much on your own here.
- It is not very well integrated with Hadoop ecosystem (and doesn’t claim to be). It works good with Spark, but it has very weak integration with Hive.
In this post we will Install and configure Alluxio. Installing, configuring and testing is too long for one post, so I will show it’s usage and testing in the next post.
Download alluxio from here.
Although the documentation mentions source code in many places, I did not find where to download the sources from. Alluxio site download section offers pre compiled versions for many Hadoop versions. I first tried the Hadoop 2.6 version, but the one I used for this post is for CDH5.
The following steps need to be done on each node participating in the Alluxio cluster:
Create an alluxio user that can ssh to itself without a password. We also need to add user alluxio to the sudoers file so it can sudo without requiring a password.
Logged in as root run visudo and add this line:
alluxio ALL=(ALL) NOPASSWD: ALL
Copy the alluxio tar file to each node and extract it.
From now on we will call the directory where we extracted Alluxio ALLUXIO_HOME.
Now go to $ALUXIO_HOME and run:
./bin/alluxio bootstrap-conf cloudera1
Replace the “cloudera1” with the name of the server that will be the master.
This will create a file called alluxio-env.sh in $ALLUXIO_HOME/conf
This file holds the default values of some alluxio parameters. These values will be set by the file if they are not set by the environment. We will have to change them since the default value for worker memory size id 2/3 of host memory (which is too high for a server which is not dedicated to Alluxio) and the underfs is local (underfs is the underlying “real” filesystem).
So now we have to edit .profile or .bash_profile on all participating nodes and set those settings according to server memory size and workload and according to the chosen underfs.
This is how it looks in my .bash_profile:
export ALLUXIO_WORKER_MEMORY_SIZE=3000MB export ALLUXIO_UNDERFS_ADDRESS=hdfs://cloudera1:8020/alluxio
If we are using HDFS as an underFS, then ALLUXIO_UNDERFS_ADDRESS should point to the nameNode.
We also need to create the alluxio directory in hdfs before starting alluxio:
[[email protected] ~]$ hdfs dfs -mkdir /alluxio [[email protected] ~]$ hdfs dfs -chown alluxio:supergroup /alluxio
It’s also a good idea to add an ALLUXIO_HOME and add the bin directory to PATH:
export ALLUXIO_HOME=/usr/alluxio/alluxio-1.1.1 export PATH=$PATH:$ALLUXIO_HOME/bin
Formatting and starting Alluxio:
On the master server, login as alluxio and run “alluxio format”:
[[email protected] ~]$ alluxio format /usr/alluxio/alluxio-1.1.1/bin/alluxio-workers.sh: line 43: /usr/alluxio/alluxio-1.1.1/logs/task.log: No such file or directory Waiting for MASTER tasks to finish... /usr/alluxio/alluxio-1.1.1/bin/alluxio-workers.sh: line 44: /usr/alluxio/alluxio-1.1.1/logs/task.log: No such file or directory All MASTER tasks finished, please analyze the log at /usr/alluxio/alluxio-1.1.1/logs/task.log. Formatting Alluxio Master @ cloudera1
Now we can start the master:
Then on all workers run:
The workers communicate with the master to form a cluster. You can go to http://<master host>:19999 to see the admin UI. Here is a general screenshot:
And here is a page showing the active workers and their capacity:
Running some tests:
We can now run some built in tests by adding $ALLUXIO_HOME/bin to the path and just typing:
This will run a set of tests and they should all pass. The output is quite long so I will show here only a small part of it:
runTest BasicNonByteBuffer CACHE_PROMOTE CACHE_THROUGH 2016-07-14 15:47:30,977 INFO type (BlockWorkerClient.java:connectOperation) - Connecting to remote worker @ cloudera6.lan/192.168.1.136:29998 2016-07-14 15:47:31,044 INFO type (NettyRemoteBlockWriter.java:write) - Connected to remote machine cloudera6.lan/192.168.1.136:29999 2016-07-14 15:47:31,143 INFO type (NettyRemoteBlockWriter.java:write) - status: SUCCESS from remote machine cloudera6.lan/192.168.1.136:29999 received 2016-07-14 15:47:31,185 INFO type (BlockWorkerClient.java:connectOperation) - Connecting to remote worker @ cloudera6.lan/192.168.1.136:29998 2016-07-14 15:47:31,207 INFO type (BlockWorkerClient.java:connectOperation) - Connecting to remote worker @ cloudera6.lan/192.168.1.136:29998 2016-07-14 15:47:31,225 INFO type (NettyRemoteBlockReader.java:readRemoteBlock) - Connected to remote machine cloudera6.lan/192.168.1.136:29999 2016-07-14 15:47:31,233 INFO type (NettyRemoteBlockReader.java:readRemoteBlock) - Data 151062052864 from remote machine cloudera6.lan/192.168.1.136:29999 received Passed the test! runTest Basic CACHE_PROMOTE THROUGH 2016-07-14 15:47:31,322 INFO type (BasicOperations.java:writeFile) - writeFile to file /default_tests_files/Basic_CACHE_PROMOTE_THROUGH took 75 ms. 2016-07-14 15:47:31,396 INFO type (BlockWorkerClient.java:connectOperation) - Connecting to remote worker @ cloudera4.lan/192.168.1.241:29998 2016-07-14 15:47:31,500 INFO type (NettyRemoteBlockWriter.java:write) - Connected to remote machine cloudera4.lan/192.168.1.241:29999 2016-07-14 15:47:31,590 INFO type (NettyRemoteBlockWriter.java:write) - status: SUCCESS from remote machine cloudera4.lan/192.168.1.241:29999 received 2016-07-14 15:47:31,620 INFO type (BasicOperations.java:readFile) - readFile file /default_tests_files/Basic_CACHE_PROMOTE_THROUGH took 298 ms. Passed the test!
Now we can work with alluxio in pretty much the way we work with HDFS. Let’s create a directory and load a csv file into it:
[[email protected] ~]$ alluxio fs mkdir /alluxio/guy Successfully created directory /alluxio/guy [[email protected] ~]$ alluxio fs ls /alluxio 0.00B 07-14-2016 16:05:24:325 /alluxio/guy [[email protected] ~]$ ls sampledata.csv [[email protected] ~]$ alluxio fs copyFromLocal ./sampledata.csv /tmp/guy Copied ./sampledata.csv to /tmp/guy [[email protected] ~]$ alluxio fs ls /tmp/guy 488.94MB 07-17-2016 12:53:24:984 In Memory /tmp/guy/sampledata.csv
The file is now in memory only and it will be gone if we restart Alluxio. If we want it written to the underFS we should use the persist option:
[[email protected] ~]$ alluxio fs persist /tmp/guy/sampledata.csv persisted file /tmp/guy/sampledata.csv with size 512688250
Now we can also see it directly from the HDFS underlying filesystem:
[[email protected] ~]$ hdfs dfs -ls / Found 3 items drwxr-xr-x - alluxio supergroup 0 2016-07-17 12:55 /alluxio drwxrwxrwx - hdfs supergroup 0 2016-06-26 23:04 /tmp drwxr-xr-x - hdfs supergroup 0 2016-03-23 10:56 /user [[email protected] ~]$ hdfs dfs -ls /alluxio Found 2 items drwxrwxrwx - alluxio supergroup 0 2016-07-14 15:47 /alluxio/default_tests_files drwxrwxrwx - alluxio supergroup 0 2016-07-17 12:55 /alluxio/tmp [[email protected] ~]$ hdfs dfs -ls /alluxio/tmp/guy Found 1 items -rwxrwxrwx 3 alluxio supergroup 512688250 2016-07-17 12:55 /alluxio/tmp/guy/sampledata.csv
You can see that the base directory for Alluxio is /alluxio and the sampledata.csv file we persisted earlier is there.
Just one final note: when I tested Alluxio on Apache Hadoop (not Cloudera), I initially encountered a “connection refused” message when I tried to persist files. Alluxio was not able to communicate with the underFS. I solved this by adding this parameter to core-site.xml:
<property> <name>fs.default.name</name> <value>hdfs://<host name>:9000</value> </property>
In the next post, running and testing Alluxio, we will see how to actually run programs on Alluxio and if it’s really that fast.