Setting up HDFS transparent encryption

In my very first post on this blog I explored HDFS blocks and saw that HDFS keeps the files on disk just as is. You can completely bypass HDFS API and go directly via the OS, and view the files (or file parts). The files are not processed or transformed in any way and they just sit there as plain text, image, sound file etc.

This is a major security concern because everyone who has access to the OS level files can view or copy them.

A possible solution would be to encrypt the files on disk. All major Hadoop distributions like Hortonworks and Hadoop 3.0 (alpha) offer HDFS encryption and encryption zones. My favorite Hadoop flavor is Cloudera that offers a relatively easy way to configure transparent HDFS encryption in their Hadoop distribution.

First of all, let’s take a look at the current situation, before encryption. I will create a demo text file with some text in it and then copy it to HDFS:

[[email protected] ~]$ echo "This is a demo file for HDFS" > demo.txt
[[email protected] ~]$ ls -l
total 8
-rw-rw-r-- 1 hdfs hdfs 29 May 7 22:44 demo.txt
-rwxrwxrwx 1 root root 51 Apr 16 22:45 test.cfg

[[email protected] ~]$ hdfs dfs -put demo.txt /tmp/demo.txt
[[email protected] ~]$ hdfs dfs -ls /tmp
Found 5 items
drwxrwxrwx - hdfs supergroup 0 2017-05-07 22:45 /tmp/.cloudera_health_monitoring_canary_files
-rw-r--r-- 3 hdfs supergroup 29 2017-05-07 22:45 /tmp/demo.txt
drwxr-xr-x - yarn supergroup 0 2017-04-14 23:03 /tmp/hadoop-yarn
drwx-wx-wx - hive supergroup 0 2017-03-26 23:31 /tmp/hive
drwxrwxrwt - mapred hadoop 0 2017-04-14 23:03 /tmp/logs

Now let’s see the same file from the regular Linux filesystem. The default directory where HDFS keeps data is /dfs/dn. So I went to one of the datanodes and dived deeper in sub-directories until I got to where the file was:

cd /dfs/dn/current/BP-1769584072-
[[email protected] subdir14]# ls -l
total 16
-rw-r--r-- 1 hdfs hdfs 29 May 7 22:45 blk_1073745440
-rw-r--r-- 1 hdfs hdfs 11 May 7 22:45 blk_1073745440_4619.meta
-rw-r--r-- 1 hdfs hdfs 56 May 7 22:56 blk_1073745451
-rw-r--r-- 1 hdfs hdfs 11 May 7 22:56 blk_1073745451_4630.meta
[[email protected] subdir14]# cat blk_1073745440
This is a demo file for HDFS

The file is just there, in plain text, readable for everyone.

Cloudera supports two ways to hold the encryption keys: one is in Java file based keystore which is good for development purposes but isn’t secured enough for production and may not be able to handle the load of a busy production cluster. The second is a dedicated pair of servers offered by Cloudera itself (Key trustee server and Key trustee KMS). It is recommended to install each of them on a separate physical server  and configure high availability for them, so you will need 4 additional servers.

Unfortunately, Cloudera key server is only available to the paid enterprise data hub users, so for this demo I had to use Java keystore. You can find the documentation for configuring both keystores here.

First, we need to add the Java Keystore KMS service. From the cluster menu choose “Add service” and from the list choose Java Keystore KMS. Then choose which server will run the new service.:

View full size image

The next step is to generate ACL. just choose a user name and a group name other than “hdfs” which is not allowed, then click “generate ACLs”:

View full size image

Then just follow the wizard without changing anything to its end:

View full size image

When you’re done, you will see the new Java KeyStore in your cluster’s services list:

Now that you have the KMS service up and running, head to the cluster menu and press “Set up HDFS Data at rest encryption”:

This opens a window that lets you choose a keystore type. As I mentioned before, we are not following all best practices here so we will choose a file based keystore. On the lower section, you can see a list of tasks to complete that changes according to your keystore selection.

Each step is a link to the instructions how to complete it. Enabling Kerberos and TLS encryption are highly recommended but not mandatory . You can look at some of my older posts for how to configure TLS encryption and Kerberos authentication. Although the documentation says it’s optional, I could not get encryption to work without setting up kerberos authentication first. When all steps are done, the “validate data encryption” link becomes active.

View full size image

Clicking the validation link opens a window with instructions how to create a key and encryption zone and then test them:

View full size image

HDFS encryption introduces a concept called “encryption zone” which is a certain directory in HDFS that automatically encrypts every file created in it.
The files are then automatically decrypted only when accessing via HDFS clients. In order to create an encryption zone the directory should already exist and should be empty.

Let’s test it as shown in the validation screen. First create a key (we configured Kerberos so we must use kinit to authenticate user):

[[email protected] ~]# su - hdfs
[[email protected] ~]$ kinit [email protected]
Password for [email protected]:
[[email protected] ~]$ hadoop key create mykey1
mykey1 has been successfully created with options Options{cipher='AES/CTR/NoPadding', bitLength=128, description='null', attributes=null}.
KMSClientProvider[http://kms1.lan:16000/kms/v1/] has been updated.

Now create an example encryption zone at HDFS location /tmp/safezone:

[[email protected] ~]$ hdfs dfs -mkdir /tmp/safezone
[[email protected] ~]$ hdfs crypto -createZone -keyName mykey1 -path /tmp/safezone
Added encryption zone /tmp/safezone

Now let’s put our demo file (the one we used in the beginning of this post) in the new encryption zone and try to access it from the HDFS client:

[[email protected] ~]$ hdfs dfs -put demo.txt /tmp/safezone
[[email protected] ~]$ hdfs dfs -cat /tmp/safezone/demo.txt
This is a demo file for HDFS

As its name implies, the encryption is transparent and clients see the data just as before the encryption.

Now let’s try to access the file from outside HDFS client, directly from the OS as we did at the beginning of this post:

[[email protected] subdir21]# cd /dfs/dn/current/BP-1769584072-
[[email protected] subdir21]# ls -l
total 8
-rw-r--r-- 1 hdfs hdfs 29 May 16 16:32 blk_1073747436
-rw-r--r-- 1 hdfs hdfs 11 May 16 16:32 blk_1073747436_6615.meta
[[email protected] subdir21]# cat blk_1073747436
▒▒)▒▒~8▒▒▒▒b▒▒'▒▒6Z▒[[email protected] subdir21]#

Voila !!! The file is encrypted and unreadable from outside HDFS.

This entry was posted in Cloudera, Hadoop, HDFS and tagged , , , . Bookmark the permalink.

Leave a Reply