Being able to replicate your data to another cluster is important both in the aspect of having backup copies of your data and both in the aspect of building development or test clusters with the same data as your production cluster. Keep in mind that this is not an automatic, continuous replication but more like a remote sync that can run on demand or at certain intervals.
Apache hadoop offers distcp, a tool for copying data that resides on HDFS across different clusters, utilizing MapReduce. distcp only copies files and directories and you have to wrap it with your ow scripts if you want scheduled backups or more complex logic. Cloudera included such a wrapper that can be easily managed from Cloudera manager. Please note that this is supported only in the Enterprise data hub edition, not in the free “express” edition. Users of the express edition will not have a “backup” menu and will have to write their own wrapper around distcp, just as in Apache Hadoop, here’s a distcp guide from Cloudera.
Cloudera adds some more possibilities on top of distcp and enables replication of data stored in Hive tables, Hive metastore data, Impala metadata and HBase data as well. We will cover HDFS and Hive replication in this post series.
For the demonstration, I created two CDH 5.9 clusters. One has 4 nodes and it will be the source and another one with only 3 nodes as the destination. My source cluster is named cluster 1 and the destination is cluster 2.
Cloudera recommends that TLS/SSL will be used. A warning is shown if the URL scheme is http instead of https. You can refer to my previous post for instructions how to enable level 1 TLS encryption for your clusters.
In addition to the regular TLS/SSL setup, you will need to form a trust between the machines that run Cloudera scm server on both clusters. Use the same command you used to import the certificate into the truststore when you configured TLS encryption, but you have to create a certificate in cluster1 and import it to the truststore in cluster 2 and vice versa.
For example, move the selfsigned.cer file from cloudera1 to clousera7 and import it to the truststore.
keytool -import -alias cloudera1.lan -file /tmp/selfsigned.cer -keystore $JAVA_HOME/jre/lib/security/jssecacerts -storepass changeit
Then move selfsigned.cer from cloudera7 to cloudera1 and import it there. If you don’t do that, creating a new peer later will fail.
The procedure of data replication is described in the official documentation here.
This demonstration uses non-kerberized clusters. Replicating Kerberized clusters require some extra steps that you can see in the documentation.
Setting up and running replication is done only from the destination cluster (pull). First you have to login to the destination cluster as administrator, then go to Backup->Peers:
There should be no peers defined at this point, so all you will see is the “add peer” button. Click it and fill the form with the name of the source cluster, the Cloudera manager URL and the admin credentials, then press “add peer”:
If everything went well you will have a “connected” message with a green check mark:
We have now connected the two clusters and now it’s time to choose what to replicate.
Click on the HDFS service, and under quick links choose “replication”. From the drop down list choose HDFS replication:
Then fill the replication form. You have to supply the source and destination clusters, which path to replicate (choose / for all), what kind of schedule to set (run once now, run once in the future or recurring schedule) and when. The default user to run the replication task is hdfs, so it’s better to just leave it that way:
If you want to change the default values you can go to resources tab, where you can set how many MapReduce jobs will run concurrently (default is 20) and how they will pick their work:
The advanced tab enables you to change things like how to handle replication errors, what to replicate deletes, how to deal with permissions and block size and some more finer tuning options:
Now our replication is ready to run. You can wait for the scheduled run or run it immediately using the drop down menu to the right:
If you click “Run now” The replication starts. You can see its source, destination and advance percentage. There are also two links, one is”command details” that shows the actual command line output and the other is “performance report” that downloads a csv file containing any performance problems.
If we click on the command details we can see that the actual command running behind the scenes is distcp:
Now let’s test it. Host cloudera1 belongs to the source cluster1 and host cloudera7 belongs to the destination cluster2. At the beginning, /tmp on cluster1 has Two directories. I then create a new file “replication.text” that contains one line of text and upload it to /tmp:
[[email protected] ~]$ hdfs dfs -ls /tmp Found 2 items drwxrwxrwx - hdfs supergroup 0 2016-12-26 15:48 /tmp/.cloudera_health_monitoring_canary_files drwxrwxrwt - mapred hadoop 0 2016-12-26 10:46 /tmp/logs [[email protected] ~]$ echo "This is a replication test" > replication.test hdfs dfs -put replication.test /tmp/replication.test [[email protected] ~]$ hdfs dfs -ls /tmp Found 3 items drwxrwxrwx - hdfs supergroup 0 2016-12-26 15:52 /tmp/.cloudera_health_monitoring_canary_files drwxrwxrwt - mapred hadoop 0 2016-12-26 10:46 /tmp/logs -rw-r--r-- 3 hdfs supergroup 27 2016-12-26 15:53 /tmp/replication.test
Then, on the destination cluster, we can see that the replication.test file does not exist at first. After running the replication process it appears:
[[email protected] ~]$ hdfs dfs -ls /tmp Found 2 items drwxrwxrwx - hdfs supergroup 0 2016-12-26 15:55 /tmp/.cloudera_health_monitoring_canary_files drwxrwxrwt - mapred hadoop 0 2016-12-26 15:23 /tmp/logs [[email protected] ~]$ hdfs dfs -ls /tmp Found 3 items drwxrwxrwx - hdfs supergroup 0 2016-12-26 16:25 /tmp/.cloudera_health_monitoring_canary_files drwxrwxrwt - mapred hadoop 0 2016-12-26 15:23 /tmp/logs -rw-r--r-- 3 hdfs supergroup 27 2016-12-26 15:53 /tmp/replication.test
So we have successfully setup HDFS replication. Here are few things to note:
- Synchronization is one way. Any changes made to files in the destination cluster will be lost next time the replication runs.
- There are certain limitations such as: no more than 100 Million files can be processed in one replication job, No more that 10 Million files can be processed if the last replication job ran less than 8 hours ago. You can read more on this in the documentation.
I wanted to cover Hive replication as well, but this post is long enough already, so I will cover it in my next post.