In the last post I showed how to replicate HDFS files and directories. Now we will continue where we left off and show a Hive replication. Hive and HDFS replication setup has many similarities and some of the steps are identical in both processes.
Cloudera recommends enabling snapshots on the directory where Hive warehouse resides and on any directories that contains Hive’s external tables. If you are replicating a big and busy Hive table, replication might fail if the table has changed while the job is running. To prevent this, Cloudera manager creates a snapshot first in order to “freeze” the table and only then copies it to the other cluster. If you do not enable snapshots, you will see a warning message like this:
This is easy to set up so I recommend doing it before setting up the actual replication. Go to Hive service -> Configuration and find “Hive Warehouse Directory. Mine is “/user/hive/warehouse”. The easiest way I found to enable snapshots is via command line. Connect to one of the nodes of the source cluster with the hdfs user and run:
[[email protected] ~]$ hdfs dfsadmin -allowSnapshot /user/hive/warehouse Allowing snaphot on /user/hive/warehouse succeeded
Now we can proceed. To set up Hive replication, like in HDFS replication, you should first have a peer relation between the clusters. My last post demonstrates how to do it. Then go to Hive service -> quick links -> replication.
In the “create schedule” drop down, choose Hive replication:
You will see this form, which is almost identical to the one of HDFS replication:
After you save the new schedule, you can run it from the replication schedule page, just like you did with HDFS replication:
Running the replication job is just like in HDFS replication, in the actions drop down menu. Looking at the running job, you can see four stages: export of Hive metadata database, checking of the export file, copying the data itself and finally importing the Hive metastore data:
If we expand the data replication stage we can see that it actually runs good old distcp:
So the metadata is transferred using export-import and the data itself, which is HDFS files is transferred using distcp, just like HDFS replication.
The process described above results in full replication. The entire Hive tables and metastore are copied whenever the replication schedule runs. This can be very time and resource consuming as the whole database is copied even if there were only minor changes or no changes at all. Incremental replication eliminate this waste of time and resources by copying only the changes made since the last replication job.
To configure this, first go to the source cluster:
- Login to the source cluster and go to the configuration tab of the Hive service.
- Look for “Enable Stored Notifications in Database” and make sure it is checked.
- Look for “Time-to-live for Database Notifications” and make sure it is greater than your replication schedule interval (default value is 2 days so it should be ok).
- Save the changes you made and restart Hive service.
This will make Hive save all DDLs in the database so they can be re-run on the destination cluster.
Then on the destination cluster:
- Login to the destination cluster and go to the configuration tab of the Hive service.
- Edit “Hive Replication Environment Advanced Configuration Snippet” and add this to it: “USE_INCR_EXPORT_SUPPORT=true”
That’s it ! Now let’s test it.
I will create a Hive table on the source system and load a file containing 15 Million rows into it:
create table sampledata1 ( seqnum int, value bigint, message string ) row format delimited fields terminated by ',' stored as textfile; LOAD DATA local INPATH '/home/hive/sampledata.csv' OVERWRITE INTO TABLE sampledata1;
I then run the Hive replication job from the destination cluster. Then I created another data file, this time with only 10,000 rows and loaded it to the same table without the override option.
I ran the replication job again. I expected that since I use incremental replication, the second run will take less time to complete than the first one.
Here is the replication history report:
View full size image
As expected, the first run copied 46MB of data while the second run had to copy only 267.9KB of data. This is all nice, but surprisingly both runs took roughly 3 minutes each. I think that the reason is that replication uses Map/Reduce which is a framework that requires lots of housekeeping and preparations before it even gets to the actual job. Setting up mappers, dividing the work among them, running the jobs, setting up reducers and so on can take a minute or even more. So the total amount of time taken to complete the job doesn’t tell the whole story. If we try the same test on a larger table we will notice a difference that will become more and more evident as the table becomes larger (since the housekeeping period doesn’t change much).