Setting up Oracle external Metastore for Hive

When installing CDH, it asks you for a database where it can build Hive Metastore. You can choose your own external database (MySQL, Oracle or PostgreSQL), or an embedded PostgreSQL database that Cloudera installer installs for you. Many people choose the embedded database because it’s the quickest and easiest choice. However, this database is less manageable than an external database, it’s harder to tune and backup. If you did choose the embedded database at install time, and this is a production cluster then it is recommended to migrate Metastore to an external database. Cloudera supports Oracle, MysSQL and PostgreSQL. Oracle is my favorite RDBMS so I will show how to migrate Hive Metastore to Oracle.

To complete the migration and completely abandon the embedded database we also have to migrate Cloudera manager and all its management services, but we will show it in a later post.

Preparing Oracle database

For the sake of speed and simplicity, I did not follow all best practices for configuring Oracle. Whenever I diverted from the best practice I specified what is the best way to configure it.

You should make sure you install the database with UTF8 character set. If you are just experimenting, you can keep your Oracle database with default parameters. But if you are about to attach this database to a large busy Hadoop cluster then you better change some Oracle parameters to better support the workload. Run those commands in sqlplus (the file locations will probably be different in your database, and memory allocation size should also change according to the your available RAM):

alter system set FILESYSTEMIO_OPTIONS=setall scope=spfile;
alter system set MEMORY_TARGET=8G scope=spfile;
alter system set sessions=550 scope=spfile;
alter system set transactions=605 scope=spfile;
alter system set processes=500 scope=spfile;

After changing those parameters you should restart the database so the will take effect (some of the parameters can be changed directly in memory but not all of them).

Now we should install Oracle jdbc drivers on the cloudera manager server node and on all nodes in the cluster that has the roles Activity Monitor, Reports Manager, Hive Metastore Server, Hue Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator Metadata Server server.

Cloudera recommends assigning all those roles to a single node and install JDBC on that node, but it is not mandatory.

You can download the ojdbc6.jar file from Oracle or from here.

Now just copy it to the server in the appropriate location:

mkdir -p /usr/share/java

cp /tmp/ojdbc6.jar /usr/share/java/oracle-connector-java.jar

Cloudera recommends installing different Oracle instance for each service on the same node that runs the service, to minimize the latency when accessing the data. I installed everything on one dedicated Oracle server.

I did not create a different tablespace for every user/schema but created them all in “users” tablespace. In real world, you should create different tablespace for each user. I just increased the tablespace size to support all those users:

alter database datafile '/oracle/data/orcl/cloudera/users01.dbf' resize 1G;

Now create the user/schema for cloudera manager and assign some privileges to it. Although this post is only about Hive Metastore, I will create all the necessary users to support the management services that we will migrate in the next post:

create user scmserver identified by scmserver default tablespace users;
grant CREATE SESSION to scmserver;
grant CREATE ANY TABLE to scmserver;
grant CREATE ANY SEQUENCE to scmserver;
grant CREATE ANY INDEX to scmserver;
grant ALTER ANY TABLE to scmserver;
grant ALTER ANY INDEX to scmserver;
alter user scmserver quota unlimited on users;

In a similar way we should now create Oracle users for Activity monitor, Reports manager, Hive metastore, Sentry server, Cloudera Navigator Audit Server and Cloudera Navigator Metadata Server.
In this post we will not use all of those users, but just to prepare the infrastructure, Here is a SQL script that creates all of them.

Backup current database contents

We should export all the data from the hive database user in order to import it into the new Oracle database. First of all, stop the cluster’s services so that no changes will be made to the Metastore database.

If you kept the embedded PostgreSQL password during CDH installation then you are lucky. If not than use this API call to retrieve the password (your version and cluster name may be different):


  "items" : [ {
    "name" : "hbase_service",
    "value" : "hbase",
    "sensitive" : false
  }, {
    "name" : "hive_metastore_database_host",
    "value" : "cloudera1.lan",
    "sensitive" : false
  }, {
    "name" : "hive_metastore_database_name",
    "value" : "hive",
    "sensitive" : false
  }, {
    "name" : "hive_metastore_database_password",
    "value" : "kiJH12yv8A",
    "sensitive" : true
  }, {
    "name" : "hive_metastore_database_port",
    "value" : "7432",
    "sensitive" : false
  }, {
    "name" : "hive_metastore_database_type",
    "value" : "postgresql",
    "sensitive" : false
  }, {
    "name" : "mapreduce_yarn_service",
    "value" : "yarn",
    "sensitive" : false
  }, {
    "name" : "zookeeper_service",
    "value" : "zookeeper",
    "sensitive" : false
  } ]

Export the current Hive data (it will prompt you for the password):

pg_dump -U hive -W --column-inserts --format p -h cloudera1.lan -p 7432 -f hive-in.sql -a hive

This will produce a file named “hive-in.sql” that contains SQL insert statements. We will use it later to import the data into Oracle.


Remove the line:

INSERT INTO VERSION (VER_ID, SCHEMA_VERSION, VERSION_COMMENT) VALUES (1, '1.1.0', 'Hive release version 1.1.0');

We do not need it as cloudera manager already inserted a version record when we created the tables.

Configuring Hive to use the new database

In Cloudera Manager go to Hive -> Configuration and search for “metastore”. Locate the database settings and change the from PostgreSQL to Oracle, along with the host name,port and username of the new database, then press save:

View full size image

View full size image

Under Hive, in actions menu, select “Create Hive Metastore database tables”:

Now run the hive-in.sql file we created earlier on the Oracle database under hive user to populate the Metastore tables. If you will run it as-is, there will be a foreign key violation, so you first have to disable this foreign key:

alter table TBLS disable constraint TBLS_FK1;

Then run the script and when it’s done, re-enable this foreign key:

alter table TBLS enable constraint TBLS_FK1;

Now start Hive and test that everything is Ok. It should start without errors and all your tables should be present.

Next time we will use the database users we created to migrate Cloudera manager and all its management services to the same Oracle database.


This entry was posted in Cloudera, Hive and tagged , , , . Bookmark the permalink.

Leave a Reply