SQLing the NoSQL with Apache Drill (Embedded mode)

To be honest, as a long time DBA, I am much more fluent and comfortable with SQL than with JSON. SQL is much closer to spoken English and easier to write for me. Maybe I’m just used to it. When writing JSON queries, especially complex ones, I often get lost in piles of braces and curly braces. I find myself running a query over and over, getting errors and trying to figure out if all open braces has matching closing ones.

So besides learning JSON better I was looking for a tool that can translate SQL to JSON and other formats. There are several solutions and a significant one is Apache drill.

Apache drill is a very interesting project based on Google’s Dremel technology that enables running standard SQL on non-SQL storage systems. It has many plugins that enable it to connect to many different platforms from filesystems to databases to cloud services (hive, hbase, Amazon s3, MongoDB, avro, sequencefiles and more). This is no CQL or HiveQL or any other mutation, drill fully supports the SQL 2003 standard for querying all those data sources.

  • One important drawback of drill is that it cannot write data to its data sources, only read.

Installing and starting Drill

Drill servers are called drillbits and they can be installed in two ways:

  1. Embedded mode. In this way drill is installed locally on a certain machine and it can only query data sources on this same machine. It does not use ZooKeeper and it cannot be accessed remotely via JDBC/ODBC. This kind of installation is very simple and quick but it does not use the full power of the Hadoop/MongoDB/whatever cluster it queries. This method is good for testing, experimenting and interactive queries.
  2. Distributed mode, where there is a drillbit installed on each node of the cluster and they all communicate using ZooKeeer. In this mode drill tries to use data locality and perform the work on the nodes that host the data. This is recommended for most busy production applications.

For this demonstration I will use the simpler embedded mode, in a later post I will try the distributed mode.

I will use ContOs 6.8 Linux as an operating system. Lets start.

Make sure you remove PpenJDK (all versions) and install Oracle JDK 8.

No installation is needed on Linux, just download the latest Apache Drill from here, move it to the server, extract it and you’re ready to go.

cd <drill directory>/bin


This will start the Drill shell.

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Feb 07, 2017 12:02:41 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.9.0
"a drill in the hand is better than two in the bush"
0: jdbc:drill:zk=local>

To exit the shell type “!quit”.

Drill has a built in web console. Although you can do everything from the command shell, I found Drill web console to be the easiest way to manage and use Drill. After starting Drill you can access it at http://:8047

Drill comes with several pre-installed storage plugins and there are many more that can be added. We will test it with MongoDB. In the web console, go to the storage menu and you will see all available storage plugins:

You can see that mongo plugin is disabled, so first we will click “update” and update it’s configuration so it points to our MongoDB. When your done and want to save the configuration, click the “update” button:

Now enable the plugin and see it appears at the upper section where the active plugins are:

Now its time to start querying. Go to the query menu and run “show databases”. You will see all the databases associated with the Three active storage plugins:

You can see that Drill treats all of them the same, regardless what the underlying data source is.

I already have a collection named “sampledata” in test database in MongoDB, so I will be using mongo.test. Go back to query page and run “select * from mongo.test.sampledata where line in (7000,122000)”.  You will get a tabular result just as you would with a relational database.

View full size image

If you wonder what “physical” and “logical” are, drill takes the sql statement and transform it into a logical plan consists of logical steps. Then, the plugin helps it transform it into a physical plan which is a plan individual to the specific storage we are using, in our case it’s MongoDB. So if you have a physical or logical plan text you can also run them from here.

If you want to see the explain plan you have two ways. One is to run your query preceded by “explain plan for”, for example “explain plan for select * from mongo.test.sampledata where line=7000”. The result is the underlying JSON query that ran on MongoDB (although not in a tidy/pretty format):

View full size image

Another way is to go to the “profile” menu, where you can see a list of all the queries you ran. Choosing one, you will see a tab saying “physical plan”. If you open it you will see a nice formatted execution plan (in MongoDB terms like project and screen) and there is also another tab, “visualize plan”  that shows the execution plan visually:

View full size image

If you scroll down the page you will find a “JSON profile” button. Clicking it will show the full (and very verbose) JSON query. According to the documentation you can copy it and paste in the query page and run it using the “physical” option, but when I tried it I got an error.

Aggregate functions such as SUM, AVG, MIN, MAX also work well and the syntax is just the same in relational database sql. You can read about available aggregate functions here.

Nested objects

Since SQL is designed for tabular data, it has problem retrieving and displaying nested data (columns or objects inside other columns or objects). So it will be interesting to see how Drill handles this.

So I updated one document to contain another document. I changed the value1 field that was of type Integer before to contain some personal info (I also had to change the column name since “value” is a reserved word):

use test

switched to db test
> db.sampledata.update (
... {line: 7000},
... {
... $set: {"value1": {
... first: "Guy",
... last: "Shilo",
... skills: "data"
... }
... }})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

We can see the nested data if we look for it in the mongo shell:

{ "_id" : ObjectId("5899c53cd4151004e5fd53cd"), "line" : 7000, "value1" : { "first" : "Guy", "last" : "Shilo", "skills" : "data" }, "message" : "Line number 7000" }

Now let’s go and run the sql query “select * from mongo.test.sampledata where line=7000;” from Drill:

View full size image

You can see that in the “value1” field, Drill just returned plain JSON representation of the data. Any attempt to drill deeper like “select value1.first” or “where value1.first=’Guy'” resulted in error.

At this point we are hitting the inherent limit of SQL in dealing with complex, schema less data. It seems that Drill can only handle one level of nesting and it handles it by simple returning the nested object as JSON.


Drill does a good job enabling SQL on various platforms. Although some functionality of the individual platform gets lost on the way (and there should be some performance penalty too). Many people are fluent in SQL which was the standard query language for so many years and drill helps them get up and running quickly. Another thing is that you can keep a unified way of accessing your various data sources regardless the underlying technology. Connecting to drill remotely using JDBC/ODBC makes it even more usable.

I will test drill in distributed mode in the future and report the outcome soon.


This entry was posted in Apache Drill, MongoDB and tagged , , , . Bookmark the permalink.

Leave a Reply