Monitoring Kafka consumer lag with Burrow

Burrow is a monitoring tool developed at Linkedin and its sole purpose is to detect consumer lag and raise alerts when such lag is detected.

In a healthy Kafka cluster, all producers are pushing messages into topics and all consumers are pulling those messages at the other end of the topics. Things start to go wrong when a consumer cannot keep up with the producer or hangs for some reason. When this happens messages don’t get through and quickly pile up in the topic. This is called a lag as the consumer is lagging behind the producer. It is important to detect such situations and fix the cause of the lag (not every lag indicates a problem, sometimes the lag is temporary and the consumer handles it eventually).

Burrow is an open source project and this is it’s Git page that also has some documentation about its installation and use. It is a command line tool that has no graphic user interface and counts on email or 3rd party visual monitoring systems to receive and show its alerts.

Burrow’s functionality is quite narrow and it does not monitor any other aspect of Kafka except consumer lag, but it does it in a special way that justifies using it in addition to regular monitoring and alerting systems you may have.

Instead of having a hard threshold of messages that defines a lag, it monitors Kafka’s internal committed offsets topic (__consumer_offsets) and evaluates a consumer’s condition by the offsets it commits. If a consumer did not commit any offset, it will be invisible to Burrow, but if it does commit offsets Burrow uses sliding windows to evaluate the rate of offset commits: is it stable, decreasing or increasing, is the lag growing and in what rate ? Is it growing consistently ?  Based on this information Burrow decides if the status of the consumer group is fine or not.

You can read those two nice and detailed articles about Burrow from Hortonworks. This is part I and this is part II.



Burrow is written in the go language, so we first have to install go:

rpm -Uvh
yum install golang
yum install hg
yum install git

If you are on RHEL 6 and git 1.7, you need to upgrade your git first. See here for instructions how to do it.

Install the go package manager:

yum install gpm
chmod +x gpm 
mv gpm /usr/local/bin

create a directory and point the variable $GOPATH to it:

mkdir /gopath
export GOPATH=/gopath

Now download and install Burrow itself:

go get

cd $GOPATH/src/
gpm install
go install


Now edit the configuration file $GOPATH/config/burrow.cfg

You should make some changes according to you system. The configuration file has several sections. First, change the host name in the zookeeper and kafka sections to fit your hosts (mine is “hadoop”), the kafka cluster name and any other non standard information:


[kafka "guy1"]

If you have storm then you should also make the changes in the “storm” section )I do not have storm). Also check that the http server section points to a free port.

In the “general” section, point to the log directory and to the logging configuration file that comes with the Burrow distribution:


Towards the end of the file there are settings for the notifiers, the ways Burrow can communicate its findings. There are Three supported types of notifiers: Email, HTTP and Slack. There is also a “smtp” section that holds the details of the email server and account for sending out email messages. I do not need slack or HTTP notifications so I cut out those parts and remained only with the email notifier. Here is how it looks like:

[email protected]
[email protected]

[emailnotifier "[email protected]"]

Note the pointer to the email template. You can edit it if you want to change the information sent by the email notifier. Also you have to define in parameter “groups” the consumer groups you want to be notified about.

Here is the actual config file I used for my tests.


I used the producer and consumer from this old post about Moving binary data with Kafka. I changed the consumer code so it will slow down and eventually stop processing messages (but stay alive). This creates a lag that Burrow can detect. Note that Burrow differentiates between consumer which is down and consumerwhich is lagging (if it is down, the status will be different). So trying to create a lag by just shutting down the consumer will not yield the expected result.

As I said before, Burrow only sees consumers that are committing offsets, so if the consumer halts before Burrow was started Burrow will not see it. I am talking about  about consumers (and so is the documentation), but the granularity in which Burrow monitors is basically consumer groups, not individual consumers.

Here’s how the Burrow log looks like after startup:

2017-01-03 16:32:10 [INFO] Starting Zookeeper client
2017-01-03 16:32:10 [INFO] Starting Offsets Storage module
2017-01-03 16:32:10 [INFO] Starting HTTP server
2017-01-03 16:32:10 [INFO] Starting Zookeeper client for cluster guy1
2017-01-03 16:32:10 [INFO] Starting Kafka client for cluster guy1
2017-01-03 16:32:11 [INFO] Starting consumers for 50 partitions of __consumer_of fsets in cluster guy1
2017-01-03 16:32:35 [INFO] Start email notify
2017-01-03 16:32:35 [INFO] Acquired Zookeeper notify lock
2017-01-03 16:42:35 [INFO] Start evaluating consumer group test in cluster guy1

First of all, you can query the Burrow http server for data about the cluster and the consumer groups. For example, to list all consumer groups in your cluster, go to:

http://{your host name}:8000/v2/kafka/{cluster name}/consumer

To see a specific consumer group status go to:

http://{your host name}:8000/v2/kafka/{cluster name}/consumer/{group name}/status

For a full list of URLS, see the second article from Hortonworks I mentioned earlier.

Here is an example of the server's answer to a status query for consumer group "test" in cluster "guy1":
{"error":false,"message":"consumer group status returned","status":{"cluster":"guy1","group":"test","status":"OK","complete":false,"partitions":[],"partition_count":1,"maxlag":null,"totallag":0},"request":{"url":"/v2/kafka/guy1/consumer/test/status","host":"hadoop.localdomain","cluster":"guy1","group":"test","topic":""}}

After running my hacked consumer for a while, I started receiving email notifications like this:

View full size image

As you see, the alert is for an entire consumer group “test”, not for an individual consumer.

I only tested Burrow with the email notifier, and here is what I think:

Burrow uses a unique and very effective way to detect consumer lag. It enables it to be more accurate and have less false alarms than systems that just count the messages in the topic and compare them to a fixed threshold. However, there are several shortcomings or maybe “things I wish Burrow had”.

  • As a “slim” open code project, there is a little documentation. If you look for forums and community support you will not find much. You’ll pretty much have to figure out things yourself, especially when you encounter problems/errors.
  • Burrow is designed to monitor Kafka, which is a distributed, large scale system. But Burrow itself is a standalone program that is vulnerable to hardware and software failures. A Burrow cluster, or another High availability solution will be a good idea.
  • Burrow email notifier did not seem to obey to the “interval” parameter and clogged my inbox with hundreds of messages with the consumer being in “Warning” status for just few minutes.
  • When the status became good again, Burrow did not send any message indicating that the error situation has been cleared.
  • Burrow only gives you the basic info that your consumer group is “Ok” or “Not Ok”, you have to find what’s wrong and which specific consumers are lagging. Internally, Burrow analyzes the situation in order to determine the consumer group status, so it has some more information, and it would be nice if it could share it in the alerts (maybe tweaking the email template, which I did not do, can change this behavior).

I guess some of those things can be achieved through better setup of the templates and config file. Burrow seems to be under active development and bug are fixed and new features are added all the time, so I believe that most of the problems will be taken care of in future releases.


This entry was posted in Kafka and tagged , , . Bookmark the permalink.

Leave a Reply