Byzantine Reality

Searching for Byzantine failures in the world around us

HOWTO: Write a Simple Monit Config File

Monit is a daemon that monitors processes running on a Linux box, and can restart them if:

  • They died for any reason (e.g., they crashed, or they were done doing whatever it is they do).
  • They use too much CPU or memory

We use it in AppScale for exactly these reasons, for nearly every daemon AppScale relies on. This includes:

  • Cassandra
  • ZooKeeper
  • Memcache
  • Celery
  • RabbitMQ
  • Ejabberd
  • App Engine apps that users upload

The official Monit documentation is pretty thorough but doesn’t tell you in a very minimal way exactly how to write a Monit config file to revive processes that die or take up too much memory. With that in mind, let’s tell you how to do exactly that!

So a Monit config file has N parts:

  1. The command used to start your service.
  2. The command used to stop your service.
  3. The maximum amount of CPU or memory allowed for your service.
  4. How to see if your service is running.

For (1) and (2), the command has to be fully-qualified. For example, you can’t say python /home/cgb/blah.py. You have to say /usr/bin/python /home/cgb/blah.py. For (3), you can say that either the main process itself is limited to a certain amount of CPU and memory, or that process and anything it forks are limited to a certain amount of CPU and memory. In AppScale-land, we always want the latter, so we say totalcpu and totalmem instead of cpu and mem. Finally, for (4), you specify a command that you want to use to see if the process is actually running. If your process doesn’t work, then this can be the same as (1). If it does fork, then you can use monit procmatch (.*) once your process is running to see what monit can see, and use that as what monit calls the “match command”.

Let’s look at an example, with what we use as a monit template file in AppScale:

template.cfg
1
2
3
4
check process myprocessname matching "match_command"
  start program = "start_command"
  stop program = "stop_command"
  if totalmem > X MB for 3 cycles then restart

So let’s use this template to see how we monitor a service like ZooKeeper. With ZooKeeper, we start it by running service start zookeeper and stop it by running service stop zookeeper. However, it forks into a different process to actually run ZooKeeper, so we can’t see that it’s running by seeing if service start zookeeper is running. I ran monit procmatch (.*) and saw that zookeeper.jar was running in the ZooKeeper process, so we can use that as our match command. Our monit config file for ZooKeeper therefore looks like the following:

zookeeper.cfg
1
2
3
4
check process zookeeper matching "zookeeper.jar"
  start program = "/usr/sbin/service zookeeper start"
  stop program = "/usr/sbin/service zookeeper stop"
  if totalmem > 250 MB for 3 cycles then restart

Notice that we didn’t just say service zookeeper start, because that doesn’t work! I ran which service, which on our Ubuntu Precise virtual machines returned /usr/sbin/service, to find out where the service command was installed. I also picked the 250 MB value somewhat arbitrarily, so adjust that as you need to for your system. I also indicated totalmem here just in case ZooKeeper forks off other processes that take up memory that we want to track as well.

That’s a lightning-fast intro on using monit, and how we use monit to monitor processes in AppScale. Enjoy!