Sometimes processes just die. It’s unavoidable. In most cases, it’ll give your users a chance to see the snarky error page you made late one night or, more likely, the ubiquitous “502 Bad Gateway” page. The art of getting your servers to automatically recover from unexpectedly lost daemons is part of Process Management. It’s just one of the tools on the sysadmin’s utility belt, but it’s a fundamental one.
There are many different process management solutions out there to choose from. In this post, I’ll walk through some of the more popular options and point out the good, the bad, and the ugly of each.
Obligatory disclaimer: I’m not an expert in process management and the following is my personal opinion and impressions of each solution.
At first glance, Upstart seems like it’s only meant to replace /etc/init.d and similar daemonization techniques. (It says so right on the project’s homepage.) Maybe you already know how to write a half-decent init.d script, so why bother with these newfangled .conf files? Because Upstart does a heck of a lot more than replacing simple start/stop/restart daemonization, including keeping daemons alive!
In many cases, a short Upstart script can replace a more verbose init.d script to achieve the same goal of being able to start, stop, restart, and get the status of a daemon process. Of course, the devil is in the details. The majority of a typical init.d script tends to be dedicated to validating the environment, reading configuration, setting environment variables, etc. Upstart supports all that too. The cookbook goes into detail about everything Upstart can do.
Note: While the cookbook is a great resource, it can be overwhelming for a beginner…it certainly was for me.
Here’s a simple configuration to manage a Django-Celery worker process:
|1 2 3 4 5 6 7 8 9||
It’s straightforward enough that I won’t go through it line by line, but it is important to note that in this example, Upstart will immediately start the daemon again if it unexpectedly goes away, courtesy of the
Save it as /etc/init/celery.conf and then run
initctl reload-configuration to let Upstart know about it. (The documentation claims changes are automatically discovered in the /etc/init directory, but that has proved unreliable for me.) The daemon will now start when the server boots, stop when it shuts down, and can be manually controlled by the likes of
service celery start or
stop celery where the service name is whatever comes before the .conf in the filename, in this case “celery”.
Overall, I like Upstart a lot. Its functionality is rich enough to let power-users achieve complex tasks while still facilitating simple daemonization thanks to well-designed defaults. Since all processes managed by Upstart are subprocesses of a master Upstart process, daemons that exit unexpectedly are immediately detected and respawned.
I like that I don’t have to worry about the question “What if Upstart itself dies?” On Ubuntu and several other distributions, Upstart is used to manage most of the core system-level processes like networking, syslog, ssh, tty terminals, etc. This is comforting. If Upstart dies, you’ll have bigger problems on your hands than your web server daemon disappearing.
Unfortunately, Upstart lacks support for custom commands. The ability to send arbitrary signals to a process or execute arbitrary scripts like init.d scripts allows comes in handy. The convenience of custom commands like
/etc/init.d/nginx configtest to check my nginx configuration syntax without affecting the running nginx service is useful enough to keep me from migrating my nginx daemonization to Upstart.
Monit is an established player in the process management game. Its sole purpose is to monitor daemon processes, files, directories, filesystems, etc on your server and respond with appropriate actions whenever something is not as it should be.
Here’s a simple configuration to monitor an SSH server daemon process:
|1 2 3 4 5 6||
What’s going on here? Given the daemon’s PID filepath and how to start/stop the daemon, Monit will check that the process exists every 60 seconds and start it anew if it is not found. What’s more, Monit will also attempt an SSH connection on port 22 and restart the ssh server process if the test fails.
That last bit is the true power of Monit. On top of simply checking that a process exists, it can perform network tests as well as check system resources like CPU usage, memory consumption, number of child processes, and many other things. This can aid greatly in determining if a webserver is correctly serving traffic on port 80. It’s also a great band-aid for a process known to have a memory leak.
After Monit is initially configured and started, it’s controlled via the command line. Executing
monit summary gives the state of all the processes it is monitoring. Monitoring for a given process can be temporarily disabled/enabled with the
monit unmonitor and
monit monitor commands. This can be useful as part of a deploy process for ensuring a daemon is stopped while a source code directory is rsynced or otherwise updated. A handful of other actions are also available.
By the way, a built-in web interface can be enabled to provide much of the same functionality available from the command line in a more user friendly way. Just be sure to properly lock down the web interface from public access, since it exposes so many powerful and potentially dangerous functions.
Monit is a pretty solid and useful tool. It’s portable and easy to compile since it has very few library dependencies. Note that, unlike most of the other solutions described here, Monit does not daemonize processes; it only monitors them. This can be seen as either a positive or negative. On one hand, separation of concerns is a good way to keep things simple. On the other hand, it’s one more system to maintain.
Monit seems to have a good developer community around it with a fairly responsive mailing list. The wiki includes an amazing collection of configurations examples for just about every common service out there.
Monit can also be used to monitor the existence, contents, and other properties of arbitrary files or directories on the server. While this likely has some interesting use cases, I imagine these features have been somewhat superseded with the advent of configuration management tools like Puppet and Chef.
I ran into several gotchas while getting familiar with Monit.
When specifying the start program and stop program directives, don’t make any assumptions about the environment’s PATH variable; use absolute paths for all executables. This led me to more than a few head-banging moments.
Don’t forget about Monit when manually starting or stopping a daemon it is watching. This can lead to a process either inexplicably being resurrected shortly after stopping it or, even worse, a process left unmonitored once it is started again! Either commit yourself to never forgetting that Monit is running (good luck with that) or get in the habit of using the
monit stop and
monit start commands when manually controlling daemon processes.
Programmatically issuing commands like
monit unmonitor and
monit monitor in rapid succession will often lead to errors. To avoid this, use groups intelligently so that only one command is ever required at a time. If groups aren’t enough, adding a one second sleep between monit commands is a reasonable solution.
Supervisor is a Python-based process management solution. It’s one of the newer contenders in the space, and shares design principles and goals with Upstart. As such, it takes care of daemonization as well as process monitoring.
Here’s a configuration analogous to the one shown for Upstart:
|1 2 3 4 5 6||