September 15, 2004

SysAdmin to SysAdmin: Service monitoring with Nagios

Author: Preston St. Pierre

Nagios calls itself an "open source host, service and network monitoring
program". In reality, though, it's more of a monitoring framework, in
that it allows an administrator to quickly fold the one-liners they use to
gather information right into the configuration. Add to this the numerous
plugins available, and you can easily integrate Nagios with monitoring tools
you already use, like RRDTool or MRTG.

First, though, you need to get your head around the way Nagios approaches
configuration in general, so we'll start there with a relatively simple
configuration. To get anything useful out of Nagios, there are four things, at
a minimum, that need to be configured. They are hosts, hostgroups, contacts,
and services. I'm going to assume that, as administrators, you're as capable of
reading the README and INSTALL files that come with Nagios as I am, so I'm not
covering installation, and I'm also making the assumption that, once installed,
the configuration directory is /etc/nagios. In this directory,
there should be sample configuration files to give you an idea of how things
work. If not, no worries -- we'll create them.

The logic behind configuring Nagios is very (almost too) simple. You
have hosts, on which presumably run services. Hosts providing the
same services can be grouped together into hostgroups for easy
summarization in the web front end. Likewise, your organization probably has
contacts for the different services. If there's more than one contact
for a particular service, you can put these contacts together under an alias or
contactgroup. If a machine Nagios monitors goes down or loses a service
it's been running, Nagios can be configured to notify the proper contact or
group for that host or service.

Configuring Hosts

We'll go in the order of the above paragraph in our configuration, so you can
always refer back to it if your mind gets numb. Let's create two hosts. Here
are two from my test configuration:


define host{
     name      generic-host
     notifications_enabled      1
     event_handler_enabled      1
     flap_detection_enabled      1
     process_perf_data      1
     retain_status_information      1
     retain_nonstatus_information      1
     register      0
     }

define host{
     use      generic-host
     host_name      cycle1
     alias      Cycle Server 1
     address      192.168.122.165
     check_command      check-host-alive
     max_check_attempts      10
     notification_interval      120
     notification_period      24x7
     notification_options      d,u,r
     }

define host{
     use      generic-host
     host_name      vpn1
     alias      VPN Server 1
     address      192.168.122.166
     check_command      check-host-alive
     max_check_attempts      10
     notification_interval      120
     notification_period      24x7
     notification_options      d,u,r
     }

This is a small host configuration file. The first entry will save you some
typing, since "generic-host" is just a template. In the other two entries here,
I've put use generic-host which automatically sets
generic-host's settings for all of the hosts that use
it. Line 1 of the template assigns a name. Line 2 allows you to
turn notifications on and off, which is great for keeping your inbox from
exploding during testing with a large number of hosts. Line 3 enables event
handling, which allows you define a set of actions to take when Nagios detects
a change in the state of a host or service it's monitoring. Line 4 protects
your inbox or pager in the event that a service or host is intermittently (and
frequently) changing state due to a network anomaly. Line 5 aggregates the data
collected from the various hosts and services to give you pretty reports as to
the availability of your network environment. Lines 6 and 7 cause Nagios to
hold on to "last known values" across restarts of Nagios. Keep in mind that
this includes the program's own settings! Read the user guide on ways to
get around this. The last line tells Nagios not to look at this entry as a
normal host and register it as such. It's just a template!

Our host definitions are fairly ho-hum. First, we use the generic-host template
for both host definitions, so all things that are true for the template are
automagically true for the host definitions that use it. The
hostname line is supposed to be the actual hostname the machine in
question goes by, but the alias line is what the Nagios website
titles will say. The check_command line specifies which built-in
Nagios command to use to determine if the host is even up. You can figure out
what the check-host-alive command does by looking in the
command.cfg file for the corresponding entry. The notification
settings are such that the machines are monitored 24x7, every 120 seconds, and
a host is considered "dead" if the check fails 10 times. The
notification_options line tells nagios, per host, what will cause
a notification to be sent. In my case, I use "d"own, "u"nreachable, and
"r"ecovered. So if a machine is down, I get a message when it goes down, and
when it recovers from the down state. Using "unreachable" as an option is a
little obsessive if you're in a network where small occasional glitches can
cause one machine or another to become unreachable temporarily.

Monitoring Services

OK, so we now have two hosts configured. At this point, all Nagios knows how to
do is ping them to see if they're alive, though. Let's set up monitoring for
the individual services on those machines that we care about. In this case,
since I administer the machines, I know that they both run SSH daemons, and
vpn1 also runs a name server and a print server. So we have three services to
set up. The configuration file format is the same for pretty much everything,
so this configuration file shouldn't be too scary by now:

First, again, we have a template entry where we can set flags that can then
essentially be "included" by the other entries, just like we had for our hosts.


define service{
     name      generic-service ; The 'name' of this service template, referenced in other service definitions
     active_checks_enabled      1 ; Active service checks are enabled
     passive_checks_enabled 1 ; Passive service checks are enabled/accepted
     parallelize_check      1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
     obsess_over_service      1 ; We should obsess over this service (if necessary)
     check_freshness      0 ; Default is to NOT check service 'freshness'
     notifications_enabled      1 ; Service notifications are enabled
     event_handler_enabled      1 ; Service event handler is enabled
     flap_detection_enabled      1 ; Flap detection is enabled
     process_perf_data      1 ; Process performance data
     retain_status_information      1 ; Retain status information across program restarts
     retain_nonstatus_information      1 ; Retain non-status information across program restarts

     register      0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

With the template defined, we can go about the business of configuring our
actual services, and reference the template to get those settings on a
per-service basis.


define service{
     use     generic-service
     host_name      *
     service_description      SSH
     is_volatile      0
     check_period      24x7
     max_check_attempts      3
     normal_check_interval      5
     retry_check_interval      1
     contact_groups      linux-admins
     notification_interval      120
     notification_period      24x7
     notification_options      w,u,c,r
     check_command      check_ssh
     }

Looks somewhat familiar, no? Note that for the SSH service, I've used a
wildcard for the host_name. This is because I want to monitor that
service on every host configured. Most of the other flags are the same as other
configuration files, which is good news. Note the check_command
here. This command is defined in command.cfg, which will point off
to the actual script used to check the service. This is good for two reasons;
first, it allows you to tweak the script if needed. Second, this means that you
could also drop your own script in place to check whatever wacky service you
might have running, and define your own command, which can then be applied to
whatever hosts you want.

Configuring Hostgroups

So what if only a subset of my hosts are running SSH? I don't want to have to
list every single host that runs SSH individually, and I don't want to run SSH
everywhere -- what to do? Configure a hostgroup called "SSH Servers" that
contains the hostnames of those hosts in the hosts.cfg file that
run SSH. Here's an example of what a hostgroup looks like:


define hostgroup{
     hostgroup_name ssh-servers
     alias SSH Servers
     contact_groups ssh-gurus
     members ssh-host1,ssh-host2,ssh-host3
     }

Now, when you define the SSH service, instead of using host_name,
just use hostgroup_name instead, and all the right things will
happen. This is also nice, because as you add the service to different
machines, you can just add their hostnames to the right group, and off you go!
One quick note, though: if you define a hostgroup with a non-existent
contact_group, you'll get errors from Nagios, so let's create one
of those!

Configuring Contactgroups

If a service becomes unavailable, you probably want someone to know about it
PDQ. After all, if all Nagios did was make shiny pictures in your browser, nobody would
use it! Much like there are hosts and host groups, likewise there are also
contacts, and contact groups. You define all of your contacts in the
contacts.cfg file, and group those contacts into groups in the
contactgroups.cfg file. Here is a typical contact definition:


define contact{
     contact_name      jonesy
     alias      Brian K. Jones
     service_notification_period      workhours
     host_notification_period      workhours
     service_notification_options      c,r
     host_notification_options      d,r
     service_notification_commands      notify-by-email
     host_notification_commands      host-notify-by-email
     email      jonesy@my.domain.com
     }

Here, I've defined a contact (with an alias for easy viewing in a browser), the
periods during which this contact will receive host and service outage
notifications, what types of notifications I'll get (only "d"own and "r"ecovery
messages for hosts, for example), the notification mechanism (email in this
case), and finally, an email address. Setting up groups of contacts simply
requires that the contacts exist in the contacts.cfg file, and they're put
together exactly like hostgroups -- just give 'em a name, an alias, and a list
of members, and you're all set. Here's a quick example:


define contactgroup{
     contactgroup_name      sun-admins
     alias      Solaris Administrators
     members      jonesy
     }

It's a good idea to create all the groups you're likely to need going forward,
even if they only have one member. This way, when you add a Solaris
administrator (in this case), you only have to add them to the contacts file,
and then to the group definition, instead of having to hard-code the contact
name everywhere it belongs.

In Conclusion

The goal of this article was not to see pretty stuff in your browser. It was to
get the uninitiated over the initial hump of understanding how Nagios is
generally configured. With this knowledge it will be a little easier to set up
a simple configuration, get some useful information back in the web interface,
and then browse the documentation and sample config files to grow your
monitoring solution.

Nagios benefits from fairly good documentation, and a fairly simple
configuration. It also suffers from this same simple configuration. In a
production environment with a web server farm, several DNS servers, mail
servers, development servers, remote access hosts, file servers and the like,
configuration can really be slow as molasses. There are a couple of web-based
tools to help ease the configuration burden, but none label themselves as
anything more than "beta". Overall, Nagios is a very powerful, very flexible
monitoring solution, with many plugins available to do almost anything, and
with a seemingly endless number of options for notification and service
monitoring. The best part, though, is that the layout and design of Nagios
makes it amazingly easy to drop in your own ideas that may be specific to your
environment's needs. I highly recommend getting to know Nagios.

Click Here!