April 6, 2007

Complex service checks with Nagios

Author: David Josephsen

Nagios is a GPL-licensed framework that allows you to intelligently schedule little monitoring programs written in any language you choose. Nagios lets you monitor hosts, services, and networks. Here are a couple of examples of real-world monitoring scenarios.

This article is excerpted from the newly published book Building a Monitoring Infrastructure with Nagios, published by Prentice Hall Professional, Copyright 2007 Pearson Education, Inc. All rights reserved.

In the first example, Company B uses a rather unreliable combination of filters to block unwanted email on its public MXes. The problem is that its business partner, Company A, seems to be particularly disliked by the filters for some reason, so every few weeks, the filters arbitrarily decide to block all email originating from Company A. Various meetings have taken place to resolve the problem, but the combination of filters is so complex and Company B is so large that Company B just cannot seem to get it together. Every time the people who meet think the problem is fixed, it happens again; and worse, every time it happens, it takes up to a day to figure out that it's happening because the filters at Company B don't bounce the mail. Instead, they answer "250 not OK" and then silently drop the mail on the floor. (A security consultant told them this was the best thing to do.)

To at least provide timely detection of the problem, the system administrator at Company A defines a command that uses the check_smtp plugin to periodically perform an SMTP handshake with Company B's mail server:

define command{
   command_name   check_spam_block
   command_line   $USER1$/check_smtp -H $HOSTADDRESS$ \
                  -C 'hello companyA.com' R '250 OK' \
                  C 'mail from: <alice@companya.com>'\
                  R '250 OK' \
                  C 'rcpt to: <bob@companyb.com>' \
                  R '250 OK'

This works well; if Company B answers anything other than "250 OK" to any part of the handshake, then the administrators at Company A are immediately notified. Further, there's no reason this definition cannot be expanded to include the data portion of the SMTP conversation, if it were required.

For the record, you should get permission from someone before you do things such as this. Monitoring things you don't own can get you into trouble. Another thing to keep in mind is that service checks that actually interact with the services they are watching affect things such as logs and connection statistics. If the data portion were included, Bob at Company B would actually get an email message; it's usually advisable to stop short of doing something that directly affects a human being. On the other hand, poorly written daemons might actually have problems with service checks that sever the connection at unexpected times. Finally, administrators on the other end might use filters to block access to your monitoring tools if they think the traffic might be malicious in nature. Always put some thought into the things you monitor, especially if those things don't belong to your company or group.

The next example centers on Ted, who is a systems administrator for a moderately sized health care company. Ted is responsible for obtaining SSL certificates from the company's rather shady PKI vendor, VeriSure. Ted is also responsible for registering new domain names, but the company doesn't use VeriSure for this. Recently, Ted's mailbox has been filling up with email from VeriSure. Most of them are marketing emails, offering Ted discounts to move his company's domain registry to VeriSure. Because his company owns a few domains and SSL certificates, Ted is receiving about 20 of these message per day, so he has a dilemma. Ted wants to /dev/null all email from VeriSure, but he also needs to get SSL expiry notifications. Here's the command he uses:

define command{
   command_name   check_ssl
   command_line   $USER1$/check_http $ARG1$ -C 10

Check_http is a great plugin that can do all sorts of useful things. The job of the -C switch is to check the expiry date of a given Web site's SSL certificate. If the certificate on the Web site expires in less than the number of days given (10, in this case), the plugin generates a critical error. This solves Ted's problem and is probably a bit more reliable than VeriSure notifications.

This definition is the first we've seen that doesn't use the $HOSTADDRESS$ macro. This is because we're specifying a URL, as opposed to a server address. The URL is passed via an ARG macro:

define service{
   host_name              webServer
   service_description    check_ssl
   check_command          check_ssl!www.myweb.org
   notification_options   c,w,r
   use                    chapter6template

An interesting digression is that, because the $HOSTADDRESS$ macro is normally the macro that decides which host the plugin will run on, the host_name directive in the service definition can be whatever you want when that macro isn't used. That is, you can specify an unrelated accounting database server for host_name in the above code and the check will work. The only place the host_name directive is used, in the absence of the $HOSTADDRESS$ macro in the command definition, is in the Web UI, which lists the check_ssl service as belonging to whatever host_name references.

Click Here!