January 30, 2006

CLI Magic: Use top to monitor PCs across a network

Author: JT Smith

User level: Advanced

Most Linux users are familiar with the top command, typically used to examine the system load on a local PC and others on the network. However, have you considered using top to monitor your system automatically and to warn you when a server is being overloaded?

In order to identify an overloaded server you must first find out what the system load is. Look at the right side of top's output, where it says something like this:

top - 17:59:26 up  9:44,  2 users,  load average: 1.05, 0.36, 0.02

This information tells you the load average for the last one, five, and 15 minutes. What exactly does load average mean? Here's a great definition I learned years ago when I first started working with Unix:

The load average represents the number of computers you would need to be able to run all of the processes at the same time.

All you have to do is extract the load-average fields and use them to identify an overloaded server. You can't do this in normal top mode because the default is to display to the screen continually, refreshing itself every few seconds. However, you can use the -n flag to limit the number of iterations:

top -n 1

This causes top to run once and then exit.

To check the load on other machines, use SSH (for example, the OpenSSH SSH client) to run top on a remote server (in batch mode, using the -b flag):

ssh bainm@aeneas "export TERM=linux; top -n 1 -b"

This command connects to the host aenas and user account bainm and then runs the top command. You can do this on multiple machines by using code similar to this:

USER="bainm"
HOSTS="acamas cassandra hector"

getLoad ()
{
	HOST=$1
	TMP=$HOST.tmp
	ssh $USER@$HOST "export TERM=linux; top -n 1 -b" > $TMP
	head -1 $TMP |
	awk -F"," '{print $3,$4,$5}' |
	awk -F":" '{print $2}'
}

for HOST in $HOSTS
do
	echo $HOST: $(getLoad $HOST)
done

This simple script logs onto each of the hosts in a list, runs top, and then extracts the load average. The resulting output looks like this:

acamas: 0.22 0.12 0.04
cassandra: 0.35 0.14 0.05
hector: 0.33 0.16 0.06

Once you master this simple technique, you can start adapting it to your own requirements. For instance, you might prefer to create output files using crontab and then view the results through your Web browser by means of a CGI script. First, create a script to be run using crontab:

#!/bin/bash

USER="bainm"
HOSTS="acamas cassandra hector"
#/usr/lib/cgi-bin
getLoad ()
{
        HOST=$1
        TMP=/tmp/$HOST.tmp
        ssh $USER@$HOST "export TERM=linux; top -n 1 -b" > $TMP
        head -1 $TMP |
        awk -F"," '{print $3,$4,$5}' |
        awk -F":" '{print $2}'
}

for HOST in $HOSTS
do
        echo "$HOST $(getLoad $HOST)" > /tmp/myload_$HOST.tmp
done

Then write a cgi-script to format the resulting data:

#!/bin/bash

HOSTS="acamas cassandra hector"
echo "Content-type: text/html"
echo ""
echo "&#60table>"
echo "&#60tr>&#60th>&#60th colspan=3>Load Averages&#60/tr>"
echo "&#60tr>&#60th>Host&#60th>1 Min.&#60th>5 Min.&#60th>15 Min.&#60/tr>"
for HOST in $HOSTS
do
        cat /tmp/myload_$HOST.tmp |
        awk '{ print "&#60tr>&#60td>"$1"&#60td>"$2"&#60td>"$3"&#60td>"$4"&#60/tr>"}'
done
echo "&#60/table>"

Use crontab -e to use the script (in this case once every five minutes):
*/5 * * * * /home/bainm/myLoad

Then use your favourite Web browser to view the results.

Hard-coding the HOSTS string into both scripts can cause problems if you have a large number of servers or if they're likely to change often. You don't want to have to continually edit both files just to add or remove host names. In such cases, I recommend saving the host names to a file and then reading the file into the scripts. Here's the host list file (~/hostlist):

acamas
cassandra
hector

and the amended code:

HOSTS="$(cat ~/hostlist)"

Whichever way you decide to identify the hosts to be monitored, you must still look at the output and decide which of the servers are being overloaded. You need to automate the process completely so that you can be alerted to a problem instead of having to find it yourself.

First, ask yourself what exactly does "overloaded" mean? A load average of around 1 implies that a processor is being utilized correctly. When does that change to being overloaded, and more importantly, when will your users start to notice? I'd recommend somewhere between 2 and 3 for a single-processor PC. You also need to consider how quickly you want to be notified. You don't really want to know about one freak peak that stops immediately. On the other hand, you don't want to ignore anything that's been going on for too long. A five-minute overload average seems to be an appropriate time frame.

Further reading

Learn more about top and what each field represents in Joe Barr's article, "CLI Magic: Getting on top of things."

To better understand how to set up and use SSH so that you don't have to enter your password whenever you log on to a remote machine, take a look at Joe "Zonker" Brockmeier's article, "CLI Magic: More on SSH."

To learn more about crontab and the techniques I've used in the scripts, take a look at my articles, "CLI Magic: Make time for crontab" and "Pipes and filters."

Now you can add some code that checks to see if any servers have reached the trigger levels. If they have, the script can fire off an email message to a system administrator. Put this code at the end of the file being used with crontab. That way, you have the best of both worlds: an automatic alert and a viewer:

TRIGGER_LEVEL=2

TRIGGER_TIME=5
case $TRIGGER_TIME in
1)TRIGGER_FIELD=2;;
5)TRIGGER_FIELD=3;;
15)TRIGGER_FIELD=4;;
esac

OVERLOAD=0
for HOST in $HOSTS
do
        AVLOAD=$(awk -vtrigger_field=$TRIGGER_FIELD '{print $trigger_field}' /tmp/myload_$HOST.tmp)
        OLCHECK=$(echo $AVLOAD $TRIGGER_LEVEL| awk '{if ($1>$2) {print 1} else {print 0}}')
        if [ $OLCHECK -eq 1 ]
        then
                let OVERLOAD=$OVERLOAD+1
                LOADHOST[$OVERLOAD]=$HOST
                LOADING[$OVERLOAD]=$AVLOAD
        fi
done

if [ $OVERLOAD -gt 0 ]
then
        WARN_EMAIL="/tmp/myload_warn.tmp"
        echo "Subject: Overload List" > $WARN_EMAIL
        echo "" >> $WARN_EMAIL
        C=1
        while [ $C -le $OVERLOAD ]
        do
                echo ${LOADHOST[$C]} ${LOADING[$C]} >> $WARN_EMAIL
                let C=$C+1
        done

        cat $WARN_EMAIL | ssmtp author@markbain-writer.co.uk
fi

You can start developing it in ways to suit the way that you want top work. For instance, you can easily identify the owners of the processes loading the server:

for HOST in $HOSTS
do
	csplit /tmp/$HOST.tmp %COMMAND%
	head -2 xx00 | tail -1 | awk '{print $2,$9}'
done

You can now easily add this information to your message to the system administrator -- or, even better, you can have the script email the user directly. Chances are the user will either stop the offending process immediately or contact you to let you know why the process must keep running.

This simple yet effective use of top helps monitor your system and keep your network running more efficiently.

Click Here!