Linux.com

Feature

CLI Magic: Use top to monitor PCs across a network

By Mark Alexander Bain on January 30, 2006 (8:00:00 AM)

Share    Print    Comments   

User level: Advanced

Most Linux users are familiar with the top command, typically used to examine the system load on a local PC and others on the network. However, have you considered using top to monitor your system automatically and to warn you when a server is being overloaded?

In order to identify an overloaded server you must first find out what the system load is. Look at the right side of top's output, where it says something like this:

top - 17:59:26 up  9:44,  2 users,  load average: 1.05, 0.36, 0.02

This information tells you the load average for the last one, five, and 15 minutes. What exactly does load average mean? Here's a great definition I learned years ago when I first started working with Unix:

The load average represents the number of computers you would need to be able to run all of the processes at the same time.

All you have to do is extract the load-average fields and use them to identify an overloaded server. You can't do this in normal top mode because the default is to display to the screen continually, refreshing itself every few seconds. However, you can use the -n flag to limit the number of iterations:

top -n 1

This causes top to run once and then exit.

To check the load on other machines, use SSH (for example, the OpenSSH SSH client) to run top on a remote server (in batch mode, using the -b flag):

ssh bainm@aeneas "export TERM=linux; top -n 1 -b"

This command connects to the host aenas and user account bainm and then runs the top command. You can do this on multiple machines by using code similar to this:

USER="bainm"
HOSTS="acamas cassandra hector"

getLoad ()
{
	HOST=$1
	TMP=$HOST.tmp
	ssh $USER@$HOST "export TERM=linux; top -n 1 -b" > $TMP
	head -1 $TMP |
	awk -F"," '{print $3,$4,$5}' |
	awk -F":" '{print $2}'
}

for HOST in $HOSTS
do
	echo $HOST: $(getLoad $HOST)
done

This simple script logs onto each of the hosts in a list, runs top, and then extracts the load average. The resulting output looks like this:

acamas: 0.22 0.12 0.04
cassandra: 0.35 0.14 0.05
hector: 0.33 0.16 0.06

Once you master this simple technique, you can start adapting it to your own requirements. For instance, you might prefer to create output files using crontab and then view the results through your Web browser by means of a CGI script. First, create a script to be run using crontab:

#!/bin/bash

USER="bainm"
HOSTS="acamas cassandra hector"
#/usr/lib/cgi-bin
getLoad ()
{
        HOST=$1
        TMP=/tmp/$HOST.tmp
        ssh $USER@$HOST "export TERM=linux; top -n 1 -b" > $TMP
        head -1 $TMP |
        awk -F"," '{print $3,$4,$5}' |
        awk -F":" '{print $2}'
}

for HOST in $HOSTS
do
        echo "$HOST $(getLoad $HOST)" > /tmp/myload_$HOST.tmp
done

Then write a cgi-script to format the resulting data:

#!/bin/bash

HOSTS="acamas cassandra hector"
echo "Content-type: text/html"
echo ""
echo "&#60table>"
echo "&#60tr>&#60th>&#60th colspan=3>Load Averages&#60/tr>"
echo "&#60tr>&#60th>Host&#60th>1 Min.&#60th>5 Min.&#60th>15 Min.&#60/tr>"
for HOST in $HOSTS
do
        cat /tmp/myload_$HOST.tmp |
        awk '{ print "&#60tr>&#60td>"$1"&#60td>"$2"&#60td>"$3"&#60td>"$4"&#60/tr>"}'
done
echo "&#60/table>"

Use crontab -e to use the script (in this case once every five minutes):
*/5 * * * * /home/bainm/myLoad

Then use your favourite Web browser to view the results.

Output in browser
Hard-coding the HOSTS string into both scripts can cause problems if you have a large number of servers or if they're likely to change often. You don't want to have to continually edit both files just to add or remove host names. In such cases, I recommend saving the host names to a file and then reading the file into the scripts. Here's the host list file (~/hostlist):

acamas
cassandra
hector

and the amended code:

HOSTS="$(cat ~/hostlist)"

Whichever way you decide to identify the hosts to be monitored, you must still look at the output and decide which of the servers are being overloaded. You need to automate the process completely so that you can be alerted to a problem instead of having to find it yourself.

First, ask yourself what exactly does "overloaded" mean? A load average of around 1 implies that a processor is being utilized correctly. When does that change to being overloaded, and more importantly, when will your users start to notice? I'd recommend somewhere between 2 and 3 for a single-processor PC. You also need to consider how quickly you want to be notified. You don't really want to know about one freak peak that stops immediately. On the other hand, you don't want to ignore anything that's been going on for too long. A five-minute overload average seems to be an appropriate time frame.

Further reading

Learn more about top and what each field represents in Joe Barr's article, "CLI Magic: Getting on top of things."

To better understand how to set up and use SSH so that you don't have to enter your password whenever you log on to a remote machine, take a look at Joe "Zonker" Brockmeier's article, "CLI Magic: More on SSH."

To learn more about crontab and the techniques I've used in the scripts, take a look at my articles, "CLI Magic: Make time for crontab" and "Pipes and filters."

Now you can add some code that checks to see if any servers have reached the trigger levels. If they have, the script can fire off an email message to a system administrator. Put this code at the end of the file being used with crontab. That way, you have the best of both worlds: an automatic alert and a viewer:
TRIGGER_LEVEL=2

TRIGGER_TIME=5
case $TRIGGER_TIME in
1)TRIGGER_FIELD=2;;
5)TRIGGER_FIELD=3;;
15)TRIGGER_FIELD=4;;
esac

OVERLOAD=0
for HOST in $HOSTS
do
        AVLOAD=$(awk -vtrigger_field=$TRIGGER_FIELD '{print $trigger_field}' /tmp/myload_$HOST.tmp)
        OLCHECK=$(echo $AVLOAD $TRIGGER_LEVEL| awk '{if ($1>$2) {print 1} else {print 0}}')
        if [ $OLCHECK -eq 1 ]
        then
                let OVERLOAD=$OVERLOAD+1
                LOADHOST[$OVERLOAD]=$HOST
                LOADING[$OVERLOAD]=$AVLOAD
        fi
done

if [ $OVERLOAD -gt 0 ]
then
        WARN_EMAIL="/tmp/myload_warn.tmp"
        echo "Subject: Overload List" > $WARN_EMAIL
        echo "" >> $WARN_EMAIL
        C=1
        while [ $C -le $OVERLOAD ]
        do
                echo ${LOADHOST[$C]} ${LOADING[$C]} >> $WARN_EMAIL
                let C=$C+1
        done

        cat $WARN_EMAIL | ssmtp author@markbain-writer.co.uk
fi

You can start developing it in ways to suit the way that you want top work. For instance, you can easily identify the owners of the processes loading the server:

for HOST in $HOSTS
do
	csplit /tmp/$HOST.tmp %COMMAND%
	head -2 xx00 | tail -1 | awk '{print $2,$9}'
done

You can now easily add this information to your message to the system administrator -- or, even better, you can have the script email the user directly. Chances are the user will either stop the offending process immediately or contact you to let you know why the process must keep running.

This simple yet effective use of top helps monitor your system and keep your network running more efficiently.

Share    Print    Comments   

Comments

on CLI Magic: Use top to monitor PCs across a network

Note: Comments are owned by the poster. We are not responsible for their content.

ehh....why on earth...

Posted by: Anonymous Coward on January 30, 2006 05:53 PM
...would you want to fire up top when uptime(1) will do the job nicely and on more platforms as well?

#

Re:ehh....why on earth...

Posted by: Anonymous Coward on January 30, 2006 06:08 PM
Or even better, ruptime(1).

#

Re:ehh....why on earth...

Posted by: Anonymous Coward on January 30, 2006 06:31 PM
Ahemm.....friends don't let friends use r-utils

#

Re:ehh....why on earth...

Posted by: Anonymous Coward on January 30, 2006 09:19 PM
I hardly think that allowing ruptime connections to hosts on the local (switched) network is as bad as, say, using rexec.

#

Re:ehh....why on earth...

Posted by: Administrator on January 31, 2006 03:30 AM
a) Any unsecured remote operation is a security risk, even if it seems unlikely. Who would have thought you could use 20 year old code designed to draw metafiles to own a Windows box?



b) Being on a switched network does not preclude sniffing. I'd argue that it makes it easier because most people THINK switched networks are "secure", and thus do nothing to actually secure them. Google for "arp cache poisoning", and leave the quotes.

#

Re:ehh....why on earth...

Posted by: Anonymous Coward on January 31, 2006 10:16 PM
Or even better:

<tt>cat<nobr> <wbr></nobr>/proc/loadavg</tt>

#

Ahem ...

Posted by: Anonymous Coward on January 30, 2006 06:41 PM
you might want to give some more thought to this. It won't work with one of the servers actually being overloaded. Try to do this on a server with a really too high load (try 700 or 800).

Just a thought.

Someone ought to design an snmp kernel module for this. With locked memory etc.

#

Ganglia

Posted by: Anonymous Coward on January 31, 2006 01:36 AM
There is also a software called Ganglia, maybe its more for clusters?
I guess it works on all server farms and such.

<a href="http://www.ganglia.info/" title="ganglia.info">http://www.ganglia.info/</a ganglia.info>
<a href="http://freshmeat.net/projects/ganglia/" title="freshmeat.net">http://freshmeat.net/projects/ganglia/</a freshmeat.net>

It scales well, can handle 2000 nodes.

#

local

Posted by: Anonymous Coward on January 31, 2006 08:40 PM
Use 'local' for variables in functions.

#

Re:my waffle

Posted by: Anonymous Coward on February 01, 2006 01:18 AM
I agree that Cacti, Nagios make great tools, but the series of these article is called CLI Magic, meaning command line interface magic. Introducing the tools you mentioned defeats the point of the series of articles - how to use the CLI. While the solution is not the best, it is a useful exercise to see what can be done. I personally like this article.

#

Very good

Posted by: Anonymous Coward on February 03, 2006 09:31 AM
Very good!!!

#

lower back pain

Posted by: Anonymous Coward on May 28, 2006 07:18 PM
[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]

  [URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]

  [URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.com] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/painrelief<nobr>.<wbr></nobr> htm] Nerve pain relief [/URL]

#

relief joint

Posted by: Anonymous Coward on May 28, 2006 07:17 PM
[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]
[URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]
[URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.com] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/painrelief<nobr>.<wbr></nobr> htm] Nerve pain relief [/URL]

#

my waffle

Posted by: Administrator on January 30, 2006 07:12 PM
Two things:

First thing: I realise this is a Linux site, but it might be worth pointing out that on some OSes load average is divided by number of processors, and some not. Solaris being a notable example of the latter. On a machine with 64 CPUs, it's quite nice to have a load average of 60 with Solaris, but with Linux, time to call the sysop...

Second thing: calling top to get load average is pretty inefficient. Use "uptime" or "cat<nobr> <wbr></nobr>/proc/loadavg" (less portable). Having said that, setting up the ssh connection weighs a fair bit, so the resources for top aren't such a biggie. There are better ways though. I little netcat wizzardry?

Hmm, well, there are nice monitoring packages already available, so I guess it's just an example.

I like this column. It gives me a change to be a pedant.<nobr> <wbr></nobr>:p

[goes back to sleep]

#

What the hell?

Posted by: Administrator on January 30, 2006 09:35 PM
I'm sure that I'm repeating what other people are commenting on, but I think it needs to be strongly emphasized.

I'm not even going to read your article because plain and simple, anyone who thinks that top is a good choice for incorporating into a web script for remote system monitoring has either not been shown the right tools to use OR has not put enough time into finding them.

It really shouldn't be that hard to find either. top is resource intensive enough that its going to throw off your system load a bit. uptime is a better program to use. But even beyond that, there are many systems for remotely monitoring system load and other statistics.

Please do yourself a favor and read up on the following topics:


  • <a href="http://www.net-snmp.com/" title="net-snmp.com">SNMP</a net-snmp.com> (the proper protocol for seeing remote system statistics)

  • <a href="http://www.cacti.net/" title="cacti.net">Cacti</a cacti.net> (for graphing)

  • <a href="http://www.nagios.org/" title="nagios.org">Nagios</a nagios.org> (for monitoring and alerting)



There is far more you can do with the above and they are not as hard as most people think they are.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya