September 2, 2008

Taming your daemons with PSMon

Author: Ben Martin

The PSMon utility lets you specify which processes should be running, how much of resources such as CPU or RAM each is allowed to use when it runs, and how many instances are able to be run. PSMon will then ensure that these processes are running and kill off a process if it starts to use too many resources, and possibly restart a process if it has crashed.

PSMon is not in the repositories for Fedora 9, Ubuntu Hardy, or openSUSE 11. You can install PSMon using CPAN as described in the PSMon manual. There is also an install script stored in the utility's support subdirectory that will take care of installation tasks for you.

PSMon needs a few Perl modules to function. The support/ script will install those Perl modules for you, or you can get them from your distribution's package repository first. The advantage of installing from the package repository is that you can keep the modules up-to-date through your normal Linux distribution updates. The commands shown below first install these extra Perl modules, then run the install script for the PSMon program.

# yum install perl-CPAN perl-YAML
# yum install perl-Config-General perl-Proc-ProcessTable perl-Unix-Syslog
# tar xjf psmon-1.29.tar.bz2
# cd psmon*
# ./support/
Checking for Config::General ... found
Checking for Proc::ProcessTable ... found
Checking for Unix::Syslog ... found
Checking for Getopt::Long ... found
Installing psmon ... done
Installing psmon-config ... done
Installing etc/psmon.conf ... done
Generating HTML documentation support/psmon.html ... done
Installing manual psmon.1 ... done

The configuration file generated by the script has key value pairs either at the top level of the file or nested inside Process groupings. The syntax is designed to be similar to that of the Apache configuration file. There is a special Process * group that lets you apply settings for all processes. However, this might not work as you expect -- it could end up killing many processes that you did not intend to get rid of, so you should avoid using the Process * group.

p>Near the top of the default /etc/psmon.conf file you will see Disabled True, making PSMon not do anything until you have changed this directive in the configuration file.

PSMon supports a small collection of directives that are designed to be used at the top level, outside of any Process group. These let you set the frequency (in seconds, default 60) with which PSMon will scan the process table. Changing this to 5 seconds will cause respawns and badly performing processes to be killed more quickly, but PSMon will consume more CPU time on the machine. The AdminEmail directive (default root@localhost) lets you set the email address that PSMon notifies when processes are spawned or killed, or a failure occurs while it performs those operations.

There are also two directives, NeverKillPID and NeverKillProcessName, that can be used to protect processes from ever being killed. These two directives take a space-delimited list of Process IDs (PID) and process names and default to 1 and a list of kernel threads that you really don't want to kill by mistake.

The example below shows a Process group, which is started and finished with XML-like tags. After the Process declaration you put the name of the process that you are describing. You cannot include path information in the process name, and should omit any command-line options that the command might have taken. Being able to specify the full path (or a regular expression to match against) of the process you wish to use PSMon with would be a welcome enhancement. For the SSH daemon, simply using sshd is not likely to generate any false hits with other running processes. In this example the sshd process group ensures that the SSH daemon is up and running, should it exit or crash for any reason.

<Process sshd>
SpawnCmd /sbin/service sshd start

Other directives that you can use in a Process group include Instances, to control the maximum number of process that can be running, and KillCmd, which lets you specify a custom way to close the process if it is misbehaving. If KillCmd is not specified, a SIGKILL will be sent to close the process. You might like to consider using a KillCmd to send a SIGTERM to the process, wait a few seconds, and then send a stronger SIGKILL if the process is still around. Another good option for the KillCmd is to use the /etc/init.d scripts to stop a service.

You can set resource limits for a process using PctCPU, PctMEM, and TTL directives to set a percentage limit on the CPU and RAM usage and how long the process can live in total. The PIDFile directive is used to tell PSMon a file path which contains the process ID of the daemon which you don't want PSMon to kill. The PIDFile directive is only useful if you are using the PctCPU, PctMEM, or TTL directives too. As an example of why you might like to use the PIDFile directive, consider a daemon that spawns many children to perform network communications. You might like to make sure that the children do not consume more than 70% of the system's RAM. Using the PIDFile you can tell PSMon not to kill the main control process, but only the child worker processes if they start to consume too much memory.

The TTL directive is handy to ensure that processes that are meant to complete within a known amount of time have done so. For example, you can limit the updatedb command or the use of unison or find to a one-hour duration to stop them from running unchecked from a user's cron job:

<Process find>
ttl 86400
instances 30

You can control how verbose PSMon is using the NoEmail, NoEmailOnKill, and NoEmailOnSpawn directives. These all default to False, but setting them to True will result in no emails at all, none on process killing, or none on process spawning, respectively.

You can also set the LogLevel and AdminEmail directives on a per-process section basis, so you can send email to an SMS gateway when a very important process such as Apache has crashed. Changing the LogLevel also affects how failed respawn attempts are reported. PSMon reports a failure to stop or start a process using the LogLevel plus one, so setting the Apache group to have a high LogLevel will also cause PSMon to report respawn errors to syslog with a high priority.

Sending the USR1 signal to PSMon when it is running as a daemon will make it rescan the running processes immediately. You can start PSMon as a daemon using the --daemon command-line option.

Final words

I am not to sold on the idea of killing processes if they are using too much of a system's resources, since a process may legitimately be using 95% of the CPU for a few minutes and you wouldn't want it to be killed. Enforcing a maximum run time, if you select a time well beyond what most legitimate uses of the command would require, can help to protect the system from badly behaving cron jobs when you are not around to notice them. Being able to respawn processes automatically if they have exited is certainly useful -- although sshd and Apache do not tend to crash much, you can bet the one time they do is when you board a airplane for an nine-hour flight. Its multiple capabilities make PSMon a worthy utility for your system administration toolkit.


  • System Administration
Click Here!