My sysadmin toolbox

43

Author: Ben Browning

I am the senior system administrator for a national ISP. We run a cluster of blade servers as our primary mail/Web/DNS/RADIUS farm. I have found several tools that I cannot live without in this environment.

The pipe

Pipefitting commands is integral to my daily routine. It’s not uncommon for me to have a command with 10 or more pipes in it. The | operator makes everything more useful, but a few of the commands I use religiously with it are cut, grep, perl -ne (more on that in a second), find, wc, xargs, nc (netcat), sort, head, tail, and uniq.

For example, I would look at what three SMTP servers are the busiest by running the following command on my central log host:

grep smtpd /var/log/qmail.log | cut -d ':' -f3 | cut -d ' ' -f 2 | sort | uniq -c | sort -n -r | head -n 3
449997 server003
448539 server002
445012 server001

Perl

If the Internet is the Information Superhighway, then Perl is the Fix-a-Flat and the spare tire — and the spare drive-shaft, should you need it. Anything you can do in a shell or sed or awk script, you can do in Perl. With the -ne options, you can iterate automatically over every line of input in a pipe chain:

cat /etc/passwd | perl -ne 'print if $_=~/daemon/'
daemon:x:1:1:daemon:/usr/sbin:/bin/sh

Another handy feature of Perl is that it comes with the rename tool, which lets you rename files based on Perl regular expressions. You could use it to translate FilesWithCaps to fileswithcaps, or replace spaces with underscores (_), or do a number of other fun things such as appending or prepending strings to the filenames:

$ ls
01.jpg  02.JPG  file 07.jpg
$ rename 's/^/image-/; tr/[A-Z ]/[a-z_]/' *
$ ls
image-01.jpg  image-02.jpg  image-file_07.jpg

OpenSSH and SSH keys

Naturally, I use OpenSSH to handle all remote shells on my servers. I also use SSH keys — a public key in ~/.ssh/authorized_keys that matches the private one on my laptop so I don’t have to type a password to log in. This can be a real lifesaver when managing multiple machines. For example, I wrote this little script:

---
#!/usr/bin/perl
    $domain = "foo.mycompany.net";
    @hosts=(
	    "server001",
	    "server002",
	    "server003",
	    "server004",
	    "server005",
	    "server006",
	    "server007",
	    "server014",
	    "server015",
	    "server016",
	    "server017",
	    "server018",
	    "server019",
	    "server020",
	    "server021"
	   );

    die "Usage: runonall 'command'n" unless $ARGV[0];
    foreach(@hosts){
	print "$_.$domain: $ARGV[0]:n";
	print `ssh $_.$domain $ARGV[0]`;
    }
---

This lets me run a command (or 10 piped ones) using my permissions on all the machines in my cluster. Here’s an example of the output:

$ runonall 'uptime'
server001.foo.mycompany.net: uptime:
17:04:01 up 53 days, 18:28,  0 users,  load average: 0.40, 0.39, 0.44
server002.foo.mycompany.net: uptime:
17:04:02 up 53 days, 18:04,  0 users,  load average: 0.13, 0.43, 0.42
server003.foo.mycompany.net: uptime:
17:04:09 up 80 days, 22:08,  0 users,  load average: 14.13, 14.20, 14.23
server004.foo.mycompany.net: uptime:
17:04:11 up 80 days, 21:56,  0 users,  load average: 0.15, 0.17, 0.11
server005.foo.mycompany.net: uptime:
17:04:12 up 80 days, 21:52,  1 user,  load average: 0.14, 0.13, 0.09
server006.foo.mycompany.net: uptime:
17:04:13 up 44 days, 19:04,  0 users,  load average: 0.11, 0.08, 0.01
server007.foo.mycompany.net: uptime:
17:04:15 up 44 days, 18:58,  0 users,  load average: 0.07, 0.02, 0.00
server014.foo.mycompany.net: uptime:
17:04:18 up 41 days, 13:01,  0 users,  load average: 0.13, 0.26, 0.28
server015.foo.mycompany.net: uptime:
17:04:22 up 62 days, 22:58,  0 users,  load average: 0.61, 0.39, 0.37
server016.foo.mycompany.net: uptime:
17:04:26 up 51 days, 12:29,  0 users,  load average: 0.18, 0.23, 0.25
server017.foo.mycompany.net: uptime:
17:04:29 up 51 days, 12:34,  0 users,  load average: 0.61, 0.50, 0.31
server018.foo.mycompany.net: uptime:
17:04:32 up 51 days, 13:13,  0 users,  load average: 0.34, 0.41, 0.36
server019.foo.mycompany.net: uptime:
17:04:36 up 62 days, 22:48,  0 users,  load average: 0.20, 0.32, 0.26
server020.foo.mycompany.net: uptime:
17:04:39 up 44 days, 19:57,  0 users,  load average: 0.01, 0.03, 0.01
server021.foo.mycompany.net: uptime:
17:04:41 up 31 days, 21:59,  0 users,  load average: 0.07, 0.08, 0.01

That lets me see, for example, that server003 needs attention because its load is much higher than it should be. SSH keys and passwordless sudo for choice commands makes this even more powerful.

The find utility

Every time I read man find I learn something new. This tool is amazing in its simplicity and power. For example, find /home -name cur -mtime +31 shows me all of my users that have not checked their mail in the last month.

I use find routinely to perform filesystem maintenance — for example, deleting all the messages in a users’ IMAP Trash folder that are more than a week old. It can also be handy to use the output from a file (e.g. find ./ > index.txt). I use that trick all the time when dealing with SquirrelMail, which hashes preferences and address book files using a nonstandard checksum four directories deep.

The -exec option for find can make the tool more potent, but be sure to run it without -exec first to make sure it’s hitting only what you want. For example, find ./ -type d finds all directories in ./, including ./ itself.

Logging with syslog-ng

If you are managing multiple systems, a central log host is invaluable. Syslog-ng will handle the task for you, and if you use it for both the client and the server you can use TCP instead of UDP, and let the client buffer log messages if the server is temporarily unavailable.

Additionally, you can filter your logs using many more criteria than the antiquated standard syslog “priority and facility” system. This can be handy because you can filter out stuff that is extraneous, such as monitoring system traffic.

SystemImager

SystemImager is basically a specialized boot disk, with some advanced DHCP, rsync (optionally over SSH), and a liberal helping of Perl to glue it all together.

The end result is that you can have an install CD (or diskette, USB drive, or a special kernel on the hard drive that you LILO in to the boot sector before rebooting, or even a bootp kernel) that syncs a new box to a stock image you have created and tweaked.

This makes installing new boxes a snap. I can have a new blade for our blade servers ready for production within 10 minutes of removing it from the packaging. As an added bonus, you can clone the image and then chroot to it, using that chroot to patch or develop new stuff.

Simply run the patching mechanism for your distribution (after disabling things like service restarts and bootloader re-installation) on the image you have chrooted into, then install it on a test box. If it fails miserably, not to worry — just reinstall the stable image.

This tool enables you to roll out patches in record time with a nearly perfect rollback mechanism built in. Also, backing up only your image server effectively backs up every box it serves. It even rsyncs an “imaging-complete” file when it is done so you can monitor its progress in its logfile. If you manage a farm of identical machines, you need this tool.

The rsync utility

The rsync utility is more powerful than you might think. I use it to distribute configs to my cluster, make backups, and to keep my home fileserver and laptop MP3 collection synchronized.

For config files, I have custom scripts that import them via rsync, munge them together or make local-specific changes, check their consistency and viability, then slide them into place and restart the daemon in question as needed. Combine rsync with SSH and SSH keys and you have an easy way to automate file backups.

Filelight

Filelight is an X-based tool, so naturally it won’t work without X. On boxes with X, however, it can be invaluable, as it provides pie charts showing where all your space is being used. It’s super handy on your home directory, and scanning / when you run it as root can be even more informative.

You can click on the pie slices to drill down into selected directories. I rarely use du on my laptop any more; I prefer to use Filelight when possible.

GNU Screen

“OK honey, I’ll be home as soon as this damn process finishes running. I know, but I started it this morning and it still isn’t done, and I can’t shut down my laptop until it’s finished.” Sound familiar? Well, not any more.

Screen lets you have multiple windows on a single terminal, and it lets you detach and reattach from each session, so you can run a long process before you leave the office and simply reattach to see the output when you get home.

Another nice side effect is that an unexpectedly terminated session gets detached, so your editor with an hour’s worth of typing will still be there after you kill the cat and plug your Wi-Fi access point back in.

Watch

The watch utility is a great tool. Ever wish the df utility had an interface like top? Well, now it can. This can be very handy in a separate window (or screen terminal) when doing things like flushing SMTP queues or downloading big files you are afraid won’t fit.

You can even give it multiple commands, like uptime; qmailctl stat; tail -n 20 /var/log/messages and adjust the frequency with which it runs the commands.

Telnet

Telnet is invaluable to get raw output from your servers. I use it daily to check latency on SMTP servers and test users’ POP/IMAP accounts.

nmap

The nmap utility is a network scanner that makes all others seem feeble. It can do everything from OS fingerprinting (giving you a guess as to what OS a remote host is running) to port checking.

It can even do a group of machines. I have uncovered more than one compromised collocation box by noticing an unexpected IRC or FTP server running on it. For abuse cases, it can be handy to see what is likely to be a problem — for example, SMB ports that are open to the world, or the server is running something on port 25, or it’s running a proxy server.

An additional valuable function is the -sP option, which does a “ping sweep” scan, pinging a range of boxes to tell you which ones are up.

Lokkit

Lokkit is just a front end to iptables. I don’t use the interface beyond initial config — I simply edit the lokkit config to add, remove, or tweak firewalling rules, then restart lokkit to have it automatically change the chains in the kernel. I find this much easier than writing my own chains and putting them in by hand, and once I do it right in the config it just works on a reboot.

Monit

Monit can watch processes, files, or services, and take actions ranging from paging you to restarting the service, or simply logging the event to syslog. If you configure the SSL interface, which is complete with a built-in access control list (ACL) system, you can even restart processes from your browser and view the local health of the machine in a snappy Web interface.

Wget

I use wget to do all sorts of things — recursively copying pictures family members put online, in some rare cases importing information from remote servers, and most notably for grabbing single files. I can surf to a SourceForge site, get all the way to the download part, then cancel the download and copy the URL from the “if it doesn’t start automatically” link and feed it to Wget on a server.

That way I don’t have to surf to the site in lynx, and I don’t have to upload the file I just downloaded. Wget can handle HTTPS, cookies, and authentication. It obeys robots.txt by default (though you can disable that) and can spoof itself as any UserAgent you might wish, which is useful for sites that serve up different content based on your browser or operating system.

I use many other tools, but those are the ones I could not live without. I strongly advise people to at least poke around with these tools and see if they make life any easier. They certainly do for me.