Monitoring your machines should be your number one concern. With Nagios I am often able to be alerted and have most problems fixed before the end user ever even realizes there is a problem.
Nagios is a monitoring system that lets us know when things go wrong. It's set up to send email to our pagers when mail queues get too large, system resources get too low, services die, or machines go down. Nagios lets us customize the alert levels so failures on critical machines or services wake us up at night, and failures on less critical machines or services wait until morning to page us.
Configuring Nagios can be quite a chore if you have several machines. We ended up writing our own Python configuration tool to do the bulk of the work for us.
There's nothing special about our wiki; we store long commands, bug fixes, code snips, research, and documentation on how we configured anything complex in it. We ended up with coWiki, a wiki that provides an easy way to secure access.
We needed a wiki that we could tie into LDAP so we could access the existing Active Directory logins in the network. This way, any staff member can log in and contribute, but we still have some access control in keeping different sections private.
Cacti is a Web application similar to the Multi Router Traffic Grapher (MRTG) for graphing and statistical reporting. Cacti looks great, which is a big bonus when you're presenting the graphs it creates to your boss, and it does its job very well.
We use Cacti for keeping historical data on various things such as system load, disk space, network traffic, and mail throughput. By default it grabs its information from SNMP, but you can get it to graph just about anything with a little scripting. The tool isn't limited to Linux either; we have it graphing the Windows servers too, for everything from Internet Information server and Serv-U FTP connections to Active Directory login attempts.
Historical data is incredibly useful when you want to justify hardware upgrades or want to see what impact a configuration change has made to the load on the machine. As with coWiki, this tool ties into Active Directory (LDAP) easily.
When you're working on a problem with someone else, you aren't always sure what configuration changes the other person has made to try and fix the problem. That's where GNU Revision Control System (RCS) comes in. RCS allows you to manage individual files (great for things in your /etc/ directory) and view previous versions, see recent changes, etc. I use it on my home box as well so I have an easy way to roll back files when I'm playing with things.
Typical usage for RCS goes something like this:
rcsdiff filename checks for differences between the current file and the last time it was checked out.
co -l filename checks out and locks the file for editing.
After you're finished editing, unlock the file and check it back in using
ci -u filename.
The majority of the machines we put into production are running Debian Sarge. Apt-cacher allows you to have one machine in your network that acts as a proxy for apt-get.
Since we often grab packages such as Apache and MySQL over and over, apt-cacher saves us bandwidth and time -- and not just our bandwidth, but also the bandwidth of Debian mirrors. Be a good netizen and use Apt-cacher if you're distributing updates and packages to a large number of Debian machines.
We have tons of SSL certificates for various sites and servers we run. I can give SSL Expire a list of IP addresses and port numbers and it will check them to find expired or expiring SSL certificates, and email us if it finds any that going to expire in the next month or so. We set this up to run from a cron job daily.
The blq Realtime Blackhole List (RBL) checker
Since I work in a Web hosting environment, we can't afford our mail servers to be blacklisted. Blq is a simple Perl script that takes a list of realtime blacklists, and checks a list of IPs against them.
To use blq, just supply the names of the RBLs and the IPs you'd like to check:
blq rbl ipaddress. You'll either get an "OK" or "BLOCKED" message, depending on whether the IP is on the blacklist or not. Run
blq without any arguments to see the list of RBLs that it supports. You can also add RBLs to the script if your favorite RBL isn't represented.
We have a cron job that runs regularly to check our mail servers against the most common RBLs on the Internet. If any of our servers is found on a blacklist, it sends us a notification via email.
Rsync, SSH, and a good boot CD
We often have situations where the only differences between two machines are the hostname and IP address. It's pointless to go through the building, patching and tweaking to get each box built from scratch. Instead we boot with a good boot CD (Debian From Scratch works well because it supports pretty much everything we use), create the partitions on the new box, mount them, and RSYNC the source machine over. After that, it's a matter of doing a quick grep through the /etc/ folder and changing all instances of the old hostname to the new one. If you're putting machines into a load-balanced network and want them to be proper clones, this is the easiest method I've found.
The command below assumes that you have your machine partitioned with separate /, /usr, /var, and /var/www partitions and that on your target machine -- the one you run this from -- you have created these partitions. You need to do them each individually, to create the directory structure, and then I usually do them all again in a loop to ensure that nothing was missed.
mount /dev/md0 /mnt/temp/ rsync -e ssh -axvPSH --numeric-ids --delete root@source_machine:/ /mnt/temp/ mount /dev/md1 /mnt/temp/usr rsync -e ssh -axvPSH --numeric-ids --delete root@source_machine:/usr/ /mnt/temp/usr/ mount /dev/md2 /mnt/temp/var rsync -e ssh -axvPSH --numeric-ids --delete root@source_machine:/var/ /mnt/temp/var/ mount /dev/md3 /mnt/temp/var/www/ rsync -e ssh -axvPSH --numeric-ids --delete root@source_machine:/var/www/ /mnt/temp/var/www/ for dir in / /usr/ /var/ /var/www/ do rsync -e ssh -axvPSH --numeric-ids --delete root@source_machine:$dir /mnt/temp$dir done
This isn't the only way to replicate machines for production use, but it works for me.
As mentioned previously, we use Active Directory for authentication. We used to be a Windows-only shop, but our shop has become increasingly heterogeneous. While the senior admin and I both have local accounts, we do have some machines -- such as development boxes or internal use machines -- that require other people in the company to log in via SSH. Rather than giving them another login to manage, winbind allows us to give them access to the box via Active Directory. Winbind understands user groups in AD, so you can still configure SSH to allow access only to certain groups. This has the added advantage of letting users have the same username and password for SSH as they use to log in to Windows.
When you're sitting in a datacenter at 3 a.m. and trying to get a machine up and running, you shouldn't be wasting time trying to get Nano, Pico, or Emacs installed before working on the problem. It doesn't matter whether it's your editor of choice -- vi is a powerful program you should master.
I resisted learning vi at first, but a senior admin basically said, "You don't get root access until you can use vi as well as you can use Nano," my old editor of choice. It's a tradition I will pass along to any future junior admins I'm looking after.
The last thing I'll say is that if you've been playing with Linux in school or at home and are looking to make the jump to professional sysadmin -- find a job where you have a great mentor. While I had a decent base knowledge, and am confident in my ability to figure things out, I was amazed by how much I didn't know when critical machines went down. Having a good mentor helps you get the knowledge you need to fix things, but more importantly shows you the best way to avoid having such problems. A huge thanks to my mentor, who still is teaching me things every day -- thanks Ian!
Kevin Millman is a system administrator in Toronto, Canada.