The Fifth Commandment of system administration

36

Author: Brian Warshawsky

If you’re a good administrator, you pride yourself on developing a fundamental understanding of the systems you build. After a while, as you begin to comprehend the complete complexity that goes along with building and maintaining your infrastructure, the commands and procedures to control them become second nature. You have to look at the documentation less and less, until eventually people refer to you as a guru. Having this kind of understanding of your servers is important, but it does no good if you aren’t available when something crashes. By creating detailed written policies detailing the ins and outs of your systems in advance, you can provide critical background information to your backup admin who can use it to restore functionality in your absence.

V. Thou shalt document complete and effective policies and procedures

In the past I found documented policies useful especially at two different times. The first is at the inception of a project. Before the system goes into production, sometimes even before the hardware is bought, detail in writing exactly what you need the server to accomplish, where its performance bottlenecks will be, and what your intentions are to correct these issues. This will allow you (and upper management!) to know that your time is not being spent chasing a fantasy implementation that will never work. It also helps you to better understand the nature of the beast you’re building. If anything goes wrong during the installation and configuration process (and something always does) you’ll be better prepared to deal with it simply due to the better understanding you’ve obtained by mapping everything out beforehand. At this point you don’t need anything more than an outline (sometimes in the form of a project plan) and a few diagrams to guide you. If it’s a much larger-scale implementation though, you’ll need a detailed project plan dividing the entire process into phases. For instance, a large-scale Beowulf cluster would require a detailed project plan, while a new intranet Web server might only require a brief outline of configuration tasks and a diagram showing how it’s integrated into network.

The second time that these policies are important is after the server has finished configuration and is ready to go into a production environment. At this point, before it is rolled out, you should take some time to create some detailed step-by-step documents explaining the backup restoration process, the steps necessary to restart a service (or just make a list of important services that might need to be restarted, depending upon the experience of your back admins) and anything else that might be helpful. Just remember that you won’t always be available to fix something; having detailed instructions for common problems or routine exercises can make the difference between 10 minutes of downtime and a week and a half if you are unavailable.

The commandments so far:
I. Thou shalt make regular and complete backups
II. Thou shalt establish absolute trust in thy servers
III. Thou shalt be the first to know when something goes down
IV. Thou shalt keep server logs on everything
V. Thou shalt document complete and effective policies and procedures