Perhaps this has happened to you--you're on the phone, your hands have been three feet from your keyboard for the past half hour, and your system is idle, when suddenly the computer reboots or the screen goes black and locks up. You may wiggle the mouse a bit thinking this is some kind of screen saver, but that doesn't seem to be the case. The first time this happens you brush it off--maybe it was a tiny power surge or a software bug. Occasional clicking can be heard from your hard drive at times, the same sounds it makes when you first turn it on. Your RAID controller or software may say your drives are out of sync, or you may occasionally experience data loss. You may wonder why your drive‚Äôs built-in failure detection technology hasn‚Äôt kicked in to warn you that something‚Äôs wrong.
If you think your computer is just dying of old age and you'll need to spend hundreds or thousands of dollars to replace it, you're not alone, but the solution may be much simpler. All of these are very typical symptoms of a system power supply that is occasionally not delivering enough power to its components. Replacing this part will cost anywhere from $50-$200 on a desktop, far cheaper than replacing the entire machine.
Since X86 hardware is designed to be inexpensive commodity hardware, your computer was expected to last no more than five years, but some components may need to be replaced earlier, much like the original battery of your car won‚Äôt last 100,000 miles. If you‚Äôre using X86 hardware in a data center where you can‚Äôt hear these symptoms or may not catch the screen going black, it can be hard to track down a problem like this. Fortunately, there are tools available to detect some of the most common hardware failures. I‚Äôll review these in the order they‚Äôre most likely to fail in a system and suggest some projects that can address these problems.
I‚Äôve had almost 20 years of experience doing desktop support, and I‚Äôve seen more power supplies fail than any other part in the system. Often a power supply failure is misdiagnosed as a hard drive, processor, memory, or video failure. The MTBF (Mean Time Between Failures) rating on a power supply will tell you how long it‚Äôs designed to last, and most OEMs build systems with the lowest MTBF they can get away with (it‚Äôs cheap, and they pass along the savings to their buyers). The general rule of thumb is that if you start to see voltages dropping more than approx 10% below spec (e.g., down to 10.8V from what‚Äôs supposed to be 12V), it‚Äôs starting to fail. Devices experience ‚Äúbrownout‚Äù conditions, and e.g. a hard drive may not finish a write operation (causing data to be lost), a processor may skip an instruction and execute a program incorrectly, or a video card may reset its resolution, causing the monitor to go blank and not return.
The lm_sensors project will tell you about power supply voltage information, but the amount of data available varies depending on what chips are on the motherboard and the availability of open specifications to build a complete driver for the chip.
Most modern hard drives contain Self-Monitoring, Analysis, and Reporting Technology or SMART. This technology can detect impending failure due to normal wear and tear. Not all failures can be predicted using this technology but is does provide a helpful hint when the drive knows it‚Äôs dying. That‚Äôs the most automated way to detect a failure, but it‚Äôs not the only method.
The smartmontools project will read SMART warnings and is compatible with Linux, but also Darwin (Mac OS X), FreeBSD, and even Windows. If you‚Äôre familiar with smartsuite you‚Äôll be happy to know that this project was derived from that (older) project.
What goes in must come out, so you can also stress test the hard drive to determine if it‚Äôs failing before all of the data disappears. This test is not automated, and will affect performance, so it‚Äôs best to take the machine out of service before doing this kind of check. The badblocks utility in the e2fsprogs package is one such utility.
As most Linux gurus know, fsck (file system check) commands will check and repair most popular file systems.
WARNING: When hard drives start to fail, they fail fast, often because debris is being dragged across the platter (which spins at a very high speed). When you find your hard drive has reached that point, boot with another disk (e.g., a Linux LiveCD) to get in, retrieve your data, and get out ASAP. It‚Äôs like rescuing someone from a burning building ‚Äì the longer you‚Äôre inside, the less successful your rescue attempt will be.
Processor and Chipsets
The leading cause of death for processors and chipsets is heat. Until I was employed by a chip company, I didn‚Äôt even realize the importance of thermal tape, thermal paste/grease between the processor and the cooling solution (commonly a heatsink with fan), nor how easily the heat-conductive properties of each are ruined. Most modern Intel and AMD processors will slow down and reduce their power consumption if they get too hot so as not to permanently damage the processor, and some motherboards will shut off the machine and flash an LED if they detect a heat problem, however consistent low-grade ‚Äúfevers‚Äù may still destroy a processor without triggering these mechanisms. The lm_sensors project was designed to catch this sort of problem, but all too often it relies on the BIOS to contain correct values for the temperature limits. Many OEMs and motherboard manufacturers don‚Äôt bother to populate the field properly. If the upper limit is below ‚Äúabsolute zero‚Äù temperature or above the melting point of steel, you can have lm_sensors override that setting and alert you only if it gets above a temperature you specify (check the processor manufacturer‚Äôs web site for per-model expected thermal operating limits).
Like a hard drive, what goes in must come out. If you store ‚ÄúMary had a little lamb‚Äù and you read back out ‚ÄúLittle Red Riding Hood‚Äù from the same memory address, your memory chips are not doing their job. Unfortunately, memory problems are rarely that obvious, and instead may show as a single intermittently incorrect bit.
The memtester application will stress test your memory chips and report problems. Figuring out which DIMM is bad can be challenging, especially since many motherboards interlace the memory addresses across chips to increase performance.
Another popular method using dd and md5sum is described at here but it‚Äôs not as thorough as memtester, so utilities like memtester are preferred.
Your Linux system already uses crond to schedule system maintenance tasks such as log cleanup. For system health checks that will impact performance temporarily, use your favorite crond setup utility (my administrator control panels include a system maintenance task scheduler function which may not be called crond on the outside, but is still crond inside) to schedule these to run when your server will not be in use. Weekly is best for things like hard drive and memory tests. As a tip, many IT administrators set their servers to send SMS messages to their cell phones if they find something that‚Äôs terribly wrong.
Other tools such as smartmon and lm_sensors can and should be run constantly so that problems can be caught promptly. Their performance impact is minimal.
Like a car, your computers will eventually reach a point that one thing after another goes wrong. At that point, it‚Äôs time to replace the computer. There are two ways to handle an aging car: keep it together with silly putty and band-aids until you‚Äôve been stranded roadside a few times and then buy a new one and pay it off over the next 5 years, or budget to get a new one when you expect it to fail and replace it before you are stranded. Computers are the same way. If you don‚Äôt want data loss and downtime, budget with that in mind, whether it‚Äôs your personal computer or your IT data center.
OEM power supplies last about three years. Hard drives last 4-6 years. Everything else is rated for about five years.
Backups and Failover
Many tools are available to automatically backup hard drives and relocate a process as soon as a failure occurs. Since there are many types of failover solutions, I won‚Äôt go into that much, though I will list a few worth looking into:
- rsync is popular for synchronizing storage, though most IT administrators prefer a full commercial SAN solution to protect their data
- The linux-ha project provides high-availability cluster tools to keep mission-critical resources online by managing a failover machine
- For virtualized servers, tools such as VMWare‚Äôs ‚ÄúVMotion‚Äù and Xen‚Äôs ‚Äúlive migration‚Äù feature can shift the entire stack, OS and all, to another machine extremely quickly
If you implement all of these suggestions, you might never again have unscheduled downtime. You can take credit for this during your next performance review; I won‚Äôt mind. ;-)