November 12, 2004

The myth of stability

Author: Jem Matzan

Many computer cognoscenti scoff at new software and hardware because it's "not stable yet" or because "the bugs have not been worked out of it." They often use this as an excuse not to buy computer upgrades or install software updates. But what does stability mean in the IT industry, and does waiting to buy or upgrade really ensure a greater degree of reliability?

Obviously products that are listed as being in beta or in any stage of debug testing can't really be trusted in production environments without your own testing procedures in place. But for most other products that have seen an official release, waiting weeks or months to buy or use them will not give you any advantage over being the first kid on your block to have them. In fact, you might end up worse off if you wait too long.

Two kinds of stability

Hardware enthusiasts generally use "stability" in the sense that a system will remain crash- and error-free throughout heavy system usage. In other words, the computer is stable if it operates within acceptable and predictable parameters, especially under load. Usually this is in reference to overclocking and high-performance computing, where a system is modified, hacked, or overclocked and is pronounced "stable" only after having passed standardized stress tests. Often times a motherboard or other component will be labeled as "unstable" by overclockers if it can't be overclocked. This does not mean that the board is unreliable; it only means that it can't be abused beyond its tested and advertised parameters.

A related form of the same word appears often on technology-related message forums, where the standard advice to would-be upgraders is to avoid buying new hardware because "the bugs have not been worked out of it yet."

Are there truly bugs to be worked out? The last major "bug" in motherboard technology was in Intel's earlier Rambus-based boards, and before that the first generation Pentium processor had a mathematical error that caused errors in certain types of complex calculations. The faulty motherboard technology resulted in a delay in manufacturing; the Pentium bug was deemed a small matter because it did not appear to affect everyday desktop computing, and Intel corrected it in the next generation of CPUs. Since that time there have been no public acknowledgments of major hardware design flaws, which probably says more about how hardware companies deal publicly with pre-release technical problems than anything else, but there has been a consistent trend of complex incompatibilities with certain types or versions of other unrelated components. For instance, the ATI Radeon 9700 Pro had a lot of trouble doing extensive 3D rendering when coupled with motherboards that used Intel's E7205 chipset, and it went through several hardware revisions until the problem was corrected. The E7205-based motherboards were delayed for months, released late, grossly overpriced and overrated, and few people bought them -- so the problem with the ATI video cards was only in a small portion of the market. Was it Intel's fault, or ATI's? The answer is irrelevant, as Nvidia-based graphics cards worked perfectly with the same motherboards. One way or the other, it was ATI's problem, even if the flaw was in Intel's product.

As far as motherboards are concerned, it's practically a guarantee that over the course of the product's life, the manufacturer will come out with several BIOS revisions that you can download and install if necessary. You can say the same about nearly any electronic device that has a flash EEPROM controlling the hardware. The trouble is, just because an update exists doesn't mean that it will be (or should be) applied. You could wait several months to buy a product and still receive the original BIOS or programming that shipped with the first-generation units. Such updates rarely fix major problems, as the BIOS is generally unable to cause that level of havoc beyond the showstopper bugs discovered and corrected during initial pre-release testing. BIOS updates usually cover specific problems encountered intermittently with certain versions of unrelated hardware, such as the case with the ATI video cards and Intel-based motherboards mentioned above. Later BIOS revisions usually serve no purpose other than to add support for newly released hardware.

The other kind of stability in the IT world has to do with the integrity of software code. While in development, a program can be stable in the sense that it never crashes or errors out, but unstable in that major changes are frequently being made. While this implies that the code may cause crashes or errors due to a lack of extensive testing, it is not necessarily an indicator of unreliable performance.

Old != stable

The perception among pundits and dime-store critics is that the only true indication of stability is time. Apparently the idea is that the longer a program, device, or component has been around, the more stable it is, assuming no problems arise. The trouble with this philosophy is that hardware and software infrastructure is constantly changing. If you have a piece of equipment or software that is unanimously declared "stable," the whole situation changes when new products are released -- you don't know if the supposedly stable product will maintain its vaunted stability when forced to deal with new variables such as a different video card, operating system, or even a different userland program. There can be flaws in a product that aren't apparent or have been worked around by other vendors and don't show up until something new and incompatible hits the market. It goes deeper than that: the new product that causes problems with older, "stable" products can be designed and manufactured perfectly according to published standards and specifications and still cause problems. One obvious example that springs to mind is Microsoft Internet Explorer's famous incompatibility with some important elements of the W3C's cascading style sheet standards. Web designers can design a site that is perfectly, by-the-book, according-to-Hoyle W3C compliant, yet Internet Explorer will not display it correctly. Whose fault is it? It doesn't matter -- the end result is that the Web designer has to correct the problem by working around the bugs in Internet Explorer.

The problems brought about by older products causing trouble with newer ones is best shown with operating systems. Debian's "stable" release, Woody, uses an older Linux kernel that doesn't work with a rapidly growing list of hardware components that have been brought to market after the 2.4 kernel was released, such as serial ATA hard drives. Even commercial distributions such as Sun's Java Desktop System have trouble with commonly used hardware because of the age of the kernel. Sun told me in an interview for the release of Java Desktop System 2 that they chose to stay with the old 2.4.19 kernel because they had worked with it for a while and felt that it was stable. If it's so stable, then why did it crash during a keynote speech given by Sun's CEO last spring?

The Debian project seems to be the epitome of the misguided belief that old is equivalent to stable. Packages in the Debian "stable" or even the "testing" and "unstable" branches can be months or years old. At the other end of the spectrum, Novell releases a new edition of SUSE Linux Professional for three architectures approximately twice per year, and each includes more than a thousand of the latest desktop software applications. SUSE is a distribution made for production environments and home desktops alike, and in terms of reliability, SUSE is as stable as they come.

Edit, compile, test, debug

How much testing and time do we really need to declare a piece of software stable? If SUSE can release a totally free (as in rights) software distribution twice per year (and some community distributions release twice as often as that) and remain reliable, other companies and projects can do it too. Certainly there is some degree of testing that has to go into software before it can be released, but how much is too much?

It is functionally impossible to test every possible variable in the broad computer equation. What works well on one system may crash and burn on another, or with a slightly different revision of motherboard, or with a different type of RAM, or with an earlier or later kernel version. The best that developers can do is release the software to the community for beta testing, make it easy to report and track bugs, and fix all of the showstoppers before the official release. It's not possible to track down every bug because, as we've already established, hardware and software are perpetually moving targets.

Keeping the enterprise out of spacedock

In an enterprise setting, stability in the reliability sense is of utmost importance. Businesses can lose hundreds of thousands of dollars per hour of computer downtime, so messing with products that can't guarantee maximum reliability is not an option.

High-priced proprietary workstations from companies like Sun Microsystems, IBM, SGI, Apple, and others are not of any higher quality than x86 or AMD64 systems you can build yourself from the ground up; actually, in most instances, brand-name components in a custom-built system will be of higher quality at lower prices. The big hardware vendors produce the cheapest systems they can without completely sacrificing reliability. This means cheaper hard drives and optical drives, and CPUs and motherboards that are made by the same or similar fabrication facilities that "aftermarket" components from companies like Intel and ASUS are made at. Open up a Sun Blade 1500, for instance, and you'll find a Seagate ATA100 consumer-grade hard drive. While this is not a bad hard drive, it does not offer the same level of reliability or performance as, for instance, Seagate's 15,000rpm SCSI drives, which are designed for heavy-duty operation. You'll find a Lite-On optical drive, which is decent for a home desktop system (where, if it fails, you can afford to be without it for a few days until you buy a replacement) but again does not provide the enterprise-grade reliability that you would expect from a system that costs $8,000 or more.

Enterprise-grade servers are usually a little better, but not always. Instead of upgrading to more reliable components, they usually employ some form of hardware redundancy, such as hot-swappable power supplies and RAID arrays. With all proprietary workstations and servers, the way a vendor covers for suboptimal reliability is with expensive around-the-clock service contracts and quick turnaround times -- so if it breaks, they will fix it quickly. The rationale seems to be that it's easier to sell cheaper systems and more expensive service contracts than it is to build and sell higher priced yet more reliable equipment with little or no support contract.

The software situation in an enterprise setting is generally a lot more complex. Instead of using a self-proclaimed "stable" operating system, a large corporation will choose a platform that has been through various kinds of certification testing from various vendors such as the X/Open Group or Sun Microsystems. These vendors will certify that an operating system or other software will reliably perform within certain parameters under certain conditions and on certain hardware. As long as you stick to the certified criteria, you're reasonably assured that your software will remain reliable.

Conditions do not generally change in an enterprise setting; what the company computers do today, they will do tomorrow and the next day until the company implements a major company-wide restructuring of the production environment. New software will not be installed, and new procedures are not usually implemented, without a substantial testing procedure beforehand. With that in mind, there is no reason to change or upgrade enterprise hardware or software unless it is unable to scale up to meet heavier loads over time. Thus, in addition to its certified reliability, the software is also stable in the sense that it remains unchanged and consistent.

This article is officially stable

The one glaring exception to this article's reasoning is the game industry. For more than 10 years, computer games have steadily decreased in quality assurance testing. It's not an uncommon occurrence to buy a new computer game off the shelf only to watch it crash back to the desktop (or crash the whole system) every few minutes, or refuse to achieve desired operation in one way or another. Patches are always released weeks or months later, but if you wait six months to buy the game, you're still generally buying the original release and you'll have to download and install the updates yourself. So it's the difference between buying a partially functional game, raising hell on the company's tech support message forums and waiting for the patch, or waiting several weeks or months for the patches to come out, then buying the game and installing the patches. Either way you can't really play it until the patches are released. What harm is there in taking a chance that the game will work well enough right out of the box?

The only resolution to the "stable" dilemma is for software and hardware manufacturers to do their best to release bug-free products, and for consumers to remain vigilant in applying updates to the software as they become available -- and they will always become available. No computer product of any significant complexity has ever or will ever be developed without some kind of problem in some facet of its operation. What is stable today may be unstable tomorrow, or unstable with your hardware, or unstable with your operating system or software. There is no reason to believe that waiting a while will make the software or hardware that you need more reliable.

You can run away, scared and screaming, from new software and hardware, or extol the virtues of Debian's ancient software repository over Gentoo's more up-to-date Portage, but you're not going to be any better off than those who live on the razor's edge. You'll just have fewer bug fixes, security updates, and features -- but hey, you'll be stable.

Jem Matzan is the author of three books, a freelance journalist and the editor-in-chief of The Jem Report.

Click Here!