January 27, 2006

Sun's ZFS builds on promise of RAID

Author: Stephen Feller

ZFS is the filesystem Sun Microsystems began shipping in November with its operating systems to provide data management and protection from the loss of data due to file corruption.

After five years of development and testing, Sun says ZFS and the new data replication model it includes, RAID-Z, can eliminate the need for additional expensive hardware. RAID-Z is a new take on the RAID systems that many system administrators employ promote data integrity. Sun believes the ZFS filesystem with RAID-Z offers "virtually unlimited capacity, provable data integrity, and near-zero administration."

ZFS provides transactional semantics -- which stores data changes and updates at once, rather than writing to disk as a file is altered -- and end-to-end data integrity, according to Sun developer Jeff Bonwick. It also offers live data scrubbing, instantaneous snapshots and clones, and built-in compression, according to the ZFS community Web site. Bonwick said that ZFS's simplified administration interface has surprised people with its ease of use.

The main selling point Bonwick and Sun offer for ZFS is the elimination of the RAID write hole -- the potential to lose data when backing it up on one of the devices in an array. He said the software's end-to-end data integrity and better checksum method, which stores the data verification information, or metadata, apart from the data itself, allows for both protection against the write hole and better protection of data. Normally, metadata is stored in the same write stripe as the data itself.

"[ZFS and RAID-Z] putting the data replication knowledge in the filesystem rather than inside an array product yields data integrity you cannot get from an array product," Bonwick said. "Fundamentally, if the data is checksummed inside the array, then anything that happens to that data in transit will not be detected by the checksum."

Generally, Bonwick said, the checksum file is stored with the data block it is meant to validate, which he said can compromise the effectiveness of validation. This can prevent users from knowing whether the data or checksum is incorrect when they get an error message, if they get one at all.

The ZFS file system is patterned as a Merkle tree, the general structure of which allows each level of data block to validate the things contained below it. According to Bonwick, this allows users to know whether the data or checksum returned by the system is incorrect in the event they disagree, because "you have physical fault isolation" between the two.

The checksum system is one of the things even RAID-Z detractors such as Jeff Darcy, the chief software architect at the company Revivio, can appreciate. Darcy has had a hand in the development of shared-storage and distributed filesystems at several companies, and has written extensively at his blog, Canned Platypus, on his view of ZFS and RAID-Z. He said that while RAID-Z is not a significant improvement over other RAID systems, if it is one at all, ZFS itself does present improvements over RAID itself.

Darcy said the checksum system has the potential to protect users against a variety of problems, such as a bad driver, a bug in the filesystem itself, or the possible degradation of files over time because of bit rot. And while he said the checksum functions of ZFS can save you in a variety of scenarios where simple RAID would not, it still does not constitute any kind of RAID, because it eliminates the necessity of a physical disc array, the basis and focus of a RAID system.

"The checksumming provides a really robust form of protection," Darcy said. "Even if you're on the highest-end storage in the world, that actually provides some protection. But that's all above RAID-Z. That's all in another part of ZFS entirely."

According to Darcy, Sun has not actually fixed the write hole problem, but rather has worked its way around it with some of the functions contained in ZFS. He said a true RAID level is at the lower end of a system, and does not need to be aware of the software operating above it. All the system needs to know is which devices are actually part of the array.

In a November 25 post on his blog, Darcy said the main benefit of RAID-Z over RAID-5, the level that Sun has compared its version to, is that it uses less memory to store data, but he does believe it improves data integrity. The post was part of an ongoing dialogue between Darcy and the engineers at Sun who developed ZFS, including Bonwick.

Bonwick said he has been using his blog as a tutorial for explaining the operation of ZFS and RAID-Z because the differences between a traditional RAID system and the one Sun has put out are "subtle." RAID-Z, he said, makes sense only within a transactional model of a filesystem, such as the one implemented with ZFS.

Because RAID-Z does only full-stripe writes to disc as part of the transactional filesystem model, the existing stripe does not need to be read and updated -- which can both slow the performance of the system and fail to correct any error in the data that already existed -- because the entire stripe is overwritten, Bonwick said. Also, unlike other RAID levels, which write all files in the same size stripe regardless of the file size, RAID-Z allows for variable stripe width -- and without correct metadata to tell the filesystem how to reconstruct a file if it is lost, as well as where in the system it is supposed to go, the filesystem would not be able to serve as a backup.

Part of the distinction Darcy made between the Sun RAID system and traditional RAID devices is the ability to automatically update disks, which he said cannot be done with the Sun system. Removing the boundary layering that allowed this to happen automatically and bringing the filesystem and storage array into communication is a bad thing, Darcy said, because what makes RAID levels powerful tools is that they can be implemented separately.

To Sun, however, this is the major innovation of the new filesystem. In RAID-Z, Bonwick said, new data writes go only to unallocated space on the disk and are not limited to a defined stripe width on the disk. This lack of definition, he said, is also the reason that ZFS works without an array separate from the filesystem -- the two must be able to communicate in order to both retrieve and update data.

Rather than deviating from the traditional RAID concept he said most people are familiar with from the 1988 paper "A Case for Redundant Arrays of Inexpensive Disks" by computer science professors David Patterson, Garth Gibson, and Randy Katz, Bonwick said RAID-Z is redefining the accepted notion of RAID.

The RAID concept, which has since swapped the word inexpensive for independent as companies began to build the systems for sale to other companies and organizations, was built on the idea that several devices could be linked together to garner faster performance and increased availability than had been offered on a single device at the time.

Bonwick said RAID-Z and ZFS are in line with the idea put forward in that paper because while the most frequent interpretation of RAID has been to stitch together a group of block devices, Patterson's definition was more widespread. The real objective of RAID, he said, is that if one of your disks dies you can reconstruct your data from the disks that remain. He added, however, that an inexpensive disk array can be used with Sun's software because it does not require the hardware that now dominates this area of the data solution industry.

Since it is a software solution, releasing ZFS under an open source model with OpenSolaris has already begun to pay off, Bonwick said. Sun received its first full-analysis root-causing bug from a developer outside the company, which means to him that at this stage "the open source model has already paid off for us."

In addition to contributions from the OpenSolaris community beginning to trickle in, Bonwick said there is work ahead for the improvement and continued development of ZFS and RAID-Z.

"There are certainly things we haven't gotten to yet," he said. "We haven't seen any architectural mistakes yet. Undoubtably we have a lot of performance work to do, but that's an ongoing thing -- you're never really done with performance. And then there are also some features I think are going to be [added in the next version]."

Bonwick said Sun is looking to improve and add to the software when it releases the new version with Solaris 10.2, which is due out sometime in mid-2006. The new version is expected to include support for hot spares, device removal, and encryption, which he said is one of the biggest things Sun would like to add to ZFS.

Additionally, Bonwick said developers are working on a delegated administration model that will allow individual users to create their own filesystems. Functions available to users, without involving the system administrator, will include taking snapshots and creating clones of files, among others, within a given space to manage data on an individual workstation.

Click Here!