A tutorial on self-healing data storage systems

Data storage systems are increasingly offering the ability to repair hard disk drive failures without user intervention. Learn about self-healing storage in this tutorial.

Beth Pariseau, Senior News Writer

Data growth waits for no man and no economic downturn -- the amount of data storage administrators must manage is growing by the day, and shows no sign of slowing.

As data sets continue to mushroom in size and storage systems grow larger and more complex, rebuilding data from every failed disk using RAID parity calculations can become too time consuming and degrade performance. In response, some storage vendors are offering automated recovery capabilities to help users keep up with managing ever-growing infrastructures.

These capabilities are often referred to as self-healing storage. While generally the definition refers to systems that automate recovery processes, drawing a line is difficult between new self-healing features and features which have long been part of enterprise external storage arrays, like automatic RAID rebuilds.

In this tutorial, learn about self-healing storage systems, vendors in the space, heal-in-place systems versus fail-in-place systems, and whether or not self-healing data storage is right for your enterprise.



What is self-healing storage? 
Heal-in-place storage systems
Fail-in-place systems
 What users are saying about self-healing storage 

 What is self-healing storage?

According to Dragon Slayer Consulting president Marc Staimer in his June 2009 Storage magazine feature, Storage, heal thyself, "Self-healing storage is [most] accurately defined as transparently restoring both the data and storage from a failure. That might seem like splitting hairs, but it's not. It's the difference between treating the symptoms and fixing the cause."

Staimer further identifies three categories of self-healing storage: end-to-end error detection and correction; "heal-in-place" systems which try to fix a failed disk before initiating RAID rebuilds; and the newest class of system, "fail-in-place," which do not require failed hard drives to be replaced during the warranty life of the system. This piece will focus on products in the latter two categories.

 Heal-in-place storage systems

DataDirect Networks S2A series

Data Direct Networks' S2A product line consists of four hardware models: the S2A6620, 9700 and 9900. The self-healing stuff resides mostly in the S2A operating system. Like BlueArc Corp.'s Titan NAS heads, DDN's products put some of the heavy-duty processing of data, such as parity calculations, into silicon by way of field-programmable gate arrays (FPGAs). This, along with parallelization, means that the arrays can write and read at the same rate.

That sets up the ability to calculate two-disk parity on every read, as well as every write, without performance degradation. So, theoretically the system never goes into rebuild mode, because it operates in that mode all the time.

DDN's systems perform disk scrubbing, as well as isolation of failed disks for diagnosis and attempted repair. The product can conduct low-level formatting of drives, power-cycle individual drives if they become unresponsive, correct data using checksums on the fly, rewrite corrected data back to the disk, and use S.M.A.R.T. diagnostics on SATA disks to determine if the drives need to be replaced.

DDN's S2A does not seal drive bays, and users are expected to replace failed disks during the life of the array.

NEC Corp. of America D-Series with Phoenix Technology

NEC's D-Series SANs are the most like conventional enterprise storage area networks (SANs) of this group. D-Series offers dual-parity RAID 6 protection, and performs RAID rebuilds on disks in some cases. Users are also expected to swap out truly failed disks.

However, according to Data Mobility Group analyst Robin Harris, "Seventy percent of the time, just power-cycling a drive fixes it," if it doesn't respond to the storage controller. Sometimes the lack of response, which can cause conventional systems to declare a drive failed, is a temporary performance degradation, and not a real problem with the hardware.

D-Series comes with what NEC calls Phoenix Technology, which separates unresponsive drives from their RAID groups. The RAID groups continue operating while Phoenix puts the problem hard drive "under supervision" to detect if it has a bad sector or has truly failed. "In case the diagnosis result is temporary performance degradation, the HDD in question is returned to … ordinary use," according to NEC's product documentation. "If there are sector errors, the data is recovered on backup sectors, and the HDD is put back in RAID." NEC claims this process can stave off 30 percent to 50 percent of RAID rebuilds.

Panasas Inc. ActiveStor with ActiveScan technology

ActiveScan is a feature of Panasas Inc.'s ActiveStor systems, part of the Panasas ActiveScale operating system. ActiveScan continuously monitors data objects, RAID parity, and underlying disk devices to catch errors before they happen. If a problem is found, as with Phoenix, ActiveScan can copy the data off the problem disk, avoiding reconstruction in most cases.

Unlike D-Series, Panasas' ActiveStor systems offer parallel scale-out storage on a series of blades; should a reconstruction be required, all blades in the system will combine processing power to speed the process.

 Fail-in-place systems

Atrato Inc. Velocity1000 (V1000)

Atrato Velocity1000 (V1000), is a 3U box containing sealed subunits of disk drives, each of which holds a set of four mini units inside. Each subunit and mini unit is referred to as a SAID -- a Self-managing Array of Identical Disks. The sealed disk canisters contain enough individual hard drives to offer parity protection for data in virtual RAID groups and to automatically swap in spares if one hard drive should fail. Atrato claims V1000 can go without requiring drive replacements or hardware maintenance for three years to five years.

One-third of the V1000 box is taken up by dual controllers running Atrato's proprietary virtual RAID software that distributes software-based virtual RAID over the hard drives in clusters of SAID units. The software also detects disk drive failures and can fix most drive problems on the fly, avoiding RAID rebuilds by copying data off a problematic drive, diagnosing whether the drive has really failed or not, refurbishing bad sectors if possible and copying the data back, either to a new drive or to the repaired drive. Atrato also uses S.M.A.R.T., among other error correction codes, to identify potential drive failures, and can work around a bad drive head by storing data on the remaining good sectors of the drive.

The 3U, 160-drive enclosure contains 2.5-inch enterprise SATA disks connected to four floor boards. The controller is external, and customers can pick from a 2U controller by IBM or a more powerful 3U controller from SGI. Atrato is positioning its product for specific vertical markets with high capacity and uniform workloads.

Xiotech Corp. Emprise 5000 and 7000 systems

Xiotech's Emprise systems are put together using a building block it calls the Intelligent Storage Element (ISE). ISE is based on Advanced Storage Architecture technology which Xiotech acquired from disk drive maker Seagate Technologies Inc. in November 2007. According to Xiotech, an ISE reduces the two greatest causes of drive failure -- heat and vibration -- to provide more than 100 times the reliability of a regular disk drive enclosed in a typical storage system drive bay.

Like Atrato, Xiotech's product can power-cycle disks, perform diagnostics and error correction on bad drive sectors, and write 'around' them if necessary. Xiotech also claims its product will incur zero service events in five years of operation, and guarantees this under warranty.

Xiotech's system comes in three models. The dual-controller Emprise 7000 SAN system supports up to 64 ISEs and includes the same management features as the Xiotech Magnitude 3D 4000 platform, including intelligent provisioning and a replication suite. Like the Magnitude 3D, the Emprise 7000 supports Fibre Channel or iSCSI. It scales to 1 PB.

The single-controller Emprise 7000 Edge is positioned targeting branch offices and the midmarket. It supports up to 10 ISE for a total maximum capacity of 160 TB. The Emprise 5000 is a DAS system that consists of one ISE. It supports Fibre Channel only. Both the 7000 Edge and the 5000 can be upgraded to a Model 7000.

 What users are saying about self-healing storage

Since Xiotech introduced ISE at Storage Networking World in the spring of 2008, the new technology has rapidly come to account for more than 80 percent of the company's revenue. Scott Ladewig, manager of networking and operations for Washington University in St. Louis, traded in an older Xiotech model, the Magnitude 3000, for an Emprise 7000 last summer.

"In the past, it's not like we've spent hundreds of man-hours on drives, our whole SAN is 24 TB or so," he said. "But if a drive failed, we'd have to spend a Saturday night watching it rebuild, and drives are growing larger and larger, and taking longer periods of time to rebuild, when they're vulnerable to a double disk failure."

The five-year warranty offered by Xiotech included in the cost of the ISE system proved irresistible to Rick Young, network systems manager at Texas A&M College of Veterinary Medicine, who also replaced a Magnitude 3D 3000 with an Emprise 5000. "Right now we spend around $11,000 a year to maintain disk trays on the 3D," he said. "Multiply that by five years on the Emprise system, and it's no small amount of savings. We can put what we would've spent on maintenance towards our next refresh."

For Richard Alcala, chief engineer of New Hat LLC, a post-production firm in Santa Monica, Calif. more typical scale-out storage products with clustered file systems proved too cumbersome to manage in a performance-intensive environment. "The highest priority for us is the number of real-time streams" the system can feed to artists working on videos. With the older system, "we spent a lot of time doing maintenance, trying to heal the system and recover data," he said. "Once every three months we'd spend about four hours running diagnostics." Alcala replaced that system with Data Direct Networks' S2A 9900.

Some users in the enterprise data storage market have balked at the idea of highly automated systems, as they take away some of the control users have over all the elements of the system. But Atrato user Shawn Mills, president of Green House Data, a wind-powered green colocation and hosting facility in Wyoming, said he likes the attention he gets from a smaller company. "One of the primary reasons we chose Atrato turned into the primary reason we stayed with them," he said. "Access to Atrato's engineering team as one of the differentiators that lead us to them. After the sale, the access did not stop. They have been extremely responsive to our requests."

Xiotech has been "going gangbusters" in the enterprise with ISE, according to Data Mobility Group's Harris, but generally, Harris said he thinks the most advanced self-healing storage products will get the most traction in specialized vertical markets like media and entertainment and high-performance computing (HPC). "That said, how hard is it to power-cycle a hard drive?" he said. "That ought to be SOP for every disk array out there."


Dig Deeper on Disaster recovery storage