Thursday, August 22, 2013

Netapp failed disk rebuild time factor

If a disk fails outright as in it is truly a dead drive from a some type of hardware component failure, this requires reconstruction using the parity information that is stored on the disk. This is the life blood of the RAID 6 DDP. On smaller busy systems, rebuild will take longer because the reconstruction process has less priority than real work in order to not impose additional performance overhead. There are many factors that influence the rebuild time. Rebuild times on large RAID groups will take longer to calculate parity because there are that many more blocks to calculate parity for. So you can count on this affecting the time to rebuild. The number of back-end FC-AL loops are very important if you have large aggregate and busy system. If you have an FC-AL type filer, the more FC-AL loops you have then the more possible bandwidth you have to handle the increased I/O required to rebuild. FC-AL goes up to 400-500 MB/s and this has to be shared. Another factor is the controller types, the type of disk that failed, the storage controller model, and even the version of the Data ONTAP can impact rebuild times.
Most Netapp disk failures are see as soft failures where too many blocks are flagged as bad and not necessarily a fatal hardware failure. This is the old bad block table is full kind of mess. In Data ONTAP 7.1 onwards, these failed drives take advantage of the Rapid RAID Recovery to rebuild by copying the good blocks to a spare drive. This helps to speed up the rebuild and recovery time. In some cases significantly.

No comments:

Post a Comment