Hard Drive error recovery control

Although I use Western Digital Red hard drives in my NAS I use various desktop grade drives in my Proxmox Hypervisor.

For my use case desktop hard drives perform perfectly adequately. However, they are generally missing several features found in enterprise class drives.

One of these features is Error Recovery Control (ERC), also known as Time-Limited Error Recovery (TLER) or Command Completion Time Limit (CCTL). I will continue to refer to it as TLER in this post as that is the term I am most familiar with.

This feature sets a time limit for how long a hard drive can attempt to recover from read or write errors.

When using a hardware RAID controller TLER support is essential as most RAID controllers will mark a drive as failed if it does not respond within a set time, often 8 seconds.

There is a lot of debate as to whether TLER is as important with software RAID solutions such as ZFS.

Many software RAID solutions will not mark a drive as failed and will wait while the drive retries it’s read or write.

During this time the virtual machine or application running on the array will be experiencing very bad performance and may even hang or crash. I have personally seen guest operating systems kernel panic due to this issue.

Check for TLER support

While TLER support is commonly disabled by default on the majority of desktop class drives it is also often configurable.

Currently I have the following drives in my Proxmox Hypervisor:

1 X HP - VB0250EAVER
2 X Toshiba - DT01ACA100
1 X Toshiba - HDWD110 (P300)

I was able to use smartctl to check the TLER status as follows:

~# smartctl -l scterc /dev/sda

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

This showed that TLER was disabled for the hard drive in question. I ran the same command for all the drives in my Hypervisor and found that all had TLER support disabled.

To find a good time to set for TLER I ran the same command on my FreeNAS system to see what my Western Digital Reds were set to:

~# smartctl -l scterc /dev/ada0

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

This showed that the timeout was set to 7 seconds.

Enabling TLER support

I then set the same timeout on each of the drives in my Hypervisor and verified that it had been set as follows:

~# smartctl -l scterc,70,70 /dev/sda

SCT Error Recovery Control set to:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

~# smartctl -l scterc /dev/sda

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

I have not yet been able to test whether the TLER setting will persist on these drives after a reboot and will update this post if and when I find out.

I highly recommend reading the Wikipedia article on TLER as well as the open-ZFS wiki for further information.

Many thanks to Benjamin Bryan and Andy Smith for their blog posts on the subject which were of great help. Also thanks to jgreco on the FreeNAS forums whose post was very informative.