For my use case desktop hard drives perform perfectly adequately. However, they are generally missing several features found in enterprise class drives.
One of these features is Error Recovery Control (ERC), also known as Time-Limited Error Recovery (TLER) or Command Completion Time Limit (CCTL). I will continue to refer to it as TLER in this post as that is the term I am most familiar with.
This feature sets a time limit for how long a hard drive can attempt to recover from read or write errors.
When using a hardware RAID controller TLER support is essential as most RAID controllers will mark a drive as failed if it does not respond within a set time, often 8 seconds.
There is a lot of debate as to whether TLER is as important with software RAID solutions such as ZFS.
Many software RAID solutions will not mark a drive as failed and will wait while the drive retries it’s read or write.
During this time the virtual machine or application running on the array will be experiencing very bad performance and may even hang or crash. I have personally seen guest operating systems kernel panic due to this issue.
Check for TLER support
While TLER support is commonly disabled by default on the majority of desktop class drives it is also often configurable.
Currently I have the following drives in my Proxmox Hypervisor:
1 X HP - VB0250EAVER
2 X Toshiba - DT01ACA100
1 X Toshiba - HDWD110 (P300)
I was able to use
smartctl to check the TLER status as follows:
This showed that TLER was disabled for the hard drive in question. I ran the same command for all the drives in my Hypervisor and found that all had TLER support disabled.
To find a good time to set for TLER I ran the same command on my FreeNAS system to see what my Western Digital Reds were set to:
This showed that the timeout was set to 7 seconds.
Enabling TLER support
I then set the same timeout on each of the drives in my Hypervisor and verified that it had been set as follows:
I have not yet been able to test whether the TLER setting will persist on these drives after a reboot and will update this post if and when I find out.