Replacing a failing hard drive in FreeNAS

Feb 5, 2019 6 minutes to read freenas hardware howto truenas

Just a matter of time

Recently I started to receive notifications from FreeNAS indicating smart errors on one my hard drives. A quick check of the system log confirms a number of smart errors on /dev/da4.

root@freenas:/var/log # grep smart /var/log/messages
Feb  1 03:39:51 freenas smartd[2739]: Device: /dev/da4 [SAT], previous self-test completed with error (read test element)
Feb  1 03:39:51 freenas smartd[2739]: Device: /dev/da4 [SAT], Self-Test Log error count increased from 0 to 1
Feb  4 21:46:05 freenas smartd[48304]: Device: /dev/da4 [SAT], previous self-test completed with error (read test element)
Feb  4 21:46:05 freenas smartd[48304]: Device: /dev/da4 [SAT], previous self-test completed with error (read test element)
Feb  4 21:47:00 freenas smartd[48927]: Device: /dev/da4 [SAT], previous self-test completed with error (read test element)
Feb  4 21:47:00 freenas smartd[48927]: Device: /dev/da4 [SAT], previous self-test completed with error (read test element)
Feb  5 06:17:00 freenas smartd[48927]: Device: /dev/da4 [SAT], previous self-test completed with error (read test element)
Feb  5 06:17:00 freenas smartd[48927]: Device: /dev/da4 [SAT], Self-Test Log error count increased from 1 to 2
Feb  5 06:17:00 freenas smartd[48927]: Device: /dev/da4 [SAT], previous self-test completed with error (read test element)
Feb  5 06:17:00 freenas smartd[48927]: Device: /dev/da4 [SAT], Self-Test Log error count increased from 1 to 2

Pulling up the smart information available for /dev/da4 confirms the recent read errors.

root@freenas:~ # smartctl -a /dev/da4
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   140   140   021    Pre-fail  Always       -       15000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       163
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   005   005   000    Old_age   Always       -       69621
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       128
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       79
193 Load_Cycle_Count        0x0032   170   170   000    Old_age   Always       -       92692
194 Temperature_Celsius     0x0022   114   099   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       28

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%      4094         3907016230
# 2  Short offline       Completed: read failure       60%      3995         3907016230
# 3  Short offline       Completed without error       00%      3271         -
# 4  Short offline       Completed without error       00%      2545         -
# 5  Short offline       Completed without error       00%      1826         -
# 6  Short offline       Completed without error       00%      1091         -
# 7  Short offline       Completed without error       00%       371         -
# 8  Short offline       Completed without error       00%     65219         -
# 9  Short offline       Completed without error       00%     64476         -
#10  Short offline       Completed without error       00%     63757         -

Time to replace it

To see all available disks in the system run camcontrol devlist

root@freenas:~ # camcontrol devlist
<ATA HGST HDN726040AL W7JH>        at scbus0 target 1 lun 0 (pass0,da0)
<ATA HGST HDN726040AL W7JH>        at scbus0 target 2 lun 0 (pass1,da1)
<ATA HGST HDN726040AL W7JH>        at scbus0 target 3 lun 0 (pass2,da2)
<ATA WDC WD2002FAEX-0 1D05>        at scbus0 target 4 lun 0 (pass3,da3)
<ATA WDC WD2002FAEX-0 1L01>        at scbus0 target 5 lun 0 (da4,pass4)
<SanDisk Ultra Fit 1.00>           at scbus8 target 0 lun 0 (pass5,da5)
<SanDisk Ultra Fit 1.00>           at scbus9 target 0 lun 0 (pass6,da6)

The first step is to shut down this disk in preparation of replacement.

root@freenas:~ # camcontrol stop da4
Unit stopped successfully

Now that the drive has been stopped, simply pull the drive sled and let hot swap do its thing.

dmesg output shows the device being disconnected from the system.

root@freenas:~ # dmesg | grep da4
da4 at mps0 bus 0 scbus0 target 6 lun 0
da4: <ATA WDC WD2001FASS-0 1D05> Fixed Direct Access SPC-4 SCSI device
da4: Serial Number WD-WMAY00665824
da4: 300.000MB/s transfers
da4: Command Queueing enabled
da4: 1907729MB (3907029168 512 byte sectors)
        (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 d7 62 b4 b0 00 00 10 00 length 8192 SMID 304 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
(da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 d7 62 b4 b0 00 00 10 00
(da4:mps0:0:6:0): CAM status: CCB request completed with an error
(da4:mps0:0:6:0): Retrying command
da4 at mps0 bus 0 scbus0 target 6 lun 0
mps0: da4: <ATA WDC WD2001FASS-0 1D05> s/n WD-WMAY00665824 detached
(da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 d7 62 b4 b0 00 00 10 00
(da4:mps0:0:6:0): CAM status: CCB request aborted by the host
(da4:mps0:0:6:0): Error 5, Periph was invalidated
GEOM_MIRROR: Device swap0: provider da4p1 disconnected.
(da4:mps0:0:6:0): Periph destroyed

FreeNAS will now show an alert indicating a disk has been removed.

Remove the failing hard drive from the drive sled and add in the replacement drive.

Once the drive sled is reinstalled into the hot swap bay, dmesg can once again be used to confirm the system has detected the new drive.

root@freenas:~ # dmesg | grep da4
da4 at mps0 bus 0 scbus0 target 5 lun 0
da4: <ATA WDC WD2002FAEX-0 1L01> Fixed Direct Access SPC-4 SCSI device
da4: Serial Number WD-WCC1P0987690
da4: 600.000MB/s transfers
da4: Command Queueing enabled
da4: 1907729MB (3907029168 512 byte sectors)

To complete the replacement process, login to FreeNAS, click the Storage button. Select the volume with the degraded status. Click the Volume Status button at the bottom.

In the Volume Status window, select the disk with the removed status. Click the Replace button at the bottom.

Now select the new disk as the member disk and click the Replace Disk button.

If all went well the new disk will be added to the volume with an online status. The volume status will appear as Resilver with a status of In progress.

Lastly, FreeNAS will show one more Critial “everything is ok” alert. Let the resilver process complete and this alert will clear on its own.

Cliffs Notes

Console commands:

See all smart errors in the system log: grep smart /var/log/messages
See all available smart information for a specific device: smartctl -a /dev/da4
See a list of all hard drive devices in the system: camcontrol devlist
Stop a specific hard disk: camcontrol stop da4
See system information after removing or adding a disk: dmesg | grep da4

FreeNAS commands:

Click Storage.
Select the degraded volume.
Click the Volume Status button at the bottom.
Select the disk with the removed status.
Click the Replace button at the bottom.
Select the new member disk.
Click the Replace Disk button.