Having a faulty drive can be stressful. You hope that you’ve backed up all the data you need from it on a regular basis. Also while you source another drive of the same or higher capacity, you hope that any other drives you have don’t also fail.

A real issue is knowing when a drive is about to fail. Have you been regularly running smartctrl checks? Even if you did chances are that you won’t detect any errors until the drive is about to completely fail. That gives you very little time to rectify the issue.

ZFS can help with some of the above. It can detect when a drive is about to fail and notify you as soon as a failure is detected. Your ZFS pool has you covered (if you have enough disks for redundancy) while you source a replacement drive. Replacing the drive once you get it is “easy” once you know what to do. Hence the reason for this article to document those steps.

To complicate things a little more, I’m running ZFS on an array of disks inside of a TerraMaster D4-300 via USB-C. The host machine is a Zimaboard 832.

Terramaster d4-300

One thing ZFS doesn’t do for you is to backup your data. It’s recommended that you run regular backups on your data. Having destroyed a ZFS pool in the past, I can attest that the ZFS is not a backup, is a valid statement.

Workflow

If a hardware fault occurs and if you regularly schedule ZFS scrubs, it will detect it and marks the drives as FAULTED, OFFLINE or DEGRADED. It’s important to note that if you don’t run regular ZFS scrubs, that it can take a while for the error to be detected. I run scrubs three times a day; around every 8 hours.

Verify the status of the drive

Run a zpool status <YOUR POOL NAME> and verify the status of the faulty drive. You should see something like this:

  pool: YOUR POOL NAME
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Sun Nov 26 05:30:02 2023
        1013G scanned at 1.20G/s, 297G issued at 359M/s, 1.41T total
        0B repaired, 20.51% done, 00:54:38 to go
config:

        NAME                                     STATE     READ WRITE CKSUM
        YOUR POOL NAME                           DEGRADED     0     0     0
          raidz1-0                               DEGRADED     0     0     0
            scsi-STDAS_TerraMaster_9DB202102202  ONLINE       0     0     0
            scsi-STDAS_TerraMaster_ADB202102202  FAULTED     97     0     0  too many errors
            scsi-STDAS_TerraMaster_BDB202102202  ONLINE       0     0     0
            scsi-STDAS_TerraMaster_CDB202102202  ONLINE       0     0     0

Notice that state is DEGRADED and STDAS_TerraMaster_ADB202102202 is FAULTED with too many errors.

The action section tells us what to do:

Replace the faulted device, or use ‘zpool clear’ to mark the device repaired.

Note down the degrade drive id STDAS_TerraMaster_ADB202102202 for use when replacing the drive.

Finding information about the failed drive

The following details will help you identify the physical drive to replace in the enclosure.

Find the drive id and mappings

Find the drive id and drive mappings for the failed drive with lsblk:

lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE,PATH,MODEL

which will have an output similar to:

NAME              SIZE SERIAL       LABEL  FSTYPE      PATH                              MODEL
loop0            63.3M                     squashfs    /dev/loop0
loop1            63.3M                     squashfs    /dev/loop1
loop2           111.9M                     squashfs    /dev/loop2
loop3            49.8M                     squashfs    /dev/loop3
loop4            53.2M                     squashfs    /dev/loop4
sda             931.5G 2250E693A93F                    /dev/sda                          CT1000BX500SSD1
├─sda1              1G                     vfat        /dev/sda1
├─sda2              2G                     ext4        /dev/sda2
└─sda3          928.5G                     LVM2_member /dev/sda3
  └─ubuntu--vg-ubuntu--lv
                928.5G                     ext4        /dev/mapper/ubuntu--vg-ubuntu--lv
sdb               2.7T 9DB202102202                    /dev/sdb                          TerraMaster
├─sdb1            2.7T              fspool zfs_member  /dev/sdb1
└─sdb9              8M                                 /dev/sdb9
sdc               3.6T ADB202102202                    /dev/sdc                          TerraMaster
├─sdc1            3.6T              fspool zfs_member  /dev/sdc1
└─sdc9              8M                                 /dev/sdc9
sdd               2.7T CDB202102202                    /dev/sdd                          TerraMaster
├─sdd1            2.7T              fspool zfs_member  /dev/sdd1
└─sdd9              8M                                 /dev/sdd9
sde               3.6T BDB202102202                    /dev/sde                          TerraMaster
mmcblk0          29.1G 0x4efa2725                      /dev/mmcblk0
├─mmcblk0p1       512M                     vfat        /dev/mmcblk0p1
├─mmcblk0p2      27.7G                     ext4        /dev/mmcblk0p2
└─mmcblk0p3       977M                     swap        /dev/mmcblk0p3
mmcblk0boot0        4M 0x4efa2725                      /dev/mmcblk0boot0
mmcblk0boot1        4M 0x4efa2725                      /dev/mmcblk0boot1
Finding the drive make,model and serial number

To find out the drive information use smartctl:

smartctl -i "$DRIVE_PATH" | grep 'Model\|Capacity\|Serial'

where DRIVE_PATH is the path to the new drive. Eg: /dev/sdc.

You should get some output of the form:

Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68AX9N0
Serial Number:    WD-WCC1T0934440
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Finding the disk id

You need to find the disk id of the drive that failed. You can do that by running by-id:

DRIVE_NAME="YOUR_DRIVE_NAME"
DRIVE_ID=$(ls -l /dev/disk/by-id/ | grep "$DRIVE_NAME" | grep -v '\-part' | grep 'scsi-' | cu    t -d' ' -f10)
echo "Drive Id:"
while IFS= read -r d; do
  echo "$d"
done <<< "$DRIVE_ID"

Ensure to replace YOUR_DRIVE_NAME with your drive name. Eg. sdc.

You should get an output of the form:

Drive Id:
scsi-35000000000000001
scsi-STDAS_TerraMaster_ADB202102202

One of these drive ids should match that failed drive id you noted down Verify the status of the drive.

Replacing the physical drive

  1. The first thing to do is offline the drive. That ensures that ZFS takes it out of the pool. You can offline the drive with:
sudo zpool offline <YOUR POOL NAME> <DRIVE_ID>

For example, in the above scenario STDAS_TerraMaster_ADB202102202 was the faulty drive:

sudo zpool offline YOUR_POOL_NAME STDAS_TerraMaster_ADB202102202
  • Shutdown your machine that has the ZFS pool

In the past I’ve seen ZFS get confused when you change drives while everything is online. This is an optional step but I do it for my piece of mind.

Note: After a reboot your drive mappings can change. Eg. /dev/sdd can become /dev/sde, but your drive id will be the same: STDAS_TerraMaster_ADB202102202.

  • Turn off the TerraMaster D4-300.
  • Remove the faulty drive from enclosure.

How do you know which drive to remove?

I have a rudimentary labelling scheme where I map each drive make, model, serial number and size to the slot on the TerraMaster D4-300. I have this as postit notes on top of the enclosure.

For example:

Slot 1: Model Western Digital Red, WDC WD30EFRX-68AX9N0, WD-WCC1T0934440, 4TB

Slot 2: ..

Slot 3: ..

Slot 4: ..

  • Insert the replacement drive into caddy and put it into the enclosure.
  • Turn on the TerraMaster D4-300.

Another note is that I can’t seem to cold boot TerraMaster D4-300 with all the drives inserted. I am not sure if this is an issue with the TerraMaster D4-300 itself or whether I’ve got some other issue with power etc. What I end up doing here is taking all the drives out and inserting them in one at a time until I hear each one boot up. Then I insert the next drive and so on.

  • Turn your machine back on.
  • Once your machine boots (and if you got this far, things are looking good), we need to find out some information about the replacement drive such as drive mappings, drive id etc.

Get information about the replacement drive

Find out details about the replacement drive to ensure you’re replacing with the correct drive.

Find the drive id and mappings

Find the drive id for the new drive with lsblk:

lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE,PATH,MODEL

which will have an output similar to:

NAME              SIZE SERIAL       LABEL  FSTYPE      PATH                              MODEL
loop0            63.3M                     squashfs    /dev/loop0
loop1            63.3M                     squashfs    /dev/loop1
loop2           111.9M                     squashfs    /dev/loop2
loop3            49.8M                     squashfs    /dev/loop3
loop4            53.2M                     squashfs    /dev/loop4
sda             931.5G 2250E693A93F                    /dev/sda                          CT1000BX500SSD1
├─sda1              1G                     vfat        /dev/sda1
├─sda2              2G                     ext4        /dev/sda2
└─sda3          928.5G                     LVM2_member /dev/sda3
  └─ubuntu--vg-ubuntu--lv
                928.5G                     ext4        /dev/mapper/ubuntu--vg-ubuntu--lv
sdb               2.7T 9DB202102202                    /dev/sdb                          TerraMaster
├─sdb1            2.7T              fspool zfs_member  /dev/sdb1
└─sdb9              8M                                 /dev/sdb9
sdc               3.6T ADB202102202                    /dev/sdc                          TerraMaster
├─sdc1            3.6T              fspool zfs_member  /dev/sdc1
└─sdc9              8M                                 /dev/sdc9
sdd               2.7T CDB202102202                    /dev/sdd                          TerraMaster
├─sdd1            2.7T              fspool zfs_member  /dev/sdd1
└─sdd9              8M                                 /dev/sdd9
sde               3.6T BDB202102202                    /dev/sde                          TerraMaster
mmcblk0          29.1G 0x4efa2725                      /dev/mmcblk0
├─mmcblk0p1       512M                     vfat        /dev/mmcblk0p1
├─mmcblk0p2      27.7G                     ext4        /dev/mmcblk0p2
└─mmcblk0p3       977M                     swap        /dev/mmcblk0p3
mmcblk0boot0        4M 0x4efa2725                      /dev/mmcblk0boot0
mmcblk0boot1        4M 0x4efa2725                      /dev/mmcblk0boot1
Finding the drive make, model and serial number

To ensure you’re replacing the correct drive find out the drive information with smartctl:

smartctl -i "$DRIVE_PATH" | grep 'Model\|Capacity\|Serial'

where DRIVE_PATH is the path to the new drive. Eg: /dev/sdc.

You should get some output of the form:

Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68AX9N0
Serial Number:    WD-WCC1T0934440
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Finding the disk id

You need to find the disk id of the drive you want to replace. You can do that by running by-id:

DRIVE_NAME="YOUR_DRIVE_NAME"
DRIVE_ID=$(ls -l /dev/disk/by-id/ | grep "$DRIVE_NAME" | grep -v '\-part' | grep 'scsi-' | cu    t -d' ' -f10)
echo "Drive Id:"
while IFS= read -r d; do
  echo "$d"
done <<< "$DRIVE_ID"

Ensure to replace YOUR_DRIVE_NAME with your drive name. Eg. sdc.

You should get an output of the form:

Drive Id:
scsi-35000000000000001
scsi-STDAS_TerraMaster_ADB202102202

You can use any of the ids returned returned to do the ZFS drive replacement. I usually use the scsi-STDAS_TerraMaster_ version.

Replacing the drive in the ZFS Pool

Now that we have all the information we need, we can finally replace the drive in the ZFS pool with zpool replace:

sudo zpool replace <YOUR POOL NAME> <OLD_ID> <NEW DRIVE ID>

For example:

sudo zpool replace <YOUR POOL NAME> scsi-STDAS_TerraMaster_ADB202102202 /dev/disk/by-id/scsi-STDAS_TerraMaster_ADB202102202

If all goes well, when you do a zpool status <YOUR POOL NAME> you should see :

  pool: <YOUR POOL NAME>
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Nov 26 06:05:37 2023
        374G scanned at 46.8G/s, 16.0M issued at 2.00M/s, 1.41T total
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                                       STATE     READ WRITE CKSUM
        YOUR POOL NAME                             DEGRADED     0     0     0
          raidz1-0                                 DEGRADED     0     0     0
            scsi-STDAS_TerraMaster_9DB202102202    ONLINE       0     0     0
            replacing-1                            DEGRADED     0     0     0
              old                                  OFFLINE      0     0     0
              scsi-STDAS_TerraMaster_ADB202102202  ONLINE       0     0     0
            scsi-STDAS_TerraMaster_BDB202102202    ONLINE       0     0     0
            scsi-STDAS_TerraMaster_CDB202102202    ONLINE       0     0     0

errors: No known data errors

Note that the action states that:

One or more devices is currently being resilvered

and the scan states that:

resilver in progress

and it states that a replacement is active:

            replacing-1                            DEGRADED     0     0     0
              old                                  OFFLINE      0     0     0
              scsi-STDAS_TerraMaster_ADB202102202  ONLINE       0     0     0

Now in a few hours, depending on how much data you have, you should be back good!