Replacement of a failed drive in a zpool

July 30, 2025

Introduction

ZFS (Zettabyte File System) is a robust, enterprise-grade file system that combines the features of a file system and a volume manager. One of its most powerful attributes is its ability to manage and maintain large storage pools with built-in support for RAID-like configurations, such as RAID 0, RAID 1, RAIDZ, and RAID 10. Among these, RAID 10 is a hybrid configuration combining striping (RAID 0) and mirroring (RAID 1), offering both performance and redundancy. However, like any storage setup, drives can and do fail. Replacing a failed drive in a RAID 10 zpool is a critical process that must be performed with precision to maintain data integrity and system uptime. Unlike traditional hardware RAID, ZFS manages redundancy through its own software-based system. A typical RAID 10 zpool in ZFS is built by creating multiple mirrored vdevs (virtual devices) that are then striped together. This setup allows for high read/write performance and redundancy, as each mirror can tolerate the failure of one disk without data loss.

The rXg leverages the robust ZFS file system, specifically through zpools, to ensure high availability (HA) and data redundancy. This configuration is critical for maintaining continuous operation and protecting valuable data from drive failures, a common occurrence in any storage environment. By using ZFS mirrors within a zpool, the rXg platform can withstand the loss of individual drives without impacting data accessibility or integrity, allowing for hot-swapping and resilvering of new drives to restore full redundancy.

Zpool status with a failed drive

Detecting a failed drive is the first step in the replacement process. Symptoms may include:

Warnings from monitoring tools or ZFS itself.
Slow performance or intermittent read/write errors.
Messages in system logs about I/O errors or device timeouts.
The zpool status showing a degraded or faulted state.

You can check the health of your zpool by running:

zpool status

The example below shows a zpool with one failed drive (da3p4), which reduces the RAID 10 resilience to drive errors. Note that due to the use of RAID 10 configuration in the system, the rXg operating system and all applications remains fully accessible.

[root@rxg /space/rxg]# zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       DEGRADED     0     0     0
          mirror-0  ONLINE       0     0     0
            da0p4   ONLINE       0     0     0
            da1p4   ONLINE       0     0     0
          mirror-1  DEGRADED     0     0     0
            da2p4   ONLINE       0     0     0
            da3p4   UNAVAIL      0     0     0  cannot open

errors: No known data errors

Replacing failed drive

Once you have identified the failed drive (in this case, da3p4, located in slot 4 of the host platform):

power down the server if hot-swapping is not supported.
replace the failed drive with a new one, ideally with the same size and performance rating
power the system back on and allow the new drive to be fully discovered.

It is advisable to take stock of the drive serial numbers before the swap is performed, to double check the swap operation, using for example the combination of following shell commands:

[root@rxg /space/rxg]# camcontrol devlist
<AHCI SGPIO Enclosure 2.00 0001>   at scbus4 target 0 lun 0 (ses0,pass0)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus9 target 0 lun 0 (ses1,pass1)
<SAMSUNG MZILG800HCHQAD3 DWG9>     at scbus10 target 0 lun 0 (pass2,da0)
<SAMSUNG MZILG800HCHQAD3 DWG9>     at scbus10 target 1 lun 0 (pass3,da1)
<SAMSUNG MZILG800HCHQAD3 DWG9>     at scbus10 target 2 lun 0 (pass4,da2)
<SAMSUNG MZILG800HCHQAD3 DWG9>     at scbus10 target 3 lun 0 (pass5,da3)
<DP BP_PSV 7.10>                   at scbus10 target 4 lun 0 (ses2,pass6)
<PNY USB 3.2.1 FD PMAP>            at scbus11 target 0 lun 0 (da4,pass7)
 
[root@rxg /space/rxg]# diskinfo -s /dev/da0
S6LANA0XC02694
[root@rxg /space/rxg]# diskinfo -s /dev/da1
S6LANA0XC02693
[root@rxg /space/rxg]# diskinfo -s /dev/da2
S6LANA0XC02692
[root@rxg /space/rxg]# diskinfo -s /dev/da3
S6LANA0XC02695

If your system supports hot-swapping, you can remove and replace the drive without shutting down. After replacing the physical drive, the new disk will be detected by the operating system.

The listing below shows the list of available drives and their partitions after the physical replacement was made. The new drive is identified as /dev/da3, as can be confirmed by double checking its serial number

[root@rxg /space/rxg]# ls -lah /dev/da*
crw-r-----  1 root operator  0xfb Jul 21 13:05 /dev/da0
crw-r-----  1 root operator 0x103 Jul 21 13:05 /dev/da0p1
crw-r-----  1 root operator 0x104 Jul 21 13:05 /dev/da0p2
crw-r-----  1 root operator 0x105 Jul 21 13:05 /dev/da0p3
crw-r-----  1 root operator 0x106 Jul 21 13:05 /dev/da0p4
crw-r-----  1 root operator  0xfa Jul 21 13:05 /dev/da1
crw-r-----  1 root operator  0xfe Jul 21 13:05 /dev/da1p1
crw-r-----  1 root operator  0xff Jul 21 13:05 /dev/da1p2
crw-r-----  1 root operator 0x100 Jul 21 13:05 /dev/da1p3
crw-r-----  1 root operator 0x102 Jul 21 13:05 /dev/da1p4
crw-r-----  1 root operator  0xfc Jul 21 13:05 /dev/da2
crw-r-----  1 root operator 0x107 Jul 21 13:05 /dev/da2p1
crw-r-----  1 root operator 0x109 Jul 21 13:05 /dev/da2p2
crw-r-----  1 root operator 0x10b Jul 21 13:05 /dev/da2p3
crw-r-----  1 root operator 0x10d Jul 21 13:05 /dev/da2p4
crw-r-----  1 root operator  0xf8 Jul 21 13:05 /dev/da3
[root@rxg /space/rxg]# diskinfo -s /dev/da3
8EX0A0BX0KX3

The replacement command executed in the shell as root is

zpool replace zroot da3p4 /dev/da3

ZFS will then begin a process known as "resilvering," where it rebuilds the data on the new disk using its mirror.

The zpool status immediately afterwards shows the replacement of the failed (replaced) drive.

[root@rxg /space/rxg]# zpool status
  pool: zroot
 state: DEGRADED
  scan: resilvered 6.42G in 00:00:08 with 0 errors on Mon Jul 21 13:15:47 2025
config:

        NAME             STATE     READ WRITE CKSUM
        zroot            DEGRADED     0     0     0
          mirror-0       ONLINE       0     0     0
            da0p4        ONLINE       0     0     0
            da1p4        ONLINE       0     0     0
          mirror-1       DEGRADED     0     0     0
            da2p4        ONLINE       0     0     0
            replacing-1  DEGRADED     0     0     0
              da3p4      UNAVAIL      0     0     0  cannot open
              da3        ONLINE       0     0     0

errors: No known data errors

Pool status after the replacement was completed

[root@rxg /space/rxg]# zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 6.42G in 00:00:08 with 0 errors on Mon Jul 21 13:15:47 2025
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da0p4   ONLINE       0     0     0
            da1p4   ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            da2p4   ONLINE       0     0     0
            da3     ONLINE       0     0     0

errors: No known data errors

Note that the drive was adopted into the pool and properly resilvered. Given the size and the speed of the drive, the resilvering process was likely too quick to be identified explicitly, but with slower drives, the following output can be observed in the zpool status:

  scan: resilver in progress since ...
	5.21G scanned out of 100G at 100M/s, 10m to go

Sometimes, ZFS may refuse to use a disk that contains old partition tables or metadata. You can wipe the new disk with:

wipefs -a /dev/sdX

Replace /dev/sdX with the device identifier of the new disk.

Best Practices for ZFS Drive Pools

To ensure long-term reliability and performance of your zpool, consider the following best practices:

Use matched drives: Always use drives of the same size and performance characteristics, at best, using the recommended replacement models.
Monitor regularly: Use monitoring tools like zabbix, prometheus, or even cron jobs with zpool status to catch failures early.
Test drive replacements: In a lab environment, practice the drive replacement process so you're prepared when it happens in production.
Maintain spares: Keep spare drives on hand for quick replacement, especially if the host is mission critical and must be back up from a drive failure as soon as possible
Enable alerts: Configure email or logging alerts to notify you of any zpool degradation or errors.
Use persistent device names: Instead of /dev/sdX, use /dev/disk/by-id/ or ZFS's GUID-based device identifiers to avoid confusion due to changing device names. This is especially visible in the example used above, where both the failed and the replacement drives were allocated the same drive name under the device directory.

back to articles home

Replacement of a failed drive in a zpool

Introduction

Zpool status with a failed drive

Replacing failed drive

Best Practices for ZFS Drive Pools

Categories

Configuration Guides

FAQ

3rd Party

Features and Capabilities

Known Issues

Tags

SoftGRE

RUCKUS

SmartZone

IPMI

Dell

Fleet Manager

ESXI

Hardware

Extreme

NAT

Bhyve

Upgrading

DHCP

Performance Improvements

DNS

Licensing

RADIUS

CLI

API

Configuration Templates

SD-WAN

IDV

NVIDIA

IPv6

Site Surveys

OpenWiFi