zpool mirror to raidz

5 July 2011.

I use ZFS on FreeBSD. I have two disks in a mirror. I've bought two more disks and want to move to a four-disk raidz, and I want to do this in-place. Sadly, ZFS doesn't let me convert a mirror into a raidz, so the plan is:

Break the mirror.
Create a raidz of three disks and one fake device.
Offline the fake device.
Copy all the data over.
Delete the mirror.
Replace the offlined device with the fourth disk.

This process totally works.

Here's the longer version:

Step 0: Burn-in.

Last time I bought two disks from the local shop, half of them failed within a few days. Opinions vary on how, and how long, you should burn-in disks. I spent about a week reading, writing, and generally spinning the new disks until I decided I wanted to move on.

Step 1: Backups.

This method worked for me. You might not be so lucky. If you lose all your data, it's not my fault.

Do you have backups? You should.

If I had proper backups, I probably wouldn't bother with all the contortions needed to upgrade in-place. I'd blow away the mirror, create a new pool, and just restore.

Step 2: Prepare the new disks.

I have /dev/ada0 and ada1
I bought ada2 and ada3.

On the existing disks, I have UFS root and boot filesystems, and also swap. So the first think I'm going to do is copy my existing labels/partitions to the new disks:

# fdisk -BI /dev/ada2
# bsdlabel -w -B /dev/ada2s1
# bsdlabel -e /dev/ada2s1
:r !bsdlabel /dev/ada0s1
# gmirror insert boot ada2s1a
# gmirror insert root ada2s1b
# gmirror insert swap ada2s1d

(and again for ada3)

Yes, these are now 4-way mirrors. This topology has excellent availability, unless your fans fail and all four disks overheat simultaneously (this happened to my dad, except with six disks)

Step 3: Dry run.

You know you can create a zpool backed by files? This is pretty handy for testing. I was going to use a file as a bogus disk to create the degraded raidz so I decided to test this approach:

# mkdir /dry
# truncate -s 2T /dry/d{0,1,2,3,-broken}
# zpool create planner raidz /dry/d{0,-broken,2,3}
# zpool offline planner /dry/d-broken # ← kernel panic!

At the time I did this (FreeBSD 8.2-RELEASE), offlining a file-backed vdev reliably caused a kernel panic. My workaround was to use a vnode-backed md, but I'm glad I found this out prior to breaking my zpool.

Step 4: Create the new zpool.

Create md0, backed by a sparse file: (you can determine the exact size from the appropriate geom module's list command)

# truncate -s 1993327292416 /bad-disk-backing
# mdconfig -a -t vnode -S 4096 -f /bad-disk-backing

I currently have:

# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        tank             ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            ada0s1e      ONLINE       0     0     0
            ada1s1e      ONLINE       0     0     0

Break the mirror:

# zpool detach tank ada0s1e

At this point you have no redundancy. If you get disk errors between now and when the resilver finishes at the end of this exercise, you are going to have real data loss.

Create the new zpool:

# zpool create tank2 raidz /dev/{ada0s1e,md0,ada2s1e,ada3s1e}
# zpool offline tank2 md0 # ← IMPORTANT
# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          ada1s1e      ONLINE       0     0     0

errors: No known data errors

  pool: tank2
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        tank2            DEGRADED     0     0     0
          raidz1         DEGRADED     0     0     0
            ada0s1e      ONLINE       0     0     0
            md0          OFFLINE      0     0     0
            ada2s1e      ONLINE       0     0     0
            ada3s1e      ONLINE       0     0     0

errors: No known data errors

I changed some ZFS settings at this point, but I don't think this was necessary because zfs recv would have sorted it out. Including for completeness:

# zfs set compression=on tank2
# zfs set atime=off tank2
# zfs set setuid=off tank2

Step 5: Copy everything.

If you have a script that rotates snapshots, at this point you should consider stopping it, or at least disabling the code that deletes snapshots. You should also consider firing up tmux or screen because the copy is going to take a long time and you might need to detach.

Drop a new snapshot and use it to copy all ZFSes and snapshots from the first pool:

# zfs get all >/zfs-get-all.txt
# zfs snapshot -r tank@copier1
# zfs send -R tank@copier1 | zfs recv -v -F -d tank2
receiving full stream of tank@2011-07-03-1441 into tank2@2011-07-03-1441
received 260KB stream in 2 seconds (130KB/sec)
receiving incremental stream of tank@copier1 into tank2@copier1
received 312B stream in 1 seconds (312B/sec)
[...]

It's safe to keep using the original pool and even creating new snapshots. We'll catch up later with a second copy operation.

It occurred to me later that I hadn't created any zvols. My guess is they would copy over just like everything else.

Step 6: Copy everything else.

At this point, stop all background processes, unmount all NFS mounts, stop writing to anything on the original zpool. Basically, go into single user mode. And then do a final catch-up copy:

# zfs snapshot -r tank@copier2
# zfs rollback tank2@copier1
# zfs send -R -I tank@copier1 tank@copier2 | zfs recv -v -d tank2

Note the capital I will create all the new snapshots that happened between copier1 and copier2.

Step 7: Switch over.

I was super paranoid, so I did this much more cautiously than what I think was strictly necessary. For completeness, here's what I did, followed by what I think I could have done:

# umount -f /usr/local # ← this was a zfs on /tank
# zpool export tank
# zpool import tank oldtank

# zpool export tank2
# mdconfig -d -u 0 # ← don't want it to resilver
# zpool import tank2 tank

(realized I outsmarted myself here,
have to bounce oldtank so I can destroy it)

# zpool export oldtank
# zpool import oldtank
# zpool destroy oldtank

(next, pick up all the mountpoints from new tank)

# zfs mount -a

(spot check to make sure everything looks ok)

# reboot

---8<--- CUT HERE ---8<---

Here's what I think would have been sufficient:
# zpool destroy tank
# zpool export tank2
# mdconfig -d -u 0
# zpool import tank2 tank

Step 8: Resilver.

Finally, replace the offline component with the remaining disk:

# zpool replace tank md0 ada1s1e
# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h1m, 0.68% done, 4h8m to go
config:

        NAME               STATE     READ WRITE CKSUM
        tank               DEGRADED     0     0     0
          raidz1           DEGRADED     0     0     0
            ada0s1e        ONLINE       0     0     0
            replacing      DEGRADED     0     0     0
              md0          OFFLINE      0     0     0
              ada1s1e      ONLINE       0     0     0  2.86G resilvered
            ada2s1e        ONLINE       0     0     0
            ada3s1e        ONLINE       0     0     0

errors: No known data errors

The resilver takes less time than the copy because it only has to write out a third as much data, and can read in parallel.

I compared zfs get all before and after, and there were no differences. The copying process takes care of all the appropriate ZFS settings.