Linux Software RAID Mirror In-Place Upgrade

Ran out of space on an old CentOS 6.8 server in the weekend, and so had to upgrade the main data mirror from a pair of Hitachi 2TB HDDs to a pair of 4TB WD Reds I had lying around.

The volume was using mdadm, aka Linux Software RAID, and is a simple mirror (RAID1), with LVM volumes on top of the mirror. The safest upgrade path is to build a new mirror on the new disks and sync the data across, but there weren't any free SATA ports on the motherboard, so instead I opted to do an in-place upgrade. I haven't done this for a good few years, and hit a couple of wrinkles on the way, so here are the notes from the journey.

Below, the physical disk partitions are /dev/sdb1 and /dev/sdd1, the mirror is /dev/md1, and the LVM volume group is extra.

1. Backup your data (or check you have known good rock-solid backups in place), because this is a complex process with plenty that could go wrong. You want an exit strategy.

2. Break the mirror, failing and removing one of the old disks

mdadm --manage /dev/md1 --fail /dev/sdb1
mdadm --manage /dev/md1 --remove /dev/sdb1

3. Shutdown the server, remove the disk you've just failed, and insert your replacement. Boot up again.

4. Since these are brand new disks, we need to partition them. And since these are 4TB disks, we need to use parted rather than the older fdisk.

parted /dev/sdb
print
mklabel gl
# Create a partition, skipping the 1st MB at beginning and end
mkpart primary 1 -1
unit s
print
# Not sure if this flag is required, but whatever
set 1 raid on
quit

5. Then add the new partition back into the mirror. Although this is much bigger, it will just sync up at the old size, which is what we want for now.

mdadm --manage /dev/md1 --add /dev/sdb1
# This will take a few hours to resync, so let's keep an eye on progress
watch -n5 cat /proc/mdstat

6. Once all resynched, rinse and repeat with the other disk - fail and remove /dev/sdd1, shutdown and swap the new disk in, boot up again, partition the new disk, and add the new partition into the mirror.

7. Once all resynched again, you'll be back where you started - a nice stable mirror of your original size, but with shiny new hardware underneath. Now we can grow the mirror to take advantage of all this new space we've got.

mdadm --grow /dev/md1 --size=max
mdadm: component size of /dev/md0 has been set to 2147479552K

Ooops! That size doesn't look right, that's 2TB, but these are 4TB disks?! Turns out there's a 2TB limit on mdadm metadata version 0.90, which this mirror is using, as documented on https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#The_version-0.90_Superblock_Format.

mdadm --detail /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Thu Aug 26 21:03:47 2010
    Raid Level : raid1
    Array Size : 2147483648 (2048.00 GiB 2199.02 GB)
  Used Dev Size : -1
  Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Nov 27 11:49:44 2017
          State : clean 
Active Devices : 2
Working Devices : 2
Failed Devices : 0
  Spare Devices : 0

          UUID : f76c75fb:7506bc25:dab805d9:e8e5d879
        Events : 0.1438

    Number   Major   Minor   RaidDevice State
      0       8       17        0      active sync   /dev/sdb1
      1       8       49        1      active sync   /dev/sdd1

Unfortunately, mdadm doesn't support upgrading the metadata version. But there is a workaround documented on that wiki page, so let's try that:

mdadm --detail /dev/md1
# (as above)
# Stop/remove the mirror
mdadm --stop /dev/md1
mdadm: Cannot get exclusive access to /dev/md1:Perhaps a running process, mounted filesystem or active volume group?
# Okay, deactivate our volume group first
vgchange --activate n extra
# Retry stop
mdadm --stop /dev/md1
mdadm: stopped /dev/md1
# Recreate the mirror with 1.0 metadata (you can't go to 1.1 or 1.2, because they're located differently)
# Note that you should specify all your parameters in case the defaults have changed
mdadm --create /dev/md1 -l1 -n2 --metadata=1.0 --assume-clean --size=2147483648 /dev/sdb1 /dev/sdd1

That outputs:

mdadm: /dev/sdb1 appears to be part of a raid array:
      level=raid1 devices=2 ctime=Thu Aug 26 21:03:47 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
      level=raid1 devices=2 ctime=Thu Aug 26 21:03:47 2010
mdadm: largest drive (/dev/sdb1) exceeds size (2147483648K) by more than 1%
Continue creating array? y
mdadm: array /dev/md1 started.

Success! Now let's reactivate that volume group again:

vgchange --activate y extra
3 logical volume(s) in volume group "extra" now active

Another wrinkle is that recreating the mirror will have changed the array UUID, so we need to update the old UUID in /etc/mdadm.conf:

# Double-check metadata version, and record volume UUID
mdadm --detail /dev/md1
# Update the /dev/md1 entry UUID in /etc/mdadm.conf
$EDITOR /etc/mdadm.conf

So now, let's try that mdadm --grow command again:

mdadm --grow /dev/md1 --size=max
mdadm: component size of /dev/md1 has been set to 3907016564K
# Much better! This will take a while to synch up again now:
watch -n5 cat /proc/mdstat

8. (You can wait for this to finish resynching first, but it's optional.) Now we need to let LVM know that the physical volume underneath it has changed size:

# Check our starting point
pvdisplay /dev/mda1
--- Physical volume ---
PV Name               /dev/md1
VG Name               extra
PV Size               1.82 TiB / not usable 14.50 MiB
Allocatable           yes 
PE Size               64.00 MiB
Total PE              29808
Free PE               1072
Allocated PE          28736
PV UUID               mzLeMW-USCr-WmkC-552k-FqNk-96N0-bPh8ip
# Resize the LVM physical volume
pvresize /dev/md1
Read-only locking type set. Write locks are prohibited.
Can't get lock for system
Cannot process volume group system
Read-only locking type set. Write locks are prohibited.
Can't get lock for extra
Cannot process volume group extra
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm1
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_pool
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm2
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for system
Cannot process volume group system
Read-only locking type set. Write locks are prohibited.
Can't get lock for extra
Cannot process volume group extra
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm1
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_pool
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm2
Cannot process standalone physical volumes
Failed to find physical volume "/dev/md1".
0 physical volume(s) resized / 0 physical volume(s) not resized

Oops - that doesn't look good. But it turns out it's just a weird locking type default. If we tell pvresize it can use local filesystem write locks we should be good (cf. /etc/lvm/lvm.conf):

# Let's try that again...
pvresize --config 'global {locking_type=1}' /dev/md1
Physical volume "/dev/md1" changed
1 physical volume(s) resized / 0 physical volume(s) not resized
# Double-check the PV Size
pvdisplay /dev/mda1
--- Physical volume ---
PV Name               /dev/md1
VG Name               extra
PV Size               3.64 TiB / not usable 21.68 MiB
Allocatable           yes 
PE Size               64.00 MiB
Total PE              59616
Free PE               30880
Allocated PE          28736
PV UUID               mzLeMW-USCr-WmkC-552k-FqNk-96N0-bPh8ip

Success!

Finally, you can now resize your logical volumes using lvresize as you usually would.

pvmove Disk Migrations

Lots of people make use of linux's lvm (Logical Volume Manager) for providing services such as disk volume resizing and snapshotting under linux. But few people seem to know about the little pvmove utility, which offers a very powerful facility for migrating data between disk volumes on the fly.

Let's say, for example, that you have a disk volume you need to rebuild for some reason. Perhaps you want to change the raid type you're using on it; perhaps you want to rebuild it using larger disks. Whatever the reason, you need to migrate all your data to another temporary disk volume so you can rebuild your initial one.

The standard way of doing this is probably to just create a new filesystem on your new disk volume, and then copy or rsync all the data across. But how do you verify that you have all the data at the end of the copy, and that nothing has changed on your original disk after the copy started? If you did a second rsync and nothing new was copied across, and the disk usage totals exactly match, and you remember to unmount the original disk immediately, you might have an exact copy. But if your original disk data is changing at all, getting a good copy of a large disk volume can actually be pretty tricky.

The elegant lvm/pvmove solution to this problem is this: instead of doing a userspace migration between disk volumes, you add your new volume into the existing volume group, and then tell lvm to move all the physical extents off of your old physical volume, and the migration is magically handled by lvm, without even needing to unmount the logical volume!

# Volume group 'extra' exists on physical volume /dev/sdc1
$ lvs
  LV   VG     Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  data extra  -wi-ao 100.00G

# Add new physical volume /dev/sdd1 into volume group
$ vgextend extra /dev/sdd1
  Volume group "extra" successfully extended
$ lvs
  LV   VG     Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  data extra  -wi-ao 200.00G

# Use pvmove to move physical extents off of old /dev/sdc1 (verbose mode)
$ pvmove -v /dev/sdc1
# Lots of output in verbose mode ...

# Done - remove old physical volume
$ pvremove /dev/sdc1
$ lvs
  LV   VG     Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  data extra  -wi-ao 100.00G

The joys of linux.