Robbat2 (robbat2) wrote,
Robbat2
robbat2

  • Music:

Linux MD RAID devices and moving spares to missing slots

Setting up the storage on my new machine, I just ran into something really interesting, what seems to be deliberate usable and useful, but completely undocumented functionality in the MD RAID layer.

It's possible to create RAID devices with the initial array having 'missing' slots, and then add the devices for those missing slots later. RAID1 lets you have one or more, RAID5 only one, RAID6 one or two, RAID10 up to half of the total. That functionality is documented in both the Documentation/md.txt of the kernel, as well as the manpage for mdadm.

What isn't documented is when you later add devices, how to get them to take up the 'missing' slots, rather than remain as spares. Nothing in md(7), mdadm(8), or Documentation/md.txt. Nothing I tried with mdadm could do it either, leaving only the sysfs interface for the RAID device.

Documentation/md.txt does describe the sysfs interface in detail, but seems to have some omissions and outdated material - the code has moved on, but the documentation hasn't caught up yet.

So, below the jump, I present my small HOWTO on creating a RAID10 with missing devices and how to later add them properly.

MD with missing devices HOWTO

We're going to create /dev/md10 as a RAID10, starting with two missing devices. In the example here, I use 4 loopback devices of 512MiB each: /dev/loop[1-4], but you should just substitute your real devices.

# mdadm --create /dev/md10 --level 10 -n 4 /dev/loop1 missing /dev/loop3 missing -x 0
mdadm: array /dev/md10 started.
# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
# mdadm --manage --add /dev/md10 /dev/loop2 /dev/loop4
mdadm: added /dev/loop2
mdadm: added /dev/loop4
# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]

Now notice that the two new devices have been added as spares [denoted by the "(S)"], and that the array remains degraded [denoted by the underscores in the "[U_U_]"]. Now it's time to break out the sysfs interface.

# cd /sys/block/md10/md/
# grep . dev-loop*/{slot,state}
dev-loop1/slot:0
dev-loop2/slot:none
dev-loop3/slot:2
dev-loop4/slot:none
dev-loop1/state:in_sync
dev-loop2/state:spare
dev-loop3/state:in_sync
dev-loop4/state:spare

Now a short foray into explaining how MD-raid sees component devices. For an array with N devices total, there are slots numbered from 0 to N-1. If all the devices are present, there are no empty slots. The presence or absence of a device in a slot is noted by the display from /proc/mdstat: [U_U_]. That shows we have a devices in slots 0 and 2, and nothing in slots 1 and 3. The mdstat output does include slot numbers after each device in the listing line: md10 : active raid10 loop4[4](S) loop2[5](S) loop3[2] loop1[0]. loop4 and loop2 are in slots 4 and 5, both spare. loop3 and loop1 are in slots 0 and 2. The slot numbers that are greater than the device numbers seem to be extraneous, I'm not sure if they are just an mdadm abstraction, or in the kernel internals only.

Now we want to fix up the array. We want to promote both spares to the missing slots. This is the first item that Documentation/md.txt is really wrong it. The description for the slot sysfs node contains: "This can only be set while assembling an array." This is actually wrong, we CAN write to it and fix our array.

# echo 1 >dev-loop2/slot
# echo 3 >dev-loop4/slot
# grep . dev-loop*/slot
dev-loop1/slot:0
dev-loop2/slot:1
dev-loop3/slot:2
dev-loop4/slot:3
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]

The slot numbers have changed in the mdstat output and the sysfs, but they no longer match at all. The spare marker "(S)" has also vanished. Now we can follow the sysfs docmentation, and force a rebuild using the sync_action node.

In theory, the mdadm daemon, if running, should have detected that the array was degraded and had valid spares, but I don't know why it didn't. Perhaps another bug to trace down later.

# echo repair >sync_action 
(wait a moment)
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[4] loop2[5] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/2] [U_U_]
      [=============>.......]  recovery = 65.6% (344064/524224) finish=0.1min speed=22937K/sec

The slot numbers still aren't what we set them to, but the array is busy rebuilding still.

# cat /proc/mdstat 
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU]

Now that the rebuild is complete, the slot numbers have flipped to their correct values.

Bonus: regular maintenance ideas

While we can regularly check individual disks with the daemon part of smartmontools, issuing short and long disk tests, there is also a way to check entire arrays for consistency.

The only way of doing it with mdadm is to force a rebuild, but that isn't really a nice proposition if it picks a disk that was about to fail as one of the 'good' disks. sysfs to the rescue again, there is a non-destructive way to test an array, and only promote to repair mode if there is an issue.

# echo check >sync_action 
(wait a moment)
# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4] 
md10 : active raid10 loop4[3] loop2[1] loop3[2] loop1[0]
      1048448 blocks 64K chunks 2 near-copies [4/4] [UUUU]
      [============>........]  check = 62.8% (660224/1048448) finish=0.0min speed=110037K/sec

Either make a cronjob to do it, or put the functionality in mdadm. You can safely issue the check command to multiple md devices at once, the kernel will ensure that it doesn't check array that share the same disks.

Tags: gentoo, linux, raid
Subscribe
  • 21 comments
You should submit that to lkml for inclusion into the docs!
mdadm seems to hot add things in for me on raid1:

mythtv test $ mdadm --create -n 2 -l 1 /dev/md2 /dev/loop1 missing
mdadm: array /dev/md2 started.
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop1[0]
102336 blocks [2/1] [U_]

unused devices:
mythtv test $ mdadm --add /dev/md2 /dev/loop2
mdadm: added /dev/loop2
mythtv test $ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 loop2[1] loop1[0]
102336 blocks [2/2] [UU]

unused devices:
mythtv test $
Ok, testing with RAID1 appears to auto-rebuild, but RAID10 does not.

# for i in 0 1 2 3 ; do dd if=/dev/zero of=/block.$i bs=1M count=128 ; losetup /dev/loop${i} /block.$i ; done ;
# mdadm --create -n 4 -l 10 /dev/md99 missing /dev/loop1 missing /dev/loop3
# mdadm --add /dev/md99 /dev/loop0
# grep '(S)' /proc/mdstat
md99 : active raid10 loop1[4](S) loop4[3] loop2[1]
I migrated a real world server from JBOD to RAID5 by installing 2 Hard drives the same size as the primary, and initializing them as RAID5 with one disk missing. I then copied the data across, verified, remounted the filesystem, and hotadded the old primary to the new RAID5

Anonymous

November 8 2008, 03:24:05 UTC 11 years ago

Looks like a known bug.

http://bugzilla.kernel.org/show_bug.cgi?id=11967

Helluva workaround though..
Fun, upstream dismissed it as being a bug originally - and I just moved to 3ware hardware for myself instead.
This looks like exactly the solution to the problem I'm currently dealing with, only there is no 'md' node in my /sys/block/md10 dir. All I find there are files called 'dev', 'range', 'size' and 'stat'.

I fear that this server might have too old a version of sysfs in it.
What kernel is on that server?
Its running a 3.6.3 Mandriva kernel.
uname -a please, the distro version says nothing.
Oops. Sorry. Typo above. I was trying to say 2.6.3 kernel. Specifically uname -a says '2.6.3-4mdk'

Ok, that's absolutely ancient. What was the build date in the uname -a string?
2.6.3 was released in Feb 2004.
Machine isn't running right now, but I think it was bought in 2000 and probably hasn't been upgraded since around 2005. That's when Mandrake became Mandriva and the upgrade was known to be problematical, so it was never done.

I wonder if there is a live distro out there with a sufficiently recent kernel, and raid support that I could use to do the trick above? If I understand correctly, once I've gotten the right values into the superblocks, rebuilt the array and resynched, I should be able to boot up on my old kernel and have things still work.

Then I can look into upgrading to a newer kernel.

robbat2

10 years ago

swestrup

10 years ago

swestrup

10 years ago

swestrup

10 years ago

robbat2

10 years ago

swestrup

10 years ago

Very interesting post.
I'm trying to "revive" a missing raid6 following your procedure and other ones.
But when I try to 'echo 1 >dev-sda1/slot' I get a message telling that there's no free space on device, and a write error.
Any idea why cannot write to this file?
This is kernel 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux
was a silly thing. Just echo -n

Finnally, I recovered the raid 6.
# echo -n 1 >/dev-sda1/slot (first disk was out, so sda1 is in slot 1 not 0)
The rest of disks automatically occupied the rest of slots.
# echo -n clean array_state
And raid6 is running again.
Now I'm trying to mount the fs stored in lvm...
  • 21 comments

Comments for this post were locked by the author