Identify and replace a failed disk in Linux RAID
Posted by on 2015-03-17 20:58:07:
EDIT 05/05/2020 - I realised this guide is a little confusing so hopefully this is easier to follow.
How Do I Tell If A Hard Disk Has Failed?
If a disk has failed, you will probably find a lot of error messages in the log files, e.g.
/var/log/messages
or /var/log/syslog
and if you set up email alerts you will also receive email alerts.You can also run
cat /proc/mdstat
and instead of the string [UU]
you will see [U_]
if you have a degraded RAID1 array.The steps are: fail the disk, remove it, install the new disk, partition it to match the old one and then add it using the
mdadm
utility.Step 1:
For this example assume /dev/sdb has failed, and /dev/sdb has two partitions /dev/sdb1 and /dev/sdb2.
Using the 'mdadm' utility, fail the first partition (/dev/sdb1) which is part of /dev/md0:
mdadm --manage /dev/md0 --fail /dev/sdb1
Step 2:
Using the 'mdadm' utility, fail the second partition (/dev/sdb2) which is part of /dev/md1:
mdadm --manage /dev/md1 --fail /dev/sdb2
Step 3:
Using the 'mdadm' utility, remove the disk from the array.
mdadm --manage /dev/md0 --remove /dev/sdb1
Output:
mdadm: hot removed /dev/sdb1
mdadm --manage /dev/md1 --remove /dev/sdb2
Output:
mdadm: hot removed /dev/sdb2
Step 4:
Replace the physical disk, and partition it exactly the same as the disk you removed - or - just copy the layout of the remaining disk.
Set the new partitions to 'Linux raid autodetect'
Step 5:
Using the 'mdadm' utility, add the new disk / partitions to the array.
mdadm --manage /dev/md0 --add /dev/sdb1
Output:
mdadm: re-added /dev/sdb1
mdadm --manage /dev/md1 --add /dev/sdb2
Output:
mdadm: re-added /dev/sdb2
Once you have added the partitions back to the array, the rebuild process will automatically begin.
You can monitor this process with the command
cat /proc/mdstat
which will show the rebuild progress.Only one partition will rebuild at a time, so you will have to wait for both to complete before the job is finished.
If the RAID is bootable, you will need to reinstall grub - ensure it is installed to *both* disks.
On a Debian system, the best way to do this is to use 'dpkg-reconfigure' like this
dpkg-reconfigure grub-pc
One final note - don't forget to ensure the new partition you created has the bootable flag set.
----------------------------------------------------------------------------------------------------------------------------------
Old method below.
----------------------------------------------------------------------------------------------------------------------------------
1 Preliminary Note
In this example I have two hard drives, /dev/sda and /dev/sdb, with the partitions /dev/sda1 and /dev/sda2 as well as /dev/sdb1 and /dev/sdb2.
/dev/sda1 and /dev/sdb1 make up the RAID1 array /dev/md0. /dev/sda2 and /dev/sdb2 make up the RAID1 array /dev/md1. /dev/sda1 + /dev/sdb1 = /dev/md0 /dev/sda2 + /dev/sdb2 = /dev/md1 /dev/sdb has failed, and we want to replace it.
2 How Do I Tell If A Hard Disk Has Failed?
If a disk has failed, you will probably find a lot of error messages in the log files, e.g. /var/log/messages or /var/log/syslog.
You can also run
cat /proc/mdstat
and instead of the string [UU] you will see [U_] if you have a degraded RAID1 array.
3 Removing The Failed Disk
To remove /dev/sdb, we will mark /dev/sdb1 and /dev/sdb2 as failed and remove them from their respective RAID arrays (/dev/md0 and /dev/md1).
First we mark /dev/sdb1 as failed:
mdadm --manage /dev/md0 --fail /dev/sdb1
The output of
cat /proc/mdstat
should look like this:
server1:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] sdb1[2](F) 24418688 blocks [2/1] [U_] md1 : active raid1 sda2[0] sdb2[1] 24418688 blocks [2/2] [UU] unused devices:
Then we remove /dev/sdb1 from /dev/md0:
mdadm --manage /dev/md0 --remove /dev/sdb1
The output should be like this:
server1:~# mdadm --manage /dev/md0 --remove /dev/sdb1 mdadm: hot removed /dev/sdb1
And
cat /proc/mdstat
Should show this:
server1:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] 24418688 blocks [2/1] [U_] md1 : active raid1 sda2[0] sdb2[1] 24418688 blocks [2/2] [UU] unused devices:
Now we do the same steps again for /dev/sdb2 (which is part of /dev/md1):
mdadm –manage /dev/md1 –fail /dev/sdb2
Examine the RAID status with 'cat /proc/mdstat'
server1:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] 24418688 blocks [2/1] [U_] md1 : active raid1 sda2[0] sdb2[2](F) 24418688 blocks [2/1] [U_] unused devices:mdadm --manage /dev/md1 --remove /dev/sdb2 server1:~# mdadm --manage /dev/md1 --remove /dev/sdb2 mdadm: hot removed /dev/sdb2
Once again examine the RAID status with 'cat /proc/mdstat'
server1:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] 24418688 blocks [2/1] [U_] md1 : active raid1 sda2[0] 24418688 blocks [2/1] [U_] unused devices:
Now power down the system:
shutdown -h now
and replace the old /dev/sdb hard drive with a new one (it must have at least the same size as the old one - if it's only a few MB smaller than the old one then rebuilding the arrays will fail).
4 Adding The New Hard Disk
After you have changed the hard disk /dev/sdb, boot the system.
The first thing we must do now is to create the exact same partitioning as on /dev/sda. We can do this with one simple command:
sfdisk -d /dev/sda | sfdisk /dev/sdb
You can run
fdisk -l
to check if both hard drives have the same partitioning now.
Next we add /dev/sdb1 to /dev/md0 and /dev/sdb2 to /dev/md1:
mdadm --manage /dev/md0 --add /dev/sdb1 server1:~# mdadm --manage /dev/md0 --add /dev/sdb1 mdadm: re-added /dev/sdb1 mdadm --manage /dev/md1 --add /dev/sdb2 server1:~# mdadm --manage /dev/md1 --add /dev/sdb2 mdadm: re-added /dev/sdb2
Now both arrays (/dev/md0 and /dev/md1) will be synchronized. Run
cat /proc/mdstat
to see when it's finished.
During the synchronization the output will look like this:
server1:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] sdb1[1] 24418688 blocks [2/1] [U_] [=>...................] recovery = 9.9% (2423168/24418688) finish=2.8min speed=127535K/sec md1 : active raid1 sda2[0] sdb2[1] 24418688 blocks [2/1] [U_] [=>...................] recovery = 6.4% (1572096/24418688) finish=1.9min speed=196512K/sec unused devices:
When the synchronization is finished, the output will look like this:
server1:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] sdb1[1] 24418688 blocks [2/2] [UU] md1 : active raid1 sda2[0] sdb2[1] 24418688 blocks [2/2] [UU] unused devices:
That's it, you have successfully replaced /dev/sdb!
Tags: Storage , Raid , Linux
Return to home page: Home