Unix/Linux

Solaris 9 Volume Manager Problems

steloflute 2014. 3. 21. 22:58

cc가 없으면 gcc를 사용한다.

You can use gcc instead of cc.

 

gcc -c assert.c

 

assert.c 내용은 이렇게 되어야 함:

Fixed assert.c:

 

#include <stdio.h>
void __assert(int a) {
     fprintf(stderr, "assertion failed: mdrcp->colnamep->start_blk < = rcp->un_orig_devstart\n");
}

 

 

 

http://www.adap.org/~edsel/blog/archives/59

 

 

Yesterday at work, I noticed a file system filling up. This file system happened to be on a RAID5 volume on a Solaris 9 system (using Solaris Volume Manager). After freeing up some space I decided to check the health of the RAID5 volume (I know I should have this automated) using metastat. To my surprise, one of the slices in the 3 slice raid 5 metadevice was in maintenance state. System messages indicated that the drive has failed.

I wasn’t overly concerned. I had a hotspare pool that kicked in to support that RAID5 volume. I could endure two more drive failures on that volume before suffering any data loss. But I needed to replace the failed drive.

The failed drive was an old 36 GB Seagate SCSI drive which was no longer available. I hunted for a spare, but only found a 146 GB Seagate SCSI drive. So I thought I could use that. As long as the slice allocated for the raid 5 volume is the same size, I should be OK (or so I thought).

I took out the broken disk, and replaced it with the new larger drive. The system recognized it without any problems so I proceeded to partition the drive using Solaris’ format utility. The slice used in the original drive was slice 0. I partitioned the new drive with an identically sized slice 0.

There was also s6 and s7 slices in the original drive used for metadbs. So I created those as well. After replacing the drive, naturally those metadbs were corrupted (metadb reported “W” beside those metadbs indicating “device has write errors”). I was able to delete those metadbs using metadb -d.

Now I wanted to re-attach slice 0 of the new drive to the RAID5 volume. Simple procedure, right? metareplace -e d30 c1t11d0s0. But before I do that why don’t I check on the status of the RAID5 volume first. metastat d30.

I was greeted by this unexpected message:

# metastat d30
Assertion failed: mdrcp->colnamep->start_blk < = rcp->un_orig_devstart, file ../common/meta_raid.c, line 151
metastat: Abort
Abort (core dumped)

Now every other meta* command I try for fixing the problem caused the same “Assertion failed” error message. However when I ran metastat as a non-root user, I did not get the Assertion failure, and the command ran successfully giving me a report of all my meta devices.

I googled this for hours trying to find someone who had experienced this and fixed this. I found one relevant entry that suggested metadevfsadm. Perhaps the md subsystem still thinks I have the old drive and so its idea of the size of the drive did not match the new drive. metadevfsadm did not core dump and it did update the “Device Relocation Information” to reflect the description of the new drive. But metastat as root continued to report assertion failures.

Perhaps if I delete the whole raid5 meta device I can re-create it successfully. The raid5 array was still accessible. So I backed up the contents of the raid5 array into another slice in the new drive, then remounted my file systems using the backup. I attempted to rebuild the raid5 array using metareplace -e but that too, resulted in assertion failures. Attempting to delete the raid5 device using metaclear also reported the same assertion failures. Basically I could not delete the array because it is in a strange state.

Rebooting the system did not clear the problem. I did notice that after rebooting the device information as reported by iostat -E correctly identifies the disk (before the reboot it still had information — vendor, model and serial number — about the old disk). Installing the latest md patch did not fix the problem. It appears my raid5 array will be in maintenance state forever.

I needed to find a way to delete the RAID5 volume so I can re-create it. How can I do that when metaclear dies with the same assertion error. Since metainit succeeds as a normal user, then perhaps the assertion could be safely ignored.

My solution was to use LD_PRELOAD. If I could create my own “assert” function which does not cause a core dump, and insert it into the application, then I may succeed in running metaclear.

source for assert.c

#include <stdio .h>
void __assert(int a)
{
     fprintf(stderr, \"assertion failed: mdrcp->colnamep->start_blk < = rcp->un_orig_devstart\n”);
     return;
}

compile that with

cc -Kpic -c assert.c
ld -zdefs -G -h libassert.so.1 -o libassert.so.1 assert.o -lc

copy the resulting libassert.so.1 to /usr/lib

Then as root:

# LD_PRELOAD=/usr/lib/libassert.so.1
# export LD_PRELOAD
# metastat d30

Metastat no longer core dumped! This is a good sign. So I attempt to delete the metadevice using metaclear:
metaclear -f d30

Success!!

At this point I can unset my LD_PRELOAD. Easiest thing to do is exit the shell.

Now I can re-create the RAID5 volume and copy the data back into it.

Disclaimer This solution works for me. I do not guarantee that it will work for you. I will not be held liable or responsible for data loss or any other kind of damages if you attempt this solution.

 

 

 

'Unix/Linux' 카테고리의 다른 글

Print IDs of /etc/passwd  (0) 2014.06.11
CSH Scripting Basics - Duke University  (0) 2014.06.10
Understanding init scripts  (0) 2014.03.04
How To Configure Static IP On CentOS 6  (0) 2013.12.31
how to compare the timestamp of 2 files  (0) 2013.11.22