The pic I should have taken
myself instead of filching from
some other site

The plan was to go camping to wonderful Goblin Valley. My Brother, his wife and four other friends had made plans to leave Friday night after work. It’s a shorr three hour drive from my place. The weather was looking good and I was stoked. I took Friday off to take care of some errands I needed to take care of and to pack.

Then it came…

The email, webmail, DNS, gateway and firewall server of a company I have been consulting for over ten years had crashed. The hardware ATA RAID system had experienced a total failure.

Both disks in the mirrored array had errors. One of which was not being recognized at all by the system, the other had bad sectors. The second with bad sectors could be read from but not written to. I knew this would not be something I could take care of over the phone regardless of how much the more-than-willing people ont he other end of the phone offered to be talked through fixing it. Ha! Impossible I thought.

Being pretty upset about what I was realizing was a quickly fading vacation, I started making a few phone calls to let others know that I would not be going and that being one of the drivers, they would now have to make other preparations for transportation to Goblin Valley. Bummer.

Arriving on-site about 45 minutes later, I settled in for what I began to realize was going to be a multi-day extended working weekend. I was able to duplicate the readable disk to a new hardrive and then copy that one to a second new hard drive. Dispite the new hardrives, the system still had errors on the filesystem and the root filesystem fell into a degraded, readonly mode when booting….

After a few frustrating hours of working on the system, transfering data and getting the system booting, it became clear that the filesystem errors on the original “readable” drive were faithfuly reproduced on the new drives by the “Copy” function of the hardware raid card. Nice going 8^). I give them credit of course. The errors were not harware but in the EXT3 filesystem. Reformatting the new partition, then exicuting (cd /old/drive && tar cf - )|( cd /new/drive && tar xvfp -) sovled the file system corruption.

Mayuko and I took a break and grabbed dinner at Little World (one of the best Chinese kitchens in Utah). I worked on getting the SuSE 8.1 Linux system converted from hardware RAID to software RAID since the device driver for the Promise Technologies Fastrack 100 would not load and is not available in newer 2.6.x kernels. I felt that the more portable solution is to go with software RAID since it’s avaialable in newer kernels and will allow for access to the data when upgrading. It was nearly 11:00pm. Time to head home for some sleep.

Saturday came quickly enough… The transition to software RAID went smooth enough and all data was readable but the system still was not booting from the LILO boot manager (Grub was not stable at the time 8.1 was released). I needed to make a new initrd which loaded the software raid1 linux kenrnel module. It turned out that the hardware RAID’s boot sector was being used for booting but the MBR on the new hard disks was being written to by LILO. When booting, the new MBR was never read and the old boot sector of the hardware RAID device was still being used to boot. A quick discision to get a new Adaptec PCI ATA133 card and not use the Promise Fastrack 100 ATA RAID controller was aided by the fact that with the RAID array defined, I could not boot using the MBR of the new harddrives and deleting the array resulted in neither drives being seen by the system. I bought the new card, installed it and turned off the on-board card. Bingo – we’re booting from the MBR and we’re up and running – sort of.

Only one thing left to do – when booting it stopped after loading the initrd, unable to mount the root filesystem on /dev/md0. I knew I was close and realised I was missing a command in the linuxrc in the initrd that would start the RAID devices that wasnt being called. I really needed a net connection to google for the answer. I grabbed the system to take home with me where I had a net connection (this was their net connection and it wasnt’ running). I called it a day at 9:30pm.

The next morning I found the answer in a few minutes – raidautorun needed to be run after all the modules have been loaded. A quick edit and a new initrd and I had a booting system. Nice.

I took the system back Sunday afternoon and had it running immediately. I spent the next three hours or so configuring some backup and “safety” software as well as running the Yast Online Updater (YOU) to get the latest updates for SuSE 8.1 and doing some house cleaning on the system.

So, that’s how my Goblin Valley trip went. How was your weekend?