DEBRIS.COMgood for a laugh, or possibly an aneurysm

Sunday, April 25th, 2004

distilled stress

It took most of two days, but I managed to get that failing disk drive replaced. By all rights I should be relieved.

I spent most of Saturday practicing RAID rebuilds on my test bench. I did nearly a complete dry run: booted with one functional drive, rewrote the partition map of the “new” drive, synchronized the RAID. It worked beautifully. Of course, it was just a demo.

The drive replacement on this server began at 10:00 AM today. We started with the usual problems: figuring out how to get the drives out of the server’s 1U rackmount chassis… determining that SCSI ID 0 must be the failing drive, but rebooting only with ID 1 to find out that ID 0 was the good one after all… magic, inexplicable SCSI ID failures, such that the ID configuration identical to what worked for eight months now refused to boot…

After about two hours, we began rebuilding the RAID. There were six partitions, the biggest of which took 30 minutes to synchronize. So we were standing around, buffeted by the sound of 1000 cooling fans and the 65° colo breeze, for a long time. We checked status frequently; the command ‘cat /proc/mdstat’ shows the ETC for the synchronization.

The process was on the 6th of 6 partitions, at about 99% complete. We were literally one minute away from being past the risky part: in one minute, we’d be able to put the lid back on, re-rack the server, reboot, test, and walk away. It would have been a by-the-book RAID repair.

But the ‘cat /proc/mdstat’ command hung. I’d become sensitive to the timing of things, because any pause usually meant trouble. Sure enough, after a few seconds the screen filled with RAID error messages. Not only did the last partition fail to synch, but the RAID software knocked three other partitions offline too, effectively erasing the past hour’s work. We shut down quickly and began the tedious process of investigation and repair.

Our best guess is that the RAID software got confused, maybe due to memory corruption, maybe due to the fact that the swap partition is also on RAID-1. (I didn’t set it up that way, and wouldn’t recommend it.) So, when we rebuilt the /var partition after rebuilding the swap partition, maybe the RAID software had a memory fault. Or maybe it was due to re-synchronizing 37gb worth of disk space. I’ll likely never know.

Upon reboot, /var was hosed. The data was corrupt. This was my worst nightmare — that repairing a redundant system, designed to prevent data loss, would in fact be the cause of data loss. We spent two hours recovering, and for most of those two hours, never really knew if we’d succeed. Fortunately, the other member partition of the RAID held a virgin copy of the missing data.

Ultimately, we erased and reformatted (mke2fs) the bad RAID partition, and copied the contents of /var onto it from a backup. This introduced some problems, but nothing that can’t be fixed. I assume.

Then we had to rebuild all the RAID partitions again. We went to lunch this time. I have, burned into visual memory, the image of approaching the system after lunch, wondering whether I’d find that the RAID rebuild had succeeded or failed, whether it had hosed a few more partitions on its way down. That was a turning point. The process had worked. (We’d opted not to rebuild the swap partition, just in case that had caused the initial failure.)

We buttoned everything up, racked the server, rebooted. The boot process hung on the network init. We realized we’d plugged the colo’s ethernet feed into the wrong port on the server, so I moved it. Ten seconds later, the server froze.

I’ve never seen a Linux machine freeze, but this was locked up tight. The screen froze at the login prompt. The keyboard was nonresponsive. We couldn’t access the machine over the wire. Despair, never far since the initial RAID rebuild failure, descended like a rusty axe toward my jugular.

I realize I’ve gotten too worked up about all this. I can imagine a thousand problems worse than a disk drive failure. But I don’t have to imagine the feeling of being dipped in stress, accompanied by a cold sweat and a stomach full of acid, because I had that already. I guess the bottom line is that if all else failed I could just go buy a new 1U server, copy data there from my backups, spend a day reconfiguring, and I’d be back online. Somehow that vision didn’t comfort me.

We used the reset button. The system rebooted successfully. Of course all the RAID partitions had been marked as out-of-synch, and the system began recovery on its own. So we left it. There didn’t seem to be any point standing around for yet another hour to watch. We’d been working for six hours on a simple (hah!) drive replacement; it was time to go home. And all the way home, I wondered if the system would be online when I got there.

Whenever I’m working on computer hardware, I notice the trend: things are making sense, or they aren’t. The trend is connected to confidence in the repair. For example, if a RAID rebuilds successfully, as it did on my test bench, I’d be very confident in the result. But if it pukes at the 99% point, I lose confidence. Having it ultimately work, whether by chance, or by careful avoidance of the things that might have triggered the failure, rebuilds some of that confidence… but if the machine later mysteriously freezes, I’m back to square zero.

And so it was that I fixed the problem that I’d been losing sleep over for a week, only to have even less confidence afterwards that the server was stable.

A long time ago I mentioned an article on web-writing tips at A List Apart. The article recommends writing during the bad times as well as the good.

Well, there you have it.


Tags:
posted to channel: Personal
updated: 2004-04-28 18:05:37

follow recordinghacks
at http://twitter.com


Search this site



Carbon neutral for 2007.