I think we all remember the “Inconceivable!” routine from The Princess Bride. That’s about how I feel after the last few weeks. I have an extremely high end RAID controller in a box, a 20 port (16 internal, 4 external) device. It’s in a monster SuperMicro case, with 2 small drives for the OS, and 14 1 TB drives for storage. Over the last few weeks, we’ve been having a host of bizarre behavior, from the pair of new 2 TB drives magically disappearing and reappearing on the controller every 11 minutes (on the dot) to the system having LEDs and sirens as if there is a severe failure but the RAID controller software showing 100% optimal state. Update: Just to make it clear, we are (and have been) working with the vendor on this issue. When we find out precisely what the issue is, I’ll post a new item. Also, I forgot one angle of this when I first posted the story. These bizarre failings (not the two 2 TB drives, the other one) turned out to have been caused by (get this) bad sectors on the drives. But those drives (or the controller) are supposed to automatically handle and work around bad sector errors! And why would that kind of error blow out the RAID to the point where the controller is sounding alarms, but not to the point where the software is aware? “Inconceivable!”
Tonight took the cake, though. My on-site person deliberately broke the RAID. We had planned to do this; we wanted to take one of the mirrored drives and put it into our backup chassis to help diagnose a problem with the backup unit. One of my “Inconceivable!” moments two weeks ago, was when we wanted to move to the backup chassis, the system went into an endless reboot cycle, even though it worked fine a few months prior and hasn’t been touched since. The plan was simple: pull the drive, put one of our spares in, and let the RAID (it’s a RAID 1) sync. No big deal. Well, the system decides to BSOD, in a definite “Inconceivable!” moment. Let’s get this straight. A RAID controller that we paid $1,500 – $2,000 for (I can’t recall the number offhand) decides to panic so badly that the entire OS comes crashing down, over a simple hot swap of hard drives? Inconceivable!
After the reboot, users start complaining that they can’t get their email, so I get a call. Yet another “Inconceivable!” moment… I had just sat down at a restaurant to celebrate my wife’s birthday with about TWENTY friends and family. We look at the Exchange server (a VM on the machine that BSOD’ed). After some diagnoses, it looks like the Exchange databases managed to get corrupted and refuse to recover themselves. Once again… “Inconceivable!” I spent the entire dinner (including bathroom breaks, ordering, and eating) on the phone. My only break was when we ran some repairs that took a while, just long enough to have a few moments of conversation and sing Happy Birthday. I’m on the phone throughout the goodbyes. And of course, Thursday is the night when I usually do the food shopping. To make matters worse, I deliberately ran out of food this afternoon, so food shopping was not an option. Valuable troubleshooting time, and I need to be in the food store. On top of that, I can’t stay up all night and sleep in, because I’ve been watching our son in the morning as my wife has returned to work, and he wakes up early. “Inconceivable!” So I must get this resolved before, say, midnight.
I eventually give up with my on-site guy, and resign myself to a very long night. My boss calls while I am in the food store, and he decides to give the recovery another try. See, the previous recovery attempts failed, with error codes that were not found on Google or Bing. “Inconceivable!” Well, the new recovery attempts all fail. At the last moment, we decide to try a different command line switch. The whole thing took an hour and a half, finished up (after fixing the corrupted database file), and after a restart of the Information Store service, Exchange is working just fine again.
So, to add up all of the “inconceivable” events:
- The enterprise grade RAID controller wet its pants over a simple hot swap
- A hard drive hot swap BSOD’ed Windows
- A “power off” failure put the Exchange database in a state that it could not automatically recover from
- This all happened on one of the three nights a year that I cannot be at my home office for, oh, four hours, and during the one three week period of the year that I can’t stay up all night and sleep during the day
- None of the obvious recovery choices worked
Obviously, “Inconceivable” does not mean what I think it does!