Home > Microsoft Exchange, Storage > “Inconceivable!”

“Inconceivable!”

I think we all remember the “Inconceivable!” routine from The Princess Bride. That’s about how I feel after the last few weeks. I have an extremely high end RAID controller in a box, a 20 port (16 internal, 4 external) device. It’s in a monster SuperMicro case, with 2 small drives for the OS, and 14 1 TB drives for storage. Over the last few weeks, we’ve been having a host of bizarre behavior, from the pair of new 2 TB drives magically disappearing and reappearing on the controller every 11 minutes (on the dot) to the system having LEDs and sirens as if there is a severe failure but the RAID controller software showing 100% optimal state. Update: Just to make it clear, we are (and have been) working with the vendor on this issue. When we find out precisely what the issue is, I’ll post a new item. Also, I forgot one angle of this when I first posted the story. These bizarre failings (not the two 2 TB drives, the other one) turned out to have been caused by (get this) bad sectors on the drives. But those drives (or the controller) are supposed to automatically handle and work around bad sector errors! And why would that kind of error blow out the RAID to the point where the controller is sounding alarms, but not to the point where the software is aware? “Inconceivable!”

Tonight took the cake, though. My on-site person deliberately broke the RAID. We had planned to do this; we wanted to take one of the mirrored drives and put it into our backup chassis to help diagnose a problem with the backup unit. One of my “Inconceivable!” moments two weeks ago, was when we wanted to move to the backup chassis, the system went into an endless reboot cycle, even though it worked fine a few months prior and hasn’t been touched since. The plan was simple: pull the drive, put one of our spares in, and let the RAID (it’s a RAID 1) sync. No big deal. Well, the system decides to BSOD, in a definite “Inconceivable!” moment. Let’s get this straight. A RAID controller that we paid $1,500 – $2,000 for (I can’t recall the number offhand) decides to panic so badly that the entire OS comes crashing down, over a simple hot swap of hard drives? Inconceivable!

After the reboot, users start complaining that they can’t get their email, so I get a call. Yet another “Inconceivable!” moment… I had just sat down at a restaurant to celebrate my wife’s birthday with about TWENTY friends and family. We look at the Exchange server (a VM on the machine that BSOD’ed). After some diagnoses, it looks like the Exchange databases managed to get corrupted and refuse to recover themselves. Once again… “Inconceivable!” I spent the entire dinner (including bathroom breaks, ordering, and eating) on the phone. My only break was when we ran some repairs that took a while, just long enough to have a few moments of conversation and sing Happy Birthday. I’m on the phone throughout the goodbyes. And of course, Thursday is the night when I usually do the food shopping. To make matters worse, I deliberately ran out of food this afternoon, so food shopping was not an option. Valuable troubleshooting time, and I need to be in the food store. On top of that, I can’t stay up all night and sleep in, because I’ve been watching our son in the morning as my wife has returned to work, and he wakes up  early. “Inconceivable!” So I must get this resolved before, say, midnight.

I eventually give up with my on-site guy, and resign myself to a very long night. My boss calls while I am in the food store, and he decides to give the recovery another try. See, the previous recovery attempts failed, with error codes that were not found on Google or Bing. “Inconceivable!” Well, the new recovery attempts all fail. At the last moment, we decide to try a different command line switch. The whole thing took an hour and a half, finished up (after fixing the corrupted database file), and after a restart of the Information Store service, Exchange is working just fine again.

So, to add up all of the “inconceivable” events:

  • The enterprise grade RAID controller wet its pants over a simple hot swap
  • A hard drive hot swap BSOD’ed Windows
  • A “power off” failure put the Exchange database in a state that it could not automatically recover from
  • This all happened on one of the three nights a year that I cannot be at my home office for, oh, four hours, and during the one three week period of the year that I can’t stay up all night and sleep during the day
  • None of the obvious recovery choices worked

Obviously, “Inconceivable” does not mean what I think it does!

J.Ja

Categories: Microsoft Exchange, Storage Tags:
  1. August 13th, 2009 at 22:41 | #1

    What controller was this? We just want to know so we can avoid it like the plague. Seems like an awfully expensive RAID controller. Did you install the optional battery? Was writeback cache enabled? Without that, write performance is absolute shit on a RAID controller.

    Honestly, I’d go for something like this http://www.buy.com/prod/supermicro-aoc-usas-l8i-8-port-sas-raid-controller-1-x-sas-x4-sas-300/q/loc/101/206007293.html and I’d use the built-in 8 ports I get on most server boards.

    Yes I know this is “soft-RAID” and you do not get the benefit of a battery backup, but those battery backups never come with the expensive controllers unless you pay extra. Most people I know don’t bother buying the battery backups so they can’t really enable write back cache mode without risking data corruption during power outages or server shut downs. At least I know the simple/cheap stuff works.

  2. August 14th, 2009 at 02:41 | #2

    I like stories with happy endings. That was nice. Thank you J.Ja.

  3. August 14th, 2009 at 04:53 | #3

    @George Ou
    I’m not going to publicize the RAID make/model yet, until I have a definitive explanation on its behavior lately and a resolution to the issues (why bad mouth something if it turns out to be my fault)? That being said, there are not many 20 port, true RAID, SATA RAID controllers on the market, it should be a fairly simple task to figure out which one I have. :)

    It is also very important for me to state, that the manufacturer’s tech support is top flight. Their people have dedicated insane amounts of time to us over the last year. It looks like most customers for this card are very “enterprise” like companies, so they don’t move very fast with a lot of stuff. As a result, the tech support folks have shared that many of the things we are doing with it (like hooking up 2 TB drives) are things that their other customers are not doing so often we are discovering new bugs for them to fix in firmware (a while back, we bought some drives that had just come on the market and they discovered that those drives had a firmware which didn’t “like” the controller, so they patched the firmware on the controller). I can’t say enough good things about their support staff, it’s not like we’re being left in the cold here.

    We do *not* have the battery installed though, which is why I disabled the writeback caching, to avoid these precise kinds of issues. I know that it hurts the write performance, but as you say, using the write caching without the battery is a risky move!

    I will say, though, this controller has been a major, major problem for us, and nothing changed to make it start acting weird. We pulled the old controller out and swapped it, but the problems persisted. The original looked like it had some thermal grase squeezed out from between an onboard chip that had dried and caked up, but again, the swap didn’t help. We did discover a few days ago that two of the drives that were blowing up periodically actually had bad sectors (woops, need to update the post, because those drives are supposed to auto-correct for the bad sectors, and haven’t been!). Most of our problems seem to come from swapping drives (or having them drop out) while the unit is on, which is a major problem to me.

    J.Ja

  4. August 14th, 2009 at 04:54 | #4

    @Dietrich Schmitz

    I liked the happy ending too! I got to sleep on time and even had enough time to watch the show I had on the DVR. :)

    J.Ja

  5. nucrash
    August 14th, 2009 at 05:06 | #5

    I am a big fan of Write Cache batteries and hardware raid. The problem is, I have had batteries fail and not notify me and as the server bounced due to a power outage, lost the entire RAID because of this.

  6. Allan MacKinnon
    August 14th, 2009 at 10:42 | #6

    How about just using ZFS instead? Glad things worked out…

  7. August 14th, 2009 at 17:36 | #7

    @Allan MacKinnon

    ZFS doesn’t address the need for 16 disks in one box as a general hardware problem. There are few devices that do this, sadly. Besides, we are a Windows shop, for better or for worse.

    J.Ja

  8. DEK46656
    August 15th, 2009 at 12:15 | #8

    First – You broke one of the cardinal rules of system administration: don’t start anything on a Friday (well, in this case it was a vacation day / celebration night).

    Second – As much as I like SATA, I don’t know if I would really trust it for tier 1 / critical operations as much as tier 3 (archival storage) in a production server.

    I don’t have practical experience with SATA raid, but “back in the day” I worked with the original SCSI (now known as SCSI-I) and mirroring (RAID 1). I was installing a voicemail system in Thailand with a lot of SCSI drives in it, redundant power supplies, etc. I had some people accessing the system over speaker phone, and had another person running activity over the console. I systematically killed different parts of the system with everyone watching, and there were no “blips” in operation. That turned out to be the best demonstration I could have ever done.

    When RAID is working, it’s the best thing in the world.

  9. DEK46656
    August 15th, 2009 at 12:19 | #9

    PS: The Princess Bride is one of our (wife and I) favorite movies. We liked it so much that we had a medieval wedding with the minister starting off with the same speech impediment that that the bishop had in the movie…

  10. August 15th, 2009 at 12:44 | #10

    @DEK46656

    I didn’t decide to do this work at that time… it was supposed to have been done the night before, and then when it *was* done, I didn’t know it would happen! If the person on-site had called me, I would have reminded him that the RAID wasn’t trustworthy and to shut down the system before doing the drive swap…

    The only difference between SATA RAID and SCSI RAID right now are the drives themselves. Some folks swear that SCSI drives are built to higher tolerances. Heck, a lot of folks are now using SAS drives (not sure if you group them in the same bucket as “SCSI” or not), which are extremely similar to SATA drives in many ways. Indeed, both the backplane in the server, and the RAID controller itself are dual SATA/SAS devices.

    So in this case, other than the drives themselves (which might explain the bad sectors, but nothing else) and my not using a battery on the controller, this setup is using the highest level of enterprise equipment possible with “off-the-shelf” parts.

    J.Ja

  11. August 15th, 2009 at 12:53 | #11

    @Justin James
    Without write back cache, we’re talking about cutting write throughput down 5-fold or more. It’s a brutal performance penalty. The only reason I would spend big money on a storage controller as oppose to the cheaper software-based controllers is the ability to have the battery backup to enable writeback caching safely (or safer). The other reason is because of RAID-6 capability.

  12. August 16th, 2009 at 15:16 | #12

    Does WTF seem like a better fit than inconceivable?

  13. TS
    August 17th, 2009 at 15:44 | #13

    @George Ou
    @ George Ou:

    Every thing has changed recently in storage world. I will give you two links I think it is self-explanatory:

    http://blogs.sun.com/brendan/entry/l2arc_screenshots
    http://blogs.sun.com/brendan/entry/slog_screenshots

  1. No trackbacks yet.