10/7/07

Finally, The Horrors of Exchange, part 1

Last week, Wednesday, September 26, 2007, I was assigned to an on-site contract for a local Owosso business. On day 1, I focused on getting oriented with the network there, mostly just familiarizing myself with the 5 on site servers and the one off site server.

The network is interesting, it is international, two exchange servers, and a whole kaboodle of other traits of the system that give it a personality all its own.

Standard IT policy for managing a network, be it from day one administration or having a new administrator come into a network, the first place we look is at the domain controllers and fire up the event viewer to find out what errors the system is having.

I noticed several errors labeled as "MSExchange..." This is one of my nightmares. Exchange is really sensitive, and can be easy to break or lose data if it isn't dealt with according to Microsoft policy. I'm not sure why Microsoft made Exchange in such a sensitive way, but they must have had a reason for it, seeing as how they've been doing it for over a decade now [Microsoft Exchange Server, Wikipedia]. One of the interesting design concepts of Exchange is how messages are stored on the server. Exchange drops messages into one of several databases: priv1.edb priv1.stm pub1.edb pub1.stm or into an ever-increasing amount of log files. But one of the annoying features of Exchange is that it requires everything to be 100% operational for most of it's own functions to work correctly.

For example, while on site, I found that the Exchange wasn't backing up correctly each night and it had been doing this for several months. After doing some tinkering, I also noticed that it was running out of space. The MS Exchange database --the database is priv1.edb + priv1.stm, was 15.9 GB, while Microsoft has set maximum capacity on the Exchange database to be 16 GB. Email doesn't exactly come in at an extreme speed for the network, so it wasn't slated as a major thing, so we scheduled Exchange Database management for the upcoming Saturday, which would give me plenty of time to check to see what commands I need to run on the database and what contingencies I should be prepared to deal with (cough-disaster recovery-cough).

One day went by without major events, but then on Friday, I arrived on site to find that the Exchange had crashed and was re-enabled. It had crashed, according to event viewer, because the hard drive had gone down to less than 10 MB of free space (thankfully, Exchange was hosted on "D drive" instead of on C). However, when I fired up My Computer to see how much space was presently left on the drive, and found it to be under 512 KB, I immediately reported to my on-site boss that the Exchange server was going to go down any minute, due to the drive being filled up again and I was going to have to shut Exchange down to prevent it. He quickly remote into the server to take a look at the data, and noticed that the drive had 2 MB free and was growing?? And then several people immediately stated that they had suddenly lost email... Yup, Exchange had crashed again.

Now, this isn't a huge issue, Exchange is designed to protect itself in situations like this, by shutting down and displaying error messages so we avoid data loss. However, now that the drive had less than 10 MB free, we were in a bit of a bind. Exchange needs to be defraged (that would be the command eseutil /d) --the defrag command deletes information in the database that has been deleted by users or mailboxes that have been deleted by administrators, which should free up a decent bit of space in the database. The recommended procedure by Microsoft is that we take a backup of Exchange and then run the defrag command. While it is very common for the defrag command to cause problems with the database, it is still a possibility, especially if we have undetected hard ware problems.

So started the second fiasco. Backing up Exchange. Now, we have known for a while that the Veritas Backup Exec backups have been failing on that server for a while now, but they were backing up the entire server, and while it was getting an error while backing up Exchange, it wasn't clear if the Exchange errors were causing the backups to fail, or if it was more related to a tape/tape drive issue. So, I decided to just do an NTBackup.exe backup and save it to an external drive. It was redundant enough for our situation.

The backup started, and 2.5 hours later it had not displayed an error message and switched to data verification. 2.5 hours after that (a total of 5 hours after starting the backup), it returned "Backup failed." I checked the logs, and it said "\Mailsore (%servername%)" failed. The database may be corrupt or inaccessible. The file will not restore correctly. This was a becoming a major stressor for me. I have to do a defrag on the database, which a backup prior-to doing the defrag is highly recommended by Microsoft, yet I am unable to do the backup due to a corrupt database.

To keep this story from getting horribly long, I eventually was able to just run Defrag on priv1.edb, pub1.edb, and pub1.stm, which freed up nearly 6 GB of space. However, I was not able to defrag priv1.stm, which seems to be corrupt. I spent a lot of time with other variations of the eseutil command, but wasn't able to get the database back. I could use eseutil /p [database] /i to rebuild the indexes between priv1.edb and priv1.stm, however, the P switch will not only repair the database, but it will also delete anything it determines to be corrupt. Which, frequently will cause other problems with the database. But, there is one nice thing, even if it does delete files that it shouldn't, we can use the *.log files to replay transactions in exchange to get it back up and running with the correct information.

Once again however, my hopes were dashed. To save space, some log files had been deleted. There is no guarantee that we'll be able to use the log files to get the database up and running again. Hopes dashed. And upon further research, I'm suck between a rock and a hard place. There is no other options available. The only options I've got are to break the RAID array and tinker with one half of the mirrored image to see what happens (no guarantee of success, and it is possible to damage the RAID array in the process), or I can call Microsoft and start a support ticket. My boss at CyberMedics suggests the latter, after talking with the guys on site, we'll make the final decision come Monday.

More details as the story progresses.

No comments: