Thursday, August 10, 2006

Exchange 2003 Server Diaster

On 29th July 2006, we have a terrible disaster on our Exchange 2003 server. One of the databases in one of the storage group got corrupted and this was logged in the event log.

Source: ESE
Category: Database Corruption
Event ID: 447

Information Store (8872) SG1: A bad page link (error -338) has been detected in a B-Tree (ObjectId: 397, PgnoRoot: 7917) of database 'E:\Exchsrvr\SG1\MS4\MStore4.edb' (0 => 7917, 538976288).


This causes all others stores within the same storage goup not able to mount/start. If I try to mount them, it will fail and this is logged in the event log. I thought that a courrupted database within a storage group should not affect other databases within the same storage group. I guess this is got to do with them sharing the same trans log.

Source: ESE
Category: Logging/Recovery
Event ID: 517

eseutil (9356) Database recovery failed with error -551 because it encountered references to a database, 'E:\Exchsrvr\SG1\MS4\MStore4.edb', which does not match the current set of logs. The database engine will not permit recovery to complete for this instance until the mismatching database is re-instated. If the database is truly no longer available or no longer required, procedures for recovering from this error are available in the Microsoft Knowledge Base or by following the "more information" link at the bottom of this message.

Anyway, the help and support center suggest that I restore the corrupted database from backup. Lucky me that the corruption happened after the full backup. I did the restore but when it reached the hard recovery portion where it tried to replay the logs, it failed with the following error.

Operation terminated with error -567 (JET_errDbTimeTooNew, dbtime on page in advence of the dbtimeBefore in record) after 25.734 seconds.

I tried eseutil /cc "c:\temp\sg1" and esetuil /cc "c:\temp\sg1" /t but no luck. eseutil /mh shows that the restored database was in the "dirty shutdown" state. I really don't understand what can cause the timestamp mismatch between the log file and the database.

After trying out several other methods to recover with no luck, I decided to go for the Dial-Tone database since the server had been down for almost a day. I moved all the databases file (.edb and .stm) and log files within that storage group out to a temp folder. Althought MS guide said do not move the log files, but without removing them, ESM just won't allow me to create a new database. I start with the mailbox store which is corrupted and after the dial tone database was created and mounted, I see users started connecting in. I restore the backup (the same backup that I used earlier on) to a Recovery Storage Group and it works. Dismounted both databases, make sure both were in the "clean shutdown" state, swap them around and mounted them. Exmerged the mailboxes in the RSG out to PST and merged them to the recovered database. For the rest of databases, I restored them direclty back without using RSG and manually run eseutil /cc /t to perform hard recovery. With that I am able to mount all the databases. I restarted the server to make sure that everything is okay. For the original databases (all in "dirty shutdown" state), I did a hard repair and defrag. Just in case someone complains about missing mails, I still can get it back from the original database.

Well that's my terrible weekend. Althought I managed to recover the databases utimately, there are still doubts in the recovery process such as the timestamp mismatch and what are possible causes for this kind of corruption. I am also thinking re-distributing the databases to more storage group (I have 2 more spare storage groups to use) so that such disaster will affect less users.

No comments: