I often get frustrated when facing I.T. outages (e.g. home broadband down) with the lack of detail in the information put out. So this is an attempt to provide an overview of the recent Exchange outage, with the aim of providing an idea of the scale of the problem and why it’s been taking so long to restore service.
ISS I.T. Services have been trying to migrate off our aging Exchange 2010 environment onto a new Exchange 2016 environment, with the eventual aim of moving the majority of users to Office 365. In order to do this new HPE Server Hardware was purchased in a “Best Practice” configuration for the new Exchange 2016 environment. These were setup, added to the Exchange environment and migration of students proceeded without any problems. Migration of the staff accounts then followed and reached about 15% of staff before the problems started.
On 23rd October, one of the 4 servers crashed and rebooted with a “Blue Screen of Death”. The reboot triggered an automatic “repair” of the file systems on the server – normal for servers these days. However, a combination of this repair and the reboot resulted in damage to some of the database mailbox stores which prevented them mounting.
Normally, when this happens the Exchange system flips over to a secondary copy of the database (for fault tolerance we keep two copies and a nightly backup). However, it seems that the database damage replicated to the copy somehow and hence both “live” copies became unavailable, this resulted in a large number of mailboxes being unavailable to end users.
Whilst trying to recover the databases from backups, other servers in the 4 server cluster also “Blue Screened” damaging further databases. We also ran into issues when trying to restore databases from backups…
Our 3rd Party Support company, having been pulled in to help, did a “Root Cause Analysis” and the conclusion was that the Disk Systems on the servers could not keep up. This lack of performance was picked up by Exchange’s “Health Service” as a problem which triggered the “Blue Screen” in order to protect itself. This resulted in the disk corruption and hence the database corruption. Once recovery has completely we will be looking at the Systems to determine the exact cause of the disk performance problem.
Having now got a number of damaged databases all on servers which could best be described as “delicate” when hit with lots of disk traffic, ourselves and our support company looked to try and recover the data to our old 2010 service – a “rollback”.
Several methods have been used involving backups, database repairs, restores, mailbox moves etc. Some of these have proven extremely slow – for example two restores for student mailboxes (approx. 4000 mailboxes each) have taken over a week and are still going – these users are those who will be missing a week or two of data. Some of the new methods have had more success and since they are working with the recovered damaged databases are recovering all data. The Disk I/O bottleneck limits the number of activities we can progress at any time.
Given the time take to get the old data back, affected users will had a “dial tone” mailbox – i.e. Empty – which enables them to send and receive mail which will receive recovered mail as and when we can.
Once everything is rolled back to Exchange 2010 we will be looking at options going forward, such as a revised 2016 rebuild and Office 365, engaging some hardware independent Microsoft consultants to help us make the right decisions.