Monday afternoon we had a critical failure of an Oracle database at work. Within a few minutes of the fault taking place, I started seeing block corruption errors whilst I was reviewing some information in the production environment. At this stage, I was thinking that we might have dropped a disk in the SAN but referred it onto our database administrator to rectify it.
As is quite common, our environment consists of multiple Oracle 10g RAC nodes connected into a shared data source. The shared data source in this instance is a SAN, where we have a whole bunch of disks configured in groups for redundancy and performance. As soon as the database administrator became involved, it became apparent that we didn’t drop a single disk but had in fact lost access to an entire group of disks within the SAN.
Due to the manner in which the SAN and Oracle are configured, we were not in a position where running in a RAID environment was going to help. If we had dropped a single disk or a subset of disks from any group within the SAN, everything would have been fine; unfortunately we dropped an entire disk group. The end result of this was that we were forced to roll back our database to the previous nights backup.
The following days have been spent recovering the lost days data through various checks and balances; but it takes a lot of time and energy from everyone involved to make this happen. We’ve been fortunate enough to trade for several years without ever needing to roll back our production database due to some sort of significant event; which I suppose we should be thankful for.
After three years without performing a production disaster recovery, had we become complacent about data restoration and recovery as haven’t really needed it before? I believe that since we haven’t had a requirement to perform a disaster recovery for some three years, that our previous data recovery guidelines have now become out of date. Whilst a daily backup may have been more than sufficient for this particular database two or three years ago, the business has undergone significant growth since that time. The daily changeset for this database is now significant enough that, whilst having a daily backup is critical – it requires significant amounts of work to recover all of the data in a moderate time frame.
As a direct result of this disaster, we’re going to be reviewing our data recovery policies shortly. The outcome of that discussion will most likely be that we require higher levels of redundancy in our environment to reduce the impact of a failure. Whilst it would be ideal to have an entire copy of our production hardware, it probably isn’t going to be a cost effective solution. I’m open to suggestions about what sort of data recovery we implement, however I think that having some sort of independent warm spare may win out.
What have we learned out of this whole event:
- daily backup of data is mandatory
- daily backup of data may not be sufficient
- verify that your backup sets are valid, invalid backup data isn’t worth the media it is stored on
- be vigilant about keeping data recovery strategies in step with business growth and expectations
Maybe periodic disasters are actually healthy for a business? Whilst every business strives to avoid any sort of down time, I expect that as a direct result of the typically high availability of certain systems that disaster recovery isn’t put through its paces often or rigorously enough; which may result in longer downtimes or complete loss of data when an actual disaster recovery is required.