Why is it important to break mirrors as part of your DR test?
Mirroring is the process of replicating data between two disk stores, usually across a local storage interconnect. Its cousin, replication, involves the same process but across distance using a WAN. Mirroring is usually viewed as synchronous, while replication is asynchronous (distance induced latency creates data deltas or differences between the state of local and remote data stores).
While mirroring is performed pretty close to real-time if done correctly, a lot of storage arrays copy after write rather than during write. In effect, data is written to disk on array #1, then it is copied by on-array software to an identical array #2 located nearby using a high-speed/high-bandwidth link. This approach also drives up the cost of mirroring by locking you into the same vendor's gear on both the primary and mirrored array.
But the real problem is one that a little software company, 21st Century Software, has been showing off in its presentations for a couple of years. They show actual screenshots of multiple customers who thought their hardware was mirroring the right data, only to discover that no data was actually being mirrored. This usually happens when the volumes in the array that hold data are reorganized by the vendor service tech or by a storage administrator, but the DR coordinator is not informed of the change. Without periodically breaking the mirror during the disaster recovery testing process, you might not notice the problem … until an actual disaster occurs.
So, why don't folks break their mirrors and test? Simple: It is a hassle to quiesce applications, flush caches to volume 1, replicate to the mirrored volume 2, then shut everything down to do a file-by-file comparison between the two. It takes time, both operator time and time from production application work, and there is no certainty that systems will restart and mirroring will resume properly.
Remote replication has some of the same problems, but it may be possible to test for data deltas without quiescing the replication process. There, you have a ton of other issues related to latency and to jitter. You need to be vigilant to ensure that you have a data store that you can recover.
This was first published in July 2013