Monitoring and managing recovery time objectives (RTOs) and recovery point objectives (RPOs)

Jeff Boles talks about how to monitor and manage recovery time objectives (RTOs) and recovery point objectives (RPOs) in this ask the expert response.

How can I monitor and manage recovery time objectives (RTOs) and recovery point objectives (RPOs)?
There are a number of tools in the market that are built to help you monitor and manage recovery time objectives (RTOs)/recovery point objectives (RPOs), but SMBs may not be able to afford one of those tools, or they may be much more than you need.

The first step in identifying your RPO/RTO is to pay attention to design and then post-implementation testing. These elements are simple, so I recommend you invest some decent time here. Moreover, attention to design and testing are more necessary than ever before because the complexity of the data protection infrastructure is increasing, and consequently, so is the risk of a single component making your plan blow up. The potential chance of miscalculation in any step of the backup or restore process is rising, especially with optimized disk-to-disk (D2D) storage systems, deduplication in the backup software, Continuous data protection (CDP) type technologies, application backups, snapshots, replicas, image- versus file-level backups and restores, and multi-tiered backup storage. Due to the complexity of these features, I recommend approaching your backup design with the following simplified process:

  • Identify and group the different types of recovery events you routinely run into, as well as those that you foresee as being important. This should be a fundamental component of setting recovery time objective and recovery point objective in the first place.
  • Target RPO/RTO levels and groupings. With these different groupings in mind, you can target some RPO/RTO levels you want to design for. Since you don't yet know your cost, you'll start to evaluate the cost of these recovery time objective/recovery point objective levels in the next steps. As an example here, you may want to recover an email server to within five minutes of a crash (RPO), and you may want to get it there in an hour. But for file servers, or SharePoint servers, the story may be entirely different. If you have 10 or 50 applications, you probably don't want to design to 50 different RPOs and RTOs, so figure out what makes sensible groupings for you so you're only designing to a few sets of RTOs/RPOs.
  • Use published speeds and feeds to design your backup infrastructure to deliver to your most aggressive RTO/RPO goals.
  • Design your backup jobs and asses their impact on your recovery point objectives. For example, if you're trying to protect data on two email servers offsite with a one-hour RPO, then your bandwidth will have to be sufficient to replicate your hourly data change within an hour, and write it to disk. If you're using post-process deduplication, and replicating deduplicated data that doesn't replicate until after it is fully deduped, then you may need a solution that can replicate the full data set in 30 minutes, because it may spend 30 minutes performing deduplication. Furthermore, you have to carefully review speeds and feeds, and draw out how your data is moved to a fully protected state so you can assess each step of the process.
  • Use your groupings of recovery events and RTO/RPO goals to examine each recovery scenario. Look at how they impact each other and if those recovery tasks are run simultaneously. For example, if your email server and file server each crash, and you run your tape infrastructure at full speed for an hour to recover your mail server, then you're not going to be able to use that tape drive until after that hour to restore your file server.
  • Evaluate how the solution you designed actually performs with a copy of your data. Where your vendor makes this impossible to do short of a full-blown implementation, you should look for third-party validations. You want something from a credible and well-rounded third party that has the expertise to understand and translate results into something meaningful. Also, the third party must be willing to bend to the demands of a vendor to inflate results or claims. Understand this will be synthetic testing, but the right third party will make these test validations representative of, or easily translated for a number of different situations.

With all of this in hand, you'll have a pretty good understanding of what your starting recovery time objective/recovery point objective is, and you can use this as a baseline to understand the impact of data growth, and/or changes in processes over time. Irrespective of the size of your organization, you should document this and restrain yourself from making changes to backup jobs, execution order, or other aspects of your protection processes without doing a full review of what might be impacted, and then documenting the new state of affairs. By doing so, you can keep a pretty good eye on what type of recovery objects you are capable of as your organization grows and changes.

In regards to the tools to use, the tried and true data protection management (DPM) products all play here -- including Aptare Inc., Bocada Inc., EMC Corp's WysDM and Tek-Tools. All of those products can offer baseline assessments of your key data protection metrics, and will deliver pretty good capabilities around assessing RTOs. There may be more differentiation between them in how well they handle RPO, especially if you're using multiple protection technologies like snapshots, replicas, virtual server proxies and more. In my assessment, they will vary most in how proactively they can manage the data protection practice, including prediction of future errors and issues before they happen, and in making recommendations on how to correct errors.

Dig Deeper on Disaster recovery planning - management