Having a disaster recovery plan in place for your business is paramount more so now than ever because of the dramatic increase of 44% of people working remotely and cybercrime costing companies billions of dollars. Resuming data service is an integral part of a business’s ability to recover and should be a key part of any disaster plan – Recovery Time Objective (RTO) and Recovery Point Objective (RPO) can help you with that. RTO and RPO are two important terms to use in Business Continuity and disaster recovery. Both, essentially, determine your business’s risk tolerance regarding data loss and system backups. To accurately determine your RTO and RPO, you need to fully grasp the two concepts. That’s why we’re demystifying them so you can better understand how to balance the two objectives. Ultimately, this will help you better recover your systems, and your data, swiftly & safely if a disaster strikes your business.
What Exactly are RTO and RPO?
The easiest way to understand the difference between the two is RTO looks towards the future, and RPO looks to the past. RTO (Recovery Time Objective) refers to your risk tolerance of downtime or the amount of time it takes to restore so that resources are again available for use. How long can your business function without its systems and data available? If you and your staff can manage critical business functions without email access for 24 hours, then your RTO is 24 hours for emails.
RPO (Recovery Point Objective) refers to the amount of data. In terms of time, it’s the point from which the data needs to be restored. The harder it is to recover or recreate the data, the shorter the RPO needs to be, or the more frequent you need to do backups. Your RPO can simply be determined by answering the question, “How much data loss can I afford?” If you had a site loss incident, like an unexpected ransomware attack, how long will it take to bring the systems back up (RTO), and how old is that data that you recovered (RPO)?
You might be thinking that both RTO and RPO seem like really similar terms so why do I need to determine both in my disaster recovery plan? Let us try and explain. A backup is only taken at specific intervals, depending on the business maybe once an hour from 8 am to 6 pm. If recovery starts at 11 pm and it takes 6 hours to recover the data, you can see how the numbers start to be a little different. Luckily, in every single case, technology can fill these gaps one way or another. So, there’s never a forced risk because there are solutions for keeping your RPO and RTO objectives low. Of course, as with most technology options, budget does come into play here.
Responsibilities & Expectations
So who is responsible for what part of recovery planning? In our opinion, it’s the IT department’s responsibility to make sure the leadership team is informed and aware of expectations in disaster recovery, so that when/if the time comes there are no surprises. It’s the leadership team’s responsibility to communicate expectations on recovery though and any hard line requirements. To complete the circle, it’s the responsibility of the IT team to communicate back to leadership on what they can (or possibly cannot) deliver. A strong communication link between these two teams must be established and encouraged to ensure the recovery goes smoothly and as planned. Any questions or concerns with the process and the business’s RTO and RPO should be addressed before an incident occurs. This way there are no surprises should the worst happen.
RTOs and RPOs are not one-size-fits-all. You can assign different objectives to the different processes depending on criticality importance. For example, marketing may be less concerned with shipping than the messaging platform, and the warehouse manager may have the opposite opinion. It’s the responsibility of the leadership team to decide what comes back online first and what the priority order is. If the entire system is down, what is most urgent in order to keep the company running? Maybe you need email up right away, and you’re less worried about the accounting system, as that is a process that is only worked through once a month, but email is where an order confirmation comes out from. A staggered approach costs less than bringing up everything right away. So, keep this in mind when determining the priority order and what systems are critical for business operations, and what systems are just nice to have because they make life easier.
Example of Prioritization
Power outages are much more common and expensive than people realize. According to one IBM Global Services study, the average revenue cost of an unplanned application outage was estimated at over $400,000 per hour, with 35% of those organizations experiencing an outage monthly. To help offset these costs, consider falling back on older, more manual, or redundant systems or processes during an outage. Take, for example, a hotel. If the power goes down, access to the booking system is also down, but they may still have printouts and the manual credit card swiper available. While this isn’t ideal because it’s not electronic and takes a bit more time, it’s a manual system that can be used in case of an emergency. So, perhaps the booking system isn’t a priority to come back up first. The hotel can take manual payment for now and enter the guest’s credit card information into the system later, once it’s back up again.
What may be more of a priority would be the hotel’s proximity card system because this system locks doors and controls access to rooms. When this system is down, guests can’t enter their hotel rooms or lock their doors and staff can’t access certain restricted areas of the hotel. The proximity card system would be categorized as critical in this example because it’s a physical safety issue and must be brought back online as soon as possible following an outage.
Single Loss Expectancy
There is also the Single Loss Expectancy (SLE) that you can use to quantitatively analyze how much money/revenue you lose per event and per process. As an example, if you lose shipping for one day, you estimate you’ll lose $1,000,000, but if you lose email for a day, you estimate you’ll only lose $1,000. This will determine the true loss expectancy of an incident and help you define recovery time objective priorities. Look at your processes like this and you might discover something interesting you hadn’t thought about. Another example is if a water main bursts, which we’ve actually had happen to a client. It was after hours so no staff were in danger, but a water main exploded right under a server rack with such force that it cracked the concrete foundation and of course, destroyed the facility. What would the true SLE be in this situation? While it’s a massive clean-up and scary event, if everything is in the cloud, maybe it wouldn’t be that bad in terms of your SLE.
RTO and RPO objectives are a baseline and should be compared to actual deliverables so you can identify what needs improvement to mitigate the costs. SLE can help give further insight on what systems to prioritize and how you can improve your RTO and RPO. BT Partners’ Managed Services team can work with you in using a combination of RTO and RPO objectives, and SLE, so you’re fully prepared in the event of a system-wide outage or disaster. Stay tuned for our next blog on this same topic, where we’ll discuss in further detail how you can easily determine your RTO and RPO.