Disaster recovery in the cloud

Underestimate the complexity and cost of cloud computing backup and recovery and you risk missing data and lost customers. Here’s what to do.

It’s late on a Friday. You get a call from your CIO that data has been removed from XYZ public cloud server, and they need it back ASAP.

It gets worse. First, there is no current backup copy of the data. The backups you expected your cloud provider to perform on your behalf only include the provider’s core system backups. That means it’s functionally unusable. Second, there is no business continuity/disaster recovery (BCDR) strategy, procedures, or playbook in place to deal with breaches or disasters. Everyone assumed the cloud was doing that automatically. That’s why we’re in the cloud, right?

These are common misconceptions. Equally common is the assumption that those charged with keeping cloud systems working and safe would have a handle on this problem by now. There are too many cases where that assumption is incorrect. In other words, you’re probably doing cloud BCDR wrong and need to do something about it.

The traditional approach to BCDR focuses on physical infrastructure and on-premises solutions. Now companies must adapt to the dynamic nature of cloud computing. Cloud-based systems offer more flexibility, scalability, and cost efficiency, but they also introduce new complexities and vulnerabilities. Organizations must adopt modern BCDR practices that align with the unique characteristics of the cloud.

The cloud provider won’t help you

Most cloud providers have monetized incentives to care about your systems’ uptime and have automated BCDR processes and mechanisms in place. However, it’s still a “shared responsibility” model, meaning they will keep things running, but you’re responsible for protecting your own data. This also applies to security, governance, and disaster recovery, which can be topics for other articles.

How can you do cloud BCDR right? Typically, most enterprises need to work on the following:

Testing and validation. This is the most common problem I encounter. Enterprises often overlook the importance of regular testing and validation of their disaster recovery plans in a cloud environment. Testing ensures that the disaster recovery mechanisms function as expected and that recovery objectives can be met within the desired timeframe. Neglecting proper testing may lead to false assumptions about recovery capabilities and the inability to effectively restore services during a disaster.

Ask for BCDR test plans and testing results to ensure everything is protected. If neither exists, it’s usually because everyone just assumes that everything works. That assumption is beyond risky.

Data replication and backup. Enterprises may neglect to implement robust data replication and backup strategies in the cloud. Relying solely on the cloud provider’s infrastructure for data redundancy is dangerous. It is important to have proper backups, including offsite ones, to safeguard against data loss or corruption.

It’s rarely true that data stored on cloud-based systems is already protected. Assume that protection does not exist (even though there is normally some recovery mechanism) and that backup and recovery are your responsibility alone. You’ll be much safer.

Geographic redundancy. Cloud providers offer geographically distributed data centers, but some enterprises fail to take advantage of this feature. Neglecting to establish redundant cloud deployments across multiple regions increases the risk of a single point of failure, often due to a natural disaster. With geographic redundancy, enterprises can enhance resilience and mitigate the impact of regional disruptions.

It’s often better to choose SaaS-based backup and recovery systems that move data to a redundant remote server. They also manage geographical redundancy for you.

Recovery time objectives (RTO) and recovery point objectives (RPO). Neglecting to define and align RTOs and RPOs with cloud capabilities can result in inadequate recovery strategies. It’s crucial for enterprises to understand the time it takes to recover their applications and systems in the cloud, as well as the amount of data that might be lost in the event of a disruption. Aligning RTOs and RPOs with cloud capabilities helps set realistic expectations and allows for appropriate recovery planning.

Communication and stakeholder management. Effective communication during a disaster is critical. Enterprises must establish clear communication channels and protocols for informing stakeholders, employees, and customers about the impact of disruptions and the steps being taken for recovery. This plan should be in writing.

Neglected communication is usually much more costly than any damage done by a data breach or loss. There needs to be a solid playbook that details who is contacted and for what purpose, and how this will be messaged inside and outside of the company.

Cheap fixes

Everything I’ve listed is inexpensive to deploy. Considering the cost of the risk, it’s the best bargain you can find these days. In fact, once these fixes are implemented and some operational costs removed (such as having to do backup and recovery intra-cloud), it’s usually cheaper to use outside SaaS-based backup and recovery services. You’ll wonder why these things were not fixed years ago because it’s always better to work smarter, not harder.

Next read this:

David S. Linthicum is an internationally recognized industry expert and thought leader. He has authored 13 books on computing, the latest of which is An Insider’s Guide to Cloud Computing. David’s views are his own.

InfoWorld Technology of the Year Awards 2023. Now open for entries!