Redsauce's Software QA Blog

The Data Center is on fire! Keep calm and DRP

Posted by Pablo Gómez

{1570}

I'm going to tell you a story that happened to us on the morning of Wednesday, March 10, 2021. Computers and phones started flooding with alerts from our infrastructure monitoring systems. Several machines had stopped responding and providing service, and time seemed to be running much faster than usual. But what the hell happened?


As the morning progressed, we learned that at dawn in Strasbourg, a data center of OVH (where we had several machines hosted) had completely burned down, apparently due to an electrical issue. The fire took a long time to extinguish due to several factors, such as the building's construction that facilitated the spread of the fire, the batteries, or the emergency generators that automatically started when power was cut—power that, on the other hand, firefighters were trying to shut off.


And this was no ordinary data center. It is estimated that at the peak of the fire, about 464,000 domain names went offline, affecting 3.6 million websites. Fortunately, no one was injured, but it remains the biggest disaster to have occurred in a data center (as far as we know).


Going back to our case, several clients' web services and a handful of internal applications were affected. How did we handle the incident? What did we do well, what did we do wrong, and what lessons did we learn for the future? I'll tell you. To begin with, we were grateful to have our DRP at hand.

What is a DRP (Disaster Recovery Plan)?

It is usually a document and the first place you should turn to when facing a serious problem in your systems. Think of a fire, flood, coffee machine breakdown, indefinite power outage, disk failure… It should be an important part of your Information Security Management System (ISMS).


RECOMMENDATION 1: Above all, have copies of this document in different locations. Additionally, it must be easily accessible—you can’t waste hours looking for it. It should also be self-contained: if it includes links to other resources and those are unavailable, crucial information will be missing. The day of the disaster, we used it extensively, and although it wasn't perfect, it served as a solid reference.

What should a DRP contain?

In our case, following intuition and cybersecurity best practices, the document included an introduction explaining its purpose, different types of backups, redundancy strategies, recovery time objectives… Ultimately, we ended up with a document of over 50 pages while the data center was burning. So:


RECOMMENDATION 2: Split your DRP into two parts. One can be called the "Operational Plan" for use in the moment of crisis—clear, concise, and quick to follow. The other can be the "Strategic Plan," containing useful background information that isn't immediately practical during an emergency. This distinction helped us later to refine the document and reduce unnecessary stress.

Operational Plan of a DRP

As mentioned, this is designed to save time and get straight to the point. It can include:

  1. How to communicate the emergency: Who to notify, through what channels (phone, Slack, Discord…), which groups or rooms, etc. Each person or team should have clear responsibilities to avoid overlap and ensure parallel workflows.

  2. Immediate actions, which will depend on the specific issue: shutting down power, stopping servers, activating backup tasks…

  3. Restoring critical services: From highest to lowest priority, the plan should specify which systems need to be restored first and how to do it.


    RECOMMENDATION 3: Be generous with details—don't skimp on clarity. The steps should be so well-documented that a high school student could execute them without hesitation. It's not uncommon to find missing steps for accessing a repository, a container, or a tunnel because documentation assumes things are already set up. That day, we realized that several procedures required public keys that were assumed to be pre-installed—but they weren’t, delaying our recovery.

  4. Communication protocols: What to say to the team and clients throughout the process. Avoid lying, exaggerating, or downplaying the situation—your team and clients will appreciate honesty.

Strategic Plan of a DRP

This document is generally more detailed and contains everything that supports the operational plan. It should include topics such as IT infrastructure, types of backups, backup strategy, risk analysis, recovery objectives and timelines, DRP maintenance, inventory of hardware and software, and a record of disaster drills.


I won’t go into detail on each point because there's plenty of literature online, including responses from ChatGPT or DeepSeek. However, I do want to emphasize the importance of conducting regular disaster recovery drills.


RECOMMENDATION 4Scheduling periodic drills helps keep the team prepared. Using masked backup data to recreate pre-production environments can serve two purposes: validating backups and maintaining a fresh environment. This can be automated using CI/CD servers like Jenkins, GitLab, or GitHub Actions, among others.

About Backups

Each system is different and has its own requirements. I won’t go into backup frequency, backup types (full, incremental, differential), encryption, or retention periods.


However, here are some key lessons we've learned over the years:


RECOMMENDATION 5

  • Always have at least two backups in two completely separate locations. Having two machines in the same data center doesn’t count—especially if the entire data center catches fire! The Operational Plan should account for both copies.

  • Having container snapshots with full machines (OS, code, and database) is convenient—if disaster strikes, you restore the entire container and you’re good to go. But what if the problem is a virus and you don’t know when the infection started? In such cases, separating database backups from code backups is crucial.

  • Automate, automate, automate. Backup and restore processes should be as automated as possible to minimize human error and recovery time. Knowing that automation is in place, that backups exist, and that they work lets me sleep much better at night.

Postmortem

After the storm comes the calm. Within 24 hours of the fire, 95% of our clients’ services were back online, and within 48 hours, everything was fully restored. Our internal applications took a few extra hours over the weekend to fine-tune.


In the end, we all kept our jobs, and it was time to look back nostalgically at those moments—those long (often late-night) hours of camaraderie, coffee, and teamwork.


RECOMMENDATION 6Write a postmortem report detailing the root cause of the disaster, what was done well, and, most importantly, what could be improved for the future. Include a list of action items with assigned owners to enhance the process. In our case, this led to improving the automation of some manual-heavy backup restorations and simplifying our infrastructure.


Pilots have emergency procedures that guide them step by step during incidents to ensure passenger safety. The DRP reminds me of these procedures—used daily behind the scenes and, thankfully, rarely needed because they achieve their goal: keeping us safe.


Our experience using the DRP as a survival guide that week was, all things considered, very positive. The thought of facing it without one is truly daunting. Maybe it’s time to check yours, review it, update it, and who knows… perhaps even put it into practice tomorrow?

About us

You have reached the blog of Redsauce, a team of experts in QA and software development. Here we will talk about agile testing, automation, programming, cybersecurity… Welcome!