Reason for Outage Report, Oslo Location Digiplex (Post-mortem)

Primary Outage
On Saturday april 30 at 21:43 CEST our ISP’s utility provider suffered downtime on one of their transformers.

Alarms were quickly raised on our end as well as our ISP’s end. Our ISP’s technicians immediately checked to make sure that power generation was running. They were able to confirm that it was. Staff were dispatched to ensure successful failover of power generation. When arriving at the datacenter at 22:45 CEST, they found that power to the site had been lost a few minutes earlier.

Upon investigating they found that the device responsible for switching over power from utility to power generation failed even though backup generation was running. Power was restored by switching over the system from utility power to backup power manually.

Subsequent effects
After power was restored we found that network connectivity was still not working for several
customers. As it turned out some network devices were having issues coming back up, and some had
either lost their configuration completely or simply did not boot.

Work was immediately started to replace failed switches or restore their configuration. The majority of our ISP’s time restoring services was spent in troubleshooting networking and doing manual interventions for servers that did not come back up.

Improving networking
Our ISP will look at ways of improving their routing infrastructure so that the loss of a single site will not cause downtime for other sites.
In our data centers we make use of a common spine/leaf architecture. For many services we deploy a leaf switch that communicates with multiple redundant spines. Most of the switches we ended up having issues with were leaf switches. For customers connected to those, they were effectively down.

Our ISP found that even though they did have backups of all configuration data, resolving these issues took away a lot of engineering time that could have been spent on resolving individual customer issues.

In order to improve this, our ISP will be looking at switch automation and zero-touch provisioning of switches. Meaning that if a switch were to fail we could very quickly pull it out, and simply have it boot into new configuration automatically with very little manual intervention other than replacing the device id in our provisioning system.

Improving communications
During an event like this, communications are always difficult. Our customers are understandably
worried about their services and when they can be back up and running. On our end we came to realise that our website and single-point-of-contact online support-forms where in effect offline due to the fact that they where hosted in the same datacenter, and that the offline cached version did not fully work. We have now deployed a redundant solution so that this will not occur again.

Furthermore - we will work on improving our communication routines to make sure customers have as much information as we can provide via our third party status portal: https://dedia.statuspage.io/

Improving power resiliency
In order to understand why power was lost, it is helpful to understand how backup power generation
works. When utility power is lost there are microcontrollers that communicate with backup power
generation and the breakers in the distribution panels.

When power is lost these controllers ask backup power to start. When the backup system has started
it will turn off the utility power breakers and turn on backup power breakers.

This transitions the power feed from utility to backup power. In our case this transition did not happen. Our ISP is working with their electrical contractor to work out why this did not happen and to prevent a similar occurrence in the future.

Furthermore they will look at making breakers remotely controllable which they currently are not.

Summary
Last, but certainly not least, we want to apologize. We know that our services are critical to our customers' businesses. Over the coming weeks our ISP and their engineers will spend a lot of time investigating the chain of events in detail in order to improve the understanding of what happened and improve infrastructure reliability.

Posted May 06, 2022 - 17:14 CEST

Resolved

Alle tjenester, inkludert VPS, webhotell og e-post kjører nå som normalt.

Posted May 01, 2022 - 21:02 CEST

Update

Vi jobber fortsatt med å gjenopprette tilgang til våre tjenester.

Posted May 01, 2022 - 17:37 CEST

Update

Vi jobber fortsatt med å gjenopprette tjenester.

Posted May 01, 2022 - 13:45 CEST

Identified

Noen tjenester har nå blitt gjenopprettet. Det jobbes fortsatt med å gjenopprette resterende.

Posted May 01, 2022 - 09:39 CEST

Investigating

Vi opplever før tiden et driftsbrudd på en rekke tjenester som er plassert på vårt datasenter i Oslo. Feilsøk pågår.

Posted Apr 30, 2022 - 22:42 CEST

This incident affected: Tjenester (cPanel (Webhotell og E-post), VPS Cloud, Acronis Cloud Backup, Weebly Nettsidebygger), Datasentre (Digiplex, Oslo, Norge), and Websider og portaler (Dedia.no, cPanel kontrollpanel, cPanel webmail, VPS Cloud kontrollpanel).