Info: Overview of the latest outages and solutions
Dear customers and partners,
As you all know, edpnet has gone through a number of serious outages lately. The initial power outage exerted a huge impact on our core infrastructure which caused, unfortunately, much tension to our customers and ourselves. We understand that it is hard to justify the confidence in such circumstances and want to apologize for all inconvenience. Strong measures have already been taken to reduce a risk of further outages and guarantee the stability of our services.
As open communication is very important for us, we would like to provide you with a detailed overview of what exactly happened and what we have done/are going to do to prevent these situations in future and improve our service. Please, find the overview below.
Monday 26/09/2016 4:49: the data center of edpnet Sint-Niklaas lost power, UPS system and power generator failed.
Monday 26/09/2016 10:19: the power was restored, systems were back online. Due to the many issues to get the power online, we provided a backup power feed from our office power feed to the most critical systems
Monday 26/09/2016 11:00: mail load balancers were dead, probably due to intermediate power cuts. These devices were replaced.
Monday 26/09/2016 20:00: the power was lost again in the data center, critical applications such as voice services and main router were still running. The interruption caused an issue on incoming calls.
Monday 26/09/2016 20:15: The root cause of the power issue was found, the UPS itself caused the power cuts. We bypassed the UPS and restored power.
Monday 26/09/2016 21:30: incoming calls were running fine again.
After the power cut, we noticed some strange behavior on our Voice switch 2, several maintenances were performed to have the system stable again, no more issues were noticed after these maintenances.
Sunday 2/10/2016 23:30: a DDoS attack starts which causes connectivity problems.
Monday 3/10/2016 9:30: The DDoS target is found and disabled. Connectivity restored.
Tuesday 4/10/2016 0:00: Planned maintenance to replace the broken UPS, and installing an external bypass system so future UPS maintenance shouldn’t give an interruption.
Wednesday 5/10/2016 7:44: A database issue occurred on Voice switch 2.
Wednesday 5/10/2016 8:06: The database is restored, voice is operational.
Thursday 6/10/2016 1:46: Voice switch 2 crashes, is completely unavailable. Voice switch 1 isn’t taking over causing voice calls not to arrive, and outgoing calls not to be made.
Thursday 6/10/2016 6:43: manually switched all voice traffic towards Voice switch 1, incoming and outgoing calls seem to work fine.
Thursday 6/10/2016 9:00: complaints arrive that no incoming calls are possible again, it seems Voice Switch 1 only receives 50% of the incoming calls
Thursday 6/10/2016 10:57: restored Voice switch 2 on spare hardware, all calls are arriving no more voice issues.
Thursday 6/10/2016 11:30: Proximus and supplier found the issue with the failing redundancy of the voice switches, a new maintenance will be planned soon to make the necessary improvements.
List of improvements to be made:
- UPS replacement (done)
- External bypass system to avoid outage due to UPS works (done)
- Move certain critical services to our data center in Interxion Brussels (ongoing)
- Install a permanent B-power feed in our data center in Sint-Niklaas (ongoing)
- Adapt voice service for correct redundancy (done)
If you have any questions or remarks left, please, do not hesitate to contact us.
Best regards
The edpnet team