Telehouse North Data Centre Outage
Incident Report for I.T Communications Limited
Postmortem

OVERVIEW
A failed line card of an MLXe device in Telehouse North (THN) resulted in a loss of
connectivity. For the vast majority of customers, the network recovered by routing around
the affected site as expected within minutes. Leased line customers with locations
terminating in THN, and traffic for limited customers routing to certain IP ranges, saw a
larger impact to their connectivity.

CAUSE OF INCIDENT
Background:
The device in question was a chassis based Brocade MLXe device. These were originally
chosen as the chassis based nature offered a modular approach that enabled easy
expansion and fault tolerance. However, we had been disappointed with the real world
performance of the Brocade hardware and had just undertaken a long programme of
works to replace this Brocade hardware at Layer 2 with dedicated, best of breed Juniper
devices. This upgrade work was the result of intensive work with Brocade in an attempt to
remediate the long convergence times their products were d elivering. After much
investigation by the most senior Brocade engineers and a programme of rolling updates,
convergence times were reduced from approx. 15 to approx. 6 minutes. This was still
clearly sub optimal, and Brocade themselves concluded that their network hardware was
incapable of yielding any further improvements. This necessitated our investment into
building of a brand new core network, not based on Brocade. The first phase of the new
core network was recently completed, with Juniper devices no w handling Layer 2 traffic
on behalf of our Brocade core. Phase two is to physically move all customer interconnects
from Brocade to Juniper, which we are part way through. Phase three is to replace MLXe
and other Brocade devices with dedicated Juniper hardware on our core network.

Incident:
Our 24/7 monitoring team detected the issue and we immediately arranged for the router
to be rebooted by remote staff, as senior engineers began travelling site to investigate
further. When the reboot failed, a Priority 1 case was raised with Brocade and methodical
fault finding identified a failed line card. A replacement part was immediately ordered
from Brocade, and as a backup, at the same time additional engineers brought to site the
3 spare line cards we keep in stock. Installing replacement line card did not immediately
resolve the issue, necessitating detailed, meticulous troubleshooting with senior Brocade
engineers. Parallel to these vendor led investigations, our engineers acquired and installed
a Brocade CER r outer to operate in place of the MLXe, should Brocade be unable to
identify the flaws. Switching to the CER router would have required 6+ hours of careful
manual configuration updates to restore services for all customers, so fixing the existing
MLXe hardw are was focussed on as the quickest route to restore services. Our senior
network engineers worked through the night, identifying the issue with the replacement
line card, and restoring the MLXe to normal operation.

Timeline:
21:49 Issue began and was detected by our 24/7 monitoring team

21:55 Initial network convergence complete, restoring the majority of services. Intermittent
connectivity remained for limited customers accessing certain IP ranges, and for leased
line customers terminating at THN

22:00 Problem device and location identified

23:00 Physical reboot of router, engineers travelling to THN and SC 1 , and P1 case
raised with Brocade.

23:00 00:00 Engineers arrive on site, diagnostics begin, line card identified, replacement
ordered from vend or

01:00 Engineer brings 3 stocked line cards from SC 1 to THN

02:00 Line card replaced, but would not integrate owing to a firmware issue, which
we escalated to Brocade

04:00 Identified and implemented a fix to certain traffic that was being blackholed
by the line card failure. Also continued working with senior Brocade engineers on resolving
the fault with integrating the replacement line cards with the chassis

05:00 Made use of our MLXe in Telehouse East for enhanced diagnostics with
Brocade. Also instigated Plan B of using Brocade CER routers in place of the MLXe in THN.

06:00 Engineer delivers CER from SC 1 to THN. Also continued working with senior
Brocade engineers on resolving the fault with integrating the replacement line cards with
the chassis

10:00 CER racked up and preliminary config begins whilst continuing to work with
senior Brocade engineers on integrating replacement line cards

12:00 Cause of line card not integrating identified by our engineers an d resolved

12:10 Card and ports come online, affected services begin to return to normal

12:12 Affected services restored

RESOLUTION DETAILS
Once the issues with the line card were resolved, it was quickly installed and power
cycled, which restored normal service.

FOLLOW UP ACTIONS
We will be reviewing our stocked spare policy, including regular config and firmware
reviews, to ensure that replacing components (such as line cards) can happen as quickly
as possible in future. In addition, we are continuing our on going phase two plans and are
expediting phase three plans, which will remove all Brocade hardware from our core
network. The MLXes will be replaced by best of breed Juniper hardware, deployed in
resilient configuration. With separate devices hand ling Layer 3 and Layer 2, and with
dedicated cold spares in strategic locations for rapid deployment, we will be extremely
resilient to repetitions of issues of this nature.

Posted Jul 18, 2019 - 11:12 BST

Resolved
This incident has been resolved.
Posted Jul 12, 2019 - 13:17 BST
Update
Engineers are still working with Brocade in attempt to resolve the issue.

Additionally, they are still working in parallel to bring a new device online to help resolve the remaining issues.

We apologise that we do not have anything more concrete at this time, but can assure you that the team are treating this as their number one priority.
Posted Jul 12, 2019 - 11:14 BST
Update
in attempt to the address the issue we are taking two paths of action.

Firstly, onsite staff are working with Brocade (the Router Vendor) to fix the problem, They previously sent a replacement line card for the router but adding this proved fruitless.

Secondly, onsite staff are installing a completely new device in case the work with Brocade doesn’t fix the issue. This is a time consuming process as they don’t want to disrupt the wider network.

The second item will take at least another hour to complete.
Posted Jul 12, 2019 - 10:03 BST
Update
Work is still continuing to resolve the issue on the failed device causing the problems.

We apologise for the trouble this is causing.
Posted Jul 12, 2019 - 09:07 BST
Update
We are continuing to work on a fix for this issue.
Posted Jul 12, 2019 - 08:57 BST
Update
Most services are now running as usual.

We’re working to bring back the remaining affected leased lines and hope to have these back online before 9am.
Posted Jul 12, 2019 - 07:44 BST
Identified
We have confirmed an issue with a Router in Telehouse North. This will affect any sites to site services terminating in this site, as well as any that use it to interlink with other providers.

In addition to this, a number of leased line circuits are affected. On-site engineers have investigated the issue, and this has been escalated to two additional engineers who are on route to the site. This page will be updated with more information as it transpires.

Onsite engineers and the vendor support team are still working on bringing the new line card into service
Posted Jul 12, 2019 - 06:49 BST
This incident affected: Customer Internet Access (Customer Leased Lines).