OVERVIEW
A failed line card of an MLXe device in Telehouse North (THN) resulted in a loss of
connectivity. For the vast majority of customers, the network recovered by routing around
the affected site as expected within minutes. Leased line customers with locations
terminating in THN, and traffic for limited customers routing to certain IP ranges, saw a
larger impact to their connectivity.
CAUSE OF INCIDENT
Background:
The device in question was a chassis based Brocade MLXe device. These were originally
chosen as the chassis based nature offered a modular approach that enabled easy
expansion and fault tolerance. However, we had been disappointed with the real world
performance of the Brocade hardware and had just undertaken a long programme of
works to replace this Brocade hardware at Layer 2 with dedicated, best of breed Juniper
devices. This upgrade work was the result of intensive work with Brocade in an attempt to
remediate the long convergence times their products were d elivering. After much
investigation by the most senior Brocade engineers and a programme of rolling updates,
convergence times were reduced from approx. 15 to approx. 6 minutes. This was still
clearly sub optimal, and Brocade themselves concluded that their network hardware was
incapable of yielding any further improvements. This necessitated our investment into
building of a brand new core network, not based on Brocade. The first phase of the new
core network was recently completed, with Juniper devices no w handling Layer 2 traffic
on behalf of our Brocade core. Phase two is to physically move all customer interconnects
from Brocade to Juniper, which we are part way through. Phase three is to replace MLXe
and other Brocade devices with dedicated Juniper hardware on our core network.
Incident:
Our 24/7 monitoring team detected the issue and we immediately arranged for the router
to be rebooted by remote staff, as senior engineers began travelling site to investigate
further. When the reboot failed, a Priority 1 case was raised with Brocade and methodical
fault finding identified a failed line card. A replacement part was immediately ordered
from Brocade, and as a backup, at the same time additional engineers brought to site the
3 spare line cards we keep in stock. Installing replacement line card did not immediately
resolve the issue, necessitating detailed, meticulous troubleshooting with senior Brocade
engineers. Parallel to these vendor led investigations, our engineers acquired and installed
a Brocade CER r outer to operate in place of the MLXe, should Brocade be unable to
identify the flaws. Switching to the CER router would have required 6+ hours of careful
manual configuration updates to restore services for all customers, so fixing the existing
MLXe hardw are was focussed on as the quickest route to restore services. Our senior
network engineers worked through the night, identifying the issue with the replacement
line card, and restoring the MLXe to normal operation.
Timeline:
21:49 Issue began and was detected by our 24/7 monitoring team
21:55 Initial network convergence complete, restoring the majority of services. Intermittent
connectivity remained for limited customers accessing certain IP ranges, and for leased
line customers terminating at THN
22:00 Problem device and location identified
23:00 Physical reboot of router, engineers travelling to THN and SC 1 , and P1 case
raised with Brocade.
23:00 00:00 Engineers arrive on site, diagnostics begin, line card identified, replacement
ordered from vend or
01:00 Engineer brings 3 stocked line cards from SC 1 to THN
02:00 Line card replaced, but would not integrate owing to a firmware issue, which
we escalated to Brocade
04:00 Identified and implemented a fix to certain traffic that was being blackholed
by the line card failure. Also continued working with senior Brocade engineers on resolving
the fault with integrating the replacement line cards with the chassis
05:00 Made use of our MLXe in Telehouse East for enhanced diagnostics with
Brocade. Also instigated Plan B of using Brocade CER routers in place of the MLXe in THN.
06:00 Engineer delivers CER from SC 1 to THN. Also continued working with senior
Brocade engineers on resolving the fault with integrating the replacement line cards with
the chassis
10:00 CER racked up and preliminary config begins whilst continuing to work with
senior Brocade engineers on integrating replacement line cards
12:00 Cause of line card not integrating identified by our engineers an d resolved
12:10 Card and ports come online, affected services begin to return to normal
12:12 Affected services restored
RESOLUTION DETAILS
Once the issues with the line card were resolved, it was quickly installed and power
cycled, which restored normal service.
FOLLOW UP ACTIONS
We will be reviewing our stocked spare policy, including regular config and firmware
reviews, to ensure that replacing components (such as line cards) can happen as quickly
as possible in future. In addition, we are continuing our on going phase two plans and are
expediting phase three plans, which will remove all Brocade hardware from our core
network. The MLXes will be replaced by best of breed Juniper hardware, deployed in
resilient configuration. With separate devices hand ling Layer 3 and Layer 2, and with
dedicated cold spares in strategic locations for rapid deployment, we will be extremely
resilient to repetitions of issues of this nature.