Service Issue
Incident Report for I.T Communications Limited
Postmortem

Event Report
ER-240203-54248

SUMMARY
During the maintenance event on Saturday, 3rd February, three of the five main switchboards entered into a control loop that would not allow the main breakers to close after planned maintenance operations of the HV switchgear.

The UPS systems fed by these switchboards drained completely and dropped their load due to the time required to mitigate the control loop. This control loop was caused by unforeseen damage caused to battery chargers and batteries within the main switchboards.

One of the five switchboards failed three additional times, first on Saturday evening at approximately 23:30 GMT and again on Monday morning at approximately 03:30 GMT and 06:10 GMT. The UPS system fed by this switchboard drained completely and dropped its load.

Each failure was caused by a spurious trip of the restricted earth fault system.

TIMELINE

Time (approx) Notes
Saturday, 03 Feb 2024 – UPS A / UPS B Outage During Maintenance
09:50 Start of Maintenance Operations
10:00 Change of HV system. Change from normal operations to HV System fed exclusively from Transformer Tx B.
All switchboards fed by HV A lose power for approximately 5 seconds as planned. Switchboard control power is normally backed by battery during the loss of control power.
10:15 Site checked – all systems functioning as usual.
10:40 Loss of SCADA visibility. Three of five LV switchboards experience

unexpected failure of control batteries and associated chargers, and lose
control power for five seconds. Loss of control power initiates control loop
on three of five switchboards causing the mains breaker to remain open.
The cause of the mains breakers remaining open was not immediately
evident, so a decision was made to revert back to normal HV operations and
abort project works. | | 10:52 | Attempt made to reverse out of the maintenance condition and restore
power to HV A. This does not change the status of the control loop. | | 11:02 | Loss of power on UPS A and UPS B due to drained batteries. | | 13:10 | Cause of control loop determined. | | 13:25 | Relays causing the control loop overridden. | | 13:28 | Power restored to affected switchboards.
Decision made to abort further planned maintenance activities.
UPS systems operating in a normal condition. | | | Saturday, 03 Feb 2024 – UPS B Outage #1 | | 23:27 | UPS B system fails due to a trip of the Restricted Earth Fault system within
the main switchboard feeding UPS B.
Investigation initiated into the cause of the trip required to ensure that there
was no danger in reclosing the affected circuit breaker.
Investigation was not possible to complete prior to the draining of the UPS B
system batteries. Loss of power to systems fed by UPS B. | | 03:15 | Investigation determines that the trip of the REF system was not a true
event and was caused by the damage of the control systems within the
switchboard feeding UPS B. | | 03:42 | Protection relay was reset once it was determed that it was safe to do so.
Power restored to the switchboard feeding UPS B.
UPS B reset and power reestablished to the UPS systems.
UPS systems operating in a normal condition. | | | Monday, 05 Feb 2024 – UPS B Outage #2 | | 03:30 | UPS B system fails due to a trip of the Restricted Earth Fault system within the main switchboard feeding UPS B.
Investigation confirmed that this trip was spurious similar to the previous event on Saturday evening.
Procedure initiated to reset the protection relay and reestablish power to the switchboard feeding UPS B. This was unable to be completed prior to the depletion of the UPS B batteries. Loss of power to systems fed by UPS B. | | 03:44 | Power restored to the switchboard feeding UPS B.
UPS B reset and power reestablished to the UPS systems.
UPS systems operating in a normal condition. | | | Monday, 05 Feb 2024 – UPS B Outage #3 | | 06:11 | UPS B system again fails due to a trip of the Restricted Earth Fault system within the main switchboard feeding UPS B.
Investigation confirmed that this trip was spurious similar to the previous two events.
Procedure initiated to reset the protection relay and reestablish power to the switchboard feeding UPS B. This was unable to be completed prior to the depletion of the UPS B batteries. Loss of power to systems fed by UPS B. | | 06:16 | Power restored to the switchboard feeding UPS B.
UPS B reset and power reestablished to the UPS systems.
UPS systems operating in a normal condition. | | | Monday, 05 Feb 2024 – Control Power System Repairs | | 18:48 | Works on temporary repairs to switchboard Mech B control power system | | 19:26 | Works on switchboard Mech B control power system complete. | | 20:22 | Works on temporary repairs to switchboard UPS B control power system begins. | | 20:49 | Works on switchboard UPS B control power system complete. | | 21:20 | Works on temporary repairs to switchboard UPS A control power system begins. | | 21:30 | Works on switchboard UPS A control power system complete. |

Current Status

  • UPS systems remain operational under normal conditions.
  • Switchboard control power systems are fully functional.
  • Full visibility to SCADA restored.
  • Data centre operations remain stable.

ACTIONS

  • Permanent fix for LV switchboard controls to be determined, planned and communicated to customers. This will be a controlled maintenance activity.

● Thorough investigation required for the switchboard feeding UPS B. While we believe that the system is safe and not subject to an earth fault condition, the repeated failure of that system is not fully explained and must be thoroughly inspected. It is likely that this inspection and retest of the affected relays will require a shutdown of the switchboard, including a shutdown of system fed by UPS B. This will be a controlled maintenance activity.

● Damaged HV tripping batteries to be replaced. Downtime is not expected to be required but works will be planned and communicated. This will be a controlled maintenance activity.

● Incident communications process is under review and improvement.

Posted Feb 06, 2024 - 18:14 GMT

Resolved
Power fully restored, San Storage Restored and all servers are now back online.

We will follow up on Monday with the Data Centre to understand the reason for the power failure.

We have been located within the Volta Data Centre since Feb 2016 and never had a issue like this before, so it is rare.

We are sorry for the outage today. Full RFO will be published once received from the Data Centre.
Posted Feb 03, 2024 - 15:50 GMT
Update
We have now engaged our San Storage provider (NETAPP) due to a issue caused by the power failure, Preventing the storage service from starting.

A NetAPP engineer has now been assigned of which they are now working with us to resolve the issue.
Posted Feb 03, 2024 - 14:43 GMT
Update
We now have power restored. We are waiting for our San Storage to come back online. Once the San is fully working, we will start up all the servers which will then restore the service.
Posted Feb 03, 2024 - 13:59 GMT
Update
We do not have an ETR at this time. The DC currently have two UPS systems offline due to a controls issue.
Posted Feb 03, 2024 - 13:09 GMT
Update
Power To Data Hall 4BL is back up. (This is where our Core routers are located)
Power To Data Hall 4BH is still down.
Power To Data Hall 4BD is still down. (This is where VoIP Services are located)

Once we have power back at 4BD, we will look to restore service.
Posted Feb 03, 2024 - 13:02 GMT
Update
We are waiting updates from the Data Centre.
Posted Feb 03, 2024 - 12:28 GMT
Identified
We have lost full power at Volta Data Centre, Onsite engineers are onsite investigating, further update to follow
Posted Feb 03, 2024 - 11:57 GMT
Investigating
Hi. We are currently investigating a issue which may be affecting some customers.

Further update will be provided ASAP.
Posted Feb 03, 2024 - 11:44 GMT