Optus will consider running larger-scale simulations of network outage events in the future, after being surprised by the size and scale of its outage event on November 8.
Optus managing director of networks Lambo Kanagaratnam (left), Optus CEO Kelly Bayer Rosmarin (right).
The telco had prepared for more localised failures, such as fibre cuts and the loss of an exchange, with its redundancy measures and had also simulated the disconnection of an entire state.
However, it had not previously contemplated a larger-scale loss of network infrastructure.
“We did do a network outage exercise, but it wasn’t for a full outage on the network,” Optus managing director of networks Lambo Kanagaratnam said on Friday at a senate inquiry.
“We ran a scenario where we lost one of the states – Western Australia – and also – within the same scenario – we did an assessment of a potential attack on one of our exchanges in South Australia.”
No plans or practised scenarios involved events that could shut down Optus’ entire network like a failure to manage changes to routing information from international peering networks, which the telco has said caused the November 8 outage.
“We didn’t have a plan in place for that specific scale of outage. I think it was unexpected,” he said.
“We will take such exercises into consideration in the future.”
Kanagaratnam said that the 12-hour, nationwide operation that restored the telco’s network was significantly larger and more complex than its simulations of localised outages.
For example, Optus had to coordinate with an overseas network management partner and attend 14 locations across Australia to reconnect routers that it was unable to remotely reboot while its network was down.
“Nokia is our managed services partner for our network, and they were involved from the very beginning in managing the incident and recovering the network; their staff are based in India in two locations,” Kanagaratnam said.
“We had 90 devices across 14 locations that were impacted that we had to reboot. In addition, for some of those – likely about half of them – we had to reboot another 50 other network elements to restore connectivity.
“So there was a total of 100 devices across 14 locations that we had to reboot to restore connectivity.”
‘No ability’ to ‘test’ Singtel update
Optus chief executive officer Kelly Bayer Rosmarin said that the telco did not properly plan for a routine update to the Singtel Internet exchange (STiX).
“The reality is that our network should have coped with this change, but on this occasion, it did not,” she said during her two-hour appearance.
“Our network has to be designed to cope with the redirection or the diversion away from where the upgrade is to an alternative link.”
However, the alternative link “was [also] being upgraded” and could not act as a “backup and redundancy option.”
“What was coming through that link needed to be diverted to another link – which happened to be configured differently – and then propagated to our network in a way that triggered failsafes in each of the different routers.”
‘High-level redundancy’ no use against a full-scale outage
Kanagaratnam said that Optus had invested in many layers of parallel emergency infrastructure to preserve connectivity when parts of its network were down, but it was no use during the outage.
“We have high levels of redundancy and it’s [a full-scale outage] not something that we expect to happen.”
Kanagaratnam noted that Optus’ layers of redundancy were dependent on the core network infrastructure, including the provider edge routers, being operational.
“We should have in place defence mechanisms to ensure that any change on a [partner’s] network does not impact our network, and we didn’t have that on that day.”
Kanagaratnam said that Optus’ three main layers of redundancy included “exchanges or ‘sites’ across the country,” which “segregate different parts of the network…so at any time if one exchange is isolated, we only impact a certain amount of customers” within that exchange.
Secondly, “within each exchange…each of the services that we [Optus] connect” is backed up with at least one emergency router that switches on when a regular connection fails.
“And then what we do as well for mobile voice data and fixed voice is we provide geographical redundancy so that traffic can switch seamlessly across the country.”