In recent days, Optus and Singtel have been keen to point out the distinction between the trigger event for the outage and its “root cause”
In its submission, Optus attempts to make this clearer, noting the software upgrade on the Singtel Internet Exchange – and subsequent diversion of traffic while it was under way – was a trigger event for the outage. It says the Optus network being unable to handle the significant chunk of new routing information was the root cause of its network being overloaded and crashing.
Optus said its network operations centre observed a loss of connectivity affecting its consumer network about 4.05am on November 8, the day of the incident.
In the initial stages of the outage, Optus said it prioritised the restoration of services as soon as possible, which required re-establishing connectivity to key elements of the network.
“It is now understood that the outage occurred due to approximately 90 PE routers [provider edge routers, which operate between one network service provider’s area and areas administered by other network providers] automatically self-isolating in order to protect themselves from an overload of IP routing information,” the Optus submission says.
“These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco).”
Optus said this “unexpected overload” of routing information occurred after a software upgrade on the Singtel Internet Exchange network, specifically at one of Singtel’s exchanges in North America.
“During the upgrade, the Optus network received changes in routing information from an alternate Singtel peering router, it says.
“These routing changes were propagated through multiple layers of our IP Core network. As a result, at around 4:05am (AEDT), the pre-set safety limits on a significant number of Optus network routers were exceeded. Although the software upgrade resulted in the change in routing information, it was not the cause of the incident.”
Optus said restoration required “a large-scale effort across more than 100 devices in 14 sites nationwide to facilitate the recovery (site by site).
“This recovery was performed remotely and also required physical access to several sites.”
Approximately 150 engineers, technicians and field technicians were in the core group of personnel working on resolution, Optus said.
“That core group was augmented by 250 additional personnel, providing further support and monitoring. We also worked with five leading international vendors who assisted us with resolution and advice.”