Optus has given its fullest account of what it thinks caused the November 8 outage: default settings in its Cisco provider edge (PE) routers that led to around 90 shutting down nationwide.
The attribution is an evolution of its previous explanation that an “international peering network” had fed it bad data.
News reports this week identified that peer to be Singtel internet exchange (STiX), and partially identified the cause as a software upgrade on Singtel’s end.
Singtel disputed that account on Thursday, instead – more correctly, it seems – identifying “preset failsafe” mechanisms in Optus’ routers as the cause – an account Optus confirmed in a submission filed late on Thursday, ahead of a senate appearance on Friday.
“It is now understood that the outage occurred due to approximately 90 PE [provider edge] routers automatically self-isolating in order to protect themselves from an overload of IP routing information,” Optus said. [pdf]
“These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco).”
Optus said the “unexpected overload” of routing information came via “an alternate Singtel peering router”, seemingly because the primary or usual router hardware that Optus took route information from was under maintenance.
The telco said an unspecified software upgrade was being performed at one STiX location in North America.
Optus suggests the upgrade led to the bad route information being propagated – but says this “was not the cause of the incident” in Australia.
Instead, it puts the blame on the edge router “safety” defaults. It does not say why the default settings were used, to what extent it had the ability to tweak the settings, or how long the routers had operated with these defaults in place.
Optus said a team of 150 engineers and technicians were directly involved in the investigation and restoration, supported by another 250 staff and five vendors.
Six theories
For the first six hours or so, the engineers pursued six different possible explanations for the large-scale outage.
These included whether works overnight by Optus itself were the cause; it rolled back those changes but found no resolution.
Other options simultaneously explored included whether it was a DDoS attack, a network authentication issue, or problems with other vendors such as its content delivery network provider.
One explanation, however, became the “leading hypothesis for network restoration”: equipment logs and alerts that “showed multiple Border Gateway Protocol (BGP) IPv6 prefixes exceeding threshold alerts.”
“We identified that resetting routing connectivity addressed the loss of network services. This occurred at 10:21am,” Optus said.
Engineers then set about “resetting and clearing routing connectivity on network elements which had disconnected themselves from the network, physically rebooting and reconnecting some network elements to restore connectivity, [and] “carefully and methodically re-introducing traffic onto the mobile data and voice core to avoid a signalling surge on the network,” it said.
Engineers performed unspecified “resiliency” works on the network between resolution on November 8 and the following Monday, November 13.
Optus foreshadowed more work to come.
“We are committed to learning from this event and continue to invest heavily, working with our international vendors and partners, to increase the resilience of our network,” it said.
“We will also support and will fully cooperate with the reviews being undertaken by the government and the senate.”
More to come