Edge router default settings caused massive Optus outage

Edge router default settings caused massive Optus outage


Optus has submitted to the Senate Standing Committee on Environment and Communications a full account on what it thinks caused the massive outage last 8 November that crippled its subscribers: 90 Cisco provider edge routers automatically self-isolated to protect themselves from an overload of IP routing information.

Previously, the telco attributed the outage to a regular upgrade going wrong that fouled up its network.

It claimed that an “international peering network” had fed it bad data.

Last Monday, the company’s spokesperson said: “At around 4:05am Wednesday morning, the Optus network received changes to routing information from an international peering network following a routine software upgrade.”




“These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.”

According to news reports, the unnamed “international peering network” that contributed to the outage is run by its Singaporean parent company Singtel.

But Singel disputed the claims, pinpointing that the outage was caused by Optus’ safety mechanisms, and not a routine software upgrade.

Optus confirmed Singtel’s account in a submission filed late on Thursday, ahead of a senate appearance today.

The Optus report reads: “It is now understood that the outage occurred due to approximately 90 PE routers automatically self-isolating in order to protect themselves from an overload of IP routing information.”

“These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco).”

Optus said the unexpected overload of IP routing information occured after a “software upgrade at one of the Singtel internet exchange” in North America.

During the upgrade, Optus said its network received changes in routing information from an alternate Singtel peering router.

“These routing changes were propagated through multiple layers of our IP Core network.”

Optus suggests the upgrade led to the bad route information being propagated. However, it did not explain further, but now said this “was not the cause of the incident in Australia.”

One hundred fifty engineers and technicians are now working to restore the system, supported by another 250 staff and five vendors, according to the document.

For the first six hours of the outage, engineers tried to come up with explanations for the large-scale outage.

Theories include whether works overnight by Optus itself were the cause but it found no resolution.

It also explored the possibility of a DDoS attack, a network authentication issue, or problems with other vendors such as its content delivery network provider.

But its IPv6 line of enquiry became its “leading hypothesis for network restoration.”

“Through this process, we identified that resetting routing connectivity addressed the loss of network services. This occurred at 10:21am.”

Optus’ engineers then performed the following steps:

1. Resetting and clearing routing connectivity on network elements which had disconnected themselves from the network.

2. Physically rebooting and reconnecting some network elements to restore connectivity.

3. Carefully and methodically re-introducing traffic onto the mobile data and voice core to avoid a signalling surge on the network.

Optus continued to investigate more on the matter, performing unspecified “resiliency” works on the network between 8 November to 13 November.

The submission also detailed its customer communications defence on the outage day, but will likely to be challenged in a senate inquiry.

Optus has offered a 200GB data to affected consumers, but some say this is not enough.

The data is not enough to compensate the affected operations of businesses which lost money during the outage. However, the telco argued that if there is an outage, businesses must be responsible to own a backup connectivity in the event service is down.

In its submission, Optus argued making a telco pay financial compensation for losses is not a precedent that should be set.

“However, there is no precedent for compensation being paid by telecommunications providers to all business customers who suffer a loss of business as a result of an outage of the kind that occurred on 8 November, either here or overseas.”

It argued that there no precedent for essential services such as electricity providers to pay compensation for businesses losses when there is an outage.

Electricity networks do not compensate business customers for “consequential losses” such as wages. It cited Ausgrid, an NSW-based electricity provider, which stated: “There is no compensation granted for consequential loss such as wages, productivity or trade.”

Optus downplayed the outage, saying “it isn’t the first to suffer a sizeable outage in Australia” nor would the 8 November outage be the last incident of its type.

“While every communications network provider wants to avoid such outcomes, it is an unfortunate reality in our reliant digital age that no communications network can completely protect against, nor prevent, these types of occurrences from ever happening – despite the investments made or resiliency efforts undertaken.”



Source link