The cause behind the Optus outage explained

The cause behind the Optus outage explained


Optus has revealed the cause behind last Wednesday’s network outage, which affected about 10 million customers Australia-wide for more than 12 hours.

So, for those of us that don’t speak tech, what does it mean?

Let’s look first at the (quite technical) statement issued by Optus on Monday, then figure out what it means.

The company said its network was affected by “changes to routing information” at around 4:05am AEDT last Wednesday “from an international peering network following a routine software upgrade”.

These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves,” the company said.

The restoration required a large-scale effort of the team and in some cases required Optus to reconnect or reboot routers physically, requiring the dispatch of people across a number of sites in Australia. This is why restoration was progressive over the afternoon.” 

Not particularly enlightening, but the statement does confirm speculation last week that the outage was caused, not by a fault in physical infrastructure, but software.

Dr Mark Gregory, an associate professor in the School of Engineering at RMIT University, says that the outage was caused by “human error” that caused a “cascading failure”.

“The Optus statement is poorly worded, but it appears that a routine software upgrade to one or more key routers was the cause of the outage,” explains Gregory.

A software upgrade is not the same thing as a software update. Instead of an enhancement to the current version of the software, an upgrade is a completely new version of it.

“A cascading failure occurred when routing information from an international peering network was received and exceeded preset safety levels on key routers,” says Gregory.

Routing information is used to find the best path between one location on the internet, the source, and another, the destination network. Internet peering is the mutual exchange of traffic between networks and a router is a device that manages the flow of this traffic.

Too many of these “routing information changes” overwhelmed the key routers, which Gregory says then “disconnected from the Optus IP Core network, bringing down the entire network.”

So, should this outage have been prevented?

“Optus has not explained what went wrong with the test process that should have occurred before the routing software upgrade occurred,” says Gregory.

“Also, there is no explanation as to why there appears to have been a lack of redundancy of the key routers, so that if there was a problem the key routers would swap to the redundant routers, which you would expect to be running the previous iteration of software.

“There remains a number of open questions that Optus has failed to explain.”

Mark Stewart, a research fellow at the Centre for Defence Communications and Information Networking at The University of Adelaide, agrees.

“A major telco should have a disaster recovery plan which is more sophisticated than your average corporate network. At a minimum, they should have had a plan to revert the changes, or remotely reboot their systems,” Stewart says.

“The statement from Optus in no way clarifies how this event was exceptional, or what preventative measures they had in place to mitigate the impact.”

The failure of the Optus network highlights the fragility of Australia’s telecommunication systems, which many services – such as hospitals, public transport, and EFTPOS transactions – rely on.

Graeme Hughes, director of the Griffith Business Lab at Griffith University, adds: “In an era where society heavily depends on interconnected technology, establishing trust in service providers is crucial from a consumer standpoint.”

For instance, Optus landlines were unable to dial 000.

“One surprising outcome is that, in this case, mobile phones proved more reliable than landlines for emergency calls. The mobile phone standards have provisions for using any company’s network to make an emergency call. So, phones automatically switched from Optus to Telstra, or Vodafone,” explains Hughes.

“The Australian Government is already working on mobile roaming between carriers during natural disasters. This could be extended to cover other network outages.”

But, Hughes says, it would require some difficult commercial and regulatory negotiations to implement in Australia.

“For government, business, and domestic users of internet and phone services there are some clear lessons from the Optus outage. Don’t have all your phones and Internet provided by the one company. If you are providing safety critical services, have connections to multiple networks.”

Optus faces a Senate inquiry and a separate Federal Government post-incident telecommunications review to examine the major impacts of the network failure and how it could be prevented from occurring again.





Source link