Site icon VMVirtualMachine.com

Optus identifies cause of nationwide outage, says ‘changes to routing information’ after software upgrade to blame

Optus identifies cause of nationwide outage, says ‘changes to routing information’ after software upgrade to blame
Spread the love


Optus says “changes to routing information” after a “routine software upgrade” was behind last week’s nationwide outage, affecting 10.2 million Australians and impacting 400,000 businesses.

In a statement released on Monday afternoon, Optus says its network was affected by “changes to routing information from an international peering network” around 4:05am AEDT last Wednesday, “following a routine software upgrade”.

“These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these,” the company said.

“This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.”

The scale of the outage meant Optus technicians had to physically reconnect or reboot the system, the telco said, and also meant the investigation into the cause “took longer than we would have liked”.

“The restoration required a large-scale effort of the team and in some cases required Optus to reconnect or reboot routers physically, requiring the dispatch of people across a number of sites in Australia,” an Optus spokesperson said.

“This is why restoration was progressive over the afternoon.

“Given the widespread impact of the outage, investigations into the issue took longer than we would have liked as we examined several different paths to restoration.

“The restoration of the network was at all times our priority and we subsequently established the cause working together with our partners.”

Optus says it has since made changes to its network to address the issue so it does not occur again, and will “continue to invest” to improve its network’s resiliency and services.

It comes after Optus made available an extra 200GB of data to customers from Monday as compensation for last Wednesday’s outage.

Software upgrade was ‘highly unlikely’ to be the cause, CEO said last week

Before Monday’s disclosure by Optus, experts had theorised the outage was likely a “regular software upgrade gone wrong”.

“The problem is too widespread to be due to a cable break or equipment failure,” said Tom Worthington, a senior lecturer in computer science from the Australian National University in Canberra.

The software upgrade theory surmised by telecommunications analysts and experts last Wednesday were put to Optus CEO Kelly Bayer Rosmarin, who rejected those suggestions.

“It’s highly unlikely, our systems are actually very stable,” she told ABC Radio Sydney last Wednesday morning.

“We provide great coverage to customers, this is a very, very rare occurrence.”

Optus has offered free data to affected customers to make up for the inconvenience the outage caused.(AAP: Dean Lewins)

On Monday afternoon, Mr Worthington said it was “no surprise that a software upgrade caused the Optus outage”, and the issue would still have occurred if there was redundancy.

“This is a similar problem which took out the Australian Population Census in 2016,” he said.

“It would be possible to replicate all the hardware, but that would double to cost of services to customers and would not stop a systematic failure of this sort.

“There are some clear lessons from the Optus outage: Don’t have all your phones and internet provided by the one company, [and] if you are providing safety critical services, have connections to multiple networks.”

Associate Professor Mark Gregory from RMIT University said the cause identified by Optus was “human error” that resulted in a “cascading failure”.

“It appears that a routine software upgrade to one or more key routers was the cause of the outage,” he said.

“Optus has not explained what went wrong with the test process that should have occurred before the routing software upgrade occurred.

“Also, there is no explanation as to why there appears to have been a lack of redundancy of the key routers, so that if there was a problem the key routers would swap to the redundant routers, which you would expect to be running the previous iteration of software.”

Research fellow at the Centre for Defence Communications and Information Networking at the University of Adelaide, Mark Stewart, said the reason for the outage is “predictable” and common with software updates.

“Network Instabilities resulting from changes to the routing information are a well known and predictable problem, which are commonly associated with software updates,” he said.

“A major telco should have disaster recovery plan which is more sophisticated than your average corporate network.”

“At a minimum they should have had a plan to revert the changes, or remotely reboot their systems.

“The statement from Optus in no way clarifies how this event was exceptional, or what preventative measures they had in place to mitigate the impact.”

Graeme Hughes, the director of the Business Lab at Griffith University, said it was fortunate from an emergency communication perspective that the outage occurred when it did.

“Had the outage occurred a week earlier in the peak of raging bushfires, the impact would have been catastrophic,” he said.

Optus customers were without service for 14 hours last Wednesday.(ABC News: Dannielle Maguire)

Optus boss to face Senate on Friday

Optus is facing a number of inquiries and investigations as a result of the outage, including a Senate inquiry that will hold its first public hearings on Friday.

Ms Bayer Rosmarin is currently the only witness to confirm her attendance. 

The telco said in a statement that it supports and will “fully cooperate” with the reviews being done by the government and the Senate.

The reason for the outage follows the federal government announcing earlier on Monday that it would require telecommunications companies in Australia to report their cybersecurity measures to avoid a repeat of Optus’ cyber hack last year.

Under the laws, telecommunications companies would be classified as “critical infrastructure” that would require their company boards to report to the government on their cybersecurity strategies in the same way energy companies, hospitals and ports do.

Loading…

If you’re unable to load the form, you can access it here.



Source link

Exit mobile version