In a statement on Thursday, Optus said: “In common with major global telecommunication networks, the Optus network is designed with multiple layers of fallback and redundancy. At the heart of this is a modern intelligent router network developed with the world’s leading vendors.
“Despite this, a network event yesterday triggered a cascading failure which resulted in the shutdown of services to our customers,” Optus said, without elaborating on what the network event was.
Routing tables
The outage does appear to have been caused by one of Optus’ routers, which control how data is shuttled across the internet, and which receive instructions on how to do so via regular updates to their internal databases, known as routing tables.
Updates are done via a global network of routers running what is known as the Border Gateway Protocol. The router that appears to have brought down Optus’ entire network receives BGP updates from Optus’ Singtel headquarters, China Telecom and the global cloud provider Akamai, as well as a UK telecommunications company and two other Optus routers.
BGP updates are routinely checked for errors before they are deployed, however. Last year, when a telecommunications company in Russia released a faulty router table update that took Twitter offline across parts of Europe, some internet companies accepted and propagated the incorrect update, “while other ISPs continued to use the previous, valid routes”, an analysis by Cisco Systems found.
It was unlikely Optus completely failed to check an update to its router tables, because if that were its practice its network would be going down “much more often”, Mr Tett said. The company probably had multiple layers of checks, which the update somehow got past.
“It could happen to everyone who runs BGP,” he told The Australian Financial Review.
As well as bringing down Optus’ landline, mobile phone and internet services, the outage also appears to have brought down the internal network used by Optus to manage its network, forcing technicians to travel to affected locations and attempt to fix problems in person.
Industry insiders said it was “unusual” for a telecommunications company not to keep that management network “out of band”, or completely separate from its main backbone, so devices can still be managed remotely when the main network is down.
One insider who asked not to be named said that running the management network “in-band” – on the same backbone it is meant to manage – was “not a choice I would make”.
But Optus may simply have done a risk assessment and decided that the odds of complete network failure were so low, an in-band management connection to its BGP routers was “worth the risk”, Mr Tett said.
Ian Martin, a telecommunications analyst at New Street Research, said there was no reason to think the outage was exacerbated by Optus failing to invest adequately in its network, however.
The company had increased its spending on network infrastructure in the last several years, and its expenditure was now roughly in line with that of its rivals when judged as a proportion of market share, he said.
Indeed, its increased spending meant that Optus now had more free network capacity than Telstra, and was just beginning to win new customers thanks to that added capacity when the outage happened, he said.
Last year, too, Optus was just beginning to reap the rewards of its network investment when it was hacked and potential customers shied away, Mr Martin said.
“And now they’ve gone and shot themselves in the foot again.”