How the Optus network failed causing Australia’s biggest outage

Spread the love


However, some of the smartest people in Australia who know how these large networks work have put together enough data that’s available publicly on the internet to determine the best and most likely scenario.

That scenario appears broadly true today, with Optus announcing the cause of the issue lay within the routing of their network.

Customers line up outside an Optus shop fron
Optus still hasn’t confirmed the official cause of the outage. (Dominic Lorrimer)
I draw my learnings in part from the reporting of our colleagues at the Sydney Morning Herald and also from industry experts posting online such as Rob Thomas.

At your home, you have devices – your phone, your laptop, your TV – that connect to your Wi-Fi network. 

They connect to your router, which then sends the traffic out to your modem and internet connection and any responses come back that same path.

If you attach a new device to your Wi-Fi, it gets given an internal address by the router, so it can also receive traffic when needed.

But for Optus, they have millions upon millions of devices on their network, so things are far more complex.

Essentially, their routers can handle more traffic but to make that highly efficient, they use “route reflectors”.

These share the directions for traffic to flow around the internet within the Optus network, and make it possible for us all to be online.

All of this then passes through a thing called the border gateway protocol (BGP) and life is good.

However, large amounts of speculation indicate that something was changed on one of those route reflectors. 

Not normally an issue, because Optus would have lots of them.

But, if there was a problem with how that change was implemented, you need to know fast.

Customers wait to be served inside an Optus store
Tech experts have speculated on the potential cause. (Photo by Brendon Thorne / Getty Images)

If another route reflector was upgraded at the same or similar time, it can cause panic between all of the routers, basically shouting at each other new instructions for how to get from A to E and G to T on the internet. 

When that level of aggressive “route broadcasting” starts happening, the BGP can be overwhelmed or flooded.

We can see the issue. Why did it take so long to fix?

Well, this goes to your staffing plan. 

Normally a technically proficient engineer will open up their laptop while lying in bed, log into the network and roll back the clock of changes and then go back to sleep as things are sorted.

Not yesterday, because the BGP failure meant their network was down, and when that happens you can’t remotely access the servers.

Okay, it’s a bit of a pain, but head into the office and plug in a computer. 

What if all your main network technical staff are outsourced to a third party overseas, probably one managing much more of your parent company Singtel’s businesses?

Did Optus fly someone in to fix it, or did they grab a Telstra or Vodafone hotspot and use FaceTime to get instructions?

We will never know.  And we don’t need to. 

But we do need to hear from Optus to know they know what failed and why, and how they will prevent it from happening again.

Optus CEO Kelly Bayer Rosmarin. (Nine)

I could have interpreted some of this technical detail wrongly, but the principle of the issue is the same, and I thank the IT community for sharing their views on this.

This explanation is a simplified view of what the likely “routing” issues were yesterday, and we may never get more details information on Optus than just the “routing” issue.



Source link