Image

When something goes wrong in a network, the old adage goes, “It’s DNS.” This time, the Domain Name Server (DNS) looks to be the symptom of the core cause of the global Facebook outage. The real reason for this is that there are no active Border Gateway Protocol (BGP) routes into Facebook’s servers.

BGP is a standardized external gateway protocol for exchanging routing and reachability information between top-level autonomous systems on the internet (AS). The vast majority of people, and certainly the vast majority of network administrators, will never have to deal with BGP.

The fact that Facebook was no longer listed on DNS was noticed by many individuals. There were even funny posts offering to sell you the domain name Facebook.com.

Dane Knecht, VP of Cloudflare, was the first to report the underlying BGP issue. As Kevin Beaumont, former Microsoft’s Head of Security Operations Center, put it on Twitter, “this meant “DNS breaks down when you don’t have BGP announcements for your DNS name servers, which means no one can find you on the internet. By the way, WhatsApp is the same way. Facebook has effectively deplatformed itself from its own platform.”

This may irritate you, but it may irritate Facebook staff much more. According to reports, Facebook employees are unable to access their buildings since their “smart” badges and doors have been blocked as a result of the network outage. If this is true, Facebook employees will be unable to enter the building to make repairs.

Meanwhile, Reddit user u/ramenporn, who claimed to be a Facebook employee working on resurrecting the social network, reported that “DNS for FB services has been affected and this is likely a symptom of the actual issue, and that’s that BGP peering with Facebook peering routers has gone down, very likely due to a configuration clerical error,” before deleting his account and messages.

He went on to say, “People are now attempting to gain physical access to the peering routers in order to implement fixes, but those with physical access are distinct from those with knowledge of how to authenticate to the systems and those who know what to do, posing a logistical challenge in unifying all of that knowledge. Part of this is due to reduced staffing in data centers as a result of pandemic preparedness efforts.”

Ramenporn also indicated that it wasn’t an assault, but rather an error in a web interface configuration modification. What’s worse, and why Facebook is still down hours later, is that because both BGP and DNS are down, “connection to the outside world is down, remote access to those tools no longer exists, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.” Of course, the on-site techs don’t know how to do it, and senior network administrators aren’t available. In a nutshell, this is a huge problem.

LEAVE A REPLY

Please enter your comment!
Please enter your name here