On Monday, Facebook was taken down totally, along with Instagram and WhatsApp (as well as a few other websites). Many people have assumed the issue was caused by BGP, or Border Gateway Protocol, citing inside Facebook sources, traffic analysis, and the gut impression that “it’s always DNS or BGP.” Facebook is on the mend, yet all of this begs the question:

WHAT IS BGP?

BGP, at its most basic level, is one of the internet’s methods for getting your traffic where it needs to go as soon as possible. Because there are so many different internet service providers, backbone routers, and servers involved in getting your data to, say, Facebook, your packets could wind up traversing a variety of different paths. BGP’s role is to show them the path and ensure that they are on the right track.

BGP has been described as a post office system, an air traffic controller, and more, but my favorite analogy was one that compared it to a map. Consider BGP as a group of people who create and update maps that show you how to go to places like YouTube and Facebook.

The internet is divided into large networks known as autonomous systems when it comes to BGP. They’re networks managed by a single entity, which could be an ISP like Comcast, a firm like Facebook, or another large institution like the government or a major university. Because it would be incredibly difficult to create bridges linking all of the islands, BGP is in charge of informing you of which islands (or autonomous systems) you must pass through to reach your destination.

Because the internet is always evolving, the maps must be updated as well; you don’t want your ISP to direct you down a road that no longer leads to Google. Because mapping the entire internet all of the time would be a huge job, autonomous systems share their maps. They’ll occasionally communicate with their island neighbors to see and duplicate any map adjustments they’ve made.

It’s easy to see how things might go away when using maps as a foundation. There were always jokes about GPS driving you off a cliff or into the middle of the desert when consumers first had access to it. The same thing might happen with BGP: if someone makes a mistake, traffic will be directed somewhere it shouldn’t be, causing issues. That error will end up on everyone’s map if it isn’t caught. Other things can go wrong, but we’ll get to those in a minute.

BGP is like maps that detail all the fastest ways from you to a website?

Right! Unfortunately, it can become much more confusing because the shortest path is not always the optimal path. There are a variety of reasons why a routing algorithm would prefer one path over another, including cost, with some networks charging others to use them in their routes.

Facebook, on the other hand, did not! According to a report released earlier this year, it has constructed its own BGP system, which allows it to conduct “rapid incremental upgrades.” However, the method described there is intended for communication within data centers – it’s difficult to identify what caused Facebook’s troubles on Monday at this moment, and it’d require someone wiser than me to say whether Facebook’s datacenter communications might cause this kind of problem. The outage, according to cybersecurity reporter Bryan Krebs, was triggered by a “regular BGP upgrade.”

What does DNS have to do with all this?

DNS tells you where you’re going, while BGP tells you how to get there, as explained by Cloudflare. DNS is how computers figure out what IP address a website or other resource has, but knowing that information isn’t really useful – if you ask a buddy where their house is, you’ll almost certainly need GPS to get there.

Cloudflare also has a fantastic technical summary of how BGP faults may mess with DNS requests – the post is unique to Monday’s Facebook problem, so it’s worth a read if you’re looking for an explanation of what it looked like from the perspective of an autonomous system.

What can go wrong with BGP?

Quite a few things. Two famous cases, according to Cloudflare, include a Turkish ISP inadvertently telling the entire internet to redirect its traffic to its service in 2004 and a Pakistani ISP inadvertently banning YouTube globally after attempting to do so exclusively for its users. One group making a mistake can cascade due to BGP’s capacity to propagate from autonomous system to autonomous system (which, as a reminder, is one of the things that makes it so dang useful).

Hackers were able to redirect queries to Amazon’s DNS and steal thousands of dollars in Ethereum by hacking a separate ISP’s BGP servers in 2018. Although Amazon was not hacked, traffic intended for it was diverted elsewhere.

Or, with a faulty BGP update, you may muck it up and take your entire service offline. BGP is affectionately known as the internet’s duct tape, but no adhesive is flawless.

SO WHAT HAPPENED TO FACEBOOK?

For whatever reason, Facebook’s servers advised everyone to remove them off their maps. If we want to know exactly what occurred to Facebook’s BGP configuration and why it was changed, we’ll have to wait for a report from the company. However, Cloudflare’s CTO claims that shortly before it went black, the service received a slew of BGP changes from Facebook (the majority of which were route withdrawals or erasing lines on the map pointing to Facebook). One of Fastly’s tech leaders tweeted that when the company went offline, Facebook ceased supplying routes to it, and KrebsOnSecurity backs up the theory that it was a change to Facebook’s BGP that caused the company’s services to go down.

If BGP was the problem, how does Facebook fix it?

Given the length of the outage, the answer appears to be “not easily.” Facebook wanted to make sure it was promoting the right albums and that they were being picked up by the rest of the internet. To put it another way, it needed to make sure its maps were accurate and visible to everyone.

However, it is easier said than done. Employees at Facebook have been reported as being shut out of badge-protected doors and having difficulty communicating. In circumstances like these, you must not only determine who has the necessary knowledge and authorization to fix the problem, but also how to connect those individuals. And when your entire company is in trouble, that’s no simple chore – techtalkarena got tales of engineers being physically dispatched to a Facebook data center in California to try to resolve the issue.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here