Understand how Facebook disappeared from the Internet

Time:2021-11-28

Original text:[https://blog.cloudflare.com/o…]
Timing

Understand how Facebook disappeared from the Internet

“FB won’t go down, will it?” we thought about it for a few minutes

Today, at 16:51 UTC on October 4, 2021, we created a list entitled “servfail returned by FB DNS query”, because we were worried that there was a problem with our DBS 1.1.1.1. But when we want to publish the status on our [public status] page, we find that a more serious problem may be happening.

Social media quickly reported it, and our engineers confirmed it. FB and its associated services WhatsApp and instagram are also down. Their DNS domain names stopped resolving and their infrastructure IP was unavailable. It’s like someone “unplugged” their data center at the same time, making them disappear from the Internet.

How did this happen?

Meeting BGP

The full name of BGP is border gateway protocol. It is a protocol for exchanging information between autonomous autonomous autonomous system (as) and routing information on the Internet. The huge routing allows the Internet to quickly update the connected list to deliver network packets to the target address. Without BGP, Internet routing doesn’t know what to do, and the Internet won’t work.

The Internet is basically a network in a pile of networks, which is divided by BGP protocol. BGP allows a network (here FB) to inform other networks in the Internet of its existence. As we mentioned earlier, FB does not broadcast its existence, ISP service providers and other networks do not know how to find FB’s network, so it is not available.

Each independent subnet has an ASN: (autonomous system number). An autonomous system (as) is an independent network using a separate internal routing strategy. An as can generate a prefix (indicating that they control a set of IP addresses) and it can also transmit a prefix (indicating that they know if they reach a specific set of IP addresses).

The ASN of cloudflare is as13335. Each ASN should use BGP to declare its prefix to route to the Internet; Otherwise, no one knows how to connect and find us.

Our [learning center] has good information on how [BGP] and [ASN] work.

This is a simplified diagram. You can see that the Internet has six autonomous systems, and two packets can be used to route from the start point to the end point. AS1 – > as2 – > AS3 is the fastest, AS1 – > as6 – > As5 – > AS4 – > AS3 is the slowest, but you can go if there is a problem with the first way.
Understand how Facebook disappeared from the Internet
At 1658 UTC, we noticed that FB stopped broadcasting their DNS prefixes to routes. This means that at least FB’s DNS server is unavailable. For this reason, cloudflare’s 1.1.1.1 DNS cannot answer the IP address query for facebook.com or instagram.com.
route-views>show ip bgp 185.89.218.0/23
% Network not in table
route-views>

route-views>show ip bgp 129.134.30.0/23
% Network not in table
route-views>
At the same time, other FB IP addresses are still routable, but they are basically useless without FB DNS related information:
route-views>show ip bgp 129.134.30.0
BGP routing table entry for 129.134.0.0/17, version 1025798334
Paths: (24 available, best #14, table default)
Not advertised to any peer
Refresh Epoch 2
3303 6453 32934

217.192.89.50 from 217.192.89.50 (138.187.128.158)
  Origin IGP, localpref 100, valid, external
  Community: 3303:1004 3303:1006 3303:3075 6453:3000 6453:3400 6453:3402
  path 7FE1408ED9C8 RPKI State not found
  rx pathid: 0, tx pathid: 0

Refresh Epoch 1
route-views>

We continue to track BGP updates and announcements we see on the global network. Here, the collected data gives us a view of how the Internet is connected and where traffic comes from all over the world.

The BGP update message tells the router that you should revoke the prefix for any prefix or overall broadcast. When we check our timing BGP database, we can clearly see a series of updates we received from Facebook. Usually this picture is calm: FB won’t make many changes.

But at 15:40 UTC, we saw a spike in routing changes from Facebook. This is when the problem began.
Understand how Facebook disappeared from the Internet
If we separate route declaration from revocation, we can see the problem more clearly. The route was plugged in, and Facebook’s DNS server was dropping. One minute after the problem occurred, cloudflare engineers wanted to determine why 1.1.1.1 could not resolve the address of facebook.com in a room, and were worried that it was a problem in our system.
Understand how Facebook disappeared from the Internet
Due to these revocation events, Facebook and its site were quickly disconnected from the Internet.

DNS affected

Due to the direct impact of this problem, DNS resolution all over the world stopped resolving their domain names.
➜ ~ dig @1.1.1.1 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com. IN A
➜ ~ dig @1.1.1.1 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com. IN A
➜ ~ dig @8.8.8.8 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com. IN A
➜ ~ dig @8.8.8.8 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com. IN A
This happens because DNS, like other systems on the Internet, has its own routing mechanism. When someone opens in the browserhttps://facebook.comAfter DNS resolution, it responds to the request for domain name query and returns the IP address to be connected. Initially, it will check whether it exists in the cache and use the cache. If not, the answer is retrieved from the domain name server, which is generally the responsibility of the entity in charge of it.

If the domain name server is unreachable or cannot respond for some reasons, servfile will return and the browser will return an error to the user.

Similarly, our learning center provides an explanation of how DNS works.
Understand how Facebook disappeared from the Internet
When Facebook stops broadcasting their DNS prefix routing through BGP, our DNS services and others cannot connect to their domain name servers. Then, 1.1.1.1, 8.8.8.8 and other major public DNS servers begin to issue (or cache) servfail responses.

But that’s not all. Now human behavior and procedural logic together lead to other exponential effects. The DNS request generated a tsunami.

This problem is partly because the app does not accept the returned error and starts to retry. It is also partly because the end user starts to redraw the page regardless of the wrong request, or kill them and restart the app, which also causes a large number of requests.

This is the increase in traffic we see on 1.1.1.1:
Understand how Facebook disappeared from the Internet

So far, because Facebook and its website are too large, our DNS handles 30 times more queries than usual, which leads to delays and timeouts on other platforms.

Fortunately, 1.1.1.1 is built to be free, fast (as proved by the independent DNS detection tool dnsprf), scalable, and we can ensure that the service has minimal impact on users.

We can keep DNS requests below 10ms. At the same time, the percentiles of p95 and p99 can see the increase in response time. It is likely that the invalid TTL needs to re request the Facebook domain name server and cause a timeout. The timeout of DNS is limited to 10 seconds, which is the default rule of engineers.
Understand how Facebook disappeared from the Internet

Affect other services

People are turning to other services and want to know what happened. When Facebook is unavailable, we see an increase in DNS access to twitter, signal and other messages and social media platforms.
Understand how Facebook disappeared from the Internet
We can see the negative impact of the unavailability of ASN 32934 from Facebook on warp traffic this time. This chart shows how the specific flow changes in each country and 3 hours ago from 15:45 UTC to 16:45 UTC. All over the world, warp traffic and traffic from Facebook have disappeared.
Understand how Facebook disappeared from the Internet

internet

Today’s events remind us that the Internet is a very complex and composed of hundreds of independent systems and protocols. Trust, standardization and collaboration between entities enable 5 billion active users around the world to connect.

to update

At about 21:00 UTC, we saw BGP update activity sent from Facebook network and peaked at 21:17 UTC.
Understand how Facebook disappeared from the Internet

This figure shows the availability of the DNS name facebook.com on cloudflare’s DNS server 1.1.1.1. It was unavailable at approximately 15:50 UTC and recovered at 21:20 UTC.
Understand how Facebook disappeared from the Internet

There is no doubt that Facebook, WhatsApp and instagram need more time to go online, but at 21:28 UTC, it seems that Facebook began to reconnect to the global Internet and DNS began to work.


This article is from WeChat (public order) malt bread, Zhu Kunrong’s official account, and the official account number ID “darkjune_think”.

Developer / science fiction enthusiast / hard core host player / amateur translator
Reprint please specify.

Microblog: Zhu Kunrong
Station B:https://space.bilibili.com/23…

Communication email:[email protected]