Facebook's Long Crash Could Be Linked to Deleted Registrar Records
Earlier today, Facebook and its related apps experienced a serious crash. The external facing websites for Facebook, WhatsApp, and Instagram became inoperable.
As if that weren't cause for enough worry, the internal servers of the company itself went down entirely.
This was the worst outage for the tech giant since 2008, when an error pushed Facebook offline for roughly one day, leaving 80 million users without service. Today, it has roughly 3 billion users.
Service returned shortly before 6:00 PM EDT, but much damage, and fretting, was done in the more than five-hour-long outage.
Since 2016, Facebook has used its own domain registrar, but it may no longer have the records for it, which would have made their attempts to go back online especially difficult.
Facebook may have pulled its own BGP routes to block 'attackers'
Just before noon EDT, Facebook's DNS was reported broken by Ars Technica reporter Jim Salter, according to a tweet. "So, @facebook's DNS is broken this morning... TL;DR: Google anycast DNS returns SERVFAIL for Facebook queries; querying a.ns.facebook.com directly times out." It seemed all four of Facebook's DNS servers showed no response, with cached zones timing out for larger providers, like Google, Cloudflare, and Level3. There were also major issues with other major apps under the Mark Zuckerberg umbrella, with Instagram and WhatsApp reported down. This was confusing, since Instagram does not use the conventional Facebook DNS — which pointed to a deeper problem.
The DNS, which stands for Domain Name System, enables the translation of hostnames legible to humans (like interestingengineering.com) into a basic numeric IP address. If you don't have a functional DNS, your browser can't tell how to reach the servers that host the website you want to see. And with Instagram's outage, the problem was more complicated, since Instagram's DNS servers are hosted on Amazon, instead of Facebook. Nothing was wrong with Amazon. You could visit Instagram and WhatsApp, but they came up with HTTP 503 failures, which means no server was active.
Facebook crash was probably not a malicious external attack
At some point, Facebook's entire network became unreachable because its Border Gateway Protocol (BGP) routes were all withdrawn, according to the thread from the Ars Technica journalist, citing @Cloudflare's explanation. BGP propagation errors aren't very rare. These can sometimes come from direct attacks on the system, but "those usually leave SOME pockets of the world functioning."
"FB COULD have deliberately pulled those routes themselves, [for example,] to limit attackers' ability to access compromised systems", read a follow-up tweet from Salter, speculating on why this is happening. The Reddit user u/ramenporn claimed to work at Facebook as part of the recovery team, and said the Monday outage was probably Facebook network engineers trying to execute a configuration change that accidentally locked them out of the system. This left only technicians physically present at the routers the ability to attempt to resolve the issue, which means the great Facebook crash was probably not caused by a malicious external hack to take down Zuckerberg's social media infrastructure.
This was a developing story and was regularly updated as new information became available.
Interesting Engineering delves into the missions of Astroscale, a space junk removal company. It is partnering with OneWeb to launch the ELSA-M mission in 2024.