F acebook, Instagram and WhatsApp are back operational after the long downturn that hit them starting at 17:30 yesterday, October 4, with an unknown cause. In the evening some hypotheses had appeared , also following some statements by a mysterious Reddit user, now canceled. Hypothesis that turned out to be reality: Facebook confirmed that the problem behind the outages is linked to a configuration change in the company's peering routers . Santosh Janardhan, vice president of the engineering team, explained what happened in a post on the official Facebook blog . But what really happened? And why did it take so long to fix the problem?

The BGP protocol

On October 4, will remain in history yesterday occurred down the longest ever recorded for Facebook, Instagram and WhatsApp. The disservice hit the whole world, sparking panic among users, who flocked to Twitter to exchange jokes and try to understand what had happened. If initially we thought of a temporary disruption, not the first for Facebook and the other platforms, after a couple of hours the situation has become very serious. Among the various rumors, both fake and more truthful, a possible cause has sprung up, which then became a reality: the problem of the down of Facebook, Instagram and WhatsApp originated following a change in the configuration of the routers of the services.

Our engineering team found that configuration changes to the backbone routers , which coordinate network traffic between our data centers, caused the communication to fail. The suspension of traffic caused a cascade effect on the data centers, stopping all services.

Santosh Janardhan, Vice President of Engineering and Infrastructure

The backbone routers named by Janardhan are responsible for the communication between different sub-networks. In the case of Facebook, the backbone network formed by these routers is responsible for managing the traffic between the different data centers of the company. In detail, the “culprit” behind the suspension of services is the configuration of the BGP – Border Gateway Protocol. This protocol deals with connecting routers (called "border routers") that belong to autonomous and distinct systems, which in turn are pools of routers. BGP is responsible for choosing the best way to transfer packets from one system to another , and is the foundation of modern Internet communication.

Down on Facebook, Instagram and WhatsApp: the cause

What happened then? And what does BGP have to do with it? In order to use the protocol and realize communication between the systems, each of them must communicate its presence in order to be identified. The identifier of each system is the Autonomous System Number or ASN , which determines a unique routing policy for that system, that is the list of IP addresses present in its network. This information is shared with the BGP to build the inter-system network.

BGP allows the interconnection of autonomous systems.

According to what was reported on the Cloudfare blog, Facebook would have stopped communicating the details on its routing , effectively deleting the connections with the other systems. The configuration update that the web was talking about yesterday concerned a change to the BGP information that made Facebook domains unreachable. The services of the platforms would then have "disconnected" from the web. The three social networks have essentially stopped communicating details on their routing, making DNS resolvers unable to connect to the company's nameservers.

To make matters worse, there was an increase in traffic to Facebook servers and DNS resolvers. Both apps and users have begun to generate a huge volume of requests to DNS resolvers in an attempt to update the feed. As many users have noted, there have been timeout issues on other platforms as well , as resolvers around the world have had to contend with 30 times more requests than normal.

The deadlines for resolving

The update of the BGP configuration also caused problems in the systems internal communication. This led to major difficulties in promptly resolving the problem, as Facebook engineers had difficulty communicating both with each other and with the systems . In this case the only solution was to have physical access to them, with consequent delays due to logistics. Not only that: in most cases those who had physical access to the systems did not have the knowledge to solve the problem, and vice versa. Furthermore, the employees who arrived on site reported problems with access badges , remaining effectively locked out of the offices.

It took all night to bring online services back to the world. Santosh Janardhan warns that there may still be some minor disruptions, due to the ongoing resolution of the problem. Having eliminated the hypothesis of a hacker attack, Facebook is keen to reassure users about the integrity of their data , stating that it has not registered any compromise.

