An unknown big change behind WeChat: How do they “move the elephant into the refrigerator”?

In mid-January 2020, less than a week before the Spring Festival, Stephen Liu, the head of WeChat technical architecture, was very anxious. The upcoming New Year's Eve is the busiest time of the year for WeChat business, and hundreds of millions of users will be at this moment. Sending New Year's greetings and WeChat red envelopes, WeChat servers are also experiencing a bigger impact year by year.

In order to ensure that everyone can receive New Year's blessings as scheduled and grab WeChat red envelopes, the WeChat technical team enters the "Spring Festival Guarantee" mode at the end of each year to conduct server stress tests to ensure that WeChat does not fall off the chain.

The most critical moment, very difficult problem

But in the testing phase of this Spring Festival, something went wrong.

The WeChat red envelope, which was born in 2014, experienced a major downtime during the Spring Festival that year, and some users could not receive the red envelope at some time, nor could they see the amount of the red envelope. The following year, WeChat won the right to interact with advertisements for the Spring Festival Gala. On New Year's Eve that year, the total number of WeChat red packets sent and received in China reached 1.01 billion times, and the total number of WeChat shakes during the Spring Festival Gala reached 11 billion times. This year, WeChat was well prepared and generally stable, with occasional small-scale downtime.

Therefore, it is not a good phenomenon to have testing problems before the Spring Festival.

Stephen Liu said:

At that time, the (target) value we wanted to stress test was probably billions of messages sent per minute, but the level of stress we measured was only half of the target, and it was only two weeks before the Spring Festival.

The so-called stress test is to expand the server capacity of WeChat online. After the expansion is complete, conduct an aggressive simulation to simulate the peak data at 0:00 on New Year's Eve to see how much this year may increase compared to last year, and then simulate that amount completely and apply it to the system.

A simple understanding is similar to a website conducting a DDoS attack on itself, testing how many people can access the website at the same time without downtime. A more popular understanding is that the restaurant picks up customers. The seats and chefs of a restaurant can only receive 100 guests at the same time in the off-season. However, in the peak season, there may be 300 guests who need to eat at the same time. At this time, it is necessary to expand the restaurant in advance to recruit chefs. Similar to "expansion", or if it really doesn't work, let guests queue up outside.

However, WeChat cannot send and receive messages by queuing.

The problem encountered by the WeChat technical team and Stephen Liu this time seems to have clearly expanded the restaurant and hired more chefs, but at the same time only 150 guests can be received, which did not reach the target of 300, and at this time, the chefs are still quite busy and the seats are empty. It was also very empty, and there were people queuing outside.

The WeChat technical team has checked for about a week or two before, and finally located the problem: there is a problem with the performance of the network card. To take another example, it is like the receptionist at the entrance of the restaurant being lazy and not bringing the guests into the room, which resulted in dissatisfaction in the restaurant and long queues of customers outside.

Behind the problem is a huge change unknown to WeChat

The reason why there was no problem with the stress test in previous years, but there was a problem with the stress test this year, involves a great change behind WeChat: self-developed cloud.

This great change started with Tencent's 930 revolution in 2018. On September 30, 2018, Tencent made a major adjustment to the company structure again. The original seven business groups were reorganized and integrated, and the newly established Cloud and Smart Industry Business Group (CSIG), platform with the Content Business Group (PCG). Among them, CSIG undertakes the grand vision of Tencent ToB, while WeChat Business Group (WXG) connects the most C-end users.

Cloud is already a strategic fulcrum business for Tencent. From this point on, self-research and migration of its own business to the cloud has become an important issue for business adjustment, while self-research and migration of WeChat business to the cloud is the top priority.

Before the reform of Tencent 930, Tencent did not provide a unified cloud infrastructure for its internal self-developed business, but adopted the model of physical machine servers. From a macro perspective, considering the huge number of users and business volume of WeChat, self-developed cloud migration can bring huge cost and efficiency advantages, which are beneficial to both WeChat and Tencent Cloud businesses.

But microscopically speaking, a business involving more than 1 billion users needs to undergo such a drastic change, and make users feel indifferent, as if the wheels of a high-speed car need to be changed, the car cannot stop, and even the wheels cannot be bumped. Change.

The problem with the previous pressure test occurred during the wheel replacement process.

In fact, it is indeed time for the wheels to be replaced, Stephen Liu said:

In 2014 WeChat was just one department. At that time, when the company proposed such an idea of ​​cost optimization, we were quite nervous, because there were not many people in the department at that time, it was only one department at that time, and there were only three or four hundred people at that time. Before 2014, all of WeChat's manpower was devoted to functional iteration and constantly polishing new functions, so there was less attention to how the backend server was used, including how well the architecture was done.

The company had this requirement again. Later, the company arranged for people to see how each business department was doing, and finally selected very experienced people. For example, the one who led the team at that time was also the company's VP. Anyway, I was very impressed because I was approved by him many times. It is said that the cost of WeChat is very high, and your server is not used well.

▲ WeChat's previous report PPT

This requirement to reduce costs and increase efficiency prompted the WeChat team to optimize the server architecture for the first time, and adopted a system architecture called YARD at that time.

However, this time, the self-developed cloud platform needs to be consistent with Tencent, and the open source K8S system architecture is adopted. Compared with YARD, the K8S architecture is more open and has inherent advantages in adapting to artificial intelligence and big data frameworks. Now, many functions of WeChat are related to artificial intelligence and big data, such as voice-to-text and text translation.

In other words, in 2014, WeChat adopted the YARD architecture for a very simple purpose, which was to help flexibly schedule server resources and save costs. It did not consider more complexity and longer term, and K8S was not open source at that time.

As business development progresses, the advantages of the K8S architecture gradually overwhelm the pain of architecture migration, which coincides with Tencent's business transformation, and this change is imperative.

Edsel Wang, a WeChat infrastructure engineer, told Aifaner the macro steps of WeChat self-development and cloud migration:

For the WeChat team, cloud migration can be divided into two levels: narrow and broad. In a narrow sense, going to the cloud is the 930 reform in 2018. After the company's 930 reform, the company promoted the self-research and cloud migration, and then WeChat began to use the unified cloud infrastructure provided by the company. In a broad sense, going to the cloud means that WeChat has gradually made the entire R&D model cloud-native. This does not simply include moving some back-end services from the original physical machine to the cloud. Of course, it also includes the integration of the entire R&D process with the cloud.

After the 930 reform in 2018, the company's promotion of self-research and cloud migration has gone through two stages so far. The first stage is from 2018 to 2020. The company mainly changed the way of providing servers, that is, from the original provision of physical machines to CVM (Cloud Virtual Machine, cloud virtual machine). The second stage starts from 2020. The company further requires each business department to change some internal scheduling systems to K8S. For us, this is to migrate from YARD to K8S. In the first stage, from the original physical machine to the use of CVM, since we designed YARD as its scheduling layer, our main work is to make YARD adapt to the cloud, because YARD originally supported physical machines, Now that YARD supports CVM virtual machines, the business layer doesn't need to change much.

In the second stage, for the WeChat team, it is to use K8S, that is, to replace the self-developed YARD platform with the scheduling capabilities of the K8S cluster provided by Tencent Cloud. To make this migration smoother, we have planned three steps in the process of replacing YARD with K8S. The first step is to solve the problem of whether WeChat can run on K8S, and whether the program can run on it. The second step is to transplant some of the experience accumulated by YARD to K8S, so that K8S can be aligned with the original capabilities of YARD, and then all the capabilities provided by the original YARD can be used. In the third step, we have to give full play to the capabilities of K8S, because we have provided the first two steps provided by YARD. In the third step, we must make full use of the capabilities of K8S, which is mainly reflected in cost and efficiency.

We completed the first two steps before 2020. From the second half of 2020, we began to use K8S on a large scale, and in 2021, we entered the third step. From the current point of view, our cost and research efficiency have been further improved, compared with the original YARD. From the perspective of cloud in a broad sense, the WeChat team also has a landmark event in promoting the CVM virtual machine before, that is, the storage team has also made a breakthrough in the cloud, because WeChat has always used the In the self-developed storage system, we have experienced many different DB (Data Base, database) and KV (Key-Value, a kind of database system) in the past ten years, and finally realized the ability of cloud storage in the version of infinityKV. In the second half of 2020, infinityKV will be launched, and about 80% of the data in the WeChat background is stored in the new system of infinityKV.

This is the WeChat cloud (process) I mentioned, that is, there are several steps (process) to move the elephant into the refrigerator.

Edsel Wang further introduced the gradually emerging limitations of YARD. In 2014, the industry's definition of cloud platform was not very clear. On the other hand, Tencent's hardware environment was quite different from the current cloud hardware environment. YARD was developed and designed in that hardware environment at that time, which caused it to lack some core capabilities such as virtualization of disks and network cards.

At the beginning, the stress test problem that occurred during WeChat’s self-development and migration to the cloud was located in the network card. The reason was that Tencent Cloud used a new model at the time, and the CVM operating system and hardware were not well adapted.

Finally, the WeChat technical architecture team temporarily solved the problem that the CPU load was small, but the performance of the network card was bottlenecked by the method of saving the country through a curve. To put it simply, if the original server CPU has 180 cores, and 90 cores are equipped with 1 network card after slicing, the result is that the network card is fully loaded, and the CPU load is only about 20%. The WeChat technical architecture team re-segmented the CPU core and changed it to 48 CPU cores corresponding to one network card, so that the CPU load is more than half, and the network card load is not a bottleneck while making full use of performance.

This is a solution to the symptoms, this is a solution to the symptoms, and the solution to the root cause is CVM to optimize the network card scheduler. The optimization of the CVM network card scheduler and the migration to K8S allows the WeChat background to control network traffic more effectively, further improving the flexibility and stability of WeChat background deployment.

Change is not scary, scary is not changing

In 2013, WeChat experienced its longest outage. Because an excavator broke the communication optical cable, the business requests of the East China Data Processing Center were turned to South China and North China, which led to the paralysis of WeChat services for more than five hours.

Since then, when the YARD architecture was deployed the following year, WeChat performed an important function: support for the three campuses. That is to build three computer rooms (parks) in each city. The network and power of the computer rooms are independent. Even if one of the optical fibers is cut off, there are two others as support.

This is the common concept of "redundancy" in server deployments today.

Now, after self-development and going to the cloud, not only the server resources are virtualized, but the new K8S architecture can go further. The server resources belong to the entire Tencent company. This is just like a loan. Previously, WeChat borrowed from municipal branches, but now it borrows from provincial headquarters.

In the 11-year history of WeChat so far, the definition of WeChat is also constantly changing. Moments, WeChat red envelopes, small programs, video accounts and other node-based functions expand the definition of WeChat again and again. It is a social network, a payment tool, and a content platform.

The server support behind WeChat is also facing such a constantly changing process.

Earlier, the first snow in Beijing caused local users to send friends desperately, which also led to an instantaneous increase in server demand. At this time, it was necessary to respond quickly to expand the capacity.

However, the weather changes and user behaviors in a certain place are unpredictable. It is inevitable to collectively send red envelopes at zero o'clock on the Spring Festival and New Year's Eve, and there are many similar inevitables. For example, Jay Chou's concert video number is broadcast live, and tens of millions of viewers are right. It is a huge test of WeChat server, but it can be stress tested and deployed in advance.

Recalling a live broadcast in September last year, Bok Zhou, the back-end development engineer of the video account, still felt thrilled.

He said that thanks to the advantages after going to the cloud, the WeChat team can also launch more server resources faster in the face of this unexpected surge in traffic, preventing some users from being unable to watch the live broadcast.

Self-research and migration to the cloud is also a long-term and constantly changing process, and the advantages will be gradually discovered. Now is not the end of this process, but some advantages and visions are already predictable.

Stephen Liu, head of WeChat technical architecture, said:

I shared a point of view with the team more than a year ago, and I took the 5 levels of autonomous driving as an analogy. Level 0 is human driving with no automation at all. Level 1 has some driving assistance, Level 2 is stronger driving assistance, Level 3 already has a certain degree of automatic driving ability, and then there are Level 4 and Level 5.

One of my hopes is to be able to achieve the same automatic driving in the future. In the future, when the Spring Festival guarantees, it can be completely driven by machines. We were probably at Level 0 a few years ago. Later, after YARD, it was Level 1. After exploring the various capabilities of K8S throughout 2021, I think we should be in a Level 2 state now. I hope to be able to reach Level 3 next, with relatively complete automated driving functions.

In the plastic greenhouse of fate, every cabbage that has been sprayed with too much pesticide once had a dream of becoming a pollution-free organic vegetable.

#Welcome to pay attention to the official WeChat account of Aifaner: Aifaner (WeChat: ifanr), more exciting content will be brought to you as soon as possible.

Love Faner | Original link · View comments · Sina Weibo