The first stop of the metaverse for ordinary people, who is the guide?

For movie lovers, facial capture is a concept that is both familiar and unfamiliar. It is familiar because motion capture and facial capture are commonly used technologies in magical movies. Many classic non-human characters rely on these two technologies to complete. of. It is technological progress that allows us to witness the magnificent Middle-earth and its diverse races in "Lord of the Rings", the alien wonders in "Avatar", and the connection between humans and other species on the screen.

It is unfamiliar because for most people, facial capture is a technology that has been heard but not experienced. We have been spectators of this technology, rather than witnesses.

But the curve of technological progress often has two branches, one goes forward and goes farther; the other goes down, from high cost to low cost, from affecting a few people to benefiting millions of people. Sometimes, the two threads also go hand in hand and intertwine to create greater energy.

Doing facial capture in the metaverse, it's hard

In the 2016 movie "Warcraft", relying on motion capture and facial capture, Asian grass candidate Wu Yanzu played the ugly and evil orc Gul'dan in the movie.

In the field of motion capture and facial capture, there is even a superstar named Andy Serkis, who played the key character Gollum in the "Lord of the Rings" trilogy, and the protagonist of the "Rise of the Planet of the Apes" trilogy. Gorilla Caesar.

▲ Metaverse demo launched by Microsoft last year

Movies are the virtual world we live outside of, and the metaverse is the virtual world we can live in in the future. However, users who have tried Metaverse VR applications may have already realized that the "I" here is far from the real "I". The modeling here is rough, and it is even impossible to map the user's legs at the beginning. Go in, not to mention the rich expressions of users.

Therefore, sometimes, as an early adopter, I will envy the facial capture technology in movie performances, and hope that in the metaverse, I will not be the cartoon villain like a QQ show, but can travel across the Middle-earth continent, Azeroth fantasists of the world of Sri Lanka or the planet Pandora.

But it is not without exception. iQIYI’s first virtual reality game breakthrough reality show "Vowel Adventure" brought a lot of cool technology to the production of the reality show, allowing guests to enter the virtual world— ——The Vowel Continent, embarking on a hilarious and hilarious adventure.

This may be one of the few Metaverse content that has a "sense of refinement". This sense of refinement is derived from the "spiritual similarity" between the virtual character and the corresponding star. Behind this "spiritual similarity" is the aforementioned facial capture technology.

In the Metaverse, industrial-grade, movie-level facial capture that is accessible to only a few people is obviously not an inclusive technology. If facial capture can be done with a mobile phone, it is naturally the best.

However, it is naturally difficult to achieve the leap from industrial grade to consumer grade.

In today's mature industrial films, the realization of accurate facial capture almost follows the law of high input and high quality output.

▲ Before and after special effects production of "Avatar"

The investment here includes both time and money. Take "Avatar", which once brought us a visual spectacle, as an example. It took director James Cameron 10 years from the idea of ​​shooting to the project landing. time.

In the movie extras, each actor needs to mark his face with black dots, which are then captured by the camera in front of him. At the same time, several cameras are distributed around to capture body movements.

When all the plots are filmed, it does not mean that the film production is over. It will take roughly double or even double the shooting time to fit the facial expressions and body movements collected by the camera into the virtual characters.

▲ Before and after special effects production of "Avatar"

And in order to achieve a sufficiently shocking result, it often requires a huge post-production team to complete it together. This kind of traditional industrial-grade assembly line, although the effect is good, the precision is full, and it is taking the route of taking a lot of trouble and giving up speed.

The speed and accuracy of facial capture are somewhat similar to a fish and a bear's paw. In the field of AI algorithm design, they and power consumption usually form an impossible triangle, which is the "trilemma". In layman's terms, it is a trilemma choose.

There are 43 muscles in the human face to express our emotions together. Many expressions are complex and subtle, and the difference between different expressions is often only separated by a river.

In order to accurately express the true feelings in the metaverse, or to convey the facial micro-expressions in reality to the virtual world, the capture accuracy must reach a certain level, and it is necessary to accurately grasp hundreds of feature points, and then cooperate with the model Algorithms are restored.

It is worth noting that the "metaverse" we mentioned is not a creation. It can have a certain amount of post-production. If you want to feel immersive and achieve real-time feedback, facial capture and communication need to be synchronized and calculated, and real-time feedback.

Even according to the standard of 24 frames of a movie, it still needs to process 24 frames of high-precision pictures per second in real time, grab the key points from hundreds of feature points, and complete the reconstruction of expressions.

It is already a huge amount of work just to capture complex and dynamic human faces. In addition to human faces, there are many, many external factors and emergencies that will also affect the effect of facial capture. In the Metaverse , it is impossible for us to create a film studio, professional lighting and post-processing computer to do this work.

Everything happens and is recorded now.

Therefore, if you want better results, you need to add objective and subjective factors such as different light and shadow changes, the vibration of the helmet, camera and other equipment worn, and partial occlusion of the face.

In short, facial capture may sound like nothing more than an image capture technology, but in fact, it needs to take into account various information points related to the face, as well as micro-expression changes, lighting environment and other factors.

It does not present the muscle changes of the face into the virtual world one by one, but conveys the emotions in reality accurately and in real time.

Why can Xiaolong achieve facial capture in the Metaverse?

For the recording and presentation of facial expressions, in fact, we already have corresponding applications around us, that is, the "animated emoticons" that most manufacturers have added to chat apps.

It acts like an entertainment function that enriches chatting. It does not require high precision, and it can only record a few more characteristic expressions. It is actually difficult to present subtle expressions.

For iQiyi's "Vowel Adventure" program, the form of "animated emoticons" is far from enough.

The challenge is that the face capture algorithm can have both fish and bear's paw, and the difficulty of human face capture is greater than that of human motion capture and animal face capture.

Therefore, hardware, software, and hardware support for software, these three levels, determine whether Snapdragon can do a good job in facial capture, which means that it requires the powerful underlying computing power of the mobile phone chip platform and the support of neural network algorithms.

Long before the Metaverse concept became popular, the imaging algorithm of the Snapdragon chip was sufficient to recognize certain face data, and targeted optimization was carried out through the corresponding algorithm.

However, it is the first time for facial capture, or the use of facial capture technology to participate in the production of "Vowel Adventure".

The first is to debug the corresponding algorithm based on the original technology. Start with the accuracy, train a complex model with a huge amount of calculation, cover all possible expressions as much as possible, and then compare and debug repeatedly to meet the needs of program recording. .

Taking into account the reduction of the amount of calculation, the "cropping calculation" is performed, in other words, the burden is reduced, but the premise is to reduce the amount of calculation while maintaining a certain accuracy of facial capture.

Previously, the AI ​​​​algorithm for facial data calculation used the mobile phone CPU, which can only be maintained at 30fps while maintaining a certain accuracy, and in the high-frequency calculation process, it will easily cause heat accumulation of the device, and in some complex light The situation of stuck under the expression.

In order to solve the problem of power consumption and battery life, Qualcomm introduced the Snapdragon SNPE tool (Snapdragon Neural Processing Engine, which is a runtime software for Snapdragon accelerated deep neural network) optimization on this algorithm, and enabled the AI ​​​​engine.

In this way, the original AI algorithm can run at 60fps and can continue to run for three hours, which almost perfectly solves the problem of precision and speed, allowing "you can have both."

What is more shocking is that when the program was recorded, this solution was only based on the previous generation of Snapdragon 8+ chip, and the AI ​​​​engine was also the previous generation.

Another point is that in the pre-exposed recording highlights, the faces of the stars involved in the recording did not have traditional intensive data sampling points, and only wore a helmet and an Android mobile phone terminal for fixed equipment.

There is no need for special markers or multi-angle recording of multiple cameras. An Android mobile phone based on the Snapdragon chip can complete the collection of 300 feature points on the face, and use the AI ​​​​engine of the terminal to analyze the complex AI algorithm. for real-time rendering.

For facial capture, both accuracy and speed are finally achieved through algorithms, AI engines, and NPU hardware acceleration. On the other hand, for "Vowel Adventure", Xiaolong's technical strength has made this program from conception to actual action.

For all kinds of interference other than human faces, Qualcomm Snapdragon and Xiangxin Technology have also made technological breakthroughs for various details.

For example, when an artist is singing, because the microphone is very close to the face, it will cause serious occlusion to the face-to-face capture. This needs to be considered in the technical design. In the end, they realized that even if the mouth is partially occluded, they can still capture the mouth movements stably and maintain virtual reality. Facial stability of the image, avoiding "twitching, shaking" and other situations that affect the on-site effect due to insufficient capture.

The first stop of the metaverse, Xiaolong as a guide

It can be seen that "Vowel Adventure" has proved that in the future, we can use the mobile phone of the Snapdragon 8 series mobile platform to complete facial capture, like a star, to reflect and express ourselves in the metaverse world. Just like the little ghost Wang Linkai, the image is a quirky clown, but the expression is still his own emotions.

▲ Click to play

In the past, we can clearly perceive the various advancements brought about by the progress of mobile phone SoC: single-core CPU to multi-core CPU, so that the mobile phone is no longer stuck; GPU progress, the games that can be played, from "Angry Birds" to the desktop level "Yuanshin", and the mobile game frame rate, from 30fps to 120fps level; the network is similar, thanks to the progress of Modem, the network speed has also changed from kb to mb to the current gb level.

More importantly, as mentioned earlier, technology must not only move forward, but also downward. If the interface of the current mobile operating system still needs to input command symbols instead of the current graphical touch interface, then the computing power of the Snapdragon chip No matter how strong it is, it is difficult to ship hundreds of millions of products every year.

When Xiaolong and Xiangxin Technology cooperated to complete the ultra-low-threshold facial capture technology for metaverse-themed variety shows, the proposition encountered was not only to overcome technical difficulties, but also to make a sufficiently simple, easy-to-use, intelligent and stable Compared with technology, the users of this technology are not technicians and developers, but film and television production teams and actors.

There is a complex theory of gravity behind all the usual apple landings, similar to the progress of Snapdragon supporting facial capture, who is behind it?

The answer is Qualcomm AI Engine.

Compared with the CPU and GPU of the processor, the presence of the AI ​​​​computing engine is weaker. Even though the computing power of each generation of AI engine is increasing exponentially, the perception it brings seems a bit weak?

Take this fast and good presentation of facial capture as an example, the AI ​​engine in ordinary devices has reached a considerable level.

In normal times, the computing power of the AI ​​​​engine increases exponentially, and every operation you do, such as unlocking, turning on the camera, waking up the voice assistant, etc., is not surrounded by the AI ​​​​engine all the time.

The high computing power brought by the AI ​​engine makes these operations respond faster, making you unaware of the existence of technology, but surrounded by better human-computer interaction.

▲ Click to play

The Qualcomm AI engine shines not only in facial capture and avatar creation in the "Vowel Adventure" program. If the metaverse wants enough immersion, it first needs to have the same perceptions as the real world, such as vision and hearing.

Accurate facial capture and motion capture can be classified into vision, while hearing can bring low-latency, high-quality sound experience with the help of Snapdragon Sound technology.

Behind these technologies is the participation of the Qualcomm AI engine, and the help of the AI ​​engine has also become the key behind the metaverse.

If you compare the avatars in some so-called Metaverse platforms at home and abroad, such as Meta’s Horizon, you will find that only the images presented in “Vowel Adventure” can be attractive to ordinary people. The image gap here is completely the difference between the 2G network and the 4G network.

The "I" in the virtual world is closer to the real "I", so that the metaverse is possible.

Technology, here, is the link that connects the "I" in the two worlds. "I" fell to the ground in another virtual world, toddling, and walked out of the Novice Village. The first stop of this metaverse, the Snapdragon mobile platform , is undoubtedly , is a guide.

#Welcome to pay attention to Aifaner's official WeChat public account: Aifaner (WeChat ID: ifanr), more exciting content will be presented to you as soon as possible.

Ai Faner | Original Link · View Comments · Sina Weibo