The art of avatar is here, Apple releases new AI technology to create your “digital avatar” in 30 minutes

While a number of technology giants are competing fiercely in the generative AI track, Apple on the side seems a bit silent.

Today, Apple released a research paper on generative AI, which rarely shows us their latest breakthrough in this field.

This paper details a generative AI technology called HUGS (Human Gaussian Splats). In short, thanks to the blessing of this technology, we can even create a human "digital avatar" through a short video.

Closer to home, let’s take a look at the specific demonstration effect

According to Apple officials, although neural network-based rendering technology has achieved significant improvements in training and rendering speed over the years, this technology mainly focuses on photogrammetry of static scenes and is difficult to apply to flexible moving human models.

In order to solve this problem, Apple's Machine Learning Research Center and the Max Planck Institute for Intelligent Systems collaborated to propose an AI framework called HUGS. After training, HUGS can automatically separate from videos within 30 minutes. Static background and a fully dynamically changing digital avatar.

How exactly is it done?

Their core idea is to use three-dimensional Gaussian distribution (3DGS) to represent people and scenes. You can understand the Gaussian distribution (GS) as a parameterized three-dimensional bell-shaped body with a center position, volume size, and rotation angle.

If we place many of these three-dimensional bell-shaped bodies in different locations in a room, adjust their positions, sizes, and angles, and combine them together, we can reconstruct the structure of the room and the people in the scene. Gaussian distribution is very fast to train and render, which is the biggest advantage of this method.

The next problem we face is that the Gaussian distribution itself is relatively simple, and it is difficult to accurately simulate the complex structure of the human body simply by stacking them together.

Therefore, they first used a human body model called SMPL, which is a commonly used, relatively simple human body shape model that provides a starting point for a Gaussian distribution that anchors the basic shape and posture of the human body.

Although the SMPL model provides the basic human body shape, it is not very accurate in handling some details, such as clothing folds, hairstyles, etc., and the Gaussian distribution can deviate and modify the SMPL model to a certain extent.

In this way, they can adjust the model more flexibly, better capture and simulate these details, and give the final digital avatar a more realistic appearance.

Separating is only the first step. You also need to make the constructed human model move. To this end, they designed a special deformation network to learn to control the motion weight of each Gaussian distribution (representing the shape of the human body and scene) under different skeletal poses, which is the so-called LBS weight.

These weights tell the system how the Gaussian distribution should change when the human skeleton moves to simulate real motion.

In addition, they not only stopped designing the network, but also optimized the Gaussian distribution of the digital avatar, the Gaussian distribution of the scene, and the deformation network by observing real human movement videos. In this way, the digital avatar can better adapt to different scenes and actions, making it look more real.

Compared with traditional methods, the training speed of this method is significantly improved, at least 100 times faster, and it can also render high-definition video at 60 frames per second.

More importantly, this new method achieves a more efficient training process and lower computational cost, sometimes requiring only 50-100 frames of video data, which is equivalent to 24 frames of video in just 2-4 seconds.

Regarding the release of this achievement, the attitudes of netizens showed a polarized trend.

Digital blogger @mmmryo marveled at the generative model's modeling of skin, clothing, hair and other details, and speculated that this technology is likely to be specially designed for iPhone or Vision Pro.

Samsung scientist Kosta Derpani appeared in the comment area of ​​Apple researcher Anurag Ranjan and expressed full praise and affirmation of this achievement.

However, some netizens did not buy it. For example, X user @EddyRobinson questioned the actual generated effect.

Apple announced that it will release the code for the model, but as of press time, clicking on the official code link provided by Apple will only result in "404".

Some netizens issued rational discussions:

It is worth mentioning that the author of this paper has a familiar Chinese face.

The core author of the paper, Jen-Hao Rick Chang, is from Taiwan, China. Before joining Apple in 2020, he received his PhD from the ECE Department at Carnegie Mellon University.

Zhang Renhao's academic career is quite legendary. While at Carnegie Mellon University, he studied under Professor Vijayakumar Bhagavatula and Professor Aswin Sankaranarayanan, both masters in the field of image processing.

After devoting himself to the field of machine learning for the first three years, out of research interest, Zhang Renhao resolutely changed his research direction and began to delve into completely different fields of optics. Since then, he has successively worked in SIGGRAPH in the field of computer graphics and interactive technology, and in the field of machine learning ICML International. Published many masterpieces at academic conferences.

This Apple paper is the latest research result he co-authored. Finally, the specific address of the paper is given. More specific details can be found at the link below.

https://arxiv.org/abs/2311.17910

It has to be said that this year's AI video generation track is simply inhumane. The emergence of Runway has brought generative AI into the hallowed halls of movies. "The Instant Universe" supported by Runway technology demonstrates the magic of AI video generation. Incisively and vividly.

Then Pika Lab's Pika 1.0 took the "patent" of AI video generation back from the hands of professional creators. Through simpler text input, easy-to-understand video editing, and higher-quality video generation, everyone has the opportunity to become their own video director.

Whether you are a professional or an amateur, you can also use MagicAnimate human animation generator to entertain yourself. Just input pictures of people according to predetermined action sequences to generate dynamic videos.

The moving protagonist can be your selfie, your pet, or a familiar famous painting. Everything can be moved by using your imagination.

Of course, what may be more eye-catching is the video generation model VideoPoet launched by the Google team today, which supports various video generation functions and audio generation, and can even allow large models to guide complete video generation.

Not only can it generate 10-second long videos at a time, VideoPoet can also solve the current problem of being unable to generate videos with large movements. It is an all-rounder in the field of video generation. The only drawback may be that it "lives" in Google's blog.

Relatively speaking, Apple's latest achievement is aimed at the current popular technology similar to AI anchors. A short video that may take less than a few seconds can generate your "digital avatar." Seeing may not be believing. How can we prove in the future that "I That’s me” may be worth worrying about again.

Vision Pro will be released in the United States next year, and the research results of this paper are probably an Easter egg buried in advance.

# Welcome to follow the official WeChat public account of aifaner: aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.

Ai Faner | Original link · View comments · Sina Weibo