Because of it, Musk and the Terracotta Army danced “Subject Three”

A single photo can make Musk, Messi and other celebrities dance magically, and even subject three, which is popular all over the Internet, can be arranged.

This is not some advanced AI technology. Alibaba Tongyi Qianwen's newly added "National Dance King" function on the mobile terminal can realize it. There are also 12 popular dance templates such as subject three, DJ slow rocking, ghost step dance, and bliss dance. You take your pick.

Enter the passwords such as "National Dance King" and "Tongyi Dance King" in Tongyi Qianwen, then select your favorite dance in the jump interface and upload a full-body photo. It only takes ten minutes to create a dance that is both physical and spiritual. The king was so gorgeously "quickly accomplished".

Unexpectedly, Einstein, with thick eyebrows and big eyes, could turn into a trendy man in an instant, and the rhythm of his movements is not too strong.

▲ Picture from: Simon_Awen

There is only one photo between the Terracotta Warriors and the King of Dance, and this posture can't be overshadowed.

The King of Dance in the figurine world is dominating, how can they ignore me, Nicholas Zhao Si, the "Asian Dance King"?

▲ Picture from: Gongfu Finance

The little characters I drew all danced more happily than me. It seems that I have to sign up for a dance class.

▲ Picture from: Brother Dao Hu Kan

Crayon Shin-chan "scratches his head and makes poses", and his childhood is back in an instant.

▲ Picture from: Panhua dog

AI magic that makes photos “alive”

So how did Alibaba’s AI research team make photos move?

The release of Tongyi Dance King function is actually a specific application and implementation of AnimateAnyone technology.

According to a paper released by the Alibaba AI research team, diffusion models are currently the mainstream in the field of visual generation research. However, in the field of image-to-video generation, there are still problems such as local distortion, blurred details, and frame rate jitter.

In this regard, Alibaba's AI research team proposed a new AI algorithm Animate Anyone based on the diffusion model. The function of this algorithm is to convert a static character image into an animated video, and at the same time, the character movements in the video can be precisely controlled by inputting the sequence of postures.

▲Display of the principle of flip book. Picture from: @flipping book Andymation

It should be noted that in video production, especially animation production, the movements of characters are completed through frame-by-frame transitions. The principle is similar to the flip-book that I often played with when I was a child. Each page is a static hand-drawn draft, which can be quickly flipped. Make the screen move through the human eye's "persistence of vision" bug.

The biggest difficulty in making a picture move is to "imagine" the next actions and scenes, and there is no reference before or after. Therefore, in the official comparison display, you can see that the traditional technology "DisCO" has been repeatedly used as a negative teaching material. Its severe distortion effect can only make the subject move, but the twisted body shape and strange motion effects are not worthy of being called at all. work.

Therefore, in order to solve the problem of video character image consistency, they introduced the reference image network ReferenceNet, which can capture the spatial detail information in the reference image.

Then, they combined ReferenceNet with UNet, allowing UNet to understand where and what details should be generated when generating the target image, so that the generated image can remove noise as a whole while retaining key details in the reference image. Achieve consistency of character image.

In addition to capturing details, the controllability of the posture must also be ensured. To this end, the Alibaba AI team also designed a lightweight pose guider, Pose Guider, which integrates pose control signals during the denoising process to ensure that the generated animation sequence conforms to the specified pose.

Considering the stability of the video, they also introduced a timing generation module to allow the model to learn the connection between frames, so that the generated video will be smooth and coherent instead of fragmented, while maintaining a high Resolution details make the picture quality better and more stable.

Compared with previous methods, this method can effectively maintain the consistency of the appearance of the video characters, without problems such as changing the color of clothes. At the same time, the video is smooth and clear, without flickering and jittering, and it also supports dynamic animation of any character. change.

For example, Messi plays with the top style that is loved by middle-aged and elderly people and raises his hand to say hello to you.

The two-dimensional characters move in a still state, and when they dance house dances, they are no less impressive than real people.

Even Iron Man joined in the fun, keeping fit and stretching his muscles, and there was nothing wrong with it.

In the field of AI video generation, the technology accumulation behind Alibaba goes beyond this. For example, last month, Alibaba also released another video generation technology, DreaMoving. This is a diffusion-based controllable video generation framework for generating high-quality customized portrait videos.

The advantage of this technology is that it does not require in-depth knowledge of complex video production techniques. Users only need to be given some guidance, such as a piece of text or a reference image, and DreaMoving can create highly realistic videos.

In other words, as long as the target identity and pose sequence are given, DreaMoving can generate a video of any person/object dancing anywhere based on the pose sequence.

To put it simply, DreaMoving can automatically generate various customized character videos through simple inputs, such as face images, action sequences and text, achieving precise control over video generation.

Specific disassembly steps: first input a person's facial image to generate the image of the person's entire body in the video, then input the sequence of postures to accurately control the character's movements in the video, and finally input text to more comprehensively Control the video generation effects.

For example, a girl, smiling, standing on the beach by the sea, wearing a light yellow long-sleeved dress.

A man dances in front of the Pyramid of Egypt, wearing a suit and blue tie.

A girl in a light blue dress smiling and dancing in a French town

The AI ​​video generation industry is going crazy

In fact, in the field of generative AI, the starting point of the field of AI video generation is not too late. Before the birth of ChatGPT, many manufacturers had already bet on this track, such as Microsoft and Google. Similar AI video generation tools have been used, but the effect is minimal.

Based on the long-term technology accumulation of the entire industry, the emergence of the diffusion model allows manufacturers to see the potential prospects of AI video generation. It has obvious advantages over early models such as RNN. It can generate more coherent and clear images or video sequences, speeding up the iterative process of video generation.

The mainstream tools on the market have also made great additions on this basis, making the AI ​​video generation track once again make waves, and truly showing an amazing explosive trend.

At the end of last year, Runway Gen-2 received a major update, with the resolution increased to 4K and a major breakthrough in the fidelity and consistency of video generation effects. A week later, the motion brush function was launched again. With a single brush, You can make static things move.

Immediately afterwards, Stability AI, the "backbone" of Wenshengtu, also released Stable Video Diffusion, adding another boom to the field of AI video generation.

Pika 1.0, on the other hand, has won the favor of many Silicon Valley bosses since its debut due to its simpler video generation, easy-to-understand partial video editing, and higher-quality video generation. From generation to post-production, you can complete a one-stop operation by yourself.

The WALT model launched by Li Feifei's team in cooperation with Google can also generate realistic 2D/3D videos or animations based on natural language/picture prompts, and the generation effect is comparable to that of Runway, Pika and other experts.

These AI video generation tools have made great strides mainly in two dimensions – quality and quantity. In terms of quality, these AI products continue to introduce more powerful model architectures and use larger-scale and higher-quality data for training, so that the image quality, fluency, and fidelity of AI-generated videos continue to improve.

In terms of quantity, the length of the generated videos is also constantly involution, breaking through to double-digit seconds in length, and the combination of scenes and events is becoming increasingly rich. In the future, with further improvements in computing power, it will be possible to generate high-quality videos that last for several hours.

The technology floating on the cloud will eventually be applied on the ground, and the rise of AI video generation will create a huge blue ocean market. Relying on the profound accumulation of technology, Tongyi Qianwen's "National Dance King" is another product based on this business logic.

This will not only open up competition with Alibaba and other companies and accelerate the progress of the entire industry, but also give us the opportunity to experience more of the possibilities brought by AI video generation technology.

# Welcome to follow the official WeChat public account of aifaner: aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.

Ai Faner | Original link · View comments · Sina Weibo