Make a video with your mouth is really coming! This new app, Meta, is terrifying

This year is a year of great progress for AI in the field of image and video production.

Someone took the digital art award with an image generated by AI and defeated a group of human artists; there are applications such as Tiktok that generate pictures through text input and turn them into the green screen background of short videos; there are new products that can do text Generate video directly, and directly realize the effect of "make video with your mouth".

The product this time comes from Meta, which has been deeply cultivating artificial intelligence for many years, and was madly ridiculed because of the metaverse some time ago.

▲ The Meta Metaverse has been ridiculed wildly

Only this time, you can't mock it, because it really has a small breakthrough.

Text to video, what can be done

Now, you can move your mouth to make a video.

Although this is a bit exaggerated, Meta's Make-A-Video this time is probably really moving towards this goal.

What Make-A-Video can currently do is:

  • Text-to-video – turn your imagination into real, one-of-a-kind videos
  • Convert pictures directly to video – let a single picture or two pictures move naturally
  • Video Generating Extended Video – Input a video to create a video variant

In terms of directly generating video from text, Make-A-Video has defeated many professional animation design students. At least it can do any style, and the production cost is very low.

Although the official website does not allow you to directly generate a video experience, you can submit your personal information first, and then Make-A-Video will share any developments with you first.

There are not many cases that can be seen so far, and the cases displayed on the official website still have some weird places in the details. But anyway, the fact that text can be directly turned into video is an improvement in itself.

A teddy bear is drawing a self-portrait, and you can see the unnatural projection of the bear's hand on the shadowed part of the paper.

Robots dance in Times Square.

The cat is holding the TV remote control to change the channel. The cat's claws are very similar to human hands, and sometimes it feels a little scary to watch.

And a furry sloth in an orange knitted hat fiddles with a laptop, the light from the computer screen in its eyes.

The above are surreal styles, and cases that are more similar to reality are easier to wear.

The cases shown by Make-A-Video are good if they only focus on local areas, such as the close-up of the artist painting on the canvas, the horse drinking water, and the small fish swimming in the coral reef.

But a slightly more realistic young couple walking in the heavy rain is very weird. The upper body is fine, but the feet of the lower body are flickering, sometimes stretched, like a ghost movie.

There are also painting-style videos of spaceships landing on Mars, couples in tuxedos trapped in downpours, sunlight on tables, and moving panda dolls. In terms of details, these videos are not perfect, but just from the innovative effect of AI text-to-video, they are still amazing.

Static paintings can also be animated with the help of Make-A-Video – the boat is moving in the big waves.

The turtles are swimming in the sea. The initial picture is very natural, but later it becomes more like a green screen cutout, which is unnatural.

Yoga trainers stretch their bodies in the rising sun, and the yoga mat will change with the changes of the video – this AI will not be able to defeat the students studying film and television production, and the control variables are not well done.

Finally enter a video to imitate its style to create video variants There are also 3 cases.

One of the changes is relatively less refined. The video of astronauts fluttering in space was turned into a slightly less aesthetic version of 4 rough versions of the video.

There are quite a few surprising changes in the video of the little bear dancing, at least the dance posture has changed.

As for the last video of the rabbit eating grass, it is the most "anneng distinguishes me as male and female". It is difficult to recognize who is the initial video in the last 5 videos, and it looks very harmonious.

As soon as the text to pictures has progressed, the video is here

In " After AlphaGo, it completely subverts human cognition again ", we once introduced the image generation application DALL·E. Someone has used it to create images to compete with human artists and eventually win.

The Make-A-Video we see now can be said to be a video version of DALL·E (primary version) – it is like the DALL·E 18 months ago, with a huge breakthrough, but the current effect may not make the people are satisfied.

▲ Extended painting created by DALL·E

It can even be said that it is a product that stands on the shoulders of the giant DALL·E and makes achievements. Compared with text-generated images, Make-A-Video has not made too many new changes in the backend.

"We saw that models describing text-generated pictures were also surprisingly effective at generating short videos," the researchers said in their paper.

▲Award-winning works describing text-generated images

At present, the videos produced by Make-A-Video have 3 advantages:

  1. Accelerated training of T2V models (text to video)
  2. No need for paired text-to-video data
  3. The converted video inherits the style of the original image/video

These images certainly have drawbacks, and the aforementioned unnaturalness is all real. And they are not like the videos born in this era, the picture quality is blurry, the motion is stiff, the sound matching is not supported, the length of a video is no more than 5 seconds, and the resolution is 64 x 64px.

▲ This video has a few frames of the dog's tongue and hands that are very weird

The first CogVideo model that can directly synthesize video from text, released a few months ago by a research team from Tsinghua University and Zhiyuan Research Institute (BAAI), also has such a problem. Based on the large-scale pre-trained Transformer architecture, it proposes a multi-frame rate hierarchical training strategy, which can efficiently align text and video clips, but it cannot stand scrutiny.

But who's to say 18 months later, Make-A-Video and CogVideo won't be making better videos than most?

▲ Video generated by CogVideo – this currently only supports Chinese generation

Although there are not many text-to-video tools that have been released, there are many on the road. After the release of Make-A-Video, the developers of the start-up StabilityAI publicly stated: "Our (text-to-video application) will be faster and better, and applicable to more people."

Competition is better, and the increasingly realistic text-to-image function is the best proof.

Not too interesting, not too optimistic.

#Welcome to pay attention to the official WeChat account of Aifaner: Aifaner (WeChat: ifanr), more exciting content will be brought to you as soon as possible.

Love Faner | Original link · View comments · Sina Weibo