Revealing the most powerful video generation model Sora, how does OpenAI achieve one shot in one minute?

Early this morning, OpenAI took out the AI ​​video generation tool Sora from its "ammunition arsenal", instantly occupying major news headlines.

Even Musk, who has always been at odds with OpenAI, is willing to admit the power of Sora and praise it, "In the next few years, humans will create outstanding works with the help of the power of AI."

The power of Sora lies in its ability to generate coherent and smooth videos of up to 60 seconds based on text descriptions, which contain delicate and complex scenes, vivid character expressions, and complex camera movements.

Compared with other videos that can only generate videos as short as single digits, Sora's one-minute duration undoubtedly has the effect of turning the table.

More importantly, Sora has shown the best level in terms of video authenticity, length, stability, consistency, resolution or text understanding. Let us first enjoy the officially released demonstration video clips.

Prompt: Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.

In this video, a couple is seen from a drone's perspective walking through a busy city street, with beautiful cherry blossom petals dancing in the air accompanied by snowflakes.

While other tools are still struggling to keep a single lens stable, Sora has smoothly achieved seamless switching of multiple lenses, and the coherence of lens switching and the consistency of objects are far ahead, which is a true dimensionality reduction blow.

▲From @gabor

In the past, shooting such a video might have required a lot of time and energy in a series of tedious tasks such as script creation and shot design. Now, with just a simple text description, Sora can completely generate such a big scene, and relevant practitioners may have begun to tremble.

Netizen @debarghya_das created this 20+ second trailer in 15 minutes using OpenAI Sora editing, David Attenborough's voice on Eleven Labs, and some natural music samples from Youtube on iMovie.

How does Sora achieve its powerful effects?

OpenAI also released a detailed technical report on Sora, introducing the technical principles and applications behind it.

So, how did Sora achieve this breakthrough? Inspired by the successful practical experience of LLM, OpenAI introduces visual patch embedding codes (patches), a highly scalable and effective visual data representation that can greatly improve the ability of generative models to handle diverse video and image data.

In a high-dimensional space, OpenAI first compresses the video data into a low-dimensional latent space and then decomposes it into spatiotemporal embeddings, thereby converting the video into a series of encoding blocks.

Next, OpenAI trained a network specifically designed to reduce the dimensionality of visual data. The network takes a raw video as input and outputs a latent representation that is compressed in both time and space. It is within this compressed latent space that Sora is trained and generates videos within this space.

Additionally, OpenAI trained a decoder model that can restore these latent representations to pixel-level video images.

By processing the compressed video input, the researchers were able to extract a series of spatiotemporal patches, which play a role similar to Transformer Tokens in the model.

Using a patch-based representation, Sora can adapt to videos and images of different resolutions, durations, and aspect ratios. When generating new video content, these randomly initialized patches can be arranged into a grid according to the required size. Control the size and form of your final video.

Although the above principle sounds quite complicated, in fact, the new technology used by OpenAI – visual block embedding code (referred to as visual block) – is like organizing a bunch of disorganized building blocks into a small box. . In this way, even if there are many building blocks, you can easily find the building blocks you need as long as you find this small box.

Since video data is converted into small squares, when OpenAI provides Sora with a new video task, they will first extract some small squares containing temporal and spatial information from the video. These small squares are then given to Sora to generate new videos based on this information.

In this way, the video can be put back together like a puzzle. The benefit of this is that the computer can learn and process a variety of different types of pictures and videos more quickly.

As Sora was trained more deeply, OpenAI researchers also found that sample quality improved significantly as the amount of training computation increased. OpenAI found that training directly on the original size of the data has several advantages:

  • Sora does not crop the material when training, allowing Sora to create content directly according to the native aspect ratio of different devices.
  • Training on the native aspect ratio of the video can significantly improve the composition and layout quality of the video.

In addition, Sora has the following features:

Training a text-to-video generation system requires a large number of videos with textual captions. OpenAI applies the re-annotation technology introduced in DALL·E 3 to videos.

Similar to DALL·E 3, OpenAI uses GPT to convert the user's short prompts into longer detailed instructions and then sends them to the video model, allowing Sora to generate high-quality videos.

In addition to converting from text, Sora can also accept input from images or existing videos. This feature allows Sora to complete a variety of image and video editing tasks, such as making seamless loop videos, adding animation effects to static images, extending the playback time of videos, etc.

A realistic image of clouds forming the word "SORA".

In a richly decorated historical hall, a huge wave is about to hit. The two surfers took advantage of the opportunity and masterfully rode the waves.

Sora can change the style and environment in a video without any prior examples. Even two videos with completely different styles can be connected smoothly.

Sora can also generate images. The research team creates images of various sizes by arranging Gaussian noise blocks in a spatial grid with a time range of only one frame. The maximum resolution reaches 2048×2048.

The real OpenAI also frankly admitted Sora's current limitations, such as its inability to simulate the physical effects of complex scenes and understand some specific causal relationships. For example, it cannot accurately simulate basic physical interactions like glass breaking.

▲Running in the opposite direction

But OpenAI firmly believes that Sora's current capabilities show that continued expansion of video models is a promising path toward developing capable simulators that can simulate the physical and digital worlds and the objects, animals, and humans within them.

World models, the next direction of AI?

OpenAI found that when trained at scale, Sora exhibits a compelling set of emergent capabilities that can simulate real-world people, animals, and environments to a certain extent.

These capabilities are not based on specific presets of three-dimensional space or objects, but are driven by large-scale data.

  • Coherence in three-dimensional space
    Sora can generate videos with dynamic perspective changes. When the camera position and angle change, the characters and scene elements in the video can move coherently in the three-dimensional space.
  • Long-distance continuity and object persistence Sora maintains video continuity over long periods of time, even when people, animals or objects are obscured or moved out of the frame. Likewise, it can show the same character multiple times in the same video sample and ensure a consistent look.
  • Simulation of the digital world
    Sora can also simulate digital processes, such as video games, by simply mentioning the words "Minecraft" to activate its related abilities.

OpenAI regards Sora as "the foundation of models that can understand and simulate the real world" and believes that its capabilities "will be an important milestone in the realization of AGI."

Regarding the arrival of Sora, NVIDIA senior scientist Jim Fan said:

If you think OpenAI's Sora is a tool for creative experimentation, like DALL·E, you may want to reconsider.

Sora is actually a data-based physics simulation engine that can simulate real or fictional worlds. This simulator learns complex image rendering, "intuitive" physical behavior, long-term planning capabilities, and semantic level understanding through denoising and gradient calculations.

The basis of this model capability is the world universal model, which is an artificial intelligence system. Its goal is to build a neural network module that can update the state to memorize and model the environment.

This model is able to predict the next possible observation based on current observations (such as images, states, etc.) and upcoming actions. It simulates possible future events in the environment by learning the laws and common sense of the world.

In fact, the world model is not a new concept. As early as December last year, Runway, the leader in AI video generation, officially announced that it would build a universal world model with the purpose of creating a kind of LLM that is different from the existing LLM and can be more realistic. Artificial intelligence systems that simulate the real world.

Specifically, the core idea of ​​the world model is to learn how the world operates by memorizing historical experience, and then predict events that may occur in the future. For example, from a video of a falling object, the model can predict the next frame based on the current picture, thereby learning the physical laws of object movement.

Turing Award winner Yann LeCun has also proposed a similar concept and criticized large models based on probabilistic generative autoregression, such as GPT, believing that such models cannot solve the hallucination problem. LeCun and his team even predict that models like GPT may be obsolete within the next five years.

World models can be seen as a research direction in the field of artificial intelligence that attempts to create AI closer to the level of human intelligence. By simulating and learning from real-world environments and events, world models have the potential to drive AI toward higher levels of simulation and prediction capabilities.

In February, Justine Moore, a partner at the well-known venture capital firm a16z, conducted an in-depth analysis of the current situation in the field of AI video generation. In the two years since generative AI has gradually entered the public eye, the field of AI video generation has ushered in a prosperous scene where a hundred flowers are blooming and a hundred schools of thought are contending.

With the addition of OpenAI Sora, the field of AI video generation will make huge waves, and existing mainstream platforms such as Runway, Pika and Stable Video Diffusion may be affected.

At the same time, the rules of the game for independent creators will be completely changed. Anyone with creativity and ideas can use Sora to generate their own video content. The lowering of the threshold for creation also means that independent creators will usher in a golden age.

As said in "The Three-Body Problem", "It doesn't matter." Regardless of the current competitive situation, the field of AI video generation may be subverted by new technologies and innovations. And Sora's entry is just the beginning, far from the end.

# Welcome to follow the official WeChat public account of aifaner: aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.

Ai Faner | Original link · View comments · Sina Weibo