How Sora works, the new frontier of OpenAI video generation

Artificial intelligence is entering a new era: that of simulating the physical world in motion. Among the pioneers of this technological advancement is OpenAI's Sora , an artificial intelligence model that promises to revolutionize the way we generate videos.

A breakthrough in AI: Sora spreads his wings

As announced on OpenAI's official website, Sora is not just a text-to-video generation model; is an ambitious project that aims to teach artificial intelligence to understand and simulate the physical world in movement. This opens previously closed doors to real-world problems that require interaction in the physical world, with a clear goal: to help people solve complex practical questions.

Key Features of Sora

  • Visual Quality and Prompt Adherence: Sora can generate videos up to one minute long while maintaining excellent visual quality and closely adhering to user prompts.
  • Professional Feedback: Currently available for red teamers to assess critical areas for risk or damage, Sora is also accessible to a select number of visual artists, designers and filmmakers, with the aim of gathering feedback to improve the model for the benefit of creative professionals .
  • Open Research: Early sharing of research progress is intended to collaborate and receive feedback from people outside of OpenAI, giving the public a preview of AI capabilities on the horizon.

Sora's main competitor

Runway's Gen 2 is Sora's main competitor, which is also a cutting-edge technology in the field of generative artificial intelligence, specializing in creating videos starting from textual input, images or videos. This platform stands out for its ability to interpret and transform various types of input into dynamic and customizable video content, spanning creative modes such as text to video, text and image to video, and image to video.

At first glance, it might seem that Sora has a distinct advantage over Gen 2 . However, it will be necessary to wait for Sora to be available to the public to be able to make an objective comparison based on solid criteria. This evaluation will allow us to fully understand the capabilities and performance of both systems in real contexts of use.

An accuracy never seen before

From the available videos we can see how Sora is capable of generating complex scenes with multiple characters, specific types of movement, and accurate subject and background details. The videos shared by OpenAI accompanied by the prompt that generated them testify to the power of this new tool. One thing OpenAI focuses on is that the model not only understands what is being asked through the prompt, but also how these elements exist in the physical world. In particular, OpenAI highlights two characteristics:

  • Understanding of Language: The model has a deep understanding of language, allowing it to interpret prompts accurately and generate characters that express vibrant emotions.
  • Visual Persistence: Sora can create multiple shots within a single generated video, accurately maintaining characters and visual style.

Sora's challenges

Despite his impressive abilities, Sora has some limitations:

  • Physics Simulation: You may have difficulty accurately simulating the physics of a complex scene, such as a bitten cookie that shows no bite mark.
  • Spatial and Temporal Details: The model may confuse spatial details, such as reversing left and right, and struggle with precise descriptions of events taking place over time.

Security and innovation: Sora's steps forward

While Sora opens new frontiers in video generation via artificial intelligence, security remains a central pillar in its evolution, according to the developer company. OpenAI takes crucial security measures before making Sora available in its products, proactively addressing challenges related to misinformation, hateful content, and bias.

Strategic collaborations for security and innovative tools for the veracity of content

According to OpenAI, collaboration with red teamers (cyber security professionals specialized in imitating attacks against an organization's IT systems to evaluate its security and defenses.), experts in various domains such as disinformation, hate content and prejudices, is a fundamental step. These professionals are tasked with adversarial testing of the model, ensuring a critical assessment of its capabilities and potential risk areas.

OpenAI is developing tools dedicated to detecting misleading content, including a detection classifier capable of identifying videos generated by Sora. In the future, we plan to include C2PA metadata in OpenAI products using Sora, further improving transparency and security.

Inherited and new security techniques

The security methodologies developed for DALL·E 3 (the text-to-image tool available in the ChatGPT plus suite) are also applied in Sora, integrating new preparatory techniques for its use. Once integrated into an OpenAI product, a text classifier will examine and reject text prompts that violate usage policies, such as calls for extreme violence or sexual content. Advanced image classifiers will review each generated video frame, ensuring adherence to usage guidelines before presenting it to the user.

Engagement with global policymakers, educators and artists is essential to understand concerns and identify positive use cases for this new technology. Despite extensive research and testing, it is impossible to predict all the beneficial or harmful ways in which our technology will be used. As such, learning from real-world use is considered a critical component to building and releasing increasingly secure AI systems over time.

Sora research and development techniques

Sora uses a diffusion model, which starts with a static noise-like video and gradually transforms it by removing the noise in many steps. Capable of generating entire videos at once or extending existing videos, Sora leverages a transformer architecture similar to GPT models, ensuring superior scalability performance.

By representing videos and images as collections of smaller data units, called patches, similar to tokens in GPT, Sora unifies how we represent data. This allows training on a wider range of visual data, spanning different durations, resolutions and aspect ratios. Building on previous research in DALL·E and GPT models, Sora represents a foundation for models capable of understanding and simulating the real world, a milestone towards achieving AGI (artificial general intelligence).

The advent of Sora marks an important step forward in the generation of visual content through artificial intelligence. While challenges remain, the path taken opens up new creative and professional possibilities, promising to transform the landscape of video production . We just have to wait for the version available to the public in order to verify the potential of this new tool.

The article How Sora works, the new frontier of OpenAI video generation was written on: Tech CuE | Close-up Engineering .