OpenAI changes the world in 26 minutes! The free version of GPT-4 is here, video voice interaction fast forwards to a science fiction movie

Early this morning, a 26-minute press conference will once again greatly change the AI ​​industry and our future lives, and will also make countless AI startups miserable.

This is really not a headline, because this is an OpenAI press conference.

Just now, OpenAI officially released GPT-4o, where the "o" stands for "omni" (which means comprehensive and omnipotent). This model has the capabilities of text, pictures, video and voice. This is even the GPT-5 An unfinished version.

What's more, this GPT-4 level model will be available to all users for free and will be rolled out to ChatGPT Plus in the coming weeks.

Let us first summarize the highlights of this conference at once. Please read below for more functional analysis.

Key points of the press conference

  1. New GPT-4o model: Open up any input of text, audio and image, and can directly generate each other without intermediate conversion
  2. GPT-4o has significantly reduced voice latency and can respond to audio input in 232 milliseconds, with an average of 320 milliseconds, which is similar to human response times in conversations.
  3. GPT-4 is free and open to all users
  4. GPT-4o API, 2 times faster than GPT4-turbo and 50% cheaper
  5. Stunning real-time voice assistant demonstration: the conversation is more like a human, can translate in real time, recognize expressions, and can recognize the screen through the camera, write code and analyze charts
  6. ChatGPT new UI, more concise
  7. A new ChatGPT desktop app for macOS, with Windows version coming later this year

These features were described by Altman as "feeling like magic" as early as the warm-up stage. Since AI models around the world are "catching up with GPT-4," OpenAI needs to pull out some real stuff from its arsenal.

Free and available GPT-4o is here, but that’s not its biggest highlight

In fact, the day before the press conference, we discovered that OpenAI had quietly changed the description of GPT-4 from "the most advanced model" to "advanced."

This is to welcome the arrival of GPT-4o. The power of GPT-4o is that it can accept any combination of text, audio, and images as input and directly generate the above media output.

This means that human-computer interaction will be closer to natural communication between people.

GPT-4o can respond to audio input in 232 milliseconds, with an average of 320 milliseconds, which is close to the reaction time of a human conversation. Previously using voice mode to communicate with ChatGPT, the average latency was 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).

It matches the performance of GPT-4 Turbo on English and code text, with significant improvements on non-English language text, while being faster and 50% cheaper on the API.

Compared with existing models, GPT-4o performs particularly well in visual and audio understanding.

  • You can interrupt at any time during a conversation
  • Can generate a variety of tones based on the scene, with human-like moods and emotions
  • Directly make a video call with AI and let it answer various questions online

Judging from the test parameters, the main capabilities of GPT-4o are basically at the same level as GPT-4 Turbo, the strongest OpenAI currently.

In the past, our experience with Siri or other voice assistants was not ideal. Essentially, the conversation with the voice assistant went through three stages:

  1. Speech recognition or "ASR": audio -> text, similar to Whisper;
  2. LLM plans what to say next: Text 1 -> Text 2;
  3. Synthesis-to-speech or “TTS”: Text 2 -> Audio, think ElevenLabs or VALL-E.

However, our daily natural conversations are basically like this

  • Think about what to say next as you listen and speak;
  • Insert “yeah, um, um” at the appropriate moment;
  • Anticipate when the other party will finish speaking and take over immediately;
  • Decide to interrupt the other person's conversation naturally and without causing resentment;
  • Think about what to say next as you listen and speak;
  • Insert “yeah, um, um” at the appropriate moment;
  • Handle and interrupt gracefully.

Previous AI language assistants could not handle these problems well, and there was a large delay in each of the three stages of the conversation, resulting in a poor experience. At the same time, a lot of information is lost in the process, such as the inability to directly observe intonation, multiple speakers, or background noise, and the inability to output laughter, singing, or express emotions.

When audio can directly generate audio, images, text, and videos, the entire experience will be leaps and bounds.

GPT-4o is a brand new model trained by OpenAI for this purpose. Direct conversion across text, video and audio requires that all inputs and outputs be processed by the same neural network.

What’s even more surprising is that free users of ChatGPT can use GPT-4o to experience the following functions:

  • Experience GPT-4 level intelligence
  • Get responses from models and networks
  • Analyze data and create charts
  • Let’s talk about the photos you took
  • Upload a file for abstract, writing, or analysis help
  • Using GPTs and GPT Store
  • Build more helpful experiences with Memory

And when you watch the following demonstrations of GPT-4o, your feelings may be more complicated.

ChatGPT version "Jarvis", everyone has it

ChatGPT can not only speak, listen, but also watch. This is nothing new, but the "ship new version" of ChatGPT still surprised me.

sleeping partner

Taking a specific life scene as an example, let ChatGPT tell a bedtime story about robots and love. It can tell an emotional and dramatic bedtime story without much thinking.

It can even tell stories in the form of singing, which can serve as a sleep companion for users.

Question master

Or, at the press conference, let it demonstrate how to help solve the linear equation 3X+1=4. It can guide you step by step and give the correct answer.

Of course, the above is still a bit of "child's play", and the on-site coding difficulties are the real test. However, it can be easily solved with three strikes, five strikes and two strikes.

With ChatGPT's "vision", it can view everything on the computer screen, such as interacting with the code base and viewing the charts generated by the code. Huh, something's wrong? Then won’t our privacy be seen clearly in the future?

real-time translation

The audience at the scene also asked some tricky questions to ChatGPT.

Translating from English to Italian, and from Italian to English, no matter how much you use this AI voice assistant, it can do it with ease. It seems that there is no need to spend a lot of money to buy a translator. In the future, maybe ChatGPT may be better than your real-time The translator is pretty reliable.

This content cannot be displayed outside Feishu documents at the moment.

▲ Real-time translation (official website case)

Perceiving the emotion of language is only the first step. ChatGPT can also interpret human facial emotions.

At the press conference, ChatGPT directly mistook the face captured by the camera for a table. Just when everyone thought it was going to roll over, it turned out to be because the front-facing camera that was turned on first was aimed at the table.

However, in the end, it accurately described the emotions on the face in the selfie and accurately identified the "bright" smile on the face.

Interestingly, at the end of the press conference, the spokesperson did not forget Cue's "strong support" from Nvidia and its founder Lao Huang. He is really understanding of human nature.

The idea of ​​a conversational language interface was incredibly prophetic.

Altman expressed in previous interviews that he hopes to eventually develop an AI assistant similar to the one in the AI ​​movie "Her", and the voice assistant released by OpenAI today is indeed coming into reality.

Brad Lightcap, chief operating officer of OpenAI, predicted not long ago that in the future we will talk to AI chatbots the same way we talk to humans, treating them as part of a team.

Now it seems that this not only paved the way for today's conference, but also a vivid footnote to our lives in the next ten years.

Apple has been struggling with AI voice assistants for thirteen years and has been unable to get out of the maze, but OpenAI found the exit overnight. It is foreseeable that in the near future, Iron Man's "Jarvis" will no longer be a fantasy.

"She is coming

Although Sam Altman did not appear at the conference, he published a blog after the conference and posted a word on X: her.

This is obviously an allusion to the classic science fiction movie "Her" of the same name. This was the first image that came to mind when I watched the presentation of this conference.

Samantha in the movie "Her" is not just a product, she even understands humans better than humans and is more like humans themselves. You can really gradually forget that she is an AI when communicating with her.

This means that the human-computer interaction model may usher in a truly revolutionary update after the graphical interface, as Sam Altman said in his blog:

The new voice (and video) modes are the best computer interface I've ever used. It feels like an AI from a movie; and I'm still a little surprised it's real. Reaching human-level response times and expressiveness turns out to be a big change.

The previous ChatGPT allowed us to see the beginning of the natural user interface: simplicity above all else: complexity is the enemy of the natural user interface. Every interaction should be self-explanatory, requiring no instruction manual.

But the GPT-4o released today is completely different. It has almost no delay in response, is smart, interesting, and practical. Our interaction with computers has never really experienced such a natural and smooth interaction.

There are still huge possibilities hidden here. When more personalized functions and collaboration with different terminal devices are supported, it means that we can use mobile phones, computers, smart glasses and other computing terminals to do many things that were not possible before.

AI hardware will no longer try to accumulate. What is more exciting now is that if Apple officially announces its cooperation with OpenAI at WWDC next month, the iPhone experience may be improved more than any conference in recent years.

NVIDIA senior code scientist Jim Fan believes that cooperation between iOS 18, known as the largest update in history, and OpenAI may have three levels:

  • Abandoning Siri, OpenAI has refined a small GPT-4o for iOS that runs purely on the device, with the option to pay to upgrade to use cloud services.
  • Native functionality feeds camera or screen streams into the model. Chip-level support for neural audio and video codecs.
  • Integrate with iOS system-level operation API and smart home API. No one uses Siri Shortcuts, but it's time for a revival. This could become an AI agent product with a billion users right out of the gate. This is like a Tesla-like full-size data flywheel for smartphones.

Speaking of which, I have to feel sorry for Google, which will hold a press conference tomorrow.

Author: Li Chaofan and Mo Chongyu

# Welcome to follow the official WeChat public account of aifaner: aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.

Ai Faner | Original link · View comments · Sina Weibo