OpenAI robot explodes on the scene! ChatGPT finally has a body, can speak, read and do housework

Large models represented by GPT-4 have built brains, and the next step is to need a robot body that can carry this brain.

Late last night, humanoid robot star company Figure AI shockingly released a video showing a series of conversational interactions carried out by their robot Figure 01 with the support of OpenAI’s powerful model.

The robot in the video shows flexible operating responses, and its fluency in communicating with humans is almost comparable to that of real people.

This is less than half a month after Figure AI received investment from OpenAI, Microsoft, Nvidia and other companies. It also allowed me to see what OpenAI’s most powerful multi-modal large model would look like with a body.

Figure 01, the humanoid robot that understands you best?

Thanks to the powerful support of OpenAI's multi-modal large model, Figure 01 is now an expert in object recognition on the table. Apples, drainers, cups and plates are all a piece of cake for it!

When you are hungry and you want it to eat a whole mouthful, it will understand your thoughts instantly and hand you an apple smoothly.

Moreover, it can even pick up the trash you discarded and explain to you why it just gave you the apple. With the help of large models, Figure 01 can understand the only food on the table – apples.

At the command of a human, Figure 01 can also do housework and put away the dishes. This robot is simply the best partner in family life.

After seeing this stunning video, netizens had a variety of reactions.

Netizens can't wait to assign tasks to Figure 01. How come there are movies about robot predecessors mixed into the task list.

Are the competitors afraid of seeing this, anxious in their hearts, preparing to secretly gear up for a big technical competition?

More excited netizens said that the dawn of AGI seems to be just around the corner.

Of course, there are always some critical voices. Some netizens complained, why is this robot stammering?

Netizens also didn’t miss the opportunity to make jokes.

Brett Adock, the head of Figure AI, was not willing to be alone and jumped out on X to give a wonderful interpretation.

The video demonstrates the application of end-to-end neural networks. No remote control (teleop) is used during this process. Video is shot at actual speed (1.0x speed) and is continuous.

As you can see in the video, the speed of the robots has improved significantly and we are gradually reaching speeds similar to humans.

No remote control required, self-taught

So how does Figure 01 do it?

Figure AI team leader Corey Lynch explained it on X.

Specifically, all behaviors demonstrated in the video were learned (not remotely controlled) and performed at realistic speed (1.0x speed).

Figure AI feeds images captured by the robot's camera and voice-transcribed text recorded through the onboard microphone into a multimodal model trained by OpenAI that can understand both image and text information.

The model processes the entire conversation history, including past images, to generate a verbal response and speak back to the human via text-to-speech. The same model is also responsible for deciding which learned closed-loop behavior to execute in response to a given command. It loads specific neural network weights onto the GPU and executes the corresponding policy.

Connecting Figure 01 to a large pre-trained multi-modal model brings many interesting new features to it.

Now, Figure 01 + OpenAI can:

  • Detail its surroundings.
  • Use common sense reasoning when making decisions. For example, "The items on the table, like that plate and cup, will most likely be placed on the drying rack next."
  • Convert vague high-level instructions, such as "I'm hungry," into situationally appropriate behaviors, such as "Pass that person an apple."
  • Explain in plain English why it performs a specific action. For example, "This is the only edible item I can offer from the table."

When it comes to the fine hands-on skills that Figure 01 mastered through learning, there is actually a series of complex and subtle principles behind it.

All behaviors are driven by the neural network’s vision-to-motor converter strategy, which directly maps image pixels to actions. These networks receive images built into the robot at a rate of 10 frames per second and generate 200 24-degree-of-freedom motions (including wrist poses and finger joint angles) 200 times per second.

These movements serve as high-speed "set points" for tracking by higher-speed full-body controllers, ensuring precise execution of movements.

This design achieves effective separation of concerns:

  • Internet pre-trained models perform common sense reasoning on images and text to generate a high-level plan.
  • The learned visuo-motor strategy executes this plan, performing fast, reactive behaviors that are difficult to specify manually, such as manipulating a deformable bag in any position.
  • At the same time, the full-body controller is responsible for ensuring the safety and stability of movements, for example, maintaining the balance of the robot.

Regarding the great progress made by Figure 01, Corey Lynch lamented:

Just a few years ago, I would have thought that having a full conversation with a humanoid robot capable of autonomously planning and executing learned behaviors would be decades in the future. Clearly, many things have changed dramatically.

Could this be humanoid robots’ GPT moment?

It has to be said that the development speed of Figure 01 is like stepping on the accelerator and racing all the way.

In January this year, Figure 01 mastered the skill of making coffee. This achievement was due to the introduction of an end-to-end neural network, allowing the robot to learn and correct errors autonomously, with only 10 hours of training.

A month later, Figure 01 had learned the new skill of lifting boxes and delivering them to a conveyor belt, albeit at only 16.7% the speed of humans.

During this process, Figure AI's pace of commercialization has not stopped. It has signed a commercial agreement with BMW Manufacturing Company to integrate AI and robotics technology into the automobile production line and settled in BMW's factory.

Then, just two weeks ago, Figure announced the completion of a $675 million Series B round of financing, with the company's valuation soaring to $2.6 billion.

Investors cover almost half of Silicon Valley – Microsoft, OpenAI Venture Fund, NVIDIA, Jeff Bezos, Parkway Venture Capital, Intel Capital and Align Ventures, etc.

At that time, OpenAI and Figure also announced that they would jointly develop the next generation humanoid robot AI model. OpenAI's multi-modal model will be extended to robot perception, reasoning and interaction.

Now, from Figure 01, we seem to be able to glimpse a draft of future life.

In fact, before large models, robots were specialized equipment. Now with the general capabilities of large models, general robots are beginning to appear. Now we not only need ChatGPT, but also WorkGPT.

These evolutions indirectly confirm a clearly visible path: after the large AI model takes root, it will eventually enter the real world, and embodied intelligence is the best path.

Nvidia founder Jensen Huang, who has been active on the front line of AI, once said: "Embodied intelligence will lead the next wave of artificial intelligence."

Integrating the OpenAI large model into Figure 01 is also an intentional strategic layout.

Mature AI large models act as artificial brains, simulating the complex neural network of the human brain, realizing cognitive functions such as language understanding, visual recognition, and situational reasoning, and solving higher-level cognitive and decision-making problems for robots.

At the same time, various sensors, actuators, and computing units are integrated into the robot body to realize perception and interaction with the environment. For example, vision systems can capture images and videos, and tactile sensors can sense the shape and texture of objects.

Figure AI founder Brett Adcock previously stated in an interview that in the next 1-2 years, Figure AI will focus on developing landmark products and expects to demonstrate the research and development results of humanoid robots to the public in the next one or two years, covering AI systems, Low-level control, etc., and finally emerge a robot that can show its talents in daily life.

He also revealed that in terms of cost, a humanoid robot has about 1,000 parts and weighs about 150 pounds (68 kg), while an electric car may have about 10,000 parts and weigh 4,000-5,000 pounds (1,800-2,250 kg). Therefore, in the long term, the cost of humanoid robots is expected to be lower than that of cheap electric cars, depending on the cost of actuators, motor components, sensors and computing costs.

Robot expert Eric Jang once put forward his insight: "Although many AI researchers believe that it will take decades for universal robots to become popular, don't forget that ChatGPT was born almost overnight."

One year ago today, OpenAI released GPT-4, proving to the world the power of large models.

Today, one year later, we did not wait for GPT-5, but we also welcomed Figure 01. Will this be the GPT-4 moment for humanoid robots?

# Welcome to follow the official WeChat public account of Aifaner: Aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.

Ai Faner | Original link · View comments · Sina Weibo