Does Gemini, the most powerful model in Google’s history, really “crush” GPT-4?

Late last night, Google suddenly released its blockbuster AI killer tool – Gemini.

Multimodal Gemini can understand, manipulate and combine different types of information, including text, code, audio, images and video.

Less than two weeks after the release of ChatGPT last year, Google had already sounded a "red alert" to deal with the challenge. But Bard, which was launched urgently, made an error on its debut, causing Google to lose US$100 billion in market value overnight.

In the past year, chatbots based on large models have received more than 2 billion monthly visits, among which ChatGPT is far ahead. Although Google Bard ranks second, it is more appropriately classified as "other" together with several competing products. .

▲ Picture from: The Information

Therefore, Gemini has long been placed on high hopes of catching up with ChatGPT. Regardless of success or failure, it is the result of Google's past desperate efforts on large AI models.

Able to see, speak and reason

Gemini 1.0 has officially announced three different sizes: medium cup, large cup, and extra large cup.

Medium: Gemini Nano – the most efficient model for device missions Large: Gemini Pro – the best model for a wide range of mission extensions Extra Large: Gemini Ultra – the largest and most capable model for highly complex tasks

Putting aside the complicated parameter information for the time being, let's use a few cases to give you a comprehensive understanding of Gemini's capabilities.

When you draw a duck at random, Gemini can accurately identify everything from the curve to the shape of the duck. Draw a wavy line for the duck, and it will understand your meaning and accurately point out the answer to the scene of the duck swimming in the water.

At the same time, it can also imitate duck calls in a humane way, even if you can speak the duck calls in fluent Mandarin.

If you are bored, you can also play a game with Gemini. Which area you point your finger to, Gemini will tell you about that country and its representative things.

The three immortals return to the cave. Guess which cup the paper ball is under. No matter how fast you move, you can't hide from Gemini's "eyes".

If you get the yarn but have no clue, don’t worry, Gemini’s smart brain will already arrange the finished product for you the moment it sees the yarn, you just need to “imitate the cat and the tiger”.

Recognizing images is only the basic level of Gemini. When you see musical instruments, Gemini can also generate music that matches the atmosphere of the environment.

Logic and puzzle solving, image sequence analysis, magic trick interpretation, memory and logic, Gemini has all these abilities and is proficient in them all.

Google has also released a text demonstration version. If you don’t want to watch the video, you can visit https://developers.googleblog.com/2023/12/how-its-made-gemini-multimodal-prompting.html to view it.

Perhaps this video is too shocking, and some netizens have questioned the possibility that Google's video is "fake." However, Gemini will soon be open to the public in Google AI Studio, and then the authenticity will be able to be distinguished.

Multimodal Gemini VS GPT-4

According to Google officials, from natural image, audio and video understanding to mathematical reasoning, Gemini Ultra's performance exceeded 30 current state-of-the-art in 32 widely used academic benchmarks for large language model (LLM) research and development. result.

Judging from the test results released by Google, Gemini's performance almost completely crushed OpenAI's GPT-4 in fields such as text, conventional reasoning, mathematics, and coding.

MMLU (Massive Multi-Task Language Understanding) is one of the most popular ways to test the knowledge and problem-solving abilities of AI models. Gemini Ultra became the first model to surpass human experts in this test with an accuracy of 90.0%. For comparison, GPT-4 only had an accuracy of 86.4%.

The new MMMU benchmark test includes multi-modal tasks across different fields, and has a higher degree of testing of large multi-modal models, but Gemini Ultra also achieved a high score of 59.4%.

In an interview with MIT Technology Review, Google CEO Sundar Picha said that one of the important reasons why Gemini is remarkable is that it is fundamentally a multi-modal model. Just like people, it not only Learn from text, but also through video, audio, and code.

Multi-modal features are native features that Gemini has spent time polishing. Gemini 1.0 can simultaneously recognize and understand text, images, audio and other information. It has a stronger ability to understand information and can answer questions related to complex topics with ease. In the multi-modal SOTA test, Gemini's multi-modal test level of image, video, and audio is once again far ahead.

Code is one of the important indicators for testing the level of large models. Gemini 1.0's ability to work across languages ​​and reason about complex information is its strength, and it can understand high-quality codes such as Python, Java, C++, etc. Two years ago, Google launched AlphaCode, the first AI code generation system to reach competitive levels in programming competitions.

Now, AlphaCode is launching its second generation, a competitive coding model fine-tuned by Gemini. When pitted against the original AlphaCode on the same platform, AlphaCode-2 scored 87% against human competitors, compared to the previous AlphaCode scored only 46%.

AlphaCode-2 technical report address  :https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf

In a technical report, Google DeepMind (producer of AlphaCode 2) shared a wealth of details about its inference-time search, filtering and re-ranking system. Jim Fan, a senior scientist at NVIDIA, praised these latest results as Google's Q* (which can be simply understood as a major breakthrough in AI).

thehiredai CEO Arman made a bold prediction: "Gemini AI just killed ChatGPT!"

It is worth mentioning that Google also announced the launch of the most powerful, efficient, and scalable TPU system to date: Cloud TPU v5p.

▲ Cloud TPU v5p

The training of Gemini 1.0 is carried out on the AI ​​optimization infrastructure of Google's internally designed Tensor Processing Units (TPUs) v4 and v5e.

Google Cloud CEO Thomas Kurian praised his product without hesitation: "Cloud TPU v5p is our most powerful and scalable TPU accelerator to date, and its model training speed is 2.8 times faster than its predecessor. Times."

New players in mobile phone models

Mobile phones are an important medium for new technologies to break through. If Gemini wants to enter mass society on a large scale, Pixel 8 must be its best choice.

As the first mobile phone with built-in artificial intelligence, Pixel 8 Pro has established a good reputation on the road to high-tech civilian use. Judging from the feedback from users who have already used Pixel 8 Pro, Google has done a good job integrating AI with mobile terminal applications. .

On this basis, Google officially announced that Gemini Nano, a medium-sized model, will officially run on Pixel 8 Pro starting today.

As soon as the news came out, PassionateGenius CTO Morimoto couldn’t wait to experience running large models on Pixel 8.

As the first smartphone designed specifically for Gemini Nano, Pixel 8 Pro has two exclusive expansion functions that will be added in subsequent updates: "Recorder Summary" and "Gboard Smart Reply".

Even if there is no network connection, the recorder can obtain summaries of mobile phone conversation recordings, interviews, demonstrations, etc. Powerful terminal hardware is the basis for supporting this function, and the optimized side-end algorithm makes it possible to "continue to be offline even if the network is disconnected".

The smart reply function is very similar to the automatic reply after we hang up the phone, but compared with the traditional fixed content, Gemini Nano can identify the content of the incoming letter and generate corresponding replies based on different sentences. The language will be more natural and friendly, making it look like a star. The operation team responds to fans’ immediate feelings on social platforms.

These two functions currently only support English text recognition, but when I think about it, it doesn’t seem to have any impact on those of us who can’t buy Google phones. However, users from non-English speaking countries who can buy Pixel 8 Pro still need to Wait for some time.

In terms of productivity optimization, Pixel on the other side of the ocean has finally caught up with the basic level in China.

Similar AI editing functions for photos and videos have become synonymous with Google's new phones when the new phones were first launched. Now the continued optimization of AI editing optimizations can add a new "professional editor" to the phone.

A new cleaning feature can help remove smudges, stains and creases from scanned documents. Now you can remove stains from your pictures with just a few swipes in your photo album.

Leveraging the power of Google Tensor G3, the video enhancement model on Pixel 8 Pro can adjust color, lighting, stability and granularity in the cloud.

Judging from the official display comparison, a "vivid" filter has been added to the video, making the color fuller and the contrast between light and dark higher. Especially in dark light environments at night, the effect of this AI optimization will be more obvious.

Compared with video editing, image beautification should be the expectation of more people. Especially when shooting dynamic objects, blurry pictures will always leave you with some regrets when you flip through them afterwards. The upgraded AI editing can edit Google photos. All blurs are removed.

In the future, you can record the highlight moments of your pets without worrying about the anxiety caused by the camera not focusing.

In addition, Google has also upgraded the linkage between multiple devices. Pixel Watch can be another way to unlock your phone, and it can also help you ignore unwanted calls, or confirm who is calling and why you are calling before answering.

If you can buy a Pixel 8 Pro, or are already a Google phone user, you can try to check whether these new features will become a driving force for you to buy or continue to use Google.

Starting today, through the newly upgraded Gemini Pro version, Bard will achieve more advanced reasoning, planning, understanding and other functions. It will be available in English in more than 170 countries and regions.

In an interview with MIT Technology Review, Sundar Pichai also said: "Gemini Pro performed very well in benchmark tests, and I can personally feel its advantages when integrating it into Bard. We’ve been testing it and seeing significant improvements across all categories of tasks, we’re calling it one of our biggest upgrades yet.”

▲Currently Bard has used the Gemini Pro version. The picture is from X user @gijigae

In the next few months, Gemini will gradually launch more products and services from Google, such as search, advertising, Chrome and Duet AI.

Starting December 13, developers and enterprise customers can access Gemini Pro through the Gemini API in Google AI Studio or Google Cloud Vertex AI.

Currently, Gemini Ultra is in internal testing and plans to push it to developers and enterprise users early next year. Early next year, Google will also launch Bard Advanced to allow more ordinary users to use the most powerful Gemini Ultra.

Google CEO Sundar Pichai said when launching Gemini:

Every technological shift is an opportunity to advance scientific discovery, accelerate human progress, and improve lives.
I believe the transformation we are seeing now related to AI will be the most profound in our lifetimes, far greater than the transformations of mobile or the web that came before it.

If you want to realize AGI (artificial general intelligence), you need AI to be able to solve complex tasks in different fields and modes as calmly as humans. In this process, in addition to basic computation, reasoning and other basic abilities, the corresponding text, Multi-modal capabilities such as images and videos must also keep up.

DeepMind has proposed a framework for AGI evaluation and classification. The first two stages are:

AGI-0: Basic artificial intelligence that can show intelligence in specific fields and tasks, such as image recognition, natural language processing, etc., but cannot learn and reason across fields and modalities, nor can it interact with humans and other AI communicates and collaborates effectively and naturally, nor can it perceive and express emotions and values.

AGI-1: Primary general artificial intelligence, capable of showing intelligence in multiple fields and tasks, such as question and answer, summarization, translation, dialogue, etc., capable of learning and reasoning across domains and modalities, and capable of interacting with humans and other AI engages in basic communication and collaboration and is able to perceive and express simple emotions and values.

Gemini's demonstration video fully demonstrates its deep understanding of interaction in various modalities. It can see, speak, reason, perceive and express simple emotions and values. It also allows us to see the potential of AGI-1.

This article was co-written by Li Chaofan, Xiao Fanbo, and Mo Chongyu

Love style makes the future within reach.

# Welcome to follow the official WeChat public account of aifaner: aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.

Ai Faner | Original link · View comments · Sina Weibo