Wall-facing Intelligence completed a new round of financing of hundreds of millions of yuan and released the second version of MiniCPM, a high-performance small steel cannon

The inspirational story of making a big difference from a small thing not only happens in the history of entrepreneurship, but also happens in large-scale end-to-end models.

In February this year, Wall-facing Intelligence officially released the 2B flagship end-side large-scale model Wall-facing MiniCPM, which not only surpassed the performance benchmark from the "European version of OpenAI", but was also overall ahead of Google Gemma's 2B level, and even surpassed 7B and 13B in volume. level models, such as Llama2-13B, etc.

Recently, Wall-Facing Intelligence has also completed a new round of financing of several hundred million yuan, led by Chunhua Venture Capital and Huawei Hubble, and followed by the Beijing Artificial Intelligence Industry Investment Fund and others. Zhihu, as a strategic shareholder, continues to invest and support, and is committed to accelerating the promotion of Efficient training of large models and quick application implementation.

Today, the side-to-side large model wall-facing MiniCPM small steel cannon is chasing the victory and ushered in the second four-shot series. The main theme is "small but strong, small but complete."

Among them, the MiniCPM-V2.0 multi-modal model has significantly enhanced its OCR capabilities and refreshed the best OCR performance of open source models. The general scene text is comparable to Gemini-Pro and surpasses the entire series of 13B models.

In the Object HalBench list that evaluates large model illusions, MiniCPM-V2.0 and GPT-4V perform almost equally.

In the OpenCompass list that combines 11 mainstream evaluation benchmarks, MiniCPM-V2.0 multi-modal model general capability surpasses Qwen-VL-Chat-10B, CogVLM-Chat-17B, Yi-VL-34B, etc. with a score of 55.0 A larger model.

In the official demonstration case, when asked to describe the scene of the same picture in detail, GPT-4V responded with 6 hallucinations, while MiniCPM-V2.0 only had 3 hallucinations.

In addition, MiniCPM-V2.0 has also launched in-depth cooperation with Tsinghua University to jointly explore the treasure of Tsinghua University Museum – Tsinghua Slips.

Thanks to its powerful multi-modal recognition and reasoning capabilities, MiniCPM-V2.0 can easily handle whether it is the recognition of the simple word "ke" or the complex word "I".

In the competition with similar Chinese benchmark multi-modal large models, the recognition accuracy of MiniCPM-V2.0 is far ahead.

The recognition of precise details places higher requirements on the clarity of images, and traditional large models can usually only handle small images of 448×448 pixels. Once the information is compressed, the model becomes difficult to read.

But this is not a problem for MiniCPM-V2.0. In the official demonstration case, faced with an ordinary urban street scene picture, MiniCPM-V2.0 can capture key information at a glance, even without the naked eye detecting it. "Family Mart" can also be captured easily.

Long images contain rich text information, and multi-modal models are often unable to recognize long images, but MiniCPM-V 2.0 can firmly grasp the key information of long images.

From 448×448 pixels, to 1.8 million high-definition large images, and even the ultimate aspect ratio of 1:9 (448 * 4032), MiniCPM-V 2.0 can achieve lossless recognition.

It is understood that the exclusive technology LLaVA-UHD is actually used behind the efficient encoding of MiniCPM-V 2.0 high-definition images.

  • Modular visual encoding: The original resolution image is divided into variable-sized slices, achieving full adaptability to the original resolution without pixel padding or image distortion.
  • Visual compression module: Uses a shared perceptron resampling layer to compress the visual tokens of image slices. The number of tokens is affordable regardless of the resolution and the computational complexity is lower.
  • Spatial modification method: Use simple patterns of natural language symbols to effectively inform the relative positions of image slices.

In terms of Chinese OCR capabilities, MiniCPM-V 2.0 also significantly surpasses GPT-4V. Compared with GPT-4V's "helplessness", its ability to accurately identify images is even more valuable.

Behind this capability is the support of cross-modal and cross-language generalization technology, which can solve the challenge of lack of high-quality, large-scale multi-modal data in the Chinese field.

The ability to process long text has always been an important criterion for measuring models.

Although the 128K long text capability is nothing new, for the MiniCPM-2B-128K, which is only 2B, this is definitely something worthy of praise.

The smallest 128K long text model, MiniCPM-2B-128K long text model, extends the original 4K context window to 128K, surpassing a number of 7B models such as Yarn-Mistral-7B-128K on the InfiniteBench list.

By introducing the MoE architecture, the newly released MiniCPM-MoE-8x2B MoE performance has improved by an average of 4.5%, surpassing the entire series of 7B models and larger models such as LlaMA234B, while the inference cost is only 69.7% of Gemma-7B.

MiniCPM-1.2B proves that "small" and "powerful" are not mutually exclusive.

Although the direct parameters have been reduced by half, MiniCPM-1.2B still maintains 87% of the comprehensive performance of the previous generation 2.4B model. On multiple public authoritative test lists, the 1.2B model is very capable, and its comprehensive performance exceeds Qwen 1.8B and Qwen 1.8B. Excellent results with Llama 2-7B and even Llama 2-13B.

Screen recording demonstration of MiniCPM-1.2B model on iPhone 15 mobile phone, the inference speed is increased by 38%. It has reached 25 token/s per second, which is 15 to 25 times faster than human speaking speed. At the same time, the memory is reduced by 51.9%, the cost is reduced by 60%, and the implementation model is smaller, but the usage scenarios are greatly increased.

In the pursuit of large-parameter models, Face Wall Intelligence has chosen a unique technical path – to develop models with smaller size and stronger performance as much as possible.

The outstanding performance of the wall-facing MiniCPM small steel cannon fully proves that "small" and "strong", "small" and "full" are not mutually exclusive attributes, but can coexist harmoniously. We also look forward to more such models appearing in the future.

# Welcome to follow the official WeChat public account of aifaner: aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.

Ai Faner | Original link · View comments · Sina Weibo