1. Background information
1.1 Why are large models important?
As a type of large language model (LLM), OpenAI's GPT (Generative Pre-Trained Transformer) model demonstrates the machine intelligence closest to human beings today. The most important feature of GPT is that it uses trillions of parameters and The text data of the entire Internet enables computer models to produce intelligent emergence (Emergence). Emergence in physics generally refers to the recurring appearance of certain stable patterns in chaotic phenomena. Emergence is the most challenging category in understanding complex natural phenomena. From a recent interview with OpenAI head Sam Altman and chief scientist Ilya, we can understand how to tame the large language model that emerges with intelligence, and how to make it serve people stably and safely (Alignment, alignment). OpenAI has not fully mastered the effective debugging method, GPT is still a black box to some extent.
We need to ask why a language model will change the world. ChatGPT is indeed more obedient and better at speaking, but what is so great about an intelligent text interaction tool? There are two reasons: 1. Artificial intelligence has multiple modes (Modal). Research between different modes penetrates and competes with each other. The mode with the best performance will be the first to define the trajectory of future AI applications. ChatGPT shows Text modal intelligence will dominate the recent development of AI. 2. Text is very important as the entrance to human society.
Point 1, multimodality. Artificial intelligence using images as a modality has been developed for many years. Computer Vision, which has repeatedly made great achievements in image recognition and autonomous driving, is another entrance to artificial intelligence. In the past ten years, there has been an explosion of papers in the three conferences CVPR/ICCV/ECCV. Even if parallel imports are removed, it is the epitome of the explosion of image intelligence. The emergence of OpenAI has reversed the prominent position of graphical intelligence in AI applications, and even its development trajectory – when Meta released Segment-Anything (a graphical algorithm for segmenting different objects in pictures), the model showed a power similar to GPT in text modality. With zero-cost migration capabilities, some people also exclaimed that traditional CV is dead (an exaggeration).
The influence of ChatGPT on text modality needs no elaboration. It goes beyond simple scientific research value and redefines the intelligence and business potential of text modality. OpenAI's product DALL·E also provides image intelligence outside of text modality. The open source Stable Diffusion and the closed source Midjourney, as overlords in the field of Vincent graphics, have also contributed to countless "death" warnings in the creative industry. All in all, AI penetrates each other in the two modalities of text and image, and pushes the boundaries of intelligence in mutual competition.
– OpenAI’s text + image modality: ChatGPT + DALL·E
– Stable Diffusion Web UI (image modal): stable-diffusion
– Midjourney (graphical modal): Midjourney
Point 2: Text modality is the entrance to human society. You can refer to the views of Yuval Harari (author of "A Brief History of Humankind") in an exclusive interview with The Economist: He believes that language is the operating system of human society, and artificial intelligence has hacked into this system, and AI will change it through Language, the operating system itself, has completely changed human history. The scheduling of human behavior and social feedback by large models will have a huge impact due to the intrusion of the language system.
For more information, please refer to: yuval-noah-harari-argues-that-ai-has-hacked-the-operating-system-of-human-civilisation
1.2 Cost of LLM
How much does it cost to train a truly large model?
First of all, regardless of images and videos, large models require at least the text data of the entire Internet; tens of thousands of A100s are required to start; the computing energy consumption of the required electricity will become a cost that cannot be ignored; the cost of trial and error is uncontrollable: it can be measured in several months The training time and labor cost; the method of model training and precise fine-tuning is unknown or not public, and the large model is still a black box. The sum of these reasons has led to the fact that there are only a handful of companies on the planet that can own large models, because it requires extremely strong financial resources and extremely high risk tolerance. Not only are junior players unable to participate, but large companies lacking the spirit of adventure are not worthy of owning it.
Elon Musk estimated in a recent interview that training a GPT-5 level model may use 30,000-50,000 H100 chips, using the latest technical architecture, and the top AI researchers (refer to OpenAI, about 200+ people). Finally, Musk gave the starting cost of the large model. Compared with the recent valuation of large-scale model Startup, this figure is of great reference value: US$250 million.
At Tencent’s 2023 shareholders’ meeting, Tencent CEO Ma Huateng responded to questions about ChatGPT and AI, saying, “We initially thought (artificial intelligence) was a once-in-a-decade opportunity for the Internet, but the more we thought about it, the more we felt that this was a once-in-a-century, similar opportunity. Opportunities like the industrial revolution that invented electricity." Ma Huateng said that Internet companies have accumulated a lot in the field of AI, and Tencent is also immersed in research and development, but it is not in a hurry to finish it early and show off the semi-finished products. "For the industrial revolution, taking out the light bulbs a month earlier is not that important in the long term. The key is to do a solid job in the underlying algorithms, computing power and data, and more importantly It’s the implementation of the scenario, and currently (we) are still doing some thinking. I feel that many companies are too hasty now, and it feels like it is to boost the stock price, which is not our style.”
To sum up, there is no need to rush to boost the stock price, and there is no need to rush to innovate as the road ahead is long. Large models are not new applications, they are the revolution itself.
I also have some opinions. The difference between taking out the light bulb a month late and taking it out a month early is whether you end up being Edison or some unknown second person who invents the light bulb. However, despite the amazing capabilities of large models, the challenge of domestication and improvement is still arduous. We are in a challenging period when we first build an airplane. If we want to fly safely and stably, we still need many hard-earned lessons from failures to understand where the red line is. OpenAI has launched the Plugin plug-in, which is a potential product method. However, the current commercial performance of the Plugin is not clear. The App Store expected to be triggered by the Plugin is also unclear at all times. How to turn GPT into a product with commercial value is still unknown. Over the years, Goose Factory has been a trendsetter in the second mover advantage and has the trump card of micro-innovation. It is not necessarily unreasonable to give full play to its strengths.
2. GPT causes changes in the human-computer interaction layer (HCI/UI)
User Interface, user interaction interface, referred to as UI. Today, everyone lives in the ocean of UI. Many Internet people believe that UI ≈ web + app design. This understanding greatly limits the connotation of UI. A more professional definition of UI should be called HCI, Human-Computer Interface, human-computer interaction interface. In the past nearly a century of development, people have designed several generations of distinctive UI based on the computing power and intelligence level of the machines at that time. We are in the transition phase from GUI to NLI.
- PCI: Punched Card Interface, punched card interactive interface
- CLI: Command Line Interface, command line interactive interface
- GUI: Graphic User Interface, graphical interactive interface
- NLI: Natural Language Interface, natural language interactive interface
- BCI: Brain Computer Interface, brain-computer interaction interface
2.1 PCI, Punched Card Interface punched card interface
Above: A stack of punched cards holding a program.
Below: U.S. clerks in 1950 making punch cards containing a section of U.S. Census data.
2.2 CLI, Command Line Interface command line interface
Programming languages are further encapsulated and display devices appear, and command line tools have become the most important interactive interface for computers. CLI operation is efficient and powerful.
2.3 GUI, Graphic User Interface User graphical interface
The GUI, the graphical user interface that Jobs "stole" from Xerox, started the personal computer revolution.
This layer of interface was extremely influential, and the world's first killer application was born on the GUI, the Macintosh spreadsheet VisiCalc, which was also the predecessor of Excel.
To this day, Mac's beautiful and smooth UI interface is still one of the most attractive product features for users.
2.4 NLI, Natural Language Interface natural language interface
1. Text to Text https://openai.com/chatgpt
2. Text to Image https://openai.com/dall-e-2
3. Text to Video 文生视频
Runway: Advancing creativity with artificial intelligence.
Say a word: "A beautiful living room concept render." "Generate a beautiful living room concept render."
4. Text to Action 文生 Behavior
Adept's goal is to build an all-round intelligent assistant through software automation. Natural language will be the only interactive content that Adept users will need to use in the future.
2.5 BCI, Brain Computer Interface
Thought to Action, from human thinking to machine behavior. NeuroLink, which was highly publicized last year, allows monkeys to play the game Pong with their thoughts, and people can also use brain-computer interfaces to control simple games and mechanical prostheses. At this stage, more meaningful brain-computer products mainly help disabled people control prosthetics and restore their ability to live. Today's brain-computer technology is still a little early for us to discuss revolutionary human-computer interaction interfaces.
-The connotation of UI needs to be expanded
Communication between machines and humans requires a layer of interactive media, which controls the boundaries of input and output in human-computer interaction. The interactive medium will filter and convert human input, making these strange human inputs safe and identifiable to the machine; at the same time, the results returned by the machine will be filtered and converted by the interactive medium, making them safe, usable and valuable to humans.
This layer of interactive media connecting humans and machines is the definition of UI.
In the Internet revolution of the past two decades, GUI has standardized all the input methods that people want to do with the machine through limited operation forms such as buttons, dragging, pulleys, finger zooming, multi-finger operations, shaking, flipping, hardware buttons, etc. This standardized input is understood by the machine and returned as standardized output. The PC and mobile Internet revolution have equated UI and GUI, but in fact UI is far richer than GUI's existing interaction methods.
The emergence of GPT directly destroyed this balance. The most important impact of machines becoming smarter on products is that the computer's fault tolerance for natural language has been greatly improved. It no longer needs a filter that can only receive very limited input to understand people. The natural language that is spoken every day is even mixed with all kinds of logic, hints, sarcasm, and mistakes. The improvement of AI's fault tolerance for natural language will definitely destroy the current interaction layer of GUI as UI:
1. A big transformation in user experience (UX). Users have changed from the main interaction method of "clicking, sliding, dragging" with fingers and mouse in the past to interaction using natural language as the interface.
2. Will the current GUI disappear? No, for two reasons. First, when the model is not accurate enough or the AI productization is immature, the elegant appearance and experience of the GUI are still attractive to users, and the cost of interacting with fingers and mouse is far lower than that of natural language. Second, referring to the different stages of UI development, is the black command line outdated? No, the GUI won't disappear immediately. If it is more efficient to use the interactive interface of the previous era, this kind of interaction will still exist even if the threshold for use is high.
3. The command line tool (CLI) is still the most efficient way to perform in-depth operations on the computer. In the smart future, if you need to operate an application in depth, someone may say: open your GUI, just like today's programmers say: open your terminal.
4. The human-computer interaction interface will develop in the direction of shallowing the depth of computer operations and lowering the threshold for use. The same goes for the changes that big models are about to trigger. You can see this trend: command line CLI – graphical interface GUI – natural language NLI – brain-computer interface BCI. This trend is getting lower and lower in the ability of computers to operate in depth, and it also makes the threshold for users lower and lower.
5. The best computer engineers cannot be replaced for their in-depth understanding of machines and in-depth operating capabilities, but only the best engineers can survive.
From the picture below, we can see more clearly why GPT will cause huge changes in the product UI? Because machine languages in the past have been very harsh and have extremely low fault tolerance. A single punctuation error in a programming language can render the entire program inoperable. The most important magic brought by large models is to significantly improve the machine's fault tolerance for human natural language (Natural Language, NL). To sum up, the future natural language interactive interface will take the text input box as the starting point and aim at multi-modal and highly dynamic interaction.
Usage threshold: The closer it is to humans, the lower the usage threshold. Command line CLI > Graphical interface GUI > Natural language NLI > Brain-computer interface BCI
Operation efficiency: The further away from the machine, the lower the control efficiency. Command line CLI > Graphical interface GUI > Natural language NLI > Brain-computer interface BCI
-The evolution of NLI
·Start: text input box
·Development: multi-modal input box, voice, image, video · Goal : input – multi-modal text, sound, image, video => return – useful Text, Voice, Image, Video + useful software behavior.
What modes can Vision pro provide: three-dimensional interaction, gestures, gravity, rotation, voice text, static images, and real-time video.
·Future : Talking to humans is just the starting point for understanding the world with large models. Using LLM as the brain, cameras as eyes, and robotic arms as limbs is a new interface for AI to interact with the physical world.
-Timberter – a "timber counting" application based on visual algorithms that has been around for many years . What would happen if we added to it the reasoning capabilities of a large model and a robotic arm that can perform handling?
-A robot controlled by voice? OpenAI GPT-4 Whisper voice interface
3. AI Ecosystem
3.1 Forbes AI 50
Forbes has selected the 50 most promising AI companies of the year in recent years. Different from previous years, this year's list of AI companies is not only from North America, but the 50 most promising and valued companies from more than 800 companies around the world, from the United States, Canada, Israel, the United Kingdom and Japan.
Below is the complete list I compiled, including OpenAI, Jasper, Hugging Face, Adept… all the AI startups you are familiar with. Interested students can go to the Forbes website to read it themselves, and I won’t start the discussion. Forbe AI 50
3.2 More AI Startups
The usage scenarios are mainly concentrated on the C side: generative Text, Audio, Image, Video + Search + Automated Copilt. B-side applications are mostly based on integration and can be in specific industry categories: law, medicine and health, academic research (biology, physics, mathematics), and intelligent analysis. In addition, there is AI infrastructure: vector database, large model AI Model, AI security, development and operation DevOps, and automated Copilt.
The picture below contains a list of more AI-generated companies (March 2023, from the perspective of American VCs). Interested students can experience it on their own.
4. Integration of large models and products
4.1 Integration costs
The integration cost here is not just a discussion of the development cost of integrating AI into products, but also the cost of learning and time required for users to complete work of the same quality as in the past by using AI intelligent applications. AI applications will only be valuable when the integration cost is significantly less than the original cost (development and operation costs + user costs).
Integration cost = AI product development cost + user cost of using AI application to complete work of the same quality in the past
Give two examples to illustrate the significance of integration costs.
Front: AIGC generates filler materials/materials for game design.
There is labor-intensive work in game design and development, including preparing filling materials, NPC character dialogue, style switching, edge scenes… This type of work does not have high requirements for originality, but the time cost cannot be significantly reduced.
If you use AI tools to generate such non-important materials, and finally adjust them by experienced designers, it is completely feasible to achieve the same quality results as in the past.
Integration cost of AI-generated non-critical materials < < < Preparation cost of traditional materials
AIGC deserves to be promoted in the context of game materials.
On the flip side: A solution for generating advanced ads using one-stop AI.
Although AI advertising solutions seem to have reduced the cost of generating text, images, and videos in the past advertising creation process, a truly attractive and advanced advertisement often requires an extremely large amount of customized creation and secondary modifications.
Therefore, when real users (advertising service providers using AI creation, or advertising demanders who want to eliminate advertisers) make secondary adjustments to the advertising content generated by AI, it will take a lot of effort to achieve the same quality level as past advertising. The adjustment cost is much higher than that of traditional methods, and in most cases it is even impossible to achieve the same quality.
AI integration cost of high-end advertising > > > Original cost of traditional advertising.
AI one-stop service solves advanced customized advertising creation, which is not realistic today.
Of course, as model performance improves, even market means and market preferences change (precision marketing, customized preferences). The current negative case will become a positive case, and the positive case may also become a negative case.
4.2 Integration methods
Large models will drive two types of products. A brand-new product built around AI capabilities can be called AI Naive/AI native. The other is to add AI functions to traditional software and carry out intelligent transformation, which can be called AI upgrade/AI Upgrade. These two methods will form different product UIs.
The most important factors in integrating AI products:
1. Performance of AI model
2. The cost of developing and operating AI products + the cost of users using AI applications (integration costs).
5. AI tool information
The composition of AI applications: infrastructure Infra + middleware Middleware + application (these definitions can overlap each other)
5.1 AI Application | Application
-Search engines: New Bing, Google Bard
-Chat Q&A: ChatGPT, Jasper, various smart chat applications
– Vincent Pictures: Midjourney, Stable Diffusion
– Vincent Video: Runway
More tool references (domestic): AI toolbox | AI tool collection | AI website navigation
5.2 Middleware |
The large model is a foundation model, which has the most extensive knowledge and shows strong generalization capabilities, but its accuracy in precise scenarios is insufficient. This is also the main challenge in how to apply large models. The significance of middleware is to organize the knowledge of professional scenarios, expand the knowledge base of models, improve AI accuracy, and finally provide convenient and usable interfaces to connect upper-layer applications. Because the cost of natural language interfaces is very low, many middleware directly provide application interfaces and a chat window, such as the first AgentGPT.
More middleware tools:
There are many similar middlewares, so I won’t list them all.
5.3 Infrastructure | Infra
Large models: OpenAI GPT, Google Bard, Anthropic, Wenyan Yixin, Baichuan Intelligence…
Open source model:
Stable Diffusion (image mode): https://github.com/AUTOMATIC1111/stable-diffusion-webui
Refer to the vector database recommended by OpenAI: https://platform.openai.com/docs/guides/embeddings/how-can-i-retrieve-k-nearest-embedding-vectors-quickly
Zilliz has an open source product Milvus: https://github.com/milvus-io/milvus
-Compile and run Compilation & DevOps
How to run large models on local devices and low-end devices is an obstacle to distributing AI capabilities.
MLC-LLM (Machine Learning Compilation-LLM) is a compilation tool for ML. It enables large models to be run locally. https://mlc.ai/mlc-llm/
The usage experience is as follows. Install the mlc-chat-cli-nightly tool through conda in the local environment, download the Model from Hugging Face, and run the large model on the local Mac for Q&A:
Run large models locally on mobile devices:
Large model: For closed-source large models such as GPT, the challenges lie in model scenarioization, data security, accuracy, debugging efficiency, Prompt Engineering, and docking of engineering interfaces. Self-deployment open source models are not strictly large models. The challenges lie in speed, performance, and benchmark performance.
Middleware: connects the model layer and the application layer, provides knowledge plug-ins in specific fields; expands application scenarios and quickly provides application interfaces; reduces development and operation and maintenance costs.
Application layer: tolerance of model performance in usage scenarios, benefit = usage value – integration cost; hazard response: AI illusion, AI safety.
The user interface is the strong glue that connects people and computers. Product design takes place on this interface. The revolution caused by GPT will have a huge impact on product UI. The content of this article is my research and compilation of AI-related information in the past few months. This evolutionary route needs to consider that model performance is imperfect, starting from text interaction, expanding into rich and diverse multi-modalities, and using new interactive experiences to meet ancient and brand-new needs.
The core business question of the AI revolution will always be, What's That Interface?
Let me end with Lennon's words: Everything will be OK in the end. If it's not OK, it's not the end.
# Welcome to follow the official WeChat public account of aifaner: aifaner (WeChat ID: ifanr). More exciting content will be provided to you as soon as possible.