Home  >  Article  >  Technology peripherals  >  Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

青灯夜游
青灯夜游forward
2023-03-31 22:39:55797browse

GPT-4 outperforms previous models in standardized tests and other benchmarks, working across dozens of languages ​​and taking images as input, meaning it can understand in the context of chat Intention and logic of the photo or diagram.

Since Microsoft released the multi-modal model Kosmos-1 in early March, it has been testing and adjusting OpenAI’s multi-modal model and making it better compatible with Microsoft’s own products.

Sure enough, taking advantage of the release of GPT-4, Microsoft also officially showed its hand. New Bing has already used GPT-4.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

The language model used by ChatGPT is GPT-3.5. When talking about how GPT-4 is more powerful than the previous version, OpenAI said that although the two versions are at random The talk seems similar, but “differences emerge when the complexity of the task reaches a sufficient threshold,” with GPT-4 being more reliable, more creative, and able to handle more nuanced instructions.

The king is crowned? Eight observations about GPT-4

1. Stunning again, better than humans

If the GPT-3 series of models prove to everyone that AI can It performs multiple tasks in it and points out the path to realize AGI. GPT-4 has reached human-level in many tasks, and even performs better than humans. GPT-4 has surpassed 90% of humans in many professional academic exams. For example, in the mock bar exam, its score is in the top 10% of test takers. How should various primary and secondary schools, universities and professional education respond to this?

2. "Scientific" Alchemy

Although OpenAI did not announce the specific parameters this time, you can guess that the GPT-4 model must be not small. If there are too many models, This means high training costs. At the same time, training a model is also very similar to "refining an elixir" and requires a lot of experiments. If these experiments are trained in a real environment, not everyone can bear the high cost pressure.

To this end, OpenAI has ingeniously developed a so-called "predictable scaling". In short, it uses one ten thousandth of the cost to predict the results of each experiment (loss and human eval). In this way, the original large-scale "lucky" alchemy training was upgraded to a "semi-scientific" alchemy training.

3. Crowdsourcing evaluation, killing two birds with one stone

This time we provide an open source OpenAI Evals in a very "smart" way, open to everyone through crowdsourcing Developers or enthusiasts are invited to use Evals to test models and at the same time engage the developer ecosystem. This method not only gives everyone a sense of participation, but also allows everyone to help evaluate and improve the system for free. OpenAI directly obtains questions and feedback, killing two birds with one stone.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

4. Engineering leak repair

This time a System Card was also released, which is An open "patching" tool that can discover vulnerabilities and reduce the "nonsense" problem of language models. Various patches have been applied to the system for pre-processing and post-processing, and the code will be opened later to crowdsource the patching capabilities to everyone. OpenAI may be able to let everyone help it in the future. This marks that LLM has finally moved from an elegant and simple next token prediction task into various messy engineering hacks.

5. Multi-modal

Since Microsoft in Germany revealed last week that GPT-4 is multi-modal, the public has been highly anticipated.

GPT-4 has been around for a long time. The multi-modality that is known as "comparable to the human brain" is actually not much different from the multi-modal capabilities described in many current papers. The main difference is that The few-shot of the text model and the logic chain (COT) are combined. The premise here is that a text LLM with good basic capabilities and multi-modality are required, which will produce good results.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

#6. Release "King Explosion" in a planned manner

According to the demo video of OpenAI demonstrating GPT-4, GPT-4 completed training as early as August last year, but was only released today. The remaining time is spent on a large number of tests, various bug fixes, and the most important work of removing the generation of dangerous content.

While everyone is still immersed in the amazing generation capabilities of ChatGPT, OpenAI has already solved GPT-4. This wave of Google engineers will probably have to stay up late to catch up again?

7. OpenAI is no longer Open

OpenAI did not mention any model parameters and data size in the public paper (the GPT-4 parameters transmitted online have reached 100 Trillions), and there is no technical principle. The explanation is that it is for the benefit of the public. I am afraid that after everyone learns how to make GPT-4, they will use it to do evil and trigger some uncontrollable things. I personally do not agree with this at all. There is no silver practice here.

8. Concentrate your efforts on big things

In addition to various "showing off skills", the paper also uses three pages to list all the people who have contributed to different systems of GPT-4. It is roughly estimated that there should be more than a hundred people, which once again reflects the unity and high degree of collaboration among OpenAI's internal team members. status. Comparing this to the team combat capabilities of other companies, is it a little far behind in terms of united efforts?

Currently, multi-modal large models have become the trend and important direction for the development of the entire AI large model. In this large-model AI "arms race", technology giants such as Google, Microsoft, and DeepMind are actively launching multi-modal models. Modal Large Model (MLLM) or Large Model (LLM).

Opening a new round of arms race: multi-modal large model

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

Microsoft: Kosmos-1

Microsoft released Kosmos-1, a multi-modal model with 1.6 billion parameters in early March. The network structure is based on Transformer's causal language model. Among them, the Transformer decoder is used as a universal interface for multi-modal input.

In addition to various natural language tasks, the Kosmos-1 model is able to natively handle a wide range of perceptually intensive tasks, such as visual dialogue, visual explanation, visual question answering, image subtitles, simple mathematical equations, OCR and descriptions Zero-shot image classification.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

Google: PaLM-E

In early March, the research team from Google and the Technical University of Berlin launched the largest visual language model to date— —PaLM-E, with up to 562 billion parameters (PaLM-540B ViT-22B).

PaLM-E is a large decoder-only model that can generate text completions in an autoregressive manner given a prefix or prompt. By adding an encoder to the model, the model can encode image or sensory data into a series of vectors with the same size as the language tags, and use this as input for the next token prediction for end-to-end training.

DeepMind: Flamingo

DeepMind launched the Flamingo visual language model in April last year. The model uses images, videos and texts as prompts (prompts) and outputs relevant languages. A small number of specific examples are needed to solve many problems without additional training.

Train the model by cross-inputting pictures (videos) and text, so that the model has few-shot multi-modal sequence reasoning capabilities and completes "text description completion, VQA/Text-VQA" and other Task.

Currently, multi-modal large models have shown more application possibilities. In addition to the relatively mature Vincentian diagram, a large number of applications such as human-computer interaction, robot control, image search, and speech generation have emerged one after another.

Taken together, GPT-4 will not be AGI, but multi-modal large models are already a clear and definite development direction. Establishing a unified, cross-scenario, multi-task multi-modal basic model will become one of the mainstream trends in the development of artificial intelligence.

Hugo said, "Science encounters imagination when it reaches its final stage." The future of multi-modal large models may be beyond human imagination.

The above is the detailed content of Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete