Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins-AI-php.cn

Home

Technology peripherals

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

青灯夜游

Mar 31, 2023 pm 10:39 PM

testModel

GPT-4 outperforms previous models in standardized tests and other benchmarks, working across dozens of languages and taking images as input, meaning it can understand in the context of chat Intention and logic of the photo or diagram.

Since Microsoft released the multi-modal model Kosmos-1 in early March, it has been testing and adjusting OpenAI’s multi-modal model and making it better compatible with Microsoft’s own products.

Sure enough, taking advantage of the release of GPT-4, Microsoft also officially showed its hand. New Bing has already used GPT-4.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

The language model used by ChatGPT is GPT-3.5. When talking about how GPT-4 is more powerful than the previous version, OpenAI said that although the two versions are at random The talk seems similar, but “differences emerge when the complexity of the task reaches a sufficient threshold,” with GPT-4 being more reliable, more creative, and able to handle more nuanced instructions.

The king is crowned? Eight observations about GPT-4

1. Stunning again, better than humans

If the GPT-3 series of models prove to everyone that AI can It performs multiple tasks in it and points out the path to realize AGI. GPT-4 has reached human-level in many tasks, and even performs better than humans. GPT-4 has surpassed 90% of humans in many professional academic exams. For example, in the mock bar exam, its score is in the top 10% of test takers. How should various primary and secondary schools, universities and professional education respond to this?

2. "Scientific" Alchemy

Although OpenAI did not announce the specific parameters this time, you can guess that the GPT-4 model must be not small. If there are too many models, This means high training costs. At the same time, training a model is also very similar to "refining an elixir" and requires a lot of experiments. If these experiments are trained in a real environment, not everyone can bear the high cost pressure.

To this end, OpenAI has ingeniously developed a so-called "predictable scaling". In short, it uses one ten thousandth of the cost to predict the results of each experiment (loss and human eval). In this way, the original large-scale "lucky" alchemy training was upgraded to a "semi-scientific" alchemy training.

3. Crowdsourcing evaluation, killing two birds with one stone

This time we provide an open source OpenAI Evals in a very "smart" way, open to everyone through crowdsourcing Developers or enthusiasts are invited to use Evals to test models and at the same time engage the developer ecosystem. This method not only gives everyone a sense of participation, but also allows everyone to help evaluate and improve the system for free. OpenAI directly obtains questions and feedback, killing two birds with one stone.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

4. Engineering leak repair

This time a System Card was also released, which is An open "patching" tool that can discover vulnerabilities and reduce the "nonsense" problem of language models. Various patches have been applied to the system for pre-processing and post-processing, and the code will be opened later to crowdsource the patching capabilities to everyone. OpenAI may be able to let everyone help it in the future. This marks that LLM has finally moved from an elegant and simple next token prediction task into various messy engineering hacks.

5. Multi-modal

Since Microsoft in Germany revealed last week that GPT-4 is multi-modal, the public has been highly anticipated.

GPT-4 has been around for a long time. The multi-modality that is known as "comparable to the human brain" is actually not much different from the multi-modal capabilities described in many current papers. The main difference is that The few-shot of the text model and the logic chain (COT) are combined. The premise here is that a text LLM with good basic capabilities and multi-modality are required, which will produce good results.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

#6. Release "King Explosion" in a planned manner

According to the demo video of OpenAI demonstrating GPT-4, GPT-4 completed training as early as August last year, but was only released today. The remaining time is spent on a large number of tests, various bug fixes, and the most important work of removing the generation of dangerous content.

While everyone is still immersed in the amazing generation capabilities of ChatGPT, OpenAI has already solved GPT-4. This wave of Google engineers will probably have to stay up late to catch up again?

7. OpenAI is no longer Open

OpenAI did not mention any model parameters and data size in the public paper (the GPT-4 parameters transmitted online have reached 100 Trillions), and there is no technical principle. The explanation is that it is for the benefit of the public. I am afraid that after everyone learns how to make GPT-4, they will use it to do evil and trigger some uncontrollable things. I personally do not agree with this at all. There is no silver practice here.

8. Concentrate your efforts on big things

In addition to various "showing off skills", the paper also uses three pages to list all the people who have contributed to different systems of GPT-4. It is roughly estimated that there should be more than a hundred people, which once again reflects the unity and high degree of collaboration among OpenAI's internal team members. status. Comparing this to the team combat capabilities of other companies, is it a little far behind in terms of united efforts?

Currently, multi-modal large models have become the trend and important direction for the development of the entire AI large model. In this large-model AI "arms race", technology giants such as Google, Microsoft, and DeepMind are actively launching multi-modal models. Modal Large Model (MLLM) or Large Model (LLM).

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

Microsoft: Kosmos-1

Microsoft released Kosmos-1, a multi-modal model with 1.6 billion parameters in early March. The network structure is based on Transformer's causal language model. Among them, the Transformer decoder is used as a universal interface for multi-modal input.

In addition to various natural language tasks, the Kosmos-1 model is able to natively handle a wide range of perceptually intensive tasks, such as visual dialogue, visual explanation, visual question answering, image subtitles, simple mathematical equations, OCR and descriptions Zero-shot image classification.

Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins

Google: PaLM-E

In early March, the research team from Google and the Technical University of Berlin launched the largest visual language model to date— —PaLM-E, with up to 562 billion parameters (PaLM-540B ViT-22B).

PaLM-E is a large decoder-only model that can generate text completions in an autoregressive manner given a prefix or prompt. By adding an encoder to the model, the model can encode image or sensory data into a series of vectors with the same size as the language tags, and use this as input for the next token prediction for end-to-end training.

DeepMind: Flamingo

DeepMind launched the Flamingo visual language model in April last year. The model uses images, videos and texts as prompts (prompts) and outputs relevant languages. A small number of specific examples are needed to solve many problems without additional training.

Train the model by cross-inputting pictures (videos) and text, so that the model has few-shot multi-modal sequence reasoning capabilities and completes "text description completion, VQA/Text-VQA" and other Task.

Currently, multi-modal large models have shown more application possibilities. In addition to the relatively mature Vincentian diagram, a large number of applications such as human-computer interaction, robot control, image search, and speech generation have emerged one after another.

Taken together, GPT-4 will not be AGI, but multi-modal large models are already a clear and definite development direction. Establishing a unified, cross-scenario, multi-task multi-modal basic model will become one of the mainstream trends in the development of artificial intelligence.

Hugo said, "Science encounters imagination when it reaches its final stage." The future of multi-modal large models may be beyond human imagination.

The above is the detailed content of Li Zhifei: Eight observations on GPT-4, the multi-modal large model competition begins. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51cto. If there is any infringement, please contact admin@php.cn delete

What is Graph of Thought in Prompt EngineeringApr 13, 2025 am 11:53 AM

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Optimize Your Organisation's Email Marketing with GenAI AgentsApr 13, 2025 am 11:44 AM

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Real-Time App Performance Monitoring with Apache PinotApr 13, 2025 am 11:40 AM

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

ChatGPT Hits 1 Billion Users? 'Doubled In Just Weeks' Says OpenAI CEOApr 13, 2025 am 11:23 AM

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics VidhyaApr 13, 2025 am 11:20 AM

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Agentic Frameworks for Generative AI Applications - Analytics VidhyaApr 13, 2025 am 11:13 AM

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Applications of Generative AI in the Financial SectorApr 13, 2025 am 11:12 AM

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Guide to Online Learning and Passive-Aggressive AlgorithmsApr 13, 2025 am 11:09 AM

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

Chinese version, very easy to use

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version

Visual web development tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.