search
HomeTechnology peripheralsAI'Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models'

The following is the content of Professor Lu Zhiwu’s speech at the ChatGPT and Large Model Technology Conference held by the Heart of the Machine. The Heart of the Machine has edited and organized it without changing the original meaning:

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

Hello everyone, I am Lu Zhiwu from Renmin University of China. The title of my report today is "Important Enlightenments of ChatGPT on Multimodal Generative Models", which consists of four parts.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

First of all, ChatGPT brings us some inspiration about the innovation of research paradigms. The first point is to use "big model and big data", which is a research paradigm that has been verified repeatedly and is also the basic research paradigm of ChatGPT. It is particularly important to emphasize that only when a large model reaches a certain level will it have emergent capabilities, such as in-context learning, CoT reasoning and other capabilities. These capabilities are very amazing.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The second point is to insist on "large model reasoning". This is also the point that impressed me most about ChatGPT. Because in the field of machine learning or artificial intelligence, reasoning is recognized as the most difficult, and ChatGPT has also made breakthroughs in this regard. Of course, ChatGPT’s reasoning ability may mainly come from code training, but whether there is an inevitable connection is not yet certain. In terms of reasoning, we should put more effort into figuring out where it comes from, or whether there are other training methods to further enhance its reasoning ability.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The third point is that the large model must be aligned with humans. This is the important thing ChatGPT gives us from an engineering perspective or a model landing perspective. Enlightenment. If not aligned with humans, the model will generate a lot of harmful information, making the model unusable. The third point is not to raise the upper limit of the model, but the reliability and security of the model are indeed very important.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The advent of ChatGPT has had a great impact on many fields, including myself. Because I have been doing multimodality for several years, I will start to reflect on why we have not made such a powerful model.

ChatGPT is a universal generation in language or text. Let’s take a look at the latest progress in the field of multi-modal universal generation. Multimodal pre-training models have begun to transform into multimodal general generative models, and there have been some preliminary explorations. First, let’s take a look at the Flamingo model proposed by Google in 2019. The following figure is its model structure.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The main body of the Flamingo model architecture is the decoder (Decoder) of the large language model, which is the blue module on the right side of the picture above. In each blue Some adapter layers are added between the color modules, and the Vision Encoder and Perceiver Resampler are added to the left visual part. The design of the entire model is to encode and convert visual things, pass through the adapter, and align them with the language, so that the model can automatically generate text descriptions for images.

Flamingo What are the benefits of such an architectural design? First of all, the blue module in the above picture is fixed (frozen), including the language model Decoder; while the parameter amount of the pink module itself is controllable, so the number of parameters actually trained by the Flamingo model is very small. So don’t think that multi-modal universal generative models are difficult to build. In fact, it’s not that pessimistic. The trained Flamingo model can do many common tasks based on text generation. Of course, the input is multi-modal, such as video description, visual question and answer, multi-modal dialogue, etc. From this perspective, Flamingo can be regarded as a general generative model.

The second example is the newly released BLIP-2 model some time ago. It is improved based on BLIP-1. Its model architecture is very similar to Flamingo, and it basically includes image coding. The decoder and the large language model decoder are fixed, and then a Q-Former with a converter function is added in the middle - from visual to language conversion. So, the part of BLIP-2 that really requires training is the Q-Former.

As shown in the figure below, first input a picture (the picture on the right) into the Image Encoder. The Text in the middle is the question or instruction raised by the user, which is input after Q-Former encoding. Go to a large language model and finally generate the answer, which is probably such a generation process.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The shortcomings of these two models are obvious, because they appeared relatively early or just appeared, and the engineering methods used by ChatGPT have not been considered. At least there is no instruction fine-tuning for graphic dialogue or multi-modal dialogue, so their overall generation effect is not satisfactory.

The third one is Kosmos-1 recently released by Microsoft. It has a very simple structure and only uses image and text pairs for training. The multi-modal data is relatively single. The biggest difference between Kosmos-1 and the above two models is that the large language model itself in the above two models is fixed, while the large language model itself in Kosmos-1 needs to be trained, so the Kosmos-1 model The number of parameters is only 1.6 billion, and a model with 1.6 billion parameters may not have the ability to emerge. Of course, Kosmos-1 did not take into account the fine-tuning of commands in graphic dialogue, causing it to sometimes speak nonsense.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The next example is Google’s multimodal embodied visual language model PaLM-E. The PaLM-E model is similar to the first three examples. PaLM-E also uses the ViT large language model. The biggest breakthrough of PaLM-E is that it finally explores the possibility of implementing multi-modal large language models in the field of robotics. PaLM-E attempts the first step of exploration, but the types of robot tasks it considers are very limited and cannot be truly universal.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The last example is GPT-4 - it gives particularly amazing results on standard data sets, and many times its results are even better than Currently, fine-tuned SOTA models trained on the dataset are even better. This may come as a shock, but it doesn't actually mean anything. When we were building multi-modal large models two years ago, we discovered that the capabilities of large models cannot be evaluated on standard data sets. Good performance on standard data sets does not mean good results in actual use. There are many differences between the two. Big gap. For this reason, I am slightly disappointed with the current GPT-4, as it only gives results on standard datasets. Moreover, the currently available GPT-4 is not a visual version, but a pure text version.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The above models are generally used for general language generation, and the input is multi-modal input. The following two models are different. Now - not only general language generation, but also visual generation, which can generate both language and images.

The first is Microsoft's Visual ChatGPT, let me briefly evaluate it. The idea of ​​​​this model is very simple, and it is more of a product design consideration. There are many types of vision-related generation, as well as some visual detection models. The inputs and instructions for these different tasks vary widely. The problem is how to use one model to include all these tasks, so Microsoft designed the Prompt manager, which is used in the core part. OpenAI's ChatGPT is equivalent to translating instructions for different visual generation tasks through ChatGPT. The user's questions are instructions described in natural language, which are translated into instructions that the machine can understand through ChatGPT.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

Visual ChatGPT does just such a thing. So it's really good from a product perspective, but nothing new from a model design perspective. Therefore, the overall model is a "stitch monster" from the perspective of the model. There is no unified model training, resulting in no mutual promotion between different modes. Why we do multi-modality is because we believe that data from different modalities must help each other. And Visual ChatGPT does not consider multi-modal generation instruction fine-tuning. Its instruction fine-tuning only relies on ChatGPT itself.

The next example is the UniDiffuser model released by Professor Zhu Jun’s team at Tsinghua University. From an academic perspective, this model can truly generate text and visual content from multi-modal input. This is due to their transformer-based network architecture U-ViT, which is similar to U-Net, the core component of Stable Diffusion, and then generates images. and text generation are unified in a framework. This work itself is very meaningful, but it is still relatively early. For example, it only considers Captioning and VQA tasks, does not consider multiple rounds of dialogue, and does not fine-tune instructions for multi-modal generation.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

Having commented so much before, we also made a product called ChatImg, as shown in the picture below. Generally speaking, ChatImg includes an image encoder, a multi-modal image and text encoder, and a text decoder. It is similar to Flamingo and BLIP-2, but we consider more, and there are detailed differences in the specific implementation.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

One of the biggest advantages of ChatImg is that it can accept video input. We pay special attention to multi-modal general generation, including text generation, image generation, and video generation. We hope to implement a variety of generation tasks in this framework, and ultimately hope to access text to generate videos.

Second, we pay special attention to real user data. We hope to continuously optimize the generation model itself and improve its capabilities after obtaining real user data, so we released the ChatImg application.

The following pictures are some examples of our tests. As an early model, although there are still some things that are not done well, in general ChatImg can still understand pictures. For example, ChatImg can generate descriptions of paintings in conversations and can also do some in-context learning.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The first example in the picture above describes the painting "Starry Night". In the description, ChatImg said that Van Gogh was an American painter. You tell it Wrong, it can be corrected immediately; the second example ChatImg made physical inferences about the objects in the picture; the third example is a photo I took myself. There are two rainbows in this photo, and it accurately Recognized.

We noticed that the third and fourth examples in the above picture involve emotional issues. This is actually related to the work we are going to do next. We want to connect ChatImg to the robot. Today's robots are usually passive, and all instructions are preset, which makes them seem very rigid. We hope that robots connected to ChatImg can actively communicate with people. How to do this? First of all, the robot must be able to feel people. It may be to objectively see the state of the world and people's emotions, or it may be to obtain a reflection; then the robot can understand and actively communicate with people. Through these two examples, I feel that this goal is achievable.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

Finally, let me summarize today’s report. First of all, ChatGPT and GPT-4 have brought innovation to the research paradigm. All of us should actively embrace this change. We cannot complain or make excuses that we have no resources. As long as we face this change, there are always ways to overcome difficulties. Multimodal research does not even require machines with hundreds of cards. As long as corresponding strategies are adopted, a small number of machines can do good work. Second, existing multi-modal generative models all have their own problems. GPT-4 does not yet have an open visual version, and there is still a chance for all of us. Moreover, I think GPT-4 still has a problem, which is what the multi-modal generative model should ultimately look like. It does not give a perfect answer (in fact, it does not reveal any details of GPT-4). This is actually a good thing. People all over the world are very smart and everyone has their own ideas. This may create a new research situation where a hundred flowers bloom. That’s it for my speech, thank you all.

Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models

The above is the detailed content of 'Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models'. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Gemma Scope: Google's Microscope for Peering into AI's Thought ProcessGemma Scope: Google's Microscope for Peering into AI's Thought ProcessApr 17, 2025 am 11:55 AM

Exploring the Inner Workings of Language Models with Gemma Scope Understanding the complexities of AI language models is a significant challenge. Google's release of Gemma Scope, a comprehensive toolkit, offers researchers a powerful way to delve in

Who Is a Business Intelligence Analyst and How To Become One?Who Is a Business Intelligence Analyst and How To Become One?Apr 17, 2025 am 11:44 AM

Unlocking Business Success: A Guide to Becoming a Business Intelligence Analyst Imagine transforming raw data into actionable insights that drive organizational growth. This is the power of a Business Intelligence (BI) Analyst – a crucial role in gu

How to Add a Column in SQL? - Analytics VidhyaHow to Add a Column in SQL? - Analytics VidhyaApr 17, 2025 am 11:43 AM

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

Business Analyst vs. Data AnalystBusiness Analyst vs. Data AnalystApr 17, 2025 am 11:38 AM

Introduction Imagine a bustling office where two professionals collaborate on a critical project. The business analyst focuses on the company's objectives, identifying areas for improvement, and ensuring strategic alignment with market trends. Simu

What are COUNT and COUNTA in Excel? - Analytics VidhyaWhat are COUNT and COUNTA in Excel? - Analytics VidhyaApr 17, 2025 am 11:34 AM

Excel data counting and analysis: detailed explanation of COUNT and COUNTA functions Accurate data counting and analysis are critical in Excel, especially when working with large data sets. Excel provides a variety of functions to achieve this, with the COUNT and COUNTA functions being key tools for counting the number of cells under different conditions. Although both functions are used to count cells, their design targets are targeted at different data types. Let's dig into the specific details of COUNT and COUNTA functions, highlight their unique features and differences, and learn how to apply them in data analysis. Overview of key points Understand COUNT and COU

Chrome is Here With AI: Experiencing Something New Everyday!!Chrome is Here With AI: Experiencing Something New Everyday!!Apr 17, 2025 am 11:29 AM

Google Chrome's AI Revolution: A Personalized and Efficient Browsing Experience Artificial Intelligence (AI) is rapidly transforming our daily lives, and Google Chrome is leading the charge in the web browsing arena. This article explores the exciti

AI's Human Side: Wellbeing And The Quadruple Bottom LineAI's Human Side: Wellbeing And The Quadruple Bottom LineApr 17, 2025 am 11:28 AM

Reimagining Impact: The Quadruple Bottom Line For too long, the conversation has been dominated by a narrow view of AI’s impact, primarily focused on the bottom line of profit. However, a more holistic approach recognizes the interconnectedness of bu

5 Game-Changing Quantum Computing Use Cases You Should Know About5 Game-Changing Quantum Computing Use Cases You Should Know AboutApr 17, 2025 am 11:24 AM

Things are moving steadily towards that point. The investment pouring into quantum service providers and startups shows that industry understands its significance. And a growing number of real-world use cases are emerging to demonstrate its value out

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function