Home >Technology peripherals >AI >Cambridge, Tencent AI Lab and others proposed the large language model PandaGPT: one model unifies six modalities

Cambridge, Tencent AI Lab and others proposed the large language model PandaGPT: one model unifies six modalities

WBOY
WBOYforward
2023-06-05 12:19:51815browse

Researchers from Cambridge, NAIST and Tencent AI Lab recently released a research result called PandaGPT, which is a method to align and bind large language models with different modalities to achieve cross-modality Techniques for command-following abilities. PandaGPT can accomplish complex tasks such as generating detailed image descriptions, writing stories from videos, and answering questions about audio. It can receive multi-modal inputs simultaneously and combine their semantics naturally.

剑桥、腾讯AI Lab等提出大语言模型PandaGPT:一个模型统一六种模态

  • ## Project homepage: https://panda-gpt.github.io/
  • Code: https://github.com/yxuansu/PandaGPT
  • ##Paper: http ://arxiv.org/abs/2305.16355
  • Online Demo display: https://huggingface.co/spaces/GMFTBY/PandaGPT

剑桥、腾讯AI Lab等提出大语言模型PandaGPT:一个模型统一六种模态


#In order to realize image & video, text, audio, heat map , depth map, IMU readings, command following capabilities in six modalities, PandaGPT combines ImageBind's multi-modal encoder with the Vicuna large language model (as shown in the figure above).

To align the feature spaces of ImageBind's multi-modal encoder and Vicuna's large language model, PandaGPT uses a total of 160k image-based language instructions released by combining LLaVa and Mini-GPT4 Follow the data as training data. Each training instance consists of an image and a corresponding set of dialogue rounds.

In order to avoid destroying the multi-modal alignment nature of ImageBind itself and reduce training costs, PandaGPT only updated the following modules:

Add a linear projection matrix to the encoding result of ImageBind, convert the representation generated by ImageBind and insert it into Vicuna's input sequence;
  1. Added additional information to Vicuna's attention module LoRA weight. The total number of parameters of the two accounts for about 0.4% of Vicuna parameters. The training function is a traditional language modeling objective. It is worth noting that during the training process, only the weight of the corresponding part of the model output is updated, and the user input part is not calculated. The entire training process takes about 7 hours to complete on 8×A100 (40G) GPUs.
  2. It is worth emphasizing that the current version of PandaGPT only uses aligned image-text data for training, but inherits the six modal understanding capabilities of the ImageBind encoder ( image/video, text, audio, depth, heat map and IMU) and the alignment properties between them, enabling cross-modal capabilities between all modalities.

In the experiment, the author demonstrated PandaGPT's ability to understand different modalities, including image/video-based question and answer, image/video-based creative writing, visual and auditory information-based Reasoning and more, here are some examples:

Image:

剑桥、腾讯AI Lab等提出大语言模型PandaGPT:一个模型统一六种模态

Audio:

剑桥、腾讯AI Lab等提出大语言模型PandaGPT:一个模型统一六种模态

##Video:

Compared with other multi-modal language models, the most outstanding feature of PandaGPT is its ability to understand and naturally combine information from different modalities.

Video audio:

剑桥、腾讯AI Lab等提出大语言模型PandaGPT:一个模型统一六种模态


##Image Audio:

剑桥、腾讯AI Lab等提出大语言模型PandaGPT:一个模型统一六种模态

##Summary

The authors also summarized the many current problems of PandaGPT and its future development directions. Although PandaGPT has an amazing ability to handle multiple modalities and their combinations, there are still many ways to greatly improve the performance of PandaGPT.

    PandaGPT can further improve the understanding of modalities other than images by using other modal alignment data, such as using ASR and TTS data for audio-text modalities. State-of-the-art understanding and ability to follow instructions.
  1. Modes other than text are only represented by an embedding vector, causing the language model to be unable to understand the fine-grained information of the model outside of text. More research on fine-grained feature extraction, such as cross-modal attention mechanisms, may help improve performance.
  2. PandaGPT currently only allows modal information other than text to be used as input. In the future, this model has the potential to unify the entire AIGC into the same model, that is, one model can simultaneously complete tasks such as image & video generation, speech synthesis, and text generation.
  3. New benchmarks are needed to evaluate the ability to combine multimodal inputs.
  4. PandaGPT may also exhibit some common pitfalls of existing language models, including hallucinations, toxicity, and stereotyping.
Finally, the authors emphasize that PandaGPT is only a research prototype and is not yet sufficient for direct application in a production environment.

The above is the detailed content of Cambridge, Tencent AI Lab and others proposed the large language model PandaGPT: one model unifies six modalities. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete