Home >Technology peripherals >AI >Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

WBOY
WBOYforward
2023-06-09 21:28:041399browse

Video plays an increasingly important role in today’s social media and Internet culture. Douyin, Kuaishou, Bilibili, etc. have become popular platforms for hundreds of millions of users. Users share their life moments, creative works, interesting moments and other content around videos to interact and communicate with others.

Recently, large language models have demonstrated impressive capabilities. Can we equip large models with “eyes” and “ears” so that they can understand videos and interact with users?

Starting from this problem, researchers from DAMO Academy proposed Video-LLaMA, a large model with comprehensive audio-visual capabilities. Video-LLaMA can perceive and understand video and audio signals in videos, and can understand user input instructions to complete a series of complex tasks based on audio and video, such as audio/video description, writing, question and answer, etc. Currently, papers, codes, and interactive demos are all open. In addition, on the Video-LLaMA project homepage, the research team also provides a Chinese version of the model to make the experience of Chinese users smoother.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

  • ## Paper link: https://arxiv.org/abs/2306.02858
  • Code address: https://github.com/DAMO-NLP-SG/Video-LLaMA


Video-LLaMA adopts modular design principles to combine the visual and Audio modality information is mapped into the input space of a large language model to achieve the ability to follow cross-modal instructions. Unlike previous large model research (MiNIGPT4, LLaVA) that focused on static image understanding, Video-LLaMA faces two challenges in video understanding: capturing dynamic scene changes in vision and integrating audio-visual signals.

To capture dynamic scene changes in videos, Video-LLaMA introduces a pluggable visual language branch. This branch first uses the pre-trained image encoder in BLIP-2 to obtain the individual features of each frame of image, and then combines it with the corresponding frame position embedding. All image features are sent to Video Q-Former, and Video Q-Former will Aggregate frame-level image representations and generate fixed-length synthetic video representations. Finally, a linear layer is used to align the video representation to the embedding space of the large language model.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

As for the sound signals in the video, Video-LLaMA uses the audio-language branch for processing. First, multiple two-second audio clips are uniformly sampled from the original video and each clip is converted into a 128-dimensional mel spectrogram. Then, the powerful ImageBind is used as the audio encoder to extract the features of each sound clip individually. After adding learnable positional embeddings, Audio Q-Former aggregates segment features as a whole and generates fixed-length audio features. Similar to the visual language branch, a linear layer is finally used to align the audio representation to the embedding space of the large language model.

In order to reduce training costs, Video-LLaMA freezes the pre-trained image/audio encoder and only updates the following parameters in the visual and audio branches: Video/Audio Q-Former , position coding layer and linear layer (shown in Figure 1).

In order to learn the alignment relationship between vision and text, the authors first pre-trained the vision branch using the large-scale video-text dataset (WebVid-2M) and image-text dataset (CC-595K). Afterwards, the authors used image command data sets from MiniGPT-4, LLaVA and video command data sets from Video-Chat to fine-tune to achieve better cross-modal command following capabilities.

As for the learning of audio-text alignment relationships, due to the lack of large-scale high-quality audio-text data, the authors adopted a workaround strategy to achieve this goal. First, the goal of the learnable parameters in the audio-linguistic branch can be understood as aligning the output of the audio encoder with the embedding space of the LLM. The audio encoder ImageBind has a very strong multi-modal alignment capability, which can align the embeddings of different modalities into a common space. Therefore, the authors use visual-text data to train the audio-language branch, aligning the common embedding space of ImageBind to the text embedding space of LLM, thereby achieving audio modality to LLM text embedding space alignment. In this clever way, Video-LLaMA is able to demonstrate the ability to understand audio during inference, even though it has never been trained on audio data.

Example display

The author shows some examples of Video-LLaMA video/audio/image-based dialogue.

(1) The following two examples demonstrate the comprehensive audio-visual perception capabilities of Video-LLaMA. The conversations in the examples revolve around audio videos. In Example 2, only the performer is shown on the screen, but the sound is the cheers and applause of the audience. If the model can only receive visual signals, it will not be able to infer the positive response of the audience. There is no sound of musical instruments in the audio. But there is a saxophone in the picture. If the model can only receive auditory signals, it will not know that the player played the saxophone.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

##(2) Video-LLaMA also has strong perceptual understanding ability for static images, and can complete picture description, question and answer Wait for the task.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

## (3) Surprisingly, Video-LLaMA can successfully identify famous landmarks and people, and can Do common sense Q&A. For example, VIdeo-LLaMA below successfully identified the White House and introduced the situation of the White House. Another example is inputting a still photo of Long Ma and Jon Snow (characters in the classic film and television series "Game of Thrones"). VIdeo-LLaMA can not only successfully identify them, but also tell them about their relationship that is constantly being edited and messed up.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

## (4) Dynamics for videos Events, Video-llama can also capture well, such as the movement of catcalls and the direction of a boat.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMASummary

Currently, audio and video understanding is still very complex and there is no mature solution yet Although Video-LLaMA has shown impressive capabilities, the authors also mentioned that it has some limitations.

(1) Limited perceptual ability: Video-LLaMA’s visual and auditory abilities are still relatively rudimentary, and it is still difficult to identify complex visual and sound information. Part of the reason is that the quality and size of the data sets are not good enough. This research group is working hard to build a high-quality audio-video-text alignment dataset to improve the perceptual capabilities of the model.

(2) Difficulty processing long videos: Long videos (such as movies and TV shows) contain a large amount of information, which requires high reasoning capabilities and computing resources for the model.

(3) The inherent hallucination problem of language models still exists in Video-LLaMA.

In general, Video-LLaMA, as a large model with comprehensive audio-visual capabilities, has achieved impressive results in the field of audio and video understanding. As researchers continue to work hard, the above challenges will be overcome one by one, making the audio and video understanding model have broad practical value.


The above is the detailed content of Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete