search
HomeTechnology peripheralsAIIn 12 video understanding tasks, Mamba first defeated Transformer
In 12 video understanding tasks, Mamba first defeated TransformerMay 01, 2024 am 08:20 AM
gitprojectarrangementmamba

In 12 video understanding tasks, Mamba first defeated Transformer

#This site publishes columns with academic and technical content. In recent years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com.


Explore a new realm of video understanding, the Mamba model leads a new trend in computer vision research! The limitations of traditional architectures have been broken. The state space model Mamba has brought revolutionary changes to the field of video understanding with its unique advantages in long sequence processing.

A research team from Nanjing University, Shanghai Artificial Intelligence Laboratory, Fudan University, and Zhejiang University released a groundbreaking work. They take a comprehensive look at Mamba's multiple roles in video modeling, propose the Video Mamba Suite for 14 models/modules, and conduct an in-depth evaluation on 12 video understanding tasks. The results are exciting: Mamba shows strong potential in both video-specific and video-verbal tasks, achieving an ideal balance of efficiency and performance. This is not only a technological leap, but also a strong impetus for future video understanding research.

In 12 video understanding tasks, Mamba first defeated Transformer

  • Paper title: Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
  • Paper link: https://arxiv.org/abs/2403.09626
  • Code link: https://github.com/OpenGVLab/video-mamba-suite

#In today’s rapidly developing field of computer vision, video understanding technology has become one of the key driving forces for industry progress. Many researchers are committed to exploring and optimizing various deep learning architectures in order to achieve deeper analysis of video content. From the early recurrent neural networks (RNN) and three-dimensional convolutional neural networks (3D CNN) to the currently highly anticipated Transformer model, each technological leap has greatly broadened our understanding and application of video data.

In particular, the Transformer model has excellent performance in many fields of video understanding - including but not limited to target detection, image segmentation, and multi-modal question answering. ——Remarkable achievements have been made. However, in the face of the inherent ultra-long sequence characteristics of video data, the Transformer model also exposes its inherent limitations: due to its quadratic increase in computational complexity, it becomes extremely difficult to directly model ultra-long video sequences.

In this context, the state space model architecture - represented by Mamba - emerged as the times require. With its advantage of linear computational complexity, it shows long processing times. The powerful potential of sequence data provides the possibility of replacing the Transformer model. Despite this, there are still some limitations in the current application of state space model architecture in the field of video understanding: first, it mainly focuses on global video understanding tasks, such as classification and retrieval; second, it mainly explores direct spatiotemporal modeling methods. However, the exploration of more diverse modeling methods is still insufficient.

To overcome these limitations and fully evaluate the potential of the Mamba model in the field of video understanding, the research team carefully built the video-mamba-suite (video Mamba suite). This suite aims to complement existing research, exploring Mamba's diverse roles and potential benefits in video understanding through a series of in-depth experiments and analyses.

The research team divided the application of the Mamba model into four different roles, and accordingly built a video Mamba suite containing 14 models/modules. After comprehensive evaluation on 12 video understanding tasks, the experimental results not only reveal Mamba's great potential in processing video and video-language tasks, but also demonstrate its excellent balance between efficiency and performance. The authors look forward to this work providing reference resources and insights for future research in the field of video understanding.

In 12 video understanding tasks, Mamba first defeated Transformer

Research background

Video understanding is a basic issue in computer vision research. Its core is to capture the time and space in the video. Dynamics, used to identify and infer the nature of activities and their evolution. Currently, architecture exploration for video understanding is mainly divided into three directions.

First, frame-based feature encoding methods perform temporal dependence modeling through recurrent networks (such as GRU and LSTM), but this segmented spatiotemporal modeling approach is difficult to capture Joint spatiotemporal information. Secondly, the use of three-dimensional convolution kernels enables simultaneous consideration of spatial and temporal correlations in convolutional neural networks.

With the great success of Transformer models in the language and image fields, the video Transformer model has also made significant progress in the field of video understanding, showing performance beyond RNNs and 3D-CNNs. Ability. Video Transformer processes the time or spatiotemporal information in the video in a unified manner by encapsulating the video in a series of tokens and using the attention mechanism to implement global context interaction and data-dependent dynamic calculations.

#However, due to the limited computational efficiency of Video Transformer when processing long videos, some variant models have emerged that strike a balance between speed and performance. Recently, state space models (SSMs) have demonstrated their advantages in the field of natural language processing (NLP). Modern SSMs exhibit strong representational capabilities in long sequence modeling while maintaining linear time complexity. This is because their selection mechanism eliminates the need to store the complete context. The Mamba model, in particular, incorporates time-varying parameters into SSM and proposes a hardware-aware algorithm for efficient training and inference. Mamba's excellent scaling performance shows that it can be a promising alternative to Transformer.

#At the same time, Mamba’s high performance and efficiency make it well-suited for video understanding tasks. Although there have been some initial attempts to explore the application of Mamba in image/video modeling, its effectiveness in video understanding is still unclear. The lack of comprehensive research on Mamba’s potential in video understanding limits further exploration of its capabilities in diverse video-related tasks.

In response to the above issues, the research team explored the potential of Mamba in the field of video understanding. The goal of their research is to evaluate whether Mamba can be a viable alternative to Transformers in this field. To do this, they first addressed the question of how to think about Mamba's different roles in understanding video. Based on this, they further studied which tasks Mamba performed better.

The paper divides Mamba's role in video modeling into the following four categories: 1) Timing model, 2) Timing module, 3) Multi-modal interaction network, 4 ) space-time model. For each role, the research team studied its video modeling capabilities on different video understanding tasks. To fairly pit Manba against Transformer, the research team carefully selected models for comparison based on standard or modified Transformer architectures. Based on this, they obtained a Video Mamba Suite containing 14 models/modules suitable for 12 video understanding tasks. The research team hopes that the Video Mamba Suite can become a basic resource for exploring SSM-based video understanding models in the future.

Four roles

Mamba as a video timing model

Tasks and Data: The research team evaluated Mamba’s performance on five video temporal tasks: Temporal Action Localization (HACS Segment), Temporal action segmentation (GTEA), dense video subtitles (ActivityNet, YouCook), video paragraph subtitles (ActivityNet, YouCook) and action prediction (Epic-Kitchen-100). In 12 video understanding tasks, Mamba first defeated Transformer
Baseline and Challenger: The research team selected the Transformer-based model as the baseline for each task. Specifically, these baseline models include ActionFormer, ASFormer, Testra, and PDVC. In order to build a Mamba challenger, they replaced the Transformer module in the baseline model with a Mamba-based module, including three modules as shown above, the original Mamba (a), ViM (b), and the DBM (c) originally designed by the research team ) module. It is worth noting that the paper compares the performance of the baseline model with the original Mamba module in an action prediction task involving causal inference.

Results and Analysis: The paper shows the comparison results of different models on four tasks. Overall, although some Transformer-based models have incorporated attention variants to improve performance. The table below shows the superior performance of the Mamba series compared to the existing Transformer series methods.

In 12 video understanding tasks, Mamba first defeated Transformer

In 12 video understanding tasks, Mamba first defeated Transformer

In 12 video understanding tasks, Mamba first defeated Transformer

##Mamba for multi-modal interaction

The research team not only focused on single-modal tasks, but also evaluated Mamba's performance in cross-modal interaction tasks. The paper uses the video temporal localization (VTG) task to evaluate Mamba's performance. The datasets covered include QvHighlight and Charade-STA.

Tasks and Data: The research team evaluated Mamba’s performance on five video temporal tasks: Temporal Action Localization (HACS Segment), Temporal action segmentation (GTEA), dense video subtitles (ActivityNet, YouCook), video paragraph subtitles (ActivityNet, YouCook) and action prediction (Epic-Kitchen-100).

Baseline and Challenger: The research team used UniVTG to build a Mamba-based VTG model. UniVTG adopts Transformer as a multi-modal interaction network. Given video features and text features, they first add learnable location embeddings and modality type embeddings for each modality to preserve location and modality information. The text and video tokens are then concatenated to form a joint input that is further fed into the multi-modal Transformer encoder. Finally, the text-augmented video features are extracted and fed into the prediction head. To create a cross-modal Mamba competitor, the research team chose to stack bidirectional Mamba blocks to form a multi-modal Mamda encoder to replace the Transformer baseline.

Results and Analysis: This paper tested the performance of multiple models through QvHighlight. Mamba has an average mAP of 44.74, which is a significant improvement compared to Transformer. On Charade-STA, the Mamba-based method shows similar competitiveness to Transformer. This demonstrates that Mamba has the potential to effectively integrate multiple modalities.

In 12 video understanding tasks, Mamba first defeated Transformer

Considering that Mamba is a model based on linear scanning, and Transformer is based on global mark interaction, the research team intuitively believes that the position of text in the mark sequence may affect multimodality Aggregation effect. To investigate this, they include different text-visual fusion methods in the table and show four different mark arrangements in the figure. The conclusion is that the best results are obtained when textual conditions are fused to the left of visual features. QvHighlight has less impact on this fusion, while Charade-STA is particularly sensitive to the position of the text, which may be attributed to the characteristics of the dataset.

In 12 video understanding tasks, Mamba first defeated Transformer

Mamba as a Video Timing Adapter

In addition to evaluating Mamba’s performance in post-timing modeling, the research team also examined its performance as a Availability of video time adapter. The Two Towers model is pre-trained by performing video-text contrastive learning on egocentric data, which contains 4 million video clips with fine-grained narration.

Tasks and Data: The research team evaluated Mamba’s performance on five video temporal tasks, including: Temporal Action Localization (HACS) Segment), Temporal Action Segmentation (GTEA), Dense Video Captioning (ActivityNet, YouCook), Video Segment Captioning (ActivityNet, YouCook) and Action Prediction (Epic-Kitchen-100).

Baseline and Challenger: TimeSformer adopts separate spatiotemporal attention blocks to separately model spatial and temporal relationships in videos. To this end, the research team introduced a bidirectional Mamba block as a timing adapter to replace the original timing self-attention and improve separate spatiotemporal interactions. For fair comparison, the spatial attention layer in TimeSformer remains unchanged. Here, the research team used ViM blocks as timing modules and called the resulting model TimeMamba.

It is worth noting that the standard ViM block has more parameters (slightly more than In 12 video understanding tasks, Mamba first defeated Transformer) than the self-attention block, where C is the feature dimension. Therefore, the expansion ratio E of the ViM block is set to 1 in the paper, reducing its parameter size to In 12 video understanding tasks, Mamba first defeated Transformer for a fair comparison. In addition to the ordinary residual connection form used by TimeSformer, the research team also explored Frozen style adaptation. The following are 5 adapter structures:

In 12 video understanding tasks, Mamba first defeated Transformer

Results and Analysis

1. Zero-sample multi-instance retrieval. The research team first evaluated different models with separate spatiotemporal interactions in the table and found that the Frozen-style residual connections reproduced in the paper were consistent with those of LaViLa. When comparing the original and Frozen styles, it is not difficult to observe that the Frozen style always produces better results. Furthermore, under the same adaptation method, the ViM-based temporal module always outperforms the attention-based temporal module.

It is worth noting that the ViM temporal block used in the paper has fewer parameters than the temporal self-attention block, highlighting the better parameters of Mamba selective scanning Utilization and information extraction capabilities.

In addition, the research team further verified the spatiotemporal ViM block. The spatiotemporal ViM block replaces the temporal ViM block with joint spatiotemporal modeling over the entire video sequence. Surprisingly, despite the introduction of global modeling, spatiotemporal ViM blocks actually resulted in performance degradation. To this end, the research team speculates that scan-based spatio-temporal may destroy the pre-trained spatial attention block to produce spatial feature distribution. The following are the experimental results:

In 12 video understanding tasks, Mamba first defeated Transformer

#2. Fine-tuning multi-instance retrieval and action recognition. The research team continues to use 16-frame fine-tuned pre-trained models on the Epic-Kitchens-100 dataset for multi-instance retrieval and action recognition. It can be observed from the experimental results that TimeMamba significantly outperforms TimeSformer in the context of verb recognition, exceeding 2.8 percentage points, which shows that TimeMamba can effectively model fine-grained timing.

In 12 video understanding tasks, Mamba first defeated Transformer

#3. Zero-sample long video Q&A. The research team further evaluated the model's long video Q&A performance on the EgoSchema dataset. The following are the experimental results:

In 12 video understanding tasks, Mamba first defeated Transformer

Both TimeSformer and TimeMamba, after pre-training on Ego4D, exceed the performance of large-scale pre-trained models (such as InternVideo). In addition, the research team continuously increased the number of test frames starting from the video at a fixed FPS to explore the impact of ViM blocks' long video temporal modeling capabilities. Although both models are pretrained with 4 frames, the performance of TimeMamba and TimeSformer steadily improves as the number of frames increases. Meanwhile, significant improvements can be observed when using 8192 frames. When the input frames exceed 32, TimeMamba generally benefits from more frames than TimeSformer, indicating the superiority of Temporal ViM blocks in temporal self-attention.

Mamba for spatiotemporal modeling

# #Tasks and Data: In addition, the paper also evaluates Mamba's ability in space-time modeling, specifically evaluating the model's performance in zero-shot multi-instance retrieval on the Epic-Kitchens-100 data set.

Baselines and Competitors: ViViT and TimeSformer study the transformation of ViT with spatial attention into models with joint spatial-temporal attention. Based on this, the research team further expanded the spatial selective scanning of the ViM model to include spatiotemporal selective scanning. Name this extended model ViViM. The research team used the ViM model pretrained on ImageNet-1K for initialization. The ViM model contains a cls token that is inserted into the middle of the flat token sequence.

The following figure shows how to convert a ViM model to ViViM. For a given input containing M frames, insert a cls token in the middle of the token sequence corresponding to each frame. In addition, the research team added temporal position embedding, initialized to zero for each frame. The flattened video sequence is then input into the ViViM model. The output of the model is obtained by calculating the average of cls tokens for each frame.

In 12 video understanding tasks, Mamba first defeated Transformer

Results and analysis: The paper further studies the results of ViViM in zero-sample multi-instance retrieval. The experimental results are shown in the following table:

In 12 video understanding tasks, Mamba first defeated Transformer

The results show the performance of different spatiotemporal models on zero-shot multi-instance retrieval. When comparing ViT and ViViM, both of which are pretrained on ImageNet-1K, it can be observed that ViViM outperforms ViT. Interestingly, although the performance gap between ViT-S and ViM-S on ImageNet-1K is small (79.8 vs 80.5), ViViM-S shows significant improvement on zero-shot multi-instance retrieval (2.1 mAP @Avg), which shows that ViViM is very effective in modeling long sequences, thus improving performance.

Conclusion

This paper comprehensively evaluates the performance of Mamba in the field of video understanding. Demonstrates the potential of Mamba as a viable alternative to traditional Transformers. Through the Video Mamba Suite, which consists of 14 models/modules for 12 video understanding tasks, the research team demonstrated Mamba's ability to efficiently handle complex spatiotemporal dynamics. Mamba not only delivers superior performance, but also achieves a better efficiency-performance balance. These findings not only highlight Mamba's suitability for video analysis tasks, but also open new avenues for its application in the field of computer vision. Future work can further explore Mamba's adaptability and extend its utility to more complex multimodal video understanding challenges.

The above is the detailed content of In 12 video understanding tasks, Mamba first defeated Transformer. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
4090生成器:与A100平台相比,token生成速度仅低于18%,上交推理引擎赢得热议4090生成器:与A100平台相比,token生成速度仅低于18%,上交推理引擎赢得热议Dec 21, 2023 pm 03:25 PM

PowerInfer提高了在消费级硬件上运行AI的效率上海交大团队最新推出了超强CPU/GPULLM高速推理引擎PowerInfer。PowerInfer和llama.cpp都在相同的硬件上运行,并充分利用了RTX4090上的VRAM。这个推理引擎速度有多快?在单个NVIDIARTX4090GPU上运行LLM,PowerInfer的平均token生成速率为13.20tokens/s,峰值为29.08tokens/s,仅比顶级服务器A100GPU低18%,可适用于各种LLM。PowerInfer与

思维链CoT进化成思维图GoT,比思维树更优秀的提示工程技术诞生了思维链CoT进化成思维图GoT,比思维树更优秀的提示工程技术诞生了Sep 05, 2023 pm 05:53 PM

要让大型语言模型(LLM)充分发挥其能力,有效的prompt设计方案是必不可少的,为此甚至出现了promptengineering(提示工程)这一新兴领域。在各种prompt设计方案中,思维链(CoT)凭借其强大的推理能力吸引了许多研究者和用户的眼球,基于其改进的CoT-SC以及更进一步的思维树(ToT)也收获了大量关注。近日,苏黎世联邦理工学院、Cledar和华沙理工大学的一个研究团队提出了更进一步的想法:思维图(GoT)。让思维从链到树到图,为LLM构建推理过程的能力不断得到提升,研究者也通

复旦NLP团队发布80页大模型Agent综述,一文纵览AI智能体的现状与未来复旦NLP团队发布80页大模型Agent综述,一文纵览AI智能体的现状与未来Sep 23, 2023 am 09:01 AM

近期,复旦大学自然语言处理团队(FudanNLP)推出LLM-basedAgents综述论文,全文长达86页,共有600余篇参考文献!作者们从AIAgent的历史出发,全面梳理了基于大型语言模型的智能代理现状,包括:LLM-basedAgent的背景、构成、应用场景、以及备受关注的代理社会。同时,作者们探讨了Agent相关的前瞻开放问题,对于相关领域的未来发展趋势具有重要价值。论文链接:https://arxiv.org/pdf/2309.07864.pdfLLM-basedAgent论文列表:

FATE 2.0发布:实现异构联邦学习系统互联FATE 2.0发布:实现异构联邦学习系统互联Jan 16, 2024 am 11:48 AM

FATE2.0全面升级,推动隐私计算联邦学习规模化应用FATE开源平台宣布发布FATE2.0版本,作为全球领先的联邦学习工业级开源框架。此次更新实现了联邦异构系统之间的互联互通,持续增强了隐私计算平台的互联互通能力。这一进展进一步推动了联邦学习与隐私计算规模化应用的发展。FATE2.0以全面互通为设计理念,采用开源方式对应用层、调度、通信、异构计算(算法)四个层面进行改造,实现了系统与系统、系统与算法、算法与算法之间异构互通的能力。FATE2.0的设计兼容了北京金融科技产业联盟的《金融业隐私计算

吞吐量提升5倍,联合设计后端系统和前端语言的LLM接口来了吞吐量提升5倍,联合设计后端系统和前端语言的LLM接口来了Mar 01, 2024 pm 10:55 PM

大型语言模型(LLM)被广泛应用于需要多个链式生成调用、高级提示技术、控制流以及与外部环境交互的复杂任务。尽管如此,目前用于编程和执行这些应用程序的高效系统却存在明显的不足之处。研究人员最近提出了一种新的结构化生成语言(StructuredGenerationLanguage),称为SGLang,旨在改进与LLM的交互性。通过整合后端运行时系统和前端语言的设计,SGLang使得LLM的性能更高、更易控制。这项研究也获得了机器学习领域的知名学者、CMU助理教授陈天奇的转发。总的来说,SGLang的

大模型也有小偷?为保护你的参数,上交大给大模型制作「人类可读指纹」大模型也有小偷?为保护你的参数,上交大给大模型制作「人类可读指纹」Feb 02, 2024 pm 09:33 PM

将不同的基模型象征为不同品种的狗,其中相同的「狗形指纹」表明它们源自同一个基模型。大模型的预训练需要耗费大量的计算资源和数据,因此预训练模型的参数成为各大机构重点保护的核心竞争力和资产。然而,与传统软件知识产权保护不同,对预训练模型参数盗用的判断存在以下两个新问题:1)预训练模型的参数,尤其是千亿级别模型的参数,通常不会开源。预训练模型的输出和参数会受到后续处理步骤(如SFT、RLHF、continuepretraining等)的影响,这使得判断一个模型是否基于另一个现有模型微调得来变得困难。无

220亿晶体管,IBM机器学习专用处理器NorthPole,能效25倍提升220亿晶体管,IBM机器学习专用处理器NorthPole,能效25倍提升Oct 23, 2023 pm 03:13 PM

IBM再度发力。随着AI系统的飞速发展,其能源需求也在不断增加。训练新系统需要大量的数据集和处理器时间,因此能耗极高。在某些情况下,执行一些训练好的系统,智能手机就能轻松胜任。但是,执行的次数太多,能耗也会增加。幸运的是,有很多方法可以降低后者的能耗。IBM和英特尔已经试验过模仿实际神经元行为设计的处理器。IBM还测试了在相变存储器中执行神经网络计算,以避免重复访问RAM。现在,IBM又推出了另一种方法。该公司的新型NorthPole处理器综合了上述方法的一些理念,并将其与一种非常精简的计算运行

何恺明和谢赛宁团队成功跟随解构扩散模型探索,最终创造出备受赞誉的去噪自编码器何恺明和谢赛宁团队成功跟随解构扩散模型探索,最终创造出备受赞誉的去噪自编码器Jan 29, 2024 pm 02:15 PM

去噪扩散模型(DDM)是目前广泛应用于图像生成的一种方法。最近,XinleiChen、ZhuangLiu、谢赛宁和何恺明四人团队对DDM进行了解构研究。通过逐步剥离其组件,他们发现DDM的生成能力逐渐下降,但表征学习能力仍然保持一定水平。这说明DDM中的某些组件对于表征学习的作用可能并不重要。针对当前计算机视觉等领域的生成模型,去噪被认为是一种核心方法。这类方法通常被称为去噪扩散模型(DDM),通过学习一个去噪自动编码器(DAE),能够通过扩散过程有效地消除多个层级的噪声。这些方法实现了出色的图

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version