Home >Technology peripherals >AI >To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

王林
王林forward
2024-04-01 11:31:32681browse

Currently, Video Pose Transformer (VPT) has achieved the most leading performance in the field of video-based three-dimensional human pose estimation. In recent years, the computational workload of these VPTs has become increasingly large, and these huge computational workloads have also limited further development in this field. It is very unfriendly to researchers with insufficient computing resources. For example, training a 243-frame VPT model usually takes several days, seriously slowing down the progress of research and becoming a major pain point in the field that needs to be solved urgently.

So, how to effectively improve the efficiency of VPT with almost no loss of accuracy?

The team from Peking University proposed an efficient three-dimensional human pose estimation framework HoT based on the hourglass Tokenizer to solve the high computational cost of the existing Video Pose Transformer (VPT) A question of demand. The framework can be plug-and-play and seamlessly integrated into models such as MHFormer, MixSTE, and MotionBERT, reducing the model's calculations by nearly 40% without losing accuracy. The code has been open sourced.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT


  • ##Title: Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
  • Paper address: https://arxiv.org/abs/2311.12028
  • Code Address: https://github.com/NationalGAILab/HoT

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT


To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

##Research motivation

In the VPT model, usually each frame of video is processed into an independent Pose Token. By processing hundreds of frames, video sequences (typically 243 to 351 frames) to achieve superior performance and maintain full-length sequence representation across all layers of the Transformer. However, since the computational complexity of the self-attention mechanism in VPT is proportional to the square of the number of Tokens (i.e., the number of video frames), these models inevitably bring huge inefficiencies when processing video inputs with higher time series resolution. The computational overhead makes them difficult to be widely deployed in practical applications with limited computing resources. In addition, this way of processing the entire sequence does not take into account the redundancy within the video sequence, especially the redundancy between consecutive frames where visual changes are not obvious, so that the duplication of this information not only adds unnecessary computational burden, and to a large extent does not make a substantial contribution to the improvement of model performance.

Therefore, in order to achieve efficient VPT, this article believes that two factors need to be considered first:

    The time perception field must be large : Although directly shortening the length of the input sequence can improve the efficiency of VPT, doing so will reduce the temporal receptive field of the model, thereby limiting the model to capture rich spatiotemporal information, constraining performance improvement. Therefore, maintaining a large temporal receptive field is crucial to achieve accurate estimation when pursuing efficient design strategies.


  • Video redundancy must be removed: Due to the similarity of actions between adjacent frames, videos often contain a large amount of redundant information. . In addition, existing research has pointed out that in the Transformer architecture, as the layers deepen, the differences between Tokens become smaller and smaller. Therefore, it can be inferred that using full-length Pose Token in the deep layers of Transformer will introduce unnecessary redundant calculations, and these redundant calculations will have limited contribution to the final estimation results.
Based on these two observations, the author proposes to prune the Pose Token of the deep Transformer to reduce the redundancy of video frames and improve the overall efficiency of VPT. However, this raises a new challenge: the pruning operation leads to a reduction in the number of Tokens. At this time, the model cannot directly estimate the number of three-dimensional pose estimation results that match the original video sequence. This is because, in the traditional VPT model, each Token usually corresponds to one frame in the video, and the remaining sequence after pruning will not be enough to cover all the frames of the original video. This is problematic when estimating the three-dimensional human pose of all frames in the video. become a significant obstacle. Therefore, in order to achieve an efficient VPT, another important factor needs to be taken into consideration:

  • Seq2seq reasoning: An actual 3D human pose estimation system should be able to perform fast reasoning through seq2seq, that is, estimate the 3D human poses of all frames from the input video at once. Therefore, in order to achieve seamless integration with the existing VPT framework and achieve fast inference, it is necessary to ensure the integrity of the Token sequence, that is, to recover a full-length Token equal to the number of input video frames.

Based on the above three considerations, the author proposes an efficient three-dimensional human pose estimation framework based on the hourglass structure, ⏳ Hourglass Tokenizer (HoT). In general, this method has two major highlights:

  • Simple Baseline, a universal and efficient framework based on Transformer

#HoT is the first Transformer-based plug-and-play framework for efficient 3D human pose estimation. As shown in the figure below, traditional VPT adopts a "rectangular" paradigm, that is, maintaining the full length of Pose Token in all layers of the model, which brings high computational costs and feature redundancy. Different from traditional VPT, HoT first prunes to remove redundant tokens, and then restores the entire sequence of tokens (looking like an "hourglass"), so that only a small amount of tokens are retained in the middle layer of the Transformer, thus effectively improving the model s efficiency. HoT also demonstrates extremely high versatility. Not only can it be seamlessly integrated into conventional VPT models, whether it is VPT based on seq2seq or seq2frame, it can also be adapted to various Token pruning and recovery strategies.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT


  • ##Both efficiency and accuracy

HoT reveals that maintaining full-length pose sequences is redundant, and using Pose Tokens of a small number of representative frames can achieve both high efficiency and high performance. Compared with the traditional VPT model, HoT not only significantly improves processing efficiency, but also achieves highly competitive or even better results. For example, it can reduce MotionBERT's FLOPs by nearly 50% without sacrificing performance, while reducing MixSTE's FLOPs by nearly 40% with only a slight performance drop of 0.2%.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

Model Method

The overall framework of HoT proposed is shown in the figure below. In order to perform Token pruning and recovery more effectively, this article proposes two modules: Token Pruning Cluster (TPC) and Token Recovering Attention (TRA). Among them, the TPC module dynamically selects a small number of representative tokens with high semantic diversity while mitigating the redundancy of video frames. The TRA module recovers detailed spatiotemporal information based on selected tokens, thereby extending the network output to the original full-length temporal resolution for fast inference.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

Token pruning and clustering module

This article believes that it is a difficult problem to select a small number of Pose Tokens with rich information for accurate three-dimensional human posture estimation.

In order to solve this problem, this article believes that the key is to select those representative tokens with high semantic diversity, because such tokens can retain necessary information while reducing video redundancy. Based on this concept, this article proposes a Token Pruning Cluster (TPC) module that is simple, effective and requires no additional parameters. The core of this module is to identify and remove those tokens that contribute little semantically, and focus on those tokens that can provide key information for the final three-dimensional human pose estimation. By using a clustering algorithm, TPC dynamically selects cluster centers as representative tokens, thereby utilizing the characteristics of cluster centers to retain the rich semantics of the original data.

The structure of TPC is shown in the figure below. It first pools the input Pose Token in the spatial dimension, and then uses the feature similarity of the pooled Token to process the input Token. Cluster and select the cluster center as the representative token.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

Token Restoration Attention Module

The TPC module effectively reduces the number of Pose Tokens. However, the decrease in time resolution caused by the pruning operation limits VPT for fast seq2seq inference. Therefore, Token needs to be restored. At the same time, considering efficiency factors, the recovery module should be designed to be lightweight to minimize the impact on the overall model computational cost.

In order to solve the above challenges, this article designs a lightweight Token Recovering Attention (TRA) module, which can recover detailed spatiotemporal information based on the selected Token. . In this way, the low temporal resolution caused by the pruning operation is effectively extended to the temporal resolution of the original complete sequence, allowing the network to estimate the three-dimensional human pose sequence of all frames at once, thereby achieving fast seq2seq reasoning.

The structure of the TRA module is shown in the figure below. It uses the representative Token in the last layer of Transformer and the learnable Token initialized to zero, through a simple cross-attention mechanism. Restore the complete Token sequence.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

##Apply to existing VPT

In discussing how to apply all Before applying the proposed method to existing VPT, this paper first summarizes the existing VPT architecture. As shown in the figure below, the VPT architecture mainly consists of three components: a pose embedding module for encoding the spatial and temporal information of the pose sequence, a multi-layer Transformer for learning global spatiotemporal representation, and a regression head module for regression output 3D Human posture results.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

According to the number of output frames, the existing VPT can be divided into two inference processes: seq2frame and seq2seq. In the seq2seq pipeline, the output is all frames of the input video, so the original full-length timing resolution needs to be restored. As shown in the HoT framework diagram, both TPC and TRA modules are embedded in VPT. In the seq2frame process, the output is the 3D pose of the center frame of the video. Therefore, under this process, the TRA module is unnecessary and only the TPC module is integrated in the VPT. Its framework is shown in the figure below.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

Experimental results

##Ablation experiment

In the table below, this article provides a comparison under the seq2seq (*) and seq2frame (†) inference processes. The results show that by applying the proposed method on the existing VPT, the method can significantly reduce FLOPs and significantly improve FPS while keeping the number of model parameters almost unchanged. In addition, compared with the original model, the proposed method is basically the same in performance or can achieve better performance.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

This article also compares different Token pruning strategies, including attention score pruning, uniform sampling, and selecting the top k tokens with larger As for the motion pruning strategy of motion token, it can be seen that the proposed TPC has achieved the best performance.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

This article also compares different Token recovery strategies, including nearest neighbor interpolation and linear interpolation. It can be seen that the proposed TRA achieves the best performance .

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

Comparison with SOTA method

Currently, on the Human3.6M data set, the leading methods for 3D human pose estimation all adopt a Transformer-based architecture. In order to verify the effectiveness of this method, the authors apply it to three latest VPT models: MHForme, MixSTE and MotionBERT, and compare with them in terms of parameter quantities, FLOPs and MPJPE.

As shown in the table below, this method significantly reduces the calculation amount of the SOTA VPT model while maintaining the original accuracy. These results not only verify the effectiveness and high efficiency of this method, but also reveal that there are computational redundancies in existing VPT models, and these redundancies contribute little to the final estimation performance, and may even lead to performance degradation. In addition, this method can eliminate these unnecessary calculations while achieving highly competitive or even better performance.

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

Code operation

The author also provides demo operation (https://github.com /NationalGAILab/HoT), integrating YOLOv3 human detector, HRNet 2D pose detector, HoT w. MixSTE 2D to 3D pose enhancer. Just download the pre-trained model provided by the author, input a short video containing people, and you can directly output a demo of 3D human pose estimation with one line of code.

python demo/vis.py --video sample_video.mp4

Results obtained by running the sample video:

To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT

Summary

This article proposes Hourglass Tokenizer (HoT), a plug-and-play Token pruning, to solve the problem of high computational cost of existing Video Pose Transforme (VPT). and recovery framework for efficient Transformer-based 3D human pose estimation from videos. The study found that maintaining full-length pose sequences in VPT is unnecessary and that using a small number of representative frames of Pose Tokens can achieve both high accuracy and efficiency. A large number of experiments have verified the high compatibility and wide applicability of this method. It can be easily integrated into various common VPT models, whether it is VPT based on seq2seq or seq2frame, and can effectively adapt to a variety of token pruning and recovery strategies, demonstrating its great potential. The authors expect HoT to drive the development of stronger and faster VPTs.

The above is the detailed content of To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete