Home > Article > Technology peripherals > What are the effective methods and common Base methods for pedestrian trajectory prediction? Top conference papers sharing!
Trajectory prediction has been in the limelight in the past two years, but most of it focuses on the direction of vehicle trajectory prediction. Today, the Heart of Autonomous Driving will share with you the algorithm for pedestrian trajectory prediction on NeurIPS - SHENet, which is used by humans in restricted scenarios. Movement patterns usually conform to limited laws to some extent. Based on this assumption, SHENet predicts a person's future trajectory by learning implicit scene rules. The article has been authorized to be original by Autonomous Driving Heart!
Due to the randomness and subjectivity of human movement, predicting a person’s future trajectory is currently still a challenging problem. However, due to scene constraints (such as floor plans, roads, and obstacles) and human-to-human or human-to-object interactivity, human movement patterns in constrained scenes usually conform to limited laws to a certain extent. Therefore, in this case, the individual's trajectory should also follow one of these laws. In other words, one person’s subsequent trajectory is likely to have been traveled by others. Based on this assumption, the algorithm of this article (SHENet) predicts a person's future trajectory by learning implicit scene rules. Specifically, we refer to the regularities inherent in the past dynamics of people and environments in a scene as scene history. The scene history information is then divided into two categories: historical group trajectories and interactions between individuals and the environment. To exploit these two types of information for trajectory prediction, this paper proposes a novel framework Scene History Mining Network (SHENet), in which scene history is exploited in a simple and effective way. In particular, two components of the design are: the group trajectory library module, which is used to extract representative group trajectories as candidates for future paths; and the cross-modal interaction module, which is used to model the interaction between an individual's past trajectory and its surrounding environment, for trajectory refinement. In addition, in order to alleviate the uncertainty of the true trajectory caused by the randomness and subjectivity of human movement mentioned above, SHENet incorporates smoothness into the training process and evaluation indicators. Finally, we verified it on different experimental data sets and demonstrated excellent performance compared with SOTA methods.
Human Trajectory Prediction (HTP) aims to predict the future path of a target person from video clips. This is crucial for smart transportation as it enables vehicles to sense pedestrian status in advance, thereby avoiding potential collisions. Monitoring systems with HTP functions can assist security personnel in predicting possible escape routes of suspects. Although much work has been done in recent years, few are sufficiently reliable and generalizable to applications in real-world scenarios, mainly due to two challenges of the task: randomness and subjectivity of human motion. However, in constrained real-world scenarios, the challenges are not absolutely intractable. As shown in Figure 1, given the previously captured videos in this scene, the target person's future trajectory (red box) becomes more predictable because the human's movement pattern usually conforms to several basic laws that the target person in this scene will follow. . Therefore, to predict trajectories, we first need to understand these patterns. We argue that these regularities are implicitly encoded in historical human trajectories (Figure 1 left), individual past trajectories, surrounding environments, and interactions between them (Figure 1 right), which we refer to as scene histories.
Figure 1: Schematic diagram of utilizing scene history: historical group trajectories and individual environment interactions for human trajectory prediction.
We divide historical information into two categories: historical group trajectories (HGT) and individual-environment interactions (ISI). HGT refers to the group representation of all historical trajectories in a scene. The reason for using HGT is that, given a new target person in the scene, his/her path is more likely to have more similarity, subjectivity, and regularity to one of the group trajectories than to any single instance of the historical trajectory due to the aforementioned randomness. . However, group trajectories are less relevant to individuals' past states and corresponding environments, and can also affect individuals' future trajectories. ISI needs to more fully utilize historical information by extracting contextual information. Existing methods rarely consider the similarity between individuals' past trajectories and historical trajectories. Most attempts only explore the interaction between the individual and the environment, where a lot of effort is spent on modeling the individual trajectory, the semantic information of the environment, and the relationship between them. Although MANTRA models similarity using an encoder trained in a reconstruction manner, and MemoNet simplifies similarity by storing the intent of historical trajectories, they both perform similarity calculations at the instance level rather than the group level, making It is sensitive to the capabilities of the trained coder. Based on the above analysis, we propose a simple yet effective framework, Scene History Mining Network (SHENet), which jointly utilizes HGT and ISI for HTP. In particular, the framework consists of two main components: (i) the Group Trajectory Base (GTB) module, and (ii) the Cross-Modal Interaction (CMI) module. GTB constructs representative group trajectories from all historical individual trajectories and provides candidate paths for future trajectory prediction. CMI encodes the observed individual trajectories and surrounding environment separately and models their interaction using a cross-modal transformer to refine the searched candidate trajectories.
Furthermore, in order to alleviate the uncertainty of the above two characteristics (i.e., randomness and subjectivity), we introduce curves in the training process and the current evaluation metrics, average and final displacement errors (i.e., ADE and FDE) Smoothing (CS), thus obtaining two new indicators CS-ADE and CS-FDE. Furthermore, to facilitate the development of HTP research, we collected a new challenging dataset with different movement patterns named PAV. This dataset is obtained by selecting videos with fixed camera views and complex human motion from the MOT15 dataset.
The contributions of this work can be summarized as follows: 1) We introduce group history to search for individual trajectories of HTP. 2) We propose a simple yet effective framework, SHENet, that jointly utilizes two types of scene histories (i.e., historical group trajectories and individual-environment interactions) for HTP. 3) We constructed a new challenging dataset PAV; In addition, considering the randomness and subjectivity of human movement patterns, a novel loss function and two new indicators are proposed to achieve better Baseline HTTP performance. 4) We conducted comprehensive experiments on ETH, UCY, and PAV to demonstrate the superior performance of SHENet and the efficacy of each component.
Unimodal methods Unimodal methods rely on learning individual movements from past trajectories regularity to predict future trajectories. For example, Social LSTM models the interactions between individual trajectories through the social pooling module. STGAT uses an attention module to learn spatial interactions and assign reasonable importance to neighbors. PIE uses a temporal attention module to calculate the importance of the observed trajectories at each time step.
Multimodal method In addition, the multimodal method also examines the impact of environmental information on HTP. SS-LSTM proposes a scene interaction module to capture the global information of the scene. Trajectron uses graph structures to model trajectories and interact with environmental information and other entities. MANTRA leverages external memory to model long-term dependencies. It stores historical single-agent trajectories in memory and encodes environmental information to refine the searched trajectories from this memory.
Differences from previous work Both single-modal and multi-modal methods use single or partial aspects of scene history, ignoring historical group trajectories. In our work, we integrate scene history information in a more comprehensive way and propose dedicated modules to handle different types of information respectively. The main differences between our method and previous work, especially memory-based methods and clustering-based methods are as follows: i) MANTRA and MemoNet consider historical individual trajectories, while our proposed SHENet focuses on historical group trajectories, which in different More universal in scenarios. ii) There are also some works that group person-neighbors for trajectory prediction; cluster trajectories into a fixed number of categories for trajectory classification; our SHENet generates representative trajectories as a reference for individual trajectory prediction.
The architecture of the proposed scene history mining network (SHENet) is shown in Figure 2. It consists of two main components: Group trajectory library module (GTB) and cross-modal interaction module (CMI). Formally, given all the trajectories in the observed video of the scene, scene images, and the past trajectories of the target person in the last time step, where represents the p The position of a person at time step t, SHENet requires predicting the pedestrian's future position in the next frame so that it is as close to the true trajectory as possible. The proposed GTB first compresses into representative group trajectories. Then, use the observed trajectory as a key to search for the closest representative group trajectory as a candidate future trajectory . At the same time, the past trajectory and scene images are transmitted to the trajectory encoder and scene encoder respectively to generate trajectory features and scene features respectively. The encoded features are fed into the cross-modal transformer to learn the offset from the ground truth trajectory. By adding to , we get the final prediction . During the training phase, if the distance to is higher than the threshold, the person's trajectory (i.e., and ) will be added to the trajectory library. After training is completed, the bank is fixed for inference.
Figure 2: SHENet’s architecture consists of two components: the Group Trajectory Library module (GTB) and the Cross-Modal Interaction module (CMI). GTB clusters all historical trajectories into a set of representative group trajectories and provides candidates for final trajectory prediction. In the training phase, GTB can incorporate the target person’s trajectory into the group trajectory library based on the error of the predicted trajectory to expand expression capabilities. CMI takes the past trajectory of the target person and the observed scene as the input of the trajectory encoder and scene encoder respectively for feature extraction, and then effectively models the interaction between the past trajectory and its surrounding environment through the cross-modal converter and Refinement is performed to provide candidate trajectories.
Figure 3: Illustration of cross-modal transformer. Trajectory features and scene features are input into the cross-modal transformer to learn the offset between the search trajectory and the ground truth trajectory.
The group trajectory library module (GTB) is used to build representative group trajectories in the scene. The core functions of GTB are bank initialization, trajectory search and trajectory update.
Trajectory library initialization Due to the redundancy of a large number of recorded trajectories, we do not simply use them, but generate a set of sparse and representative trajectories as The initial value of the trajectory library. Specifically, we represent the trajectories in the training data as and divide each into a pair of observed trajectories and future trajectories , Thereby is divided into observation set and the corresponding future set . Then, we calculate the Euclidean distance between each pair of trajectories in , and obtain trajectory clusters through the K-medoids clustering algorithm. The initial membership of is the average of trajectories belonging to the same cluster (see Algorithm 1, step 1). Each trajectory in represents the movement pattern of a group of people.
Trajectory Search and Update In the group trajectory library, each trajectory can be viewed as a past-future pair. Numerically, , where represents the combination of past trajectory and future trajectory, is the number of past-future pairs in . Given a trajectory , we use the observed as a key to calculate its similarity score with past trajectories in and find the representative Sex trajectories are scored according to maximum similarity (see Algorithm 1, step 2). The similarity function can be expressed as:
by adding the offset (see Equation 2) to the representative trajectory , we obtain the predicted trajectory of the observer (see Figure 2). Although the initial trajectory library works well in most cases, in order to improve the generalization of the library (see Algorithm 1, step 3), we decide whether to update based on the distance threshold θ .
This module focuses on the interaction between individual past trajectories and environmental information. It consists of two single-modal encoders to learn human motion and scene information respectively, and a cross-modal transformer to model their interaction.
Trajectory Encoder The trajectory encoder uses a multi-head attention structure from the Transformer network, which has a self-attention (SA) layer. The SA layer captures human motion at different time steps with a size of , and projects motion features from dimensions to , where is the trajectory The embedding dimension of the encoder. Therefore, we use a trajectory encoder to obtain human motion representation: .
Scene Encoder Since the pre-trained Swin Transformer has compelling performance in feature representation, we adopt it as the scene encoder. It extracts scene semantic features of size , where ( in the pre-trained scene encoder) is the number of semantic classes, such as people and roads, and are spatial resolutions. In order to enable subsequent modules to easily fuse motion representation and environment information, we re-change the semantic features from size () to () and project them from dimension () to () through multi-layer perceptual layers. As a result, we use the scene encoder to obtain the scene representation .
Cross-modal Transformer The single-modal encoder extracts features from its own modality and ignores the interaction between human motion and environmental information. A cross-modal transformer with layers aims to refine candidate trajectories by learning this interaction (see Section 3.2). We adopt a two-stream structure: one is used to capture important human motions that are constrained by environmental information, and the other is used to pick out environmental information related to human motions. The cross-attention (CA) layer and the self-attention (SA) layer are the main components of the cross-modal converter (see Figure 3). In order to capture important human body movements affected by the environment and obtain movement-related environmental information, the CA layer treats one modality as a query and the other modality as the key and value that interact with the two modalities. The SA layer is used to promote better internal connections and calculate the similarity between elements (query) and other elements (key) in scene-constrained motion or motion-related environmental information. Therefore, we obtain the multimodal representation via cross-modal transformer (). To predict the offset between the search trajectory and the true trajectory , we take the last element (LE) # of ## and the output after the global pooling layer (GPL) . The offset can be expressed as follows:
where [; ] represents vector concatenation, and MLP is a multi-layer perceptual layer. We train the overall framework of SHENet end-to-end to minimize the objective function. During training, since the scene encoder has been pre-trained on ADE20K, we freeze its segmentation part and update the parameters of the MLP head (see Section 3.3). Following existing work, we calculate the mean square error (MSE) between predicted trajectories and ground truth trajectories on the ETH/UCY dataset:.
In the more challenging PAV dataset, we use a curve smoothing (CS) regression loss, which helps reduce the impact of individual biases. It calculates the MSE after smoothing the trajectory. The CS loss can be expressed as follows: where CS represents the curve smoothing function [2].Dataset We evaluate our method on the ETH, UCY, PAV and Stanford Drone Dataset (SDD) datasets. Single-modal methods only focus on trajectory data, however, multi-modal methods need to consider scene information.
Compared with the ETH/UCY dataset, PAV is more challenging with multiple motion modes, including PETS09-S2L1 (PETS), ADL-Rundle-6 (ADL) and Venice-2 (VENICE) ,These data are captured from static cameras and provide ,sufficient trajectories for HTP tasks. We divide the videos into training set (80%) and test set (20%), and PETS/ADL/VENICE contains 2,370/2,935/4,200 training sequences and 664/306/650 test sequences respectively. We useobservation frames to predict future frames so that we can compare the long-term prediction results of different methods.
Unlike the ETH/UCY and PAV datasets, SDD is a large-scale dataset captured from a bird's-eye view of a university campus. It consists of multiple interacting agents (e.g. pedestrians, cyclists and cars) and different scenarios (e.g. sidewalks and intersections). Following previous work, we use the past 8 frames to predict the future 12 frames.
Figure 4: Illustration of our proposed metrics CS-ADE and CS-FDE.
Figure 5: Visualization of some samples after curve smoothing.
Evaluation Indicators For the ETH and UCY data sets, we use the standard indicators of HTP: average displacement error (ADE) and final displacement error (FDE). ADE is the average error between the predicted trajectory and the true trajectory at all time steps, and FDE is the error between the predicted trajectory and the true trajectory at the final time step. The trajectory in PAV has some jitter (e.g. sharp turns). Therefore, a reasonable forecast may produce approximately the same error as an unrealistic forecast using the traditional metrics ADE and FDE (see Figure 7(a)). In order to focus on the pattern and shape of the trajectory itself and reduce the impact of randomness and subjectivity, we propose CS-Metric: CS-ADE and CS-FDE (shown in Figure 4). CS-ADE is calculated as follows:
where CS is the curve smoothing function, defined the same as Lcs in Section 3.4. Similar to CS-ADE, CS-FDE calculates the final displacement error after smoothing the trajectory: Convert to smooth trajectory. Implementation detailsIn SHENet, the initial size of the group trajectory library is set to. Both the trajectory encoder and scene encoder have 4 self-attention (SA) layers. The cross-modal Transformer has 6 SA layers and Cross Attention (CA) layers. We set all embedding dimensions to 512. For the trajectory encoder, it learns human motion information of size
(in ETH/UCY,
in PAV). For the scene encoder, it outputs semantic features of size 150 × 56 × 56. We change the size from 150 × 56 × 56 to 150 × 3136 and project them from dimensions 150 × 3136 to 150 × 512. We train the model for 100 epochs on 4 NVIDIA Quadro RTX 6000 GPUs and use the Adam optimizer with a fixed learning rate of 1e − 5. Ablation ExperimentIn Table 1, we evaluate each component of SHENet, including the Group Trajectory Library (GTB) module and the Cross-Modal Interaction (CMI) module, which contains trajectories Encoder (TE), Scene Encoder (SE) and Cross-Modal Interaction (CMI) modules.
Impact of GTBWe first study the performance of GTB. Compared with CMI (i.e., TE, SE, and CMT), GTB improves FDE on PETS by 21.2%, which is a significant improvement and illustrates the importance of GTB. However, GTB alone (Table 1 row 1) is not enough and even performs slightly worse than CMI. Therefore, we explored the role of various parts in the CMI module.
Influence of TE and SEIn order to evaluate the performance of TE and SE, we concatenate the trajectory features extracted from TE and the scene features extracted from SE (Table 1 Line 3), and improves the performance of ADL and VENICE with smaller movements (vs. TE alone. This shows that incorporating environmental information into trajectory prediction can improve the accuracy of the results.
Effect of CMT Compared with the third row of Table 1, CMT (4th row of Table 1) can significantly improve the model performance. It is worth noting that its performance is better than that of TE and SE concatenated on PETS, and ADE improves 7.4%. Compared to GTB alone, full CMI improves ADE by an average of 12.2%.
Compare our model with state-of-the-art methods on the ETH/UCY dataset: SS-LSTM, Social-STGCN, MANTRA, AgentFormer, YNet. The results are summarized in Table 2. Our model reduces the average FDE from 0.39 to 0.36, an improvement of 7.7% compared to the state-of-the-art method YNet. In particular, when the trajectory undergoes large movements, our model significantly outperforms previous methods on ETH, improving ADE and FDE by 12.8% and 15.3%, respectively.
Table 2: Comparison of state-of-the-art (SOTA) methods on the ETH/UCY dataset. * indicates that we use a smaller set than the unimodal approach. Evaluate using the best of the top 20.
Table 3: Comparison with SOTA methods on PAV dataset.
To evaluate the performance of our model in long-term prediction, we conducted experiments on PAV with observation frames per trajectory, future frames. Table 3 shows the performance comparison with previous HTP methods: SS-LSTM, Social-STGCN, Next, MANTRA, YNet. Compared with the latest results of YNet, the proposed SHENet CS-ADE and CS-FDE achieve an average improvement of 3.3% and 10.5%, respectively. Since YNet predicts heatmaps of trajectories, it performs better when trajectories have small movements (such as VENICE). Nonetheless, our method is still competitive in VENICE and significantly better than other methods on PETS with larger motions and intersections. In particular, our method improves CS-FDE by 16.2% on PETS compared to YNet. We've also made huge strides in traditional ADE/FDE metrics.
The distance threshold θ θ is used to determine the update of the trajectory library. Typical values for θ are set based on the trajectory length. The absolute value of the prediction error is generally larger when the ground truth trajectory is longer in pixels. However, their relative errors are comparable. Therefore, when the errors converge, θ is set to 75% of the training error. In experiments, we set θ = 25 in PETS and θ = 6 in ADL. The "75% training error" is obtained from the experimental results, as shown in Table 4.
Table 4: Comparison of different parameters θ on PAV dataset. The results are the average of the three cases.
Table 5: Comparison of the initial number of clusters K on the PAV dataset.
K Number of clusters in the center point We studied the effect of setting different numbers of initial clusters K, as shown in Table 5 shown. We can notice that the initial number of clusters is not sensitive to the prediction results, especially when the initial number of clusters is 24-36. Therefore, we can set K to 32 in the experiment.
Bank complexity analysis The time complexity of search and update is O(N) and O(1) respectively. Their space complexity is O(N). The number of group trajectories N≤1000. The time complexity of the clustering process is ββ, and the space complexity is ββ. β is the number of clustering trajectories. is the number of clusters, is the number of iterations of the clustering method.
Figure 6: Qualitative visualization of our approach and state-of-the-art methods. The blue line is the observed trajectory. The red and green lines show the predicted and true trajectories.
Figure 7: Qualitative visualization without/with CS.
Figure 6 shows the qualitative results of SHENet and other methods. In contrast, we are surprised to note that in the extremely challenging case where a person walks to the roadside and then turns back (green curve), all other methods do not handle it well, while our proposed SHENet still can Deal with it. This should be attributed to the role of our specially designed historical group trajectory library module. Furthermore, in contrast to the memory-based method MANTRA [20], we search for trajectories of groups, not just individuals. This is more versatile and can be applied to more challenging scenarios. Figure 7 includes qualitative results for YNet and our SHENet without/with Curve Smoothing (CS). The first row shows the results using MSE loss . Affected by past trajectories with some noise (such as sudden and sharp turns), YNet's predicted trajectory points are clustered together and cannot present a clear direction, while our method can provide potential paths based on historical group trajectories. The two predictions are visually different, but the numerical errors (ADE/FDE) are approximately the same. In contrast, the qualitative results of our proposed CS loss are shown in the second row of Figure 7. It can be seen that the proposed CS significantly reduces the impact of randomness and subjectivity and produces reasonable predictions through YNet and our method.
ConclusionOriginal link: https://mp.weixin.qq.com/s/GE-t4LarwXJu2MC9njBInQ
The above is the detailed content of What are the effective methods and common Base methods for pedestrian trajectory prediction? Top conference papers sharing!. For more information, please follow other related articles on the PHP Chinese website!