Home > Article > Technology peripherals > ICLR'24 new ideas without pictures! LaneSegNet: map learning based on lane segmentation awareness
Maps are key information for downstream applications of autonomous driving systems, and are usually represented by lanes or center lines. However, the existing map learning literature mainly focuses on detecting geometry-based topological relationships of lanes or sensing centerlines. Both methods ignore the inherent relationship between lane lines and center lines, that is, lane lines bind center lines. Although simply predicting two types of lanes in one model are mutually exclusive in the learning goal, this paper proposes lane segmentation as a new representation that seamlessly combines geometric and topological information, thus proposing LaneSegNet. This is the first end-to-end mapping network that generates lane segments to obtain a complete representation of road structure. LaneSegNet has two key modifications. One is the lane attention module, which is used to capture key area details within long-distance feature space. The other is the same initialization strategy of the reference point, which enhances the learning of position priors for lane attention. On the OpenLane-V2 dataset, LaneSegNet has significant advantages over previous similar products in three tasks, namely map element detection (4.8 mAP), lane centerline perception (6.9 DETl) and newly defined lane segment perception (5.6 mAP). Additionally, it achieved a real-time inference speed of 14.7FPS.
Open source link: https://github.com/OpenDriveLab/LaneSegNet
In summary, the main contributions of this article are as follows:
Centerline Perception: Centerline Perception from Vehicle-mounted Sensor Data (Compared to Lane Map Learning in this paper same) has attracted a great deal of attention recently. STSU proposed a DETR-like network to detect centerlines, followed by a multilayer perceptron (MLP) module to determine their connectivity. Based on STSU, Can et al. introduced an additional minimum loop query to ensure the correct order of overlapping rows. CenterLineDet treats center lines as vertices and designs a graph update model trained through imitation learning. It is worth noting that Tesla proposed the concept of "lane language" to express the lane map as a sentence. Their attention-based model recursively predicts lane markings and their connectivity. In addition to these segmentation methods, LaneGAP also introduces a path method that uses an additional transformation algorithm to recover the lane map. TopoNet targets complete and diverse driving scene graphs, explicitly models the connectivity of center lines within the network, and incorporates traffic elements into the task. In this work, we adopt the segment method to construct lane graphs. However, we differ from previous methods in modeling Lane Segments instead of taking the centerline as the vertex of the lane graph, which allows convenient integration of segment-level geometric and semantic information.
Map element detection: In previous work, people focused on improving map element detection from the camera plane to 3D space to overcome projection errors. With the popular trend of BEV sensing, recent works focus on learning HD maps using segmentation and vectorization methods. Map segmentation predicts the semantics of each pure BEV grid, such as lanes, crosswalks, and drivable areas. These works mainly differ in perspective view (PV) to BEV conversion modules. However, segmented maps cannot provide direct information used by downstream modules. HDMapNet handles this problem by grouping and vectorizing segmentation maps with complex post-processing.
Although dense segmentation provides pixel-level information, it still cannot touch the complex relationships of overlapping elements. VectorMapNet proposes to represent each map element directly as a sequence of points, using coarse keypoints to sequentially decode lane locations. MapTR explores a unified permutation-based point sequence modeling approach to eliminate modeling ambiguity and improve performance and efficiency. PivotNet further models map elements using pivot-based representation in an ensemble prediction framework to reduce redundancy and improve accuracy. StreamMapNet utilizes multi-point attention and temporal information to improve the stability of remote map element detection. In fact, since vectorization also enriches the direction information of lanes, vectorization-based methods can be easily adapted to centerline awareness through alternating supervision. In this work, we propose a unified, easy-to-learn representation—lane segmentation—for all HD map elements on a road.
Instances of Lane Segment contain geometric and semantic aspects of the road. As for geometry, it can be represented as a line segment consisting of a vectorized centerline and its corresponding lane boundary: . Each line is defined as an ordered collection of points in 3D space. Alternatively, the geometry can be described as a closed polygon that defines the drivable area within that lane.
In terms of semantics, it includes Lane Segment category C (e.g., Lane Segment, pedestrian crossing) and the line style of the left/right lane boundary (e.g., invisible, solid, dashed line): {}. These details provide autonomous vehicles with important insights into deceleration requirements and the feasibility of lane changes.
In addition, topology information plays a crucial role in path planning. To represent this information, a lane graph is constructed for Lane Segment, represented as G = (V, E). Each Lane Segment is a node in the graph, represented by the set V, and the edges in the set E describe the connectivity between Lane Segments. We use an adjacency matrix to store this lane graph, where matrix element (i, j) is set to 1 only when the j-th Lane Segment follows the i-th Lane Segment; otherwise, it remains 0.
The overall framework of LaneSegNet is shown in Figure 2. LaneSegNet takes surround images as input to perceive Lane Segments within a specific BEV range. In this section, we first briefly introduce the LaneSeg encoder used to generate BEV features. Then, we introduce lane segmentation decoder and lane attention. Finally, we propose lane segmentation predictors along with training losses.
LaneSeg Encoder
The encoder converts the surround image into BEV features for Lane Segment extraction. We utilize the standard ResNet-50 backbone to derive feature maps from raw images. The PV to BEV encoder module using BEVFormer is then used for view conversion.
LaneSeg Decoder
Transformer-based detection method utilizes the decoder to collect features from BEV features and updates the decoder query through multiple layers. Each decoder layer utilizes self-attention, cross-attention mechanisms, and feed-forward networks to update the query. Additionally, learnable location queries are employed. The updated query is then output and fed to the next stage.
Due to complex and elongated map geometries, collecting long-range BEV features is crucial for online mapping tasks. Previous work utilizes hierarchical (instance point) decoder queries and deformable attention to extract local features for each point query. Although this approach avoids capturing long-distance information, it comes with high computational cost due to the increased number of queries.
Lane Segment, as a lane instance representation for constructing scene graphs, has superior characteristics at the instance level. Our goal is not to use multi-point queries, but to use single instance queries to represent Lane Segments. Therefore, the core challenge is how to use single instance queries to cross-focus on global BEV features.
Lane Attention: In target detection, deformable attention uses the position prior of the target and only focuses on a small part of the attention values near the target reference point as a pre-filter, which greatly improves the Convergence is accelerated. During layer iterations, a reference point is placed at the center of the prediction target to refine the sampling locations of attention values, which are dispersed around the reference point via learnable sampling offsets. Intentional initialization of sample offsets includes the geometry preceding the 2D target. By doing so, the multi-branch mechanism can capture the characteristics of each direction well, as shown in Figure 3a.
In the context of map learning, Li et al. used naive deformable attention to predict center lines. However, as shown in Figure 3b, due to the naive placement of the reference points, it may not be able to obtain lone range attention. Furthermore, due to the elongated shape of the target and complex visual cues (e.g., accurately predicting breakpoints between solid and dashed lines), this process requires additional adaptive design for our task. Considering all these characteristics, it is necessary for the network to have the ability to not only pay attention to long-range contextual information, but also accurately extract local details. Therefore, it is recommended to distribute the sampling locations over a large area to effectively perceive long-distance information. On the other hand, local details should be easily distinguishable to identify key points. It is worth noting that although there is a competitive relationship between value features within a single attention head, value features between different heads can be retained during the Attention process. Therefore, it is promising to explicitly exploit this property to promote attention to local features of a specific region.
To this end, this article proposes to establish a heads-to-regions mechanism. We first distribute multiple reference points evenly within the Lane Segment area. The sampling locations are then initialized around each reference point in the local area. To preserve complex local details, we use a multi-branch mechanism, where each head focuses on a specific set of sampling locations within a local area, as shown in Figure 3c.
A mathematical description of the lane attention module is now provided. Given BEV features, i-th Lane Segment query feature qi and a set of reference points pi as input, lane attention is calculated as follows:
The same initialization of the reference points: Ref. The location of the points is the determining factor for the functionality of the lane attention module. In order to align the area of interest of each instance query with its actual geometry and location, the reference point p in each instance query is distributed based on the Lane Segment prediction of the previous layer, as shown in Figure 3c. and iteratively refine the predictions.
Previous work argued that the reference points provided to the first layer should be individually initialized with learnable priors derived from position query embeddings. However, since the location query is independent of the input image, this initialization method may in turn limit the model's ability to remember geometric and location priors, and incorrectly generated initialization locations can also pose an obstacle to training.
Therefore, for the first layer of the Lane Segment decoder, we propose the same initialization strategy. In the first layer, each head takes the same reference point generated by the position query. Compared with the distributed initialization of reference points in traditional methods (i.e., initializing multiple reference points for each query), the same initialization will make the learning of position priors more stable by filtering out the interference of complex geometries. Note that the same initialization may seem counter-intuitive, but has been observed to work.
LaneSeg Predictor
We use MLP in multiple prediction branches to generate the final predicted Lane Segment from the Lane Segment query, taking into account geometric, semantic and topological aspects. .
For geometry, we first designed a centerline regression branch to regress the vectorized point position of the centerline in three-dimensional coordinates. The output format is. Due to the symmetry of the left and right lane boundaries, we introduce an offset branch to predict the offset, whose format is. Therefore, the left and right lane boundary coordinates can be calculated using
Assuming that lane segments can be conceptualized as drivable areas, we integrate the instance segmentation branch into the predictor. In terms of semantics, three classification branches predict the classification score of C, and the score of C in parallel. The topological branch takes the updated query features as input and outputs a weighted adjacency matrix of the lane graph G using MLP.
Training Loss
LaneSegNet adopts a DETR-like paradigm, using the Hungarian algorithm to efficiently compute a one-to-one optimal allocation between predictions and ground truth. The training loss is then calculated based on the distribution results. The loss function consists of four parts: geometric loss, classification loss, laneline classification loss and topological loss.
Geometric loss supervises the geometric structure of each predicted Lane Segment. According to the binary matching result, a GT Lane Segment is assigned to each predicted vectorized Lane Segment. The vectorized geometric loss is defined as the Manhattan distance calculated between assigned Lane Segment pairs.
Lane Segment perception: In Table 1, We compare LaneSegNet with several state-of-the-art methods MapTR, MapTRv2 and TopoNet on the newly introduced Lane Segment-aware benchmark. Retrain their model with our Lane Segment labels. LaneSegNet outperforms other methods by up to 9.6% in mAP, and the average distance error is relatively reduced by 12.5%. LaneSegNet-mini also outperforms previous methods with a higher FPS of 16.2.
Qualitative results are shown in Figure 4:
Map element detection: In order to Map element detection methods For a fairer comparison, we decompose LaneSegNet’s predicted Lane Segment into pairs of lanes and then compare it with state-of-the-art methods using map element detection metrics. We feed the disassembled lane line and crosswalk labels into several state-of-the-art methods for retraining. The experimental results are shown in Table 2, showing that LaneSegNet always outperforms other methods in map element detection tasks. On a fair comparison, LaneSegNet recovers road geometry better with additional supervision. This shows that Lane Segment learning representation is good at capturing road geometric information.
Centerline Awareness: We also compare LaneSegNet with state-of-the-art centerline awareness methods in Table 3. For consistency, center lines are also extracted from Lane Segment for retraining. It can be concluded that LaneSegNet's performance in the lane map perception task is significantly higher than other methods. With additional geographical monitoring, LaneSegNet also demonstrates superior topological reasoning capabilities. It is proved that reasoning ability is closely related to strong positioning and detection capabilities.
Lane Segment Formula: In Table 4, we provide ablation to verify our proposed Lane Segment learning formula design advantages and training efficiency. Compared to the separately trained models in the first two rows, joint training of centerlines and map elements brings an overall average improvement of 1.3 on the two main metrics, as shown in row 4, demonstrating the feasibility of multi-task training. However, the common approach of training centerlines and map elements in a single branch by adding additional categories leads to significant performance degradation. Compared with the above naive single-branch method, our model trained with Lane Segment labels obtains significant performance enhancement (7.2 on OLS and 4.4 on mAP for comparison between rows 3 and 5), which verifies The positive interaction between various road information in our map learning formulation is demonstrated. Our model even outperforms multi-branch methods, especially in centerline awareness (OLS of 4.8). This shows that geometry can guide topological reasoning in our map learning formulation, where the multi-branch model only slightly outperforms the CL-only model (0.6 OLS between rows 1 and 4). As for the small decrease, it comes from the reshaping process of our prediction results and is caused by the error of line classification.
Lane Attention Module: We show the attention The force module ablation is shown in Table 5. To facilitate a fair comparison, we replace the lane attention module in the framework with an alternative attention design. With our careful design, LaneSegNet with lane attention significantly outperforms these methods, showing significant improvements (mAP improved by 3.9 and TOPll improved by 1.2 compared to row 1). Furthermore, the decoder latency can be further reduced (from 23.45ms to 20.96ms) due to the reduction in the number of queries compared to the hierarchical query design.
This paper proposes Lane Segment awareness as a new map learning formula, and proposes LaneSegNet, an end-to-end solution specifically for this problem network. In addition to the network, two innovative enhancements are proposed, including a lane attention module that employs a head-to-region mechanism to capture long-distance attention, and the same initialization strategy of reference points to enhance the location of lane attention. Prior learning. Experimental results on the OpenLane-V2 dataset demonstrate the effectiveness of our design.
Limitations and Future Work. Due to computational limitations, we do not extend the proposed LaneSegNet to more additional backbones. The formulation of Lane Segment awareness and LaneSegNet may benefit downstream tasks and is worth exploring in the future.
The above is the detailed content of ICLR'24 new ideas without pictures! LaneSegNet: map learning based on lane segmentation awareness. For more information, please follow other related articles on the PHP Chinese website!