[3d object detection] BEVFormer

发布时间 2023-08-11 17:18:46作者: ldfm

paper: BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, 2022

1. Grid-shaped BEVqueries

We predefine a group of grid-shaped learnable parameters Q ∈ RH×W×C as the queries of BEVFormer, where H, W are the spatial shape of the BEV plane. To be specific, the query Qp ∈ R1×C located at p = (x, y) of Q is responsible for the corresponding grid cell region in the BEV plane. Each grid cell in the BEV plane corresponds to a real-world size of s meters. The center of BEV features corresponds to the position of the ego car by default(The NuScenes dataset camera detect range can be [-40m, -40m, -1m, 40m, 40m, 5.4m], which are symmetric.). Following common practices [ 14], we add learnable positional embedding to BEV queries Q before inputting them to BEVFormer.

2. Spatial cross-attention

each BEV query only interacts with image features in the regions of interest.
Step:

  1. First lift each query on the BEV plane to a pillar-like query, sample \(N_{ref}\) 3D reference points from the pillar.
  2. Project these points to 2D views as reference points.
  • he real world location (x′, y′) corresponding to the query \(Q_p\) located at p = (x, y) of Q.
    将栅格坐标转换到以车辆为中心的世界坐标。

\[\begin{cases} x^{'} = (x - \frac{W}{2}\cdot s) \\ y^{'} = (y - \frac{H}{2}\cdot s) \\ \end{cases} \]

  • the objects located at (x′, y′) will appear at the height of z′ on the z-axis. So we predefine a set of anchor heights \(\{z^{'}_j\}^{N_{ref}}_{j=1}\) to make sure we can capture clues that appeared at different heights. In this way, for each query \(Q_p\), we obtain a pillar of 3D reference points \((x′, y′, z')^{N_{ref}}_{j=1}\).
  • project the 3D reference points to different image views through the projection matrix of cameras. \(T_i \in R_{3×4}\) is the known projection matrix of the i-th camera.

\[z_{ij}\cdot \{x_{ij}, y_{ij}, 1\}^T = T_i·\{x′, y′, z′_j, 1\}^T \]

  1. sample the features from the hit views \(V_{hit}\) around these reference points.
  2. Perform a weighted sum of the sampled features as the output of spatial cross-attention.

3. Temporal self-attention

each BEV query interacts with two features: the BEV queries at the current timestamp and the BEV features at the previous timestamp.
Step:

  1. Given the BEV queries Q at current timestamp t and history BEV features Bt−1 preserved at timestamp t−1, we first align Bt−1 to Q according to ego-motion to make the features at the same grid correspond to the same real-world location.
    根据运动关系,将上一帧 BEV 特征对齐到 Q 的世界坐标系
  2. (It is challenging to construct the precise association of the same objects between the BEV features of different times) Model temporal connection between features through the temporal self-attention (TSA) layer.