论文地址:BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View
代码地址:https://github.com/HuangJunJie2017/BEVDet
1、引言
BEVDet是继续LSS后,BEV检测的又一个重要工作,其本质仍然是在鸟瞰图(BEV)中执行2D-3D正投影的三维目标检测算法。因此BEVDet更多得是重用现有的模块(Image-view Encoder、View Transformer、BEV Encoder、Task-specific Head)来构建其框架(来源于LSS)。
同时为了提高模型的鲁棒性,BEVDet提出了两种改进策略:
(1)为了防止模型的过拟合,在BEV的空间中进行额外的数据增强操作。(在BEV空间中进行额外的数据增强操作,如翻转、缩放和旋转,以提高模型在这些方面的鲁棒性。)
(2)对NMS进行升级,以提高在三维场景的适应性。
总体步骤:
(1) Image-view Encoder:使用图像特征提取的backbone(ResNet,SwinTransformer,DenseNet,HRNet等)提取不同视角的图像特征,再使用一个多分辨率特征融合颈部(FPN、FPN-LSS)分别对不同视角的图像特征进行多尺度融合,得到6个相机的多尺度融合图像特征。
(2) View Transformer:利用了LSS中的深度估计算法,把不同视角的图像特征当做输入,先对图像进行升维构造视锥,预测图像的深度,然后基于预测的图像深度和图像特征生成点云,最后在竖直方向上池化得到BEV特征空间。
(3) BEV Encoder:使用图像特征提取backbone中的ResNet来构造BEV空间中的特征提取主干网络,使用FPN-LSS来融合不同尺度的BEV特征,进一步对BEV空间的特征进行特征提取(类似于image-view Encoder,但它可以高精度地感知一些关键线索,如比例、方向和速度)。
(4) Task-specific Head:根据所执行任务设计输出头(3D物体检测旨在检测行人、车辆、障碍物等可移动物体的位置、比例、方向和速度),直接采用3D目标检测的头部在CenterPoint的第一个阶段。在CenterPoint的第二个细化的阶段并没有采用。
2、pipeline
2.1 Image-view Encoder
2.1.1 原理:
image_encoder模块的主要作用是使用backbone对6个视角的图像进行特征提取,文中使用的是resnet50对[8,6,3,256,704]进行提取,最后得到两个特征图,分别为[48,1024,16,44]和[48,2048,8,22],之后再使用一个检测中常用的neck结构将两个特征图合并为一个[48,256,16,44],再将其reshape为[8,6,256,16,44],得到最终的特征图。
2.1.2 代码:
def image_encoder(self, img, stereo=False):
imgs = img # [8,6,3,256,704]
B, N, C, imH, imW = imgs.shape #
imgs = imgs.view(B * N, C, imH, imW) # [8*6,3,256,704]
if self.grid_mask is not None:
imgs = self.grid_mask(imgs) # [8*6,3,256,704]
x = self.img_backbone(imgs) # [ [48,1024,16,44], [48,2048,8,22] ],使用resnet50进行特征提取
stereo_feat = None
if stereo:
stereo_feat = x[0]
x = x[1:]
if self.with_img_neck:
x = self.img_neck(x) # [48,256,16,44] (分别使用两个全连接对[48,1024,16,44], [48,2048,8,22]进行MLP映射之后在concat)
if type(x) in [list, tuple]:
x = x[0]
_, output_dim, ouput_H, output_W = x.shape # [48,256,16,44]
x = x.view(B, N, output_dim, ouput_H, output_W) # [8,6,256,16,44]
return x, stereo_feat
2.2 View Transformer
2.2.1 原理:
这里主要讲解一下2D->3D投影变换的代码逻辑与基本原理。图2展示了View Transformer模块的代码逻辑。
下面就结合代码分别讲解一下图2中各个函数的主要功能:
(1)self.depth_net: 利用深度估计网络(多个卷积层+BN+ReLU)对C维度进行特征预测,最终得到shape为 [48,123,16,44],分布表示 [B*N,C,H,W]。
(2)self.view_transform(self.view_transform_core): 2D->3D视角转换函数
(3)self.get_lidar_coor: 利用相机的内外参和相机到激光坐标系的转换矩阵将视锥转换为点云。
(4)self.voxel_pooling_v2:利用得到的点云coor、深度估计depth、feat进行BEV特征的提取。
(5)self.voxel_pooling_prepare_v2:为体素池化(voxel pooling)准备数据。接收一个激光雷达空间中的点坐标张量coor,返回每个点所属的体素的排名、在深度空间中的保留索引和在特征空间中的保留索引。
(6)bev_pool_v2:利用点云coor、体素排名rank、深度估计特征depth等得到BEV特征。
2.2.2 代码:
forward函数
def forward(self, input, depth_from_lidar=None):
"""Transform image-view feature into bird-eye-view feature.
Args:
input (list(torch.tensor)): of (image-view feature, rots, trans,
intrins, post_rots, post_trans)
Returns:
torch.tensor: Bird-eye-view feature in shape (B, C, H_BEV, W_BEV)
"""
x = input[0] # [8,6,256,16,44]
B, N, C, H, W = x.shape # [8,6,256,16,44]
x = x.view(B * N, C, H, W) # [8*6,256,16,44]
if self.with_depth_from_lidar: #
assert depth_from_lidar is not None
if isinstance(depth_from_lidar, list):
assert len(depth_from_lidar) == 1
depth_from_lidar = depth_from_lidar[0]
h_img, w_img = depth_from_lidar.shape[2:]
depth_from_lidar = depth_from_lidar.view(B * N, 1, h_img, w_img)
depth_from_lidar = self.lidar_input_net(depth_from_lidar)
x = torch.cat([x, depth_from_lidar], dim=1)
if self.with_cp:
x =checkpoint(self.depth_net, x)
else:
x = self.depth_net(x) # [48,123,16,44]
depth_digit = x[:, :self.D, ...] # [48,59,16,44]
tran_feat = x[:, self.D:self.D + self.out_channels, ...] # [48,64,16,44]
depth = depth_digit.softmax(dim=1) # [48,59,16,44] 在dim=1的维度进行softmax归一化道0~1之间
return self.view_transform(input, depth, tran_feat) # []
self.view_transform函数
def view_transform(self, input, depth, tran_feat):
for shape_id in range(3):
assert depth.shape[shape_id+1] == self.frustum.shape[shape_id]
if self.accelerate:
self.pre_compute(input)
return self.view_transform_core(input, depth, tran_feat)
self.view_transform_core函数
def view_transform_core(self, input, depth, tran_feat):
# input:[[8,6,256,16,44],[8,6,4,4],[8,6,4,4],[8,6,3,3],[8,6,3,3],...]
# depth:[48,59,16,44] 根据图像预测出来的深度(伪点云)
# tran_feat:[48,64,16,44]
B, N, C, H, W = input[0].shape # [8,6,256,16,44]
# Lift-Splat
if self.accelerate:
feat = tran_feat.view(B, N, self.out_channels, H, W)
feat = feat.permute(0, 1, 3, 4, 2)
depth = depth.view(B, N, self.D, H, W)
bev_feat_shape = (depth.shape[0], int(self.grid_size[2]),int(self.grid_size[1]), int(self.grid_size[0]),feat.shape[-1]) # (B, Z, Y, X, C)
bev_feat = bev_pool_v2(depth, feat, self.ranks_depth, self.ranks_feat, self.ranks_bev, bev_feat_shape, self.interval_starts, self.interval_lengths)
bev_feat = bev_feat.squeeze(2)
else:
# 点云coor
coor = self.get_lidar_coor(*input[1:7]) # [8,6,59,16,44,3]
# 体素化,拍扁得到BEV特征 [8,6,1,128,128]
bev_feat = self.voxel_pooling_v2(coor, depth.view(B, N, self.D, H, W), tran_feat.view(B, N, self.out_channels, H, W)) #
return bev_feat, depth
self.get_lidar_coor函数
def get_lidar_coor(self, sensor2ego, ego2global, cam2imgs, post_rots, post_trans,bda):
"""Calculate the locations of the frustum points in the lidar
coordinate system.
Args:
rots (torch.Tensor): Rotation from camera coordinate system to
lidar coordinate system in shape (B, N_cams, 3, 3).
trans (torch.Tensor): Translation from camera coordinate system to
lidar coordinate system in shape (B, N_cams, 3).
cam2imgs (torch.Tensor): Camera intrinsic matrixes in shape
(B, N_cams, 3, 3).
post_rots (torch.Tensor): Rotation in camera coordinate system in
shape (B, N_cams, 3, 3). It is derived from the image view
augmentation.
post_trans (torch.Tensor): Translation in camera coordinate system
derived from image view augmentation in shape (B, N_cams, 3).
Returns:
torch.tensor: Point coordinates in shape
(B, N_cams, D, ownsample, 3)
"""
B, N, _, _ = sensor2ego.shape # [8,6,4,4]
# post-transformation
# B x N x D x H x W x 3
# self.frustum.to(sensor2ego)是指将sensor2ego的设备号cuda复制到 self.frustum上
points = self.frustum.to(sensor2ego) - post_trans.view(B, N, 1, 1, 1, 3) # [8,6,59,16,44,3]
points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1)) # [8,6,59,16,44,3,1] 数据增强后的变换
# cam_to_ego 坐标系转换
points = torch.cat((points[..., :2, :] * points[..., 2:3, :], points[..., 2:3, :]), 5) # [8,6,59,16,44,3,1] 3这个维度的xy*z,坐标变换中的知识
combine = sensor2ego[:,:,:3,:3].matmul(torch.inverse(cam2imgs)) # [8,6,3,3]
points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1) # [8,6,59,16,44,3]
points += sensor2ego[:,:,:3, 3].view(B, N, 1, 1, 1, 3) # [8,6,59,16,44,3]
# bad 为BEV 特征下的增强矩阵,这里为单位矩阵
# 解释来源为 https://github.com/Megvii-BaseDetection/BEVDepth/issues/44
points = bda[:, :3, :3].view(B, 1, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1)).squeeze(-1) # [8,6,59,16,44,3]
points += bda[:, :3, 3].view(B, 1, 1, 1, 1, 3) # [8,6,59,16,44,3]
return points
self.voxel_pooling_v2函数
def voxel_pooling_v2(self, coor, depth, feat):
# coor: [8,6,59,16,44,3]
# depth: [8,6,59,16,44]
# feat: [8,6,64,16,44]
# 为体素池化(voxel pooling)准备数据。它接收一个激光雷达空间中的点坐标张量coor,并返回每个点所属的体素的排名、在深度空间中的保留索引和在特征空间中的保留索引
ranks_bev, ranks_depth, ranks_feat, interval_starts, interval_lengths = self.voxel_pooling_prepare_v2(coor) #
# ranks_bev : [1502983]
# ranks_depth : [1502983]
# ranks_feat : [1502983]
# interval_starts : [112405]
# interval_lengths : [112405]
if ranks_feat is None:
print('warning ---> no points within the predefined '
'bev receptive field')
dummy = torch.zeros(size=[
feat.shape[0], feat.shape[2],
int(self.grid_size[2]),
int(self.grid_size[0]),
int(self.grid_size[1])]).to(feat)
dummy = torch.cat(dummy.unbind(dim=2), 1)
return dummy
feat = feat.permute(0, 1, 3, 4, 2) # [8,6,16,44,64]
bev_feat_shape = (depth.shape[0], int(self.grid_size[2]),int(self.grid_size[1]), int(self.grid_size[0]),feat.shape[-1]) # (B, Z, Y, X, C) [8,1,128,128,64]
# [8,64,1,128,128]
bev_feat = bev_pool_v2(depth, feat, ranks_depth, ranks_feat, ranks_bev, bev_feat_shape, interval_starts, interval_lengths) #
# collapse Z
if self.collapse_z:
bev_feat = torch.cat(bev_feat.unbind(dim=2), 1)
return bev_feat
self.voxel_pooling_prepare_v2函数
def voxel_pooling_prepare_v2(self, coor):
"""Data preparation for voxel pooling.
Args:
coor (torch.tensor): Coordinate of points in the lidar space in
shape (B, N, D, H, W, 3).
Returns:
tuple[torch.tensor]: Rank of the voxel that a point is belong to
in shape (N_Points); Reserved index of points in the depth
space in shape (N_Points). Reserved index of points in the
feature space in shape (N_Points).
"""
# voxel pooling前的数据准备
# 为了在一个激光雷达点云数据集中进行体素池化(voxel pooling)前的数据准备
# 为体素池化(voxel pooling)准备数据。它接收一个激光雷达空间中的点坐标张量coor,并返回每个点所属的体素的排名、在深度空间中的保留索引和在特征空间中的保留索引
B, N, D, H, W, _ = coor.shape # [8,6,59,16,44,3]
num_points = B * N * D * H * W # 8*6*59*16*44=1993728
# record the index of selected points for acceleration purpose
ranks_depth = torch.range(0, num_points - 1, dtype=torch.int, device=coor.device) # [1993728] 创建一个深度空间的排名
ranks_feat = torch.range(0, num_points // D - 1, dtype=torch.int, device=coor.device) # [33792] 创建一个特征空间的排名
ranks_feat = ranks_feat.reshape(B, N, 1, H, W) # [8,6,1,16,44]
ranks_feat = ranks_feat.expand(B, N, D, H, W).flatten() # [8,6,59,16,44] -> [1993728]
# convert coordinate into the voxel space 将坐标转换到体素空间
# 将原点移动到左下角并且将坐标系转到BEV空间的尺度
coor = ((coor - self.grid_lower_bound.to(coor)) /self.grid_interval.to(coor)) # [8,6,59,16,44,3]
coor = coor.long().view(num_points, 3) # [1993728,3]
# 记录当前视锥点在哪个batch
batch_idx = torch.range(0, B - 1).reshape(B, 1).expand(B, num_points // B).reshape(num_points, 1).to(coor) # [1993728,1]
coor = torch.cat((coor, batch_idx), 1) # [1993728,4]
# filter out points that are outside box 过滤到box之外的点
kept = (coor[:, 0] >= 0) & (coor[:, 0] < self.grid_size[0]) & \
(coor[:, 1] >= 0) & (coor[:, 1] < self.grid_size[1]) & \
(coor[:, 2] >= 0) & (coor[:, 2] < self.grid_size[2]) # [1993728]
if len(kept) == 0:
return None, None, None, None, None
# 保留box内部的点
coor, ranks_depth, ranks_feat = coor[kept], ranks_depth[kept], ranks_feat[kept] # []
# get tensors from the same voxel next to each other 计算体素内的排名(Bird's Eye View, BEV)
ranks_bev = coor[:, 3] * (self.grid_size[2] * self.grid_size[1] * self.grid_size[0]) # [1502983]
ranks_bev += coor[:, 2] * (self.grid_size[1] * self.grid_size[0]) # [1502983]
ranks_bev += coor[:, 1] * self.grid_size[0] + coor[:, 0] # [1502983]
# 对点进行排序
order = ranks_bev.argsort() # [1502983]
ranks_bev, ranks_depth, ranks_feat = ranks_bev[order], ranks_depth[order], ranks_feat[order]
# 找出体素的起始索引
kept = torch.ones(ranks_bev.shape[0], device=ranks_bev.device, dtype=torch.bool) # [1502983]
# 错位比较,可以使得索引位置相同的,收个位置为True,如图所示。
kept[1:] = ranks_bev[1:] != ranks_bev[:-1] # []
interval_starts = torch.where(kept)[0].int() # [112405]
# 处理没有体素的情况
if len(interval_starts) == 0:
return None, None, None, None, None
# 计算体素的长度
interval_lengths = torch.zeros_like(interval_starts) # [112405]
interval_lengths[:-1] = interval_starts[1:] - interval_starts[:-1] # [112405]
interval_lengths[-1] = ranks_bev.shape[0] - interval_starts[-1] # [112405]
return ranks_bev.int().contiguous(), ranks_depth.int().contiguous(), ranks_feat.int().contiguous(), interval_starts.int().contiguous(), interval_lengths.int().contiguous()
2.3 BEV Encoder
这里BEV Encoder其实还是一个resnet结构+一个多尺度融合的neck,再结构上和常见的2D检测基本完全一样,这里就不过多赘述了。
总结
1、BEVDet的代码基本和LSS类似,更多的是使用mmdet3D框架对其进行了封装。
2、读完整个BEVDet的代码之后,其实本人没有找到论文中提及到的数据增强和Scale NMS两个创新点的代码位置(如果有大佬找到,恳请在此分享一下!!!)
3、尤其是论文中提到了过拟合问题,即img_backbone得到了充分的训练(6个相机的图像),但是BEV_encoder端的数据(6个相机的图像才会生成一个BEV图)是缺少的,感觉代码上没有体现出这部分的解决思路。