Fig. 3

BEV feature generation from multi-view camera inputs. 6 surround view images are first processed using a ResNet+FPN backbone to extract 2D image features. A depth network lifts each pixel into a 3D frustum using camera intrinsics and extrinsics. The lifted points are then aggregated into a voxel grid and vertically pooled to produce the final BEV representation.