Tsinghua Science and Technology


scene understanding, point cloud semantic segmentation, multi-modal information fusion, deep learning


This paper addresses the problem of the semantic segmentation of large-scale 3D road scenes by incorporating the complementary advantages of point clouds and images. To make full use of geometrical and visual information, this paper extracts 3D geometric features from a point cloud using a deep neural network for 3D semantic segmentation and extracts 2D visual features from images using a Convolutional Neural Network (CNN) for 2D semantic segmentation. In order to bridge the features of the two modalities, this paper uses superpoints as an intermediate representation to connect the 2D features with the 3D features. A superpoint-based pooling method is proposed to fuse the features from the two different modalities for joint learning. To evaluate the approach, the paper generates 3D scenes from the Virtual KITTI dataset. The results of the experiments demonstrate that the proposed approach is capable of segmenting large-scale 3D road scenes based on the compact and semantically homogeneous superpoints, and that it achieves considerable improvements over the 2D image and 3D point cloud semantic segmentation methods.


Tsinghua University Press