Introduction

The rapid advancement of autonomous driving technology has thrust scene understanding into the forefront, garnering considerable attention1,2. Scene understanding involves swiftly and accurately providing precise location and classification details of environmental objects3. Usually, scene understanding is performed by a visual camera or 3D LIDAR. Unlike conventional visual cameras, 3D LIDAR has emerged as a favored sensor for scene comprehension due to its robustness and ability to capture highly precise environmental data4,5. The main task of semantic scene understanding based on 3D LIDAR is to assign predicted class information to the captured environmental point cloud6.

In recent years, with the continuous evolution of deep learning, research on LIDAR-based point cloud semantic segmentation methods has also progressed. Initially, in the early stages of point cloud semantic segmentation, a 3D point cloud was projected onto a 2D plane, followed by processing using 2D CNNs7. However, this approach inevitably led to the loss of structural geometry features of the point cloud. Simultaneously, some studies utilized 3D CNNs to directly process the unordered point cloud after converting it into regular voxels. While this voxel-based representation method captured superior spatial feature information and was more efficient, it also resulted in the loss of contextual feature information, significantly impacting the accuracy of point cloud segmentation8. Consequently, the research focus gradually shifted towards direct point processing. PointNet9 emerged as the pioneering method for directly processing and segmenting points. Despite its ability to retain structural geometry information to a significant extent and its enhanced capability to capture spatial location context feature information, the disorderly and sparse distribution of point clouds led to computationally intensive calculations10.

Due to the limitations of various methods, researchers have increasingly explored combining different approaches to enhance the extraction of complex geometric and potentially related features in point cloud semantic segmentation. Among these, the fusion method that combines point-based and voxel-based approaches has shown superior segmentation performance11. This method integrates the high-precision context features extracted by the point-based approach with the global structural features provided by the voxel-based approach12, leading to more accurate point cloud segmentation. However, while existing methods combining point and voxel-based approaches have achieved high segmentation accuracy, they still face significant challenges. Sparse 3D convolution methods, for instance, tend to expand all sparse features within a voxel, resulting in excessive redundant features. Additionally, while integrating point features with voxel features can leverage their complementary strengths, straightforward feature addition often leads to losing crucial information. To address these issues, this paper proposes an adaptive fusion point cloud semantic segmentation method using multi-scale sparse three-dimensional convolution and point convolution. The main contributions of this work are as follows:

  • To address the limitations of existing sparse 3D convolution methods, which tend to extract excessive redundant features while neglecting the structural features in each direction, we propose an asymmetric importance of space locations (IoSL) sparse 3D convolution module. This module enhances the sparse learning of voxel features by prioritizing the prediction of point positions. Additionally, it preserves the features of the point cloud along both the vertical and horizontal directions, leading to improved efficiency and accuracy in the segmentation process;

  • To address the limitations of existing point and voxel feature fusion methods, which typically fuse features directly at a single scale and overlook important contextual information across different scales, we propose a multi-scale feature fusion cross-gating module. This module enhances the fusion accuracy between different receptive fields and adapts to the distinct propagation features of points and voxels, thereby improving the overall feature fusion performance;

  • Extensive experiments validate the effectiveness of the proposed point cloud semantic segmentation method, demonstrating its superior performance compared to alternative methods.

The remaining sections of this paper are outlined below. The second section provides a comprehensive review of related research work. Following this, the third section elaborates on the components of the proposed method. Subsequently, the fourth part delves into the details of the experiment conducted. Finally, the fifth section summarizes the entire text.

Related work

In this section, we will briefly introduce existing research methods on point cloud semantic segmentation. Existing methods can be categorized into four main types based on their research approaches: projection-based, point-based, voxel-based, and fusion-based methods.

Projection-based segmentation

Projection-based methods offer an indirect approach to point cloud semantic segmentation by transforming 3D point clouds into 2D representations for processing. Alonso et al. introduced 3D-MiniNet13, a framework combining 3D and 2D learning layers to capture both local and global features. These features are fed into an FCNN for semantic segmentation. The resulting 2D labels are then re-projected into 3D space and refined with post-processing modules. Similarly, Wang et al. proposed SwinURNet14, which employs spherical projection to transform the point cloud into a distance image. Extracted 2D features are encoded and decoded using a non-square transformer, while a multi-dimensional information fusion module balances semantic differences between 2D feature maps and the 3D feature space. Massa et al. developed a Bayesian multi-projection fusion method15 that enhances robustness by weighting classification probabilities from a basic classifier and further refining results through KNN processing and multi-projection fusion. Yuan et al. proposed a segmentation method16 for large-scale point clouds using a transformer and slot attention mechanism. Their approach projects the point cloud into multi-channel images, where a transformer-based context feature aggregation module extracts global features. The slot attention mechanism then learns the relative relationships between multi-channel features, improving semantic understanding and segmentation accuracy. Although projection-based methods often achieve faster processing speeds, they have limitations. These include reduced capacity for preserving 3D spatial features and challenges such as distortion and occlusion during projection, which can hinder segmentation performance.

Point-based segmentation

Point-based methods focus on capturing feature information at the individual point. Recent advancements in applying neural networks to point clouds have shown significant promise in this domain. Qi et al. pioneered this approach with PointNet9, the first semantic segmentation method for directly processing raw point clouds. PointNet employs MLP and max-pooling layers to extract both point-specific and global features from irregular point sets. However, its ability to capture local feature information is limited. To address this, Li et al. proposed a semantic segmentation method17 using attention transfer learning, which enhances local feature extraction through an attention pooling module. Transfer learning is then applied to optimize segmentation performance with minimal data requirements. Zhan et al. introduced FA-ResNet18, which leverages a residual MLP to deeply aggregate local shape relationships and point features, significantly improving the accuracy of point feature extraction. Inspired by the success of convolution operators in image processing, methods based on point convolution have also emerged. Thomas et al. developed KPConv19, a kernel point convolution method that uses multiple local 3D filters for flexible and effective convolution operations on point clouds. This approach enables unrestricted feature description and learning capabilities. Building on this, Zhao et al. proposed a semantic segmentation network20 using dual-attention KPConv, which integrates channel and spatial attention blocks to refine features adaptively. The method employs multiple attention gates during the decoding process to fuse upsample features, yielding multi-scale perceptual insights. Compared to MLP-based methods, kernel convolution approaches demonstrate superior performance by effectively extracting point-wise features and neighborhood relationships, making them more adept at capturing local and global features in point clouds.

Voxel-based segmentation

Voxel-based methods represent 3D spaces by converting voxelized point clouds into a structured voxel grid. This approach enables the estimation of geometric and feature information within each voxel, providing a robust framework for semantic segmentation. Zhu et al. introduced a model21 leveraging 3D CNNs that utilizes a cylindrical partition representation of 3D point clouds. Their approach integrates asymmetric residual blocks and dimensional decomposition-based context modeling modules to extract higher-order contextual features while minimizing computational overhead. Similarly, Cheng et al. proposed the Bilateral Voxel Transformer (BVT)22, which employs a dual-branch structure to query voxel locations and dynamically update key features through weighted geometric sensing sampling. Bao et al. presented GLSNet + + 23, a context-dependent voxel segmentation method that combines global and local feature manifold streams to capture multi-scale context and structural information. The method employs spatial context-dependent feature fusion to accurately resolve class membership relationships near voxel boundaries. Building on these advancements, Yang et al. proposed V-SIM24, an approach designed to mitigate the challenges of point density imbalance. V-SIM enhances feature interactions through voxel slicing and uses an interactive attention mechanism to achieve adaptive, self-enhancing voxel feature extraction. While voxel-based methods excel in accuracy and precision, their efficiency is limited by the computational burden of processing empty voxels, which significantly hinders inference speed.

Fusion-based segmentation

To address the limitations of individual methods, researchers have explored fusion-based approaches that combine multiple techniques to enhance point cloud semantic segmentation. Dai et al. proposed PCE25, a fusion method that integrates 2D projection context-embedded features with 3D voxel features. By employing an embedding entanglement strategy, PCE effectively combines these features to improve prediction accuracy. Park et al. presented a fusion method12 that combines sparse 3D convolution and point convolution, using the latter for feature extraction and the former for efficient feature propagation. Their approach further optimizes discretization errors through cross-entropy loss and position-perception loss. Li et al. introduced a multi-scale fusion method26 that incorporates voxel and point data alongside a pyramid decoding module, enhancing feature representation across multi-scale. Similarly, Zhang et al. developed PVCFormer27, which processes voxel data at varying resolutions simultaneously to expand the receptive field. This method employs a cross-attention mechanism to fuse point and voxel features, significantly improving segmentation accuracy and computational efficiency. Xie et al. proposed a fusion method28 leveraging axial movement operations, capturing richer geometric features through upward axial movements around the voxel. This approach integrates point features using an attention mechanism to achieve a more robust feature representation. Fusion-based methods are highly effective at extracting diverse feature information in parallel and integrating it through specialized backbones29. Building on this, this paper proposes an adaptive fusion point cloud semantic segmentation method that leverages multi-scale sparse three-dimensional convolution and point convolution.

Fig. 1
figure 1

The overall network structure of the method proposed in this paper.

Methods

The paper introduces a novel point cloud semantic segmentation method designed for LIDAR, with the overall network architecture depicted in Fig. 1. Initially, the input LIDAR point cloud undergoes preprocessing to generate voxel data based on a multi-scale cylindrical grid partitioning approach. Subsequently, the original point cloud and multi-scale voxel data are fed into two independent feature encoder branches, allowing for parallel processing. In the point cloud processing branch, KPConv kernel point convolution is employed to capture fine-grained point features. Meanwhile, the voxel processing branch utilizes asymmetric IoSL sparse 3D convolution to extract voxel features, obtaining coarse-grained voxel representations at different scales. After feature extraction in both branches, a multi-scale feature fusion cross-gating module is applied. This module incorporates a voxel feature fusion gating mechanism and a cross self-attention fusion mechanism to combine features of varying types and scales adaptively. Finally, point-wise refinement using an MLP is performed to produce the final point cloud semantic segmentation results. The subsequent sections provide a detailed description of each module.

Data pre-processing

The performance of point cloud semantic segmentation tasks is closely related to the size of the receptive field30, where a larger receptive field can capture more extensive global contextual information31. Meanwhile, due to the sparse density distribution features of LIDAR point clouds, uniform point grid partitioning can lead to an uneven distribution of points. Therefore, inspired by the cylindrical voxel partitioning method21 we propose a multi-scale cylindrical coordinate point cloud partitioning pre-processing method, which obtains layered cylinder grid voxels by processing point clouds with different scale resolutions.

Fig. 2
figure 2

Multi-scale point cloud cylindrical partition.

In cylindrical grid division, an increased mesh size is employed within the cylindrical coordinate system to encompass more distant point clouds while maintaining a uniform proportion of non-empty voxels21. This ensures a more balanced distribution of point clouds at varying distances. The multi-scale point cloud cylindrical partitioning process is illustrated in Fig. 2. Initially, we employ the farthest point sampling (FPS) algorithm32 three times on the input point cloud, resulting in three subsets of point clouds with different resolutions, containing \(\:{N}_{1}\), \(\:{N}_{2}\), and \(\:{N}_{3}\) points, respectively. Subsequently, the points in the three sets of point cloud subsets in Cartesian coordinates are transformed into cylindrical coordinates. This transformation entails converting coordinates \(\:(x,y,z)\) to \(\:(\rho\:,\theta\:,z)\), where \(\:\rho\:\) represents the radius, which is the distance from the origin to the point projected onto the \(\:xy\)-plane, and \(\:\theta\:\) denotes the azimuthal angle, indicating the angle between the \(\:x\)-axis and the projection of the point onto the \(\:xy\)-plane. Subsequently, the three-dimensional coordinate system is partitioned into cylindrical volumes. Notably, within the cylindrical coordinate system, the size of the segmented unit grid increases proportionally with the distance from the origin. Finally, we use mapping functions \(\:P\to\:V\) and \(\:V\to\:P\) to represent the mutual conversion between points and cylinders, respectively. Utilizing these mapping functions enables the relative transformation between features. Consequently, the extracted coarse-grained voxel features can complement the fine-grained point features, thereby mitigating feature loss and deviation.

Point processing

In the fine-grained point processing branch, we utilize the KPConv19 kernel convolution method to extract feature information from the input point cloud \(\:P\). Following the methodology established in prior research, we consider an input point \(\:{x}_{i}\in\:P\) with its corresponding input feature \(\:{f}_{i}\in\:{F}_{P}\). Subsequently, the convolution operation for a point \(\:x \in \:\mathbb{R}^{{N \times \:3}}\) at core point \(\:{\{x}_{k}\left|k<K\right\}\subset\:\mathbb{R}^{K\times\:3}\) is defined as:

$$\:g\left(x\right)=\sum\:_{{x}_{i}\in\:K}\left\{\sum\:_{k<K}h({x}_{i}-x,{x}_{k}){{f}_{i}W}_{k}\right\}$$
(1)

Where \(\:k\in\:\mathbb{R}^{K\times\:3}\) is the neighborhood position of the central point, \(\:{W}_{k}\in\:\mathbb{R}^{{D}_{in}\times\:{D}_{out}}\) represents the correlation convolution weight mapping the input dimension of the central point to the output dimension. Here, \(\:h\left(\cdot\:\right)\) signifies the correlation between the input point \(\:{x}_{i}\) and its corresponding central point \(\:{x}_{k}\). Notably, as the central point approaches the input point, the output of the correlation function \(\:h\) increases. Consequently, the linear correlation function \(\:h\left(\cdot\:\right)\) can be expressed as follows:

$$\:h\left( {x_{i} - x,x_{k} } \right) = {\text{max}}\left( {{\text{0,1}} - \frac{{\left\| {(x_{i} - x) - x_{k} } \right\|}}{\sigma }} \right)$$
(2)

Where \(\:\sigma\:\) represents the influence distance of the central point, which is selected based on the density of the input point cloud. In nuclear deformation learning, linear correlation is employed to alleviate the backpropagation of gradients.

Voxel processing

For the branch of coarse-grained cylindrical partition voxel processing, this paper employs a method of feature extraction based on three-dimensional sparse convolution. After pre-processing the point cloud, three different resolutions of cylindrical voxel partitioning are obtained, which are utilized for extracting voxel features at small, medium, and large scales, respectively. At each scale, an asymmetric IoSL sparse 3D convolution module is employed for feature extraction. Next, we will introduce the IoSL sparse 3D convolution, and asymmetric IoSL sparse 3D convolution module separately.

Fig. 3
figure 3

Process of IoSL sparse 3D convolution.

IoSL sparse 3D convolution

Sparse 3D convolution is the preferred method for voxel feature extraction. However, existing regular and submanifold sparse convolutions33 process all input sparse data, resulting in a significant increase in sparse features and generating redundant candidate features that blur valuable information34. These limitations can decrease the performance of the model. To address this issue, inspired by focal sparse 3D convolution35, this paper proposes the IoSL sparse 3D convolution based on the importance of space locations.

The IoSL Sparse 3D Convolution selects feature position inputs based on predicting the importance of sparse feature space locations36. The IoSL sparse 3D convolution process is shown in Fig. 3. It begins by computing the importance map \(\:{I}^{p}\), for all sparse features within the space. For a given input sparse feature \(\:{T}_{p}\), situated at position \(\:p\) in three-dimensional space, with \(\:{c}_{in}\) feature channels, the input and output feature spaces are denoted as \(\:{P}_{in}\) and \(\:{P}_{out}\) respectively. When \(\:{P}_{in}{=P}_{out}\), through nuclear weight \(\:w\in\:\mathbb{R}^{{K}^{d}\times\:{c}_{in}\times\:{c}_{out}}\) convolution of importance to deal with the features of the output mapping \(\:{I}^{p}\), the specific calculation process is as follows:

$$\:I^{p} = Sigmod\left( {y_{{p \in \:P_{{out}} }} } \right) = Sigmod\left( {\sum\limits_{{k \in \:K^{d} \left( {p,P_{{in}} } \right)}} {w_{k} \cdot \:T_{{\overline{p} k}} } } \right)$$
(3)

Where \(\:{I}^{p}\) signifies feature importance mapping; \(\:k\) denotes the discrete position within feature space \(\:{K}^{d}\); \(\:\overline{p} _{k} = p + k\) represents the corresponding position of the feature center \(\:p\), where \(\:k\) indicates the offset distance from \(\:p\). \(\:{K}^{d}(p,{P}_{in})\) refers to the subset of feature space \(\:{K}^{d}\) with vacant positions eliminated, contingent upon position \(\:p\) and input feature space \(\:{P}_{in}\).

$$\:{K}^{d}\left(p,{P}_{in}\right)=\left\{k\right|p+k\in\:{P}_{in},k\in\:{K}^{d}\}$$
(4)

After obtaining the feature importance mapping \(\:{I}^{p}\), the model selects significant input features based on the threshold value \(\tau\). The important input feature space \(\:{P}_{im}\) constitutes a subset of the input feature space \(\:{P}_{in}\), encompassing the positions of the most crucial input features, outlined as follows:

$$\:{P}_{im}=\left\{p\right|{I}_{0}^{p}\ge\:,p\in\:{P}_{in}\}$$
(5)

Where \(\:{I}_{0}^{p}\) denotes the center of the importance prediction map \(\:{I}^{p}\) at position \(\:p\); \(\tau\) represents the set threshold. When \(\tau\) equals 0 or 1, the formula respectively transforms into regular or submanifold sparse convolution.

The feature extends into a dynamic form within the significant input feature space \(\:{P}_{im}\), and the output close to the feature position \(\:p\) is determined by the dynamic output set \(\:{K}_{im}^{d}\left(p\right)\).

$$\:{K}_{im}^{d}\left(p\right)=\left\{k\right|p+k\in\:{P}_{in},{I}_{k}^{p}\ge\:,k\in\:{K}^{d}\}$$
(6)

For the remaining unimportant features, their output positions will serve as fixed inputs, implying that the submanifold convolves the inputs. Deleting them outright or employing a fully dynamic approach without preserving them could render the training process unstable. In this scenario, the output feature space \(\:{P}_{out}\) can be established as the amalgamation of all significantly extended regions and other unimportant positions. The extent of the extended area can be dynamically and adaptively determined based on its input position.

$$\:{P}_{out}=\left(\bigcup\:_{p\in\:{P}_{im}}P(p,{K}_{im}^{d}(p\left)\right)\right)\cup\:{P}_{in/im}$$
(7)

After delineating the significant feature input space \(\:{P}_{im}\) and the feature output space \(\:{P}_{out}\), the explicit calculation formula for IoSL sparse 3D convolution is presented as follows:

$$\:y_{{p \in \:P_{{out}} }} = \sum\limits_{{k \in \:K_{{im}}^{d} \left( p \right)}} {w_{k} \cdot \:T_{{\overline{p} k}} }$$
(8)

Asymmetric IoSL sparse 3D convolution module

In the actual application process, the objects in the natural scene all show irregular shapes and contours, so it is still a big challenge to accurately segment the object boundary37,38. Therefore, to achieve the accurate segmentation of the upward boundary of all parties, we use the asymmetric upsample convolution blocks and downsample convolution blocks in the cylinder partition voxel processing branch, and extract as many geometric features of the upward structure of all parties as possible through the left and right asymmetric convolution operations. Figure 4 illustrates the network structure of the asymmetric IoSL sparse 3D convolution module. The AD and AU sections depict the structures of the downsample network and upsample network, respectively. Within the downsample network, voxel inputs from the cylinder partition bifurcate into two branches: one comprises a 3*1*3 convolution kernel and a 1*3*3 convolution kernel, while the other consists of a 1*3*3 convolution kernel and a 3*1*3 convolution kernel, followed by aggregation of their outputs. These aggregated results undergo processing through downsample IoSL sparse 3D convolution. In the upsample network, inputs initially undergo upsample IoSL sparse 3D convolution to ensure consistency in the number of feature channels between upsample and downsample layers. Subsequently, the output of the subsampled network, along with corresponding layers, is aggregated and fed into the residual branch for further processing.

Fig. 4
figure 4

The network structure of the asymmetric IoSL sparse 3D convolution module.

The asymmetric IoSL sparse 3D convolution module preserves the feature information in each direction to the maximum extent by constructing different convolution branches in the horizontal and vertical directions, thus enhancing the feature extraction capability. In addition, compared with conventional convolution modules, the asymmetric IoSL sparse 3D convolution module also greatly saves computing costs and improves segmentation efficiency.

Fig. 5
figure 5

The structure of multi-scale voxel feature fusion gating mechanism.

Multi-scale feature fusion cross-gating module

After point processing and voxel processing to obtain voxel features and point features at different scales, all features will be input into the multi-scale fusion module for feature fusion. The proposed multi-scale feature fusion cross-gating module in this paper consists of two main parts: the multi-scale voxel feature fusion gating mechanism and the cross self-attention fusion mechanism. The multi-scale voxel feature fusion gating mechanism processes voxel features at multi-scale for coarse-grained voxel feature fusion. And the cross self-attention fusion mechanism is responsible for fusing coarse-grained voxel features with fine-grained point features. Next, we will introduce these two modules separately.

Multi-scale voxel feature fusion gating mechanism

Figure 5 illustrates the network structure of the multi-scale voxel feature fusion gating mechanism. At each scale, the feature mapping \(\:{T}_{i}\) corresponds to the voxel feature \(\:{F}_{{V}_{i}}\). The calculation formula for feature mapping \(\:{T}_{i}\) is as follows:

$$\:{T}_{i}=sigmod({W}^{{T}_{i}}\cdot\:{F}_{{V}_{i}})$$
(9)

Where \(\:{W}^{{T}_{i}}\) denotes the learning weight matrix computed by a linear function, and \(\:{T}_{i}\in\:\left[\text{0,1}\right]\) represents the acceptance or rejection of the transfer of voxel features at each scale. For voxels at scale \(\:I\), the fusion features are calculated as follows:

$$\:\widetilde{F}_{V} = \sum\limits_{{i = 1}}^{I} {split\left[ {softmax\left( {\sum\limits_{{i = 1}}^{I} {T_{i} } } \right)} \right]_{i} \cdot \:F_{{V_{i} }} }$$
(10)

Where \(\:\widetilde{F}_{V}\) represents the fusion voxel feature, and \(\:\cdot\:\) represents element-wise multiplication, the probability weights of feature channels at each scale are calculated through softmax normalization. Subsequently, these weights are multiplied and accumulated with the corresponding feature channels to derive the final fusion voxel features.

Cross self-attention fusion mechanism

Fusing fine-grained point features with coarse-grained voxel features can enhance the model’s accuracy and robustness. However, direct fusion may result in erroneous feature fusion due to differing propagation features39.To address this, we employ a feature fusion mechanism based on cross self-attention40 to enhance both point and voxel feature information fusion.

Fig. 6
figure 6

The structure of cross self-attention fusion mechanism.

Figure 6 illustrates the network structure of cross self-attention fusion mechanism. Initially, the matrices \(\:Q\), \(\:K\), and \(\:V\) are computed separately by employing a linear function on the fused feature \(\:{F}_{PV}\), where \(\:F_{{PV}} = F_{P} \oplus \:\widetilde{F}_{V}\). Subsequently, the \(\:R\) matrix is derived by embedding the local point coding features41. The calculation formula is as follows:

$$\:\left(Q,K,V\right)={F}_{PV}\cdot\:\left({W}^{Q},{W}^{K},{W}^{V}\right)$$
(11)
$$\:R={R}_{ec}\cdot\:{W}^{R}$$
(12)

Where \(\:Q\), \(\:K\), and \(\:V\) represent the query matrix, key matrix, and value matrix, respectively, while \(\:R\) denotes the point coding matrix. \(\:{W}^{Q},{\:W}^{K},{\:W}^{V}\), and \(\:{W}^{R}\) denote the weights of their respective matrices, and \(\:{R}_{ec}\) represents the embedded local point coding feature.

Next, it is important to note that the score \(\:E\) is computed by multiplying the \(\:Q\) matrix with both the \(\:K\) matrix and the \(\:R\) matrix. Subsequently, the attention weight is determined using the softmax function and then multiplied with the \(\:V\) matrix to derive the final fused feature.

$$\:\widetilde{F}_{{PV}} = mlp\left( {\left( {softmax\left( {\frac{{Q \cdot \:R^{T} + Q \cdot \:K^{T} }}{{\sqrt {dt} }}} \right)} \right) \cdot \:V} \right)$$
(13)

Where \(\:dt\) represents the number of eigen-dimensions of the key vector; And \(\:\widetilde{F}_{{PV}}\) represents the final fused feature between the point and voxel.

Point-wise refinement module

After obtaining the fused feature \(\:\widetilde{F}_{{PV}}\), it is projected onto the corresponding points using the voxel mapping function. At this stage, each point’s features encompass abundant point-specific details as well as extensive global context information. Subsequently, the MLP layer42 computes the probability for each category at every point based on the fused feature. The maximum probability value for each category is then utilized to classify individual points, thereby achieving semantic segmentation of the point cloud on a point-by-point basis.

Experiments results and analyses

In this section, we will evaluate the performance of our approach using different datasets. We begin by describing the experimental details and evaluation indexes. Next, we present a comparative analysis of the proposed method and other approaches using the SemanticKITTI and nuScenes datasets. Finally, we provide an in-depth analysis of the results, offering insights into the proposed method’s performance.

Implementation details and evaluation indexes

These experiments were conducted using an NVIDIA RTX 3090 GPU. For both the SemanticKITTI and nuScenes datasets, we segmented the input point cloud into a three-dimensional cylinder partition with dimensions of 480*360*32. During the training phase, the model was optimized using the Adams optimizer with default parameter settings. The maximum number of epochs was set to 150, and the initial learning rate was 0.001.

To evaluate the performance of the proposed method, we use the mean cross-association mIoU43 as the main evaluation metric. The specific calculation formula of mIoU is shown as follows:

$$\:mIoU=\frac{1}{C}\sum\:_{c=1}^{C}\frac{{TP}_{c}}{{TP}_{c}+{FP}_{c}+{FN}_{c}}$$
(14)

Where \(\:C\) is the number of semantic categories; \(\:{TP}_{c}\), \(\:{FP}_{c}\), and \(\:{FN}_{c}\) represent the true positive examples, false positive examples, and false negative examples corresponding to semantic category \(\:C\), respectively. A higher mIoU value indicates better model performance.

Dataset

SemanticKITTI

The SemanticKITTI dataset, collected using the Velodyne HDL-64 LIDAR, is a large-scale driving scene dataset designed for point cloud segmentation. It is tailored for specific tasks such as semantic segmentation and panoramic segmentation. The dataset comprises 22 sequences and 19 semantic classes. Sequences 00 to 10 constitute the training set, sequence 08 form the validation set, and sequences 11 to 21 constitute the test set.

nuScenes

The nuScenes dataset is collected using the Velodyne HDL-32E LIDAR and is specifically designed for autonomous driving with sparse point cloud segmentation, which introduces additional challenges for LIDAR semantic segmentation tasks. The dataset comprises 1,000 scene sequences categorized into 16 standard semantic classes. Out of these, the first 700 scene sequences are designated for the training set, the next 150 scene sequences for the validation set, and the remaining 150 scene sequences for the test set.

Comparative experimental results

Comparison on SemanticKITTI

The comparative experimental results between our method and others on the SemanticKITTI dataset are presented in Table 1. The selected comparison methods include projection-based methods, point-based methods, voxel-based methods, and fusion-based methods. From the data in the table, it is evident that our method achieves the most accurate semantic segmentation results on the SemanticKITTI dataset, outperforming other methods in terms of the mIoU evaluation metric. Compared to projection-based methods such as 3D-MiniNet13 and SalsaNext45, our method demonstrates a significant improvement in mIoU values, showing increases of 9.8% and 6.1%, respectively. Similarly, compared to point-based methods like LG-Net46 and HCT-Kpconv47, our approach outperforms these models, achieving 9.3% and 1.5%, respectively. Furthermore, our method also surpasses voxel-based and fusion-based techniques, including MarS3D48, Memoryseg49, FusionNet50, and PCSCNet12, with mIoU values improving by 12.9%, 7.3%, 4.3%, and 2.9%, respectively. Particularly noteworthy is that our method demonstrates significantly better segmentation performance for dynamic objects such as cars and trucks compared to other methods.

Table 1 The method proposed in this paper compares with the results of existing segmentation methods on the SemanticKITTI test set. Please note that all results are from the corresponding literature.

To more intuitively reflect the point cloud semantic segmentation quality of our proposed method, we present a visual comparison diagram of segmentation results as shown in Fig. 7. The test sequence selected for visualizing the segmentation results is the validation sequences of the SemanticKITTI dataset. In Fig. 7, column a depicts the ground segmentation truth images of the point cloud. Subsequently, column b displays the segmentation results obtained using the FusionNet method, while column c showcases the segmentation visualization results of the PCSCNet method. Finally, column d presents the visualization images of point cloud segmentation generated by our method. As shown in Fig. 7, both FusionNet and PCSCNet exhibit poor segmentation performance, particularly when dealing with complex real-world environments, where significant errors in segmentation occur. These methods struggle to accurately segment visually similar objects, such as traffic poles and trees, leading to many missegmentation cases. This issue arises because traffic poles and trees often share similar characteristics in color and texture, making it challenging to distinguish between them. Additionally, dynamic objects, such as moving vehicles, are frequently misclassified as part of the static environment, further exacerbating segmentation errors. In contrast, the method presented in this paper demonstrates clear advantages in segmentation accuracy when applied to various targets in real-world environments. Specifically, for segmentation tasks involving moving vehicles, trees, roads, and other objects, the proposed method effectively segments and provides more precise results. Of particular note is its strong performance in segmenting dynamic targets, as it can accurately capture the motion and morphological characteristics of vehicles. This ability allows the method to minimize interference from the static background and significantly improve segmentation accuracy. These advantages enable the proposed method to reduce error rates and enhance overall performance when dealing with segmentation tasks in complex scenes.

Fig. 7
figure 7

Comparison of segmentation visual results of the sequence in SemanticKITTI dataset: (a) Visualization of ground truth values; (b) the segmentation visualization results of FusionNet; (c) the segmentation visualization results of PCSCNet; (d) Visualization of segmentation results of our method.

Comparison on nuScenes

The comparative experimental results between the proposed method and others on the nuScenes dataset are presented in Table 2. It is evident from the data that the proposed method demonstrates superior performance in the nuScenes dataset, boasting the highest mIoU value. Our method’s mIoU value shows a notable increase of 7.5%, 2.0%, 1.0%, 0.2%, and 1.1%, respectively, compared to the following methods: RangeNet + + 51, PolarNet52, PCSCNet12, ACPNet53, and Salsanext45. Notably, the method presented in this paper excels at segmenting both dynamic and static targets, particularly those that are difficult for other methods to handle, such as motorcycles, buses, and bicycles. When compared to alternative approaches, the mIoU of these moving object categories shows a significant increase, further demonstrating the advantages and robustness of the proposed method in managing complex scenes. In addition, it is worth noting that the nuScenes dataset features a sparser point cloud distribution compared to the SemanticKITTI dataset. Despite this, the proposed method still achieves the best segmentation performance, outpacing other point cloud segmentation methods, which struggle in sparse point cloud environments. This result further demonstrates that the proposed method can maintain high semantic segmentation performance even under low point cloud density.

Table 2 The results of the proposed method and other segmentation methods on nuScenes validation set.

Efficiency studies

In this section, we analyze the computational complexity and runtime performance of the proposed method. Evaluating efficiency provides insights into the method’s ability to handle large-scale point cloud data and required computational resources for deployment54. Table 3 presents a comparison of the efficiency of our method against other approaches on the SemanticKITTI dataset.

Table 3 compares the parameters, multiply-accumulate operations (MACs), latency, and mIoU of various methods on the SemanticKITTI dataset under consistent evaluation criteria. The data highlights key trade-offs between model size, computational efficiency, and segmentation performance. Compared to small models with low latency, such as PolarNet52, SalsaNext45, and LG-Net46, the proposed method shows a significant improvement in mIoU, with increases of 11.3%, 6.1%, and 9.3%, respectively. Additionally, it uses significantly fewer parameters than larger models like SqueezeSegV355, which has 26.2 M parameters, thanks to the asymmetric IoSL sparse 3D convolution module. Furthermore, while the average MACs of other comparison methods is 164.4G, the proposed method only requires 92.7G, a reduction of 71.7G compared to the average. This demonstrates that the proposed method consumes fewer computing resources. Finally, while the proposed method is comparable to PCSCNet12 in terms of model size and runtime, it achieves a notable improvement in segmentation performance, with a mIoU increase of 2.9%.

The comparative analysis of operational efficiency demonstrates that the proposed method achieves the highest segmentation accuracy while maintaining low computational complexity. This makes it well-suited for deployment in resource-constrained environments, as it requires significantly fewer computational resources compared to other methods.

Table 3 The efficiency analysis results of each method on the SemanticKITTI test set. Please note that all results are from the corresponding literature.

Ablation experimental results

In this section, we will conduct comprehensive ablation experiments on various components of the proposed method to investigate their impact on the overall method performance.

Table 4 Ablation experiment results (check marks in the table indicate that the method includes this module).

Asymmetric IoSL sparse 3D convolution module

To evaluate the impact of the proposed asymmetric IoSL sparse 3D convolution module on segmentation performance, we isolated this module while retaining other components of our method. We directly applied the traditional submanifold sparse 3D convolution module to extract voxel features from the cylinder grid partition, effectively removing the asymmetric IoSL sparse 3D convolution module. This modified method is denoted as M1. The results of the ablation experiment M1 on the SemanticKITTI dataset are presented in Table 4. Analysis of the comparative data in the table clearly shows that the integration of the asymmetric IoSL sparse 3D convolution module not only improves the mIoU evaluation index by 2.7%, but also reduces the number of parameters in the network model by 2.4 M. In addition, the table also contains three categories of data that are most affected by asymmetric IoSL sparse 3D convolution module, namely car, person, and pole. Compared with M1 without asymmetric IoSL sparse 3D convolution module, the IoU of the method proposed in this paper increases by 3.1%, 5.5%, and 5.3% respectively. It is further shown that the asymmetric IoSL sparse 3D convolution module can not only reduce the extraction of redundant features, but also enhance the extraction ability of key features.

To facilitate an intuitive comparison of ablation performance, we selected partial field point clouds from sequence 08 in the SemanticKITTI dataset for visual segmentation comparison, depicted in Fig. 8. The visual comparison clearly shows that M1, which excludes the asymmetric IoSL sparse 3D convolution module, exhibits limited segmentation capability for distant objects like trees and cars within the environment. Ablation experiment M1 underscored the effectiveness of the proposed asymmetric IoSL sparse 3D convolution module. This module enhances the characterization of target distributions by extracting asymmetric sparse features horizontally and vertically, prioritizing critical features through feature importance analysis to mitigate the redundancy of irrelevant features.

Multi-scale feature fusion cross-gating module

To better understand how the multi-scale feature fusion cross-gating module influences the overall segmentation performance of the method, we conducted ablation experiments on two specific mechanisms: M2, which focuses on the multi-scale voxel feature fusion gating mechanism, and M3, which explores the cross self-attention point and voxel feature fusion mechanism within the module.

In ablation experiment M2, we retained all other modules of the method while ablating the multi-scale voxel feature fusion gating mechanism. Instead, we directly fused the extracted voxel features of the cylinder mesh at three different scales. The results of this ablation are presented in Table 4. The ablation study results reveal the significant impact of the multi-scale voxel feature fusion gating mechanism on the method’s performance. Removing this mechanism led to a 2.9% reduction in mIoU, underscoring its importance. For dynamic objects with relatively distinct spatial structures, such as cars and pedestrians, the mechanism effectively aggregated key features across different scales, resulting in IoU improvements of 1.7% and 1.4%, respectively. For fixed objects with elongated spatial structures, such as bar-like classes, the mechanism maintained segmentation accuracy and integrity by enhancing the distinction between features of different classes, achieving a 2.4% increase in IoU. Additionally, the number of parameters was reduced by 1.3 M, demonstrating improved efficiency. These findings validate the effectiveness of the multi-scale voxel feature fusion gating mechanism. By addressing feature disparities across scale resolutions and selectively integrating critical multi-scale features, it significantly enhances the semantic segmentation performance of point clouds.

In ablation experiment M3, we utilized the fusion mechanism involving ablative cross self-attention points and voxel features. Specifically, we directly fused corresponding point features with voxel features, leaving other modules unchanged. The results of this ablation are detailed in Table 4. The detailed ablation results, presented in Table 4, demonstrate that the proposed fusion mechanism significantly enhances the aggregation accuracy of point and voxel features, leading to a 1.7% increase in mIoU. Notably, in three representative categories—cars, pedestrians, and poles—the mechanism achieved IoU improvements of 0.8%, 1.1%, and 1.6%, respectively. Despite these accuracy gains, the number of parameters remains largely unchanged. The findings from ablation experiment M3 further validate the effectiveness of the fusion mechanism. By addressing the unique propagation features of point and voxel features, the mechanism significantly improves integration accuracy while minimizing the risk of erroneous fusion. This highlights its contribution to the overall performance of the method.

Fig. 8
figure 8

Comparison of segmentation visual results of sequence 06 in SemanticKITTI dataset: (a) Visualization of segmentation results of M1; (b) Visualization of segmentation results of our method.

Fig. 9
figure 9

Comparison of segmentation visual results of sequence 06 in SemanticKITTI dataset: (a) Visualization of segmentation results of M2; (b) Visualization of segmentation results of M3; (c) Visualization of segmentation results of our method.

To better compare the performance of the ablation experiments M2 and M3, selected partial field point clouds from sequence 08 in the SemanticKITTI dataset were visually compared for segmentation results, as depicted in Fig. 9. The comparison clearly illustrates that segmentation performance declines when the multi-scale feature fusion cross-gating module is omitted. This omission results in varying degrees of false segmentation, particularly affecting dynamic targets like cars and smaller targets such as street trees. Therefore, it can be concluded that the multi-scale feature fusion cross-gating module enhances the accuracy of feature fusion across different types and scales, thereby improving the overall accuracy and robustness of the method.

Experimental discussion and analysis

The superior performance of the proposed method is demonstrated through comparison and ablation results on the SemanticKITTI and nuScenes datasets. These results show that the proposed method outperforms other methods in both datasets. Firstly, on the SemanticKITTI dataset, the proposed method achieves high precision and robust segmentation, especially in complex environments. This performance improvement is largely attributed to the multi-scale feature fusion cross-gating module, which enhances the segmentation of dynamic objects with distinct spatial structures, such as cars and people. Furthermore, for static objects with elongated spatial structures, the gating mechanism plays a crucial role in effectively aggregating key semantic features, thereby preserving the structural integrity of the environment, particularly for objects like trees and poles. Additionally, experiments on the SemanticKITTI dataset demonstrate that the proposed method significantly improves LIDAR point cloud segmentation, even in complex scenes. Despite challenges such as illumination variations and dynamic targets, our method consistently outperforms current mainstream methods in terms of accuracy. Secondly, the comparative experimental results on the nuSenes dataset clearly demonstrate that the proposed method outperforms other methods in terms of high-precision segmentation, even in complex environments with sparse point cloud distributions. This is due to the asymmetric IoSL sparse 3D convolution module, which effectively solves the challenges brought by sparse point clouds. Specifically, the module effectively fuses global and local context features at different scales, significantly alleviating the feature loss problem common in sparse point clouds. Therefore, when applied to nuScenes dataset with sparse point cloud distribution and complex scenes, the method proposed in this paper still has high segmentation performance. Finally, the comparison of efficiency between the proposed method and other approaches shows that the proposed method achieves the highest segmentation accuracy while keeping computational complexity low. This advantage primarily arises from the Iosl sparse network, which identifies key features by pre-calculating their locations. This pre-selection process reduces the number of network parameters, further optimizing computational efficiency. As a result, the proposed method is easy to implement in complex environments and demonstrates greater robustness.

While the proposed method demonstrates excellent segmentation performance and robustness, it has some potential limitations. Specifically, it has only been tested on public datasets, so its performance in real-world scenarios has not been verified. Additionally, as segmentation accuracy improves, the complexity and computational cost of the network significantly increase. This presents challenges in real-world environments where computational resources may be limited, potentially affecting both efficiency and accuracy. Nevertheless, the experimental results of this paper indicate that the proposed method generalizes well to unknown data. When compared to the most advanced methods, our approach shows significant improvements in both accuracy and robustness.

Discussion

In this paper, we propose a LIDAR point cloud semantic segmentation method based on point convolution and sparse 3D convolution adaptive fusion. The method utilizes an asymmetric IoSL sparse 3D convolution module, which enhances the sparse learning performance of key voxel features by prioritizing the prediction of feature location importance. Following this, a multi-scale feature fusion cross-gating module is employed to adaptively fuse fine-grained point features with coarse-grained voxel features. This improves the accuracy of feature fusion across different types and scales, effectively extracting contextual semantic information. Experimental validation on the SemanticKITTI and nuScenes datasets demonstrates that, compared to other methods, the proposed model achieves state-of-the-art segmentation accuracy.

In the future, we will continue to explore more precise and reliable improvements, utilizing model compression and lightweight techniques to reduce computational complexity and enhance algorithm efficiency. Furthermore, we plan to extend the method from public datasets to real-world environments, ensuring optimal performance while significantly improving runtime. This will contribute to advancing the real-time processing applications of large-scale LIDAR point cloud semantic segmentation.