TY  - EJOU
AU  - Wu, Aihua 
AU  - Huang, Chenlu 

TI  - PointNMSA: An Improved PointNeXt Network with Non-Local Multi-Scale Aggregation for 3D Point Cloud Semantic Segmentation
T2  - Computers, Materials \& Continua

PY  - 
VL  - 
IS  - 
SN  - 1546-2226

AB  - Three-dimensional (3D) point cloud semantic segmentation is a core task in indoor scene understanding, providing detailed semantic information about spatial structures and object categories in indoor environments. Although methods based on deep learning have made steady progress in recent years, accurately segmenting complex indoor scenes remains challenging due to the unordered nature of point clouds and variations across large scales. Most existing networks have limited capability for multi-scale feature aggregation and struggle to balance local geometric details with global semantic context. These issues are further exacerbated by hierarchical downsampling, which often leads to the loss of fine-grained structural information. Moreover, feature interaction restricted to local neighborhoods may limit the capture of non-local semantic dependencies in complex indoor scenes. To address these limitations, we propose PointNMSA (PointNeXt with Non-local Multi-Scale Aggregation), an improved semantic segmentation network built upon the PointNeXt backbone. A Multi-Scale Feature Enhancement (MSFE) module is introduced in the decoding stage to fuse features from different encoding levels, and further refines the fused features to produce more stable multi-scale representations, which preserves geometric details across scales. In addition, a Convolution-Attention Mixing (CA-Mix) module is designed to jointly integrate local spatial structures and non-local contextual dependencies via dual-stream aggregation and multi-dimensional attention fusion, thereby enabling more discriminative feature representations. Experiments on the Stanford Large-Scale 3D Indoor Spaces (S3DIS) benchmark demonstrate the effectiveness of PointNMSA. On the Area 5 test split, PointNMSA achieves a mean intersection over union (mIoU) of 65.10%, outperforming the PointNeXt baseline by 1.59%, while introducing only a modest increase in computational cost (latency from 42.24 to 45.18 ms and parameters from 3.16 to 8.67M). Despite the noticeable growth in parameter count, the increase in inference latency remains relatively limited, indicating a favorable trade-off between segmentation accuracy and computational efficiency. Additional cross-dataset experiments on ScanNet further verify that PointNMSA maintains stable gains under different indoor scene distributions. Such performance gains suggest that PointNMSA provides a more robust and generalizable solution for semantic segmentation in large-scale indoor environments with complex structural layouts.
KW  - 3D point cloud semantic segmentation; indoor scene understanding; multi-scale feature aggregation; non-local context integration; PointNeXt

DO  - 10.32604/cmc.2026.078692