Open Access
ARTICLE
Attention and Mamba Based Iterative Registration Network for Low-Overlap and Large-Scale Point Cloud
1 School of Astronomy and Space Science, University of Science and Technology of China, Hefei, China
2 Nanjing Astronomical Instruments Research Center, Chinese Academy of Sciences, Nanjing, China
3 CAS Nanjing Astronomical Instruments Co., Ltd., Nanjing, China
* Corresponding Author: Qingsheng Zhu. Email:
Computers, Materials & Continua 2026, 88(2), 55 https://doi.org/10.32604/cmc.2026.081695
Received 06 March 2026; Accepted 27 April 2026; Issue published 15 June 2026
Abstract
Point Cloud Registration (PCR) is a basic task in computer vision, mobile robotics, and autonomous driving. PCR primarily faces challenges, including insufficient registration performance in low-overlap scenarios and high computational resource consumption in large-scale point cloud scenarios. Most recent PCR methods are transformer-based. Methods like transformers have quadratic computational complexity , leading to rapid increases in computational cost with large-scale point cloud data. To address these problems, an iterative PCR method namedKeywords
Point cloud registration (PCR) is a basic task in computer vision, mobile robotics [1] and autonomous driving [2]. The task of PCR is to estimate an optimal transformation in the special Euclidean group
With the development of deep learning, especially the proposal of the attention mechanism [5], the Transformer model has been widely used in the PCR task. A very early transformer-based method is CoFiNet [6], which proposes a coarse-to-fine registration network. It learns to match down-sampled nodes whose vicinity points share more overlap on a coarse scale, and it refines the corresponding relationships from the overlap area of the corresponding patches through an adaptive matching module at a fine scale. Further work, such as GeoTransformer [7], improves the ability of super-point discrimination by leveraging geometric relationships. CAST [8] designs a consistency-aware spot-guided Transformer, including a spot-guided cross-attention module to avoid interfering with irrelevant areas and a consistency-aware self-attention module to enhance matching capabilities with geometrically consistent correspondences. The transformer model can effectively model global spatial relationships and feature correspondences between point clouds, outperforming traditional methods based on geometric optimization and conventional CNNs.
While these Transformer or attention-based methods have shown distinctive performance, they still face two major shortcomings:
(1) In low-overlap scenarios, the low-overlap area makes it easier to extract features that appear similar on the surface but actually belong to different areas during the registration process, leading to a decline in registration performance;
(2) Transformer-based methods inherently have quadratic computational complexity
Recent works try to address shortcoming (1) using an iterative method. PEAL [9] introduces a post-processing method to refine the registration results, establishing an iterative approach by employing the same network as [7] repeatedly to demonstrate enhanced performance in low-overlap scenarios. AMR [10] considers the fact that the priors become increasingly accurate throughout the refinement steps, and proposes an iterative refinement network to leverage the knowledge of the overlap area, tailored for the low-overlap challenge in PCR. However, the iterative method requires multiple training iterations, which severely increases computational overhead.
State space models (SSMs) [11], especially the Mamba model [12], demonstrate extraordinary performance at efficiently capturing long-range contextual dependencies in sequence modeling tasks. Mamba leverages a linear-complexity state-space model to approximate global context, enhancing efficiency and scalability for long sequences and enabling it to address shortcoming (2). Its global receptive field and linear runtime enable fast, low-cost inference, ideal for large-scale or real-time applications. Unfortunately, Mamba is designed to handle sequential data, thereby leveraging its advantage of linear complexity. As 3-D data, point clouds exhibit spatial disorder and irregularity, so they require serialization before processing with Mamba.
To solve shortcomings above concurrently and enable the PCR network to be applicable to both low-overlap and large-scale point cloud scenarios, inspired by CAST [8], AMR [10] and Mamba [12], Attention and Mamba Based Iterative Registration Network (AMBIR) is proposed, leveraging the attention mechanisms to suppress interference from irrelevant regions, the iterative model to extract features in low-overlap scenarios and the Mamba model to linearize its computational complexity simultaneously.
The effectiveness of this work stems from the following contributions:
• An iterative model that progressively learns overlap knowledge from prior registration to ground-truth alignment is incorporated, overcoming the performance degradation in low-overlap registration scenarios.
• A serialization method that converts 3-D point clouds into linear sequences is proposed to apply unordered and spatially irregular 3-D point cloud data to Mamba. While achieving linear computational complexity, serialization ensures that points at the same position in the sequence correspond spatially by leveraging prior information to pre-align and uniformly sort the two point clouds, thereby improving the subsequent registration performance.
• A Mamba model with an overlap-driven soft gating mechanism and a bidirectional architecture is introduced. This model mitigates the computational resource consumption of partial attention mechanisms by achieving linear complexity. While achieving efficient global long-range feature aggregation, it endows the model with strong robustness to noise in non-overlapping regions through an implicit filtering mechanism.
2.1 Transformer-Based PCR Method
Transformer-based PCR methods leverage the strong data-driven ability of the Transformer architecture for PCR. Numerous studies have incorporated encoder-decoder frameworks and attention mechanisms, significantly improving registration accuracy. In addition to the methods discussed in Section 1, OIF-Net [13] proposed a singular-intrinsic-point-based positional encoding approach for PCR networks. It employed a differentiable optimal transport layer to establish correspondences, which were then used to normalize each point for positional encoding, effectively eliminating issues arising from differing reference frames between the two point clouds. Additionally, it mitigated feature ambiguity and related problems by learning spatial consistency. RoITr [14] proposes a local-level attention mechanism embedded with point-pair feature coordinates to describe pose-invariant geometric structures. Based on this, it constructs a novel attention-based encoder-decoder architecture. At the global level, it introduces a global Transformer that learns rotation-invariant cross-frame spatial perception via a self-attention mechanism. This significantly enhances the feature discriminability and improves the model’s robustness in low-overlap scenarios. SIRA-PCR [15] proposes the first method to explore simulation-to-reality adaptation in PCR. The framework incorporated an adaptive resampling module to address the domain gap between simulated and real point cloud patterns and constructed a synthetic scene-level PCR dataset that employed both physics-based and randomized strategies to arrange diverse objects. RegFormer [16] introduces a feature extraction Transformer and a bijective association Transformer, which capture long-range dependencies and filter outliers via global point feature extraction. This ensures high efficiency even in large-scale scenes while enabling the regression of initial transformations.
The method reviewed above leverages the advantages of Transformer models from different perspectives, improving the efficiency and speed of point cloud registration. However, none of these works overcome the inherent quadratic computational complexity
An iterative-based model gradually optimizes results, approaches targets, or solves problems by repeatedly executing fixed steps, using the output of the previous iteration as input for the next iteration. The well-known methods for the PCR task based on iterative models are PEAL [9] and AMR [10]. In addition, IFNet [17] proposes a novel iterative feedback network for unsupervised PCR, in which the representation of low-level features is efficiently enriched by rerouting subsequent high-level features. Besides the PCR task, iterative models have also been applied in many areas. In 3-D reconstruction, MSDER-MVS [18] optimizes depth estimation iteratively using residuals and the Jacobian without additional parameters. In point cloud completion, PMP-Net [19] achieves iterative refinement through shape deformation and builds point-level correspondences.
A drawback of the iterative-based model is that it introduces additional computational overhead due to the multiple iterative processes. Let the resource consumption of a single iteration be
To apply the Mamba model to PCR, two issues need to be addressed: (1) how to convert 3-D point clouds into 1-D sequences; (2) how to extract global features. Recently, to apply Mamba to PCR, MT-PCR [20] performs Z-order-based spatial serialization on 3-D point cloud data, replaces the self-attention module in the CAST [8] backbone with a Mamba encoder, and constructs a hierarchical framework. This framework combines Mamba’s global modeling capability with local attention and cross-scale optimization, reducing the VRAM usage of the registration network. E2MNet [21] replaces the feature extraction Transformer and bijection association Transformer in the RegFormer [16] backbone with the feature extraction Mamba2 module and spatio-temporal fusion module, respectively. It comprehensively captures the local and global features of point clouds and efficiently accomplishes large-scale PCR tasks. MaGo-I2P [22] proposes the first Mamba-based image-to-picture registration framework. It recovers the geometric structure of images through depth estimation, thereby constructing an implicit 3-D representation of the image scene to alleviate the modality gap between images and point clouds and facilitates cross-modal feature extraction. In specialized domains, AeroMamba [23] leveraged the Mamba architecture and a Hilbert curve to address the challenges posed by large-scale, featureless point clouds in aircraft assembly.
However, the aforementioned Mamba-based PCR methods are either optimized only for specific domains or designed only to address the registration of either large-scale or featureless point clouds. There is still no unified PCR method that can simultaneously address challenges in both large-scale and low-overlap scenarios. In addition, unlike permutation-invariant Transformers, the autoregressive nature of SSMs makes Mamba highly sensitive to sequence ordering. Consequently, rather than being a universal replacement, the
The task of PCR is to transform point clouds of the same object or environment acquired under different coordinate systems into a single coordinate system by estimating an optimal special Euclidean group
Attention Mechanism [5] originated from research on the human visual system. When observing things, humans do not focus equally on all information; instead, they selectively concentrate on interesting or important parts, quickly capturing key information while ignoring irrelevant details. This mechanism was introduced into deep learning to improve the efficiency and accuracy of models when processing complex data. The attention mechanism includes self-attention, cross-attention, and multi-head attention.
Transformers are constructed by stacking self-attention and cross-attention modules, enabling effective modeling of global dependencies and feature correspondences. The self-attention mechanism computes attention weights over the same set of points to capture internal feature interactions, whereas cross-attention identifies correspondences between two different point sets. Formally, given a query
where
Most existing PCR networks are based on a single training session. In contrast, iterative models undergo multiple training rounds. Specifically, a training count
The currently most effective iterative model is AMR [10]. Formally, taking AMR as an example, let
which means that each model learns from the overlapping prior knowledge contained in the rigid transformation of the previous model. Eq. (3) follows the adaptive refinement paradigm established in [10], treating registration as a progressive residual update
For iterative models, the registration accuracy varies at each step. Therefore, to train models for different steps separately, a transition function
Since rotation matrices are nonlinear, the spherical interpolation function is used:
where
Linear interpolation is employed for the translation vectors:
SSM consists of two equations: the state equation and the observation equation:
where
Computers are adept at processing discrete signals, and in modern control theory, the zero-order hold (ZOH) is used to convert them into discrete-time state space models. Denote the sampling period as
Therefore, the discrete-time state space model can be expressed as:
where
Mamba model, proposed in [12] and inspired by SSMs, extends them into selective state space models (Selective SSMs). In Mamba, the parameters
where
Mamba can leverage its linear computational complexity to reduce the computational burden caused by the Transformer architecture, the serialization method can convert 3-D point clouds into linear sequences and apply them to the Mamba model and the iterative-based model can continuously learn knowledge from overlapping priors and ground truth. By combining the strengths of these three models, it can effectively reduce the computational cost of point cloud registration while addressing large-scale and low-overlap point cloud alignment.
To combine these three models, AMBIR has two parts: iteration backbone and iteration process. The iteration backbone refers to the method used at each step of the overall network. After this iteration is completed, the process proceeds to the next iteration in accordance with the iteration process. Therefore, for the entire registration network to operate efficiently, these two components need to work in tandem. For the iteration backbone, it must achieve high registration performance while maintaining low resource usage. This ensures that the model’s performance improves after multiple iterations without excessive resource consumption. For the iteration process, appropriate learning rules need to be designed so that the model can learn the registration knowledge from the previous iteration at each step.
As shown in Fig. 1, the iteration backbone of AMBIR consists of feature extraction, hybrid coarse registration, and sparse-to-dense fine registration in sequence.

Figure 1: Overview of the iterative backbone of AMBIR.
Feature Extraction. Feature extraction is the process of deriving low-dimensional representations with discriminability, invariance, and compactness from raw point clouds. Herein, FA-KPConv [24] is utilized to encode the input point clouds into multi-scale feature representations. Let the feature map of the original point clouds be denoted as
Hybrid Coarse Registration. First, to enhance the semi-dense feature, a linear cross-attention [25] is adopted before subsequent modules. Both semi-dense feature and coarse feature superpoints need to be serialized for subsequent processing. Thus, Prior-Informed Co-aligned Serialization (PICOS) is applied to bridge the gap between unordered point clouds and sequential models. Utilizing the transformation estimate
For coarse feature, these synchronized sequences are passed through an encoder composed of H stacked Mamba blocks to extract hierarchical geometric features. Each block consists of layer normalization (LN), a selective state space model (SelectiveSSM) [12], depth-wise separable convolutions (DW) [27], and residual connections. The architecture is illustrated in Fig. 2, and the
where

Figure 2: Architecture of the Mamba encoder and Mamba block. Left: Mamba Encoder with residual connections and feedforward neural networks (FNNs). Right: Mamba block centering around the SelectiveSSM.
For semi-dense features, sequences are fed into the Consistency-Aware Mamba Encoder (CAME). As a lightweight alternative to the computationally expensive self-attention mechanism, CAME employs a bi-directional SSM to aggregate global geometric context with linear complexity. Crucially, an overlap-driven soft-gating mechanism is integrated to implicitly suppress features from non-overlapping regions, enhancing robustness against outliers. Finally, the enhanced semi-dense features
where
Sparse-to-Dense Fine Registration. Inspired by the hierarchical strategy [8], the fine registration module employs a lightweight sparse-to-dense mechanism to achieve precise alignment without computational bottlenecks. Distinct keypoints are extracted from local patches centered at semi-dense nodes of
Notably, although the sparse-to-dense fine registration introduces local attention and matching costs, it computes only within local neighborhoods or on dynamically down-sampled point sets. As a result, they effectively avoid the time complexity introduced by global attention.
As shown in Fig. 3, the registration pipeline is structured as a cascade of K adaptive refinement stages, where each stage employs an identical network architecture based on the proposed backbone but possesses independent trainable parameters tailored to specific noise distributions. To foster adaptivity, synthetic prior transformations spanning T discrete accuracy levels are generated and linearly partitioned into K groups (K < T), with the specific model index

Figure 3: Iterative process of AMBIR.
3.4 Prior-Informed Co-Aligned Serialization
SSMs, particularly Mamba, rely on autoregressive modeling of 1-D sequences to capture global context with linear complexity. Bridging the dimensional gap between unordered 3-D point clouds and ordered 1-D sequences is a fundamental prerequisite for the proposed architecture. Thus, a serialization strategy that transforms the challenging global registration problem into a manageable local sequence matching task is needed.
Standard point cloud serialization typically employs Space-Filling Curves (SFCs), such as the Hilbert curve or Z-order curve, to map 3-D coordinates onto a 1-D manifold while preserving local neighborhood structures [20]. While effective for static tasks such as semantic segmentation, where the coordinate frame is fixed, SFCs exhibit a critical limitation in registration scenarios due to their rotational sensitivity. SFCs are strictly coordinate-dependent. A rigid transformation

Figure 4: Architecture of PICOS.
3.4.1 Proxy Alignment and Shared Projection
To resolve this bottleneck, the strategy decouples the spatial ordering from the feature representation. The transformation estimate
Formally, the process begins by constructing a proxy point cloud
It is crucial to note that
Next, a shared canonical space
Because
Finally, obtain the permutation indices
It is worth noting that while the argsort operation introduces a theoretical time complexity of
The input features
where
Eq. (16) creates a siamese sequence pair where the
In summary, PICOS enables the full selective scanning capability of the Mamba architecture to focus on comparing local features rather than learning global rotation invariance in
3.5 Consistency-Aware Mamba Encoder
While the serialization (Section 3.4) provides a geometrically aligned token sequence, standard SSMs treat all tokens equally during the recurrent state update. In the context of registration, however, points in non-overlapping yet similar-but-distinct regions (e.g., flat walls and desktops) act as noise, potentially contaminating the global context. The original CAST [8] addressed this by using a graph-based sampling strategy that applied sparse self-attention only to consistent nodes. Although effective, constructing compatibility graphs incurs

Figure 5: Architecture of CAME. Note: Some relevant mathematical symbols representing both the source point cloud
3.5.1 Coordinate-Injected Embedding
Although the input features are ordered via the Hilbert curve, the standard Mamba architecture processes sequences based strictly on relative positions within the 1-D array, lacking explicit awareness of the underlying 3-D metric space. To compensate for this loss of metric information during serialization, geometric embedding is injected prior to feature aggregation.
Let
This operation ensures that the SSM implicit states can leverage both the sequential context and the absolute spatial distribution, facilitating the learning of distance-dependent geometric dependencies.
3.5.2 Consistency-Guided Soft Gating
To emulate the outlier-rejection capability of graph-based sampling without incurring quadratic computational costs, a soft gating mechanism is introduced. This mechanism dynamically modulates the information flow into the SSM based on the estimated reliability of each point.
Let
where
For theoretical insight, the core recurrence of an SSM is governed by the state equation
3.5.3 Bi-Directional Aggregation
Space-filling curves impose a fixed traversal direction, which introduces a directional bias in information propagation. To capture the full geometric context and ensure isotropic feature learning, a Bi-directional Mamba strategy is employed. The gated sequences
here,
3.5.4 Feature Fusion and Restoration
The context-enriched features from both directions are fused via element-wise addition. To facilitate gradient flow and preserve original semantic information, a residual connection with the pre-gated input is applied, followed by normalization:
Finally, to maintain compatibility with downstream modules that rely on the original point cloud indexing (such as the explicit cross-attention module), the output sequences
In summary, the overall effective theoretical complexity of AMBIR across multiple iterations can be rigorously expressed as
To supervise the iterative refinement framework, a multi-task objective function is designed, structured into four components: keypoint detection
where
Inspired by Usip [29], define the loss function of keypoint detection as:
where
This component supervises the hybrid coarse registration module, ensuring both the validity of the Mamba encoder and the accuracy of the Transformer interaction. It consists of spot matching loss and coarse matching loss.
Spot Matching Loss
where
Furthermore, when the patch centered at point
where
Coarse Matching Loss
where
The total coarse matching loss is defined as
Three losses are employed to supervise similarity calculation, correspondence prediction, and consistency filtering, respectively.
Similarity Calculation Loss
where
Correspondence Prediction Loss
Consistency Filtering Loss
where
The total keypoint matching loss is
The dense registration module is supervised using the translation loss
where F means Frobenius norm.
The total dense registration loss is defined as
To evaluate the performance of AMBIR and its advantages over other state-of-the-art methods, the experiments adopt two types of datasets: the indoor point cloud dataset 3DMatch [31], the indoor low-overlap point cloud dataset 3DLoMatch [32], as well as the large-scale outdoor point cloud datasets KITTI [33].
For the indoor datasets 3DMatch and low-overlap 3DLoMatch, the experiments adopt the evaluation metrics as follows:
• Registration Recall (RR): Measures the percentage of point cloud pairs successfully aligned within a specified Root Mean Square Error (RMSE < 0.2 m);
• Inlier Ratio (IR): Quantifies the proportion of correspondences within a certain residual threshold under the ground-truth transformation;
• Feature Matching Recall (FMR): Evaluates the percentage of point cloud pairs with an IR exceeding 5%.
For the outdoor large-scale datasets KITTI, the experiments also adopt the evaluation metrics from Predator [32], namely:
• Relative Rotation Error (RRE): The geodesic distance between the estimated and ground-truth rotation matrices;
• Relative Translation Error (RTE): The Euclidean distance between the estimated and ground-truth translation vectors;
• Registration Recall (RR): Represents the proportion of point cloud pairs where both RRE and RTE are below specific thresholds (RRE <
4.2 Environment and Parameters
4.2.1 Experimental Environment
For a fair comparison, all models involved in the experiments were executed in the same environment, which was equipped with a 14-core Intel Xeon (R) Platinum 8362 CPU and a single NVIDIA RTX 3090 GPU with 24 GB of VRAM. All code was compiled on the Linux Ubuntu 22.04 operating system with 32 GB of RAM allocated.
AMBIR is trained using the Muon [34] optimizer with a batch size of 1, an initial learning rate of
4.3.1 Result on 3DMatch and 3DLoMatch Datasets
As shown in Table 1, 5000, 2500, 1000, 500, and 250 points are sampled from the 3DMatch and 3DLoMatch datasets. Among them, 5000 and 2500 are categorized as dense point clouds, 1000 as medium-density, and 500 and 250 as sparse.

In the 3DMatch dataset, AMBIR achieves state-of-the-art (sota) performance in both RR and IR across various sampling numbers, while remaining on par with sota levels for FMR. Fig. 6 shows the qualitative registration results on 3DMatch dataset. Compared with the two best open-source SOTA methods, AMR and CAST, both methods exhibit local matching due to similar positions within the red-boxed region, leading to overall misalignment. In contrast, AMBIR does not suffer from this issue.

Figure 6: Qualitative registration results on 3DMatch dataset.
As shown in Table 2, in the 3DLoMatch dataset, the RR and IR of AMBIR significantly outperform current non-iterative models and achieve performance comparable to the AMR [10] iterative model across most sampling numbers. Regarding FMR, it ranks second only to the leading RoITr [14] model and maintains performance similar to other SOTA models. These results indicate the effectiveness of AMBIR in handling challenging low-overlap registration scenarios.

As shown in Table 3, AMBIR achieves a 100% RR on KITTI, matching the SOTA performance of recent years. In terms of RTE and RRE, AMBIR outperforms both iterative models-PEAL [9] and AMR [10]-while simultaneously reaching SOTA levels. This demonstrates that AMBIR delivers exceptional performance on large-scale point clouds while remaining highly effective in low-overlap scenarios. Fig. 7 shows the qualitative registration results of AMBIR on the KITTI dataset, where it can be seen that the performance of AMBIR is very close to the ground truth.


Figure 7: Qualitative registration results of AMBIR on KITTI dataset.
Benefiting from the linear attention mechanism of Mamba, as shown in Table 4, AMBIR achieves the shortest average runtime, the lowest VRAM consumption, and a reduced number of FLOPs among all compared methods. It is worth emphasizing that although AMBIR is an iterative approach, the linearization provided by Mamba enables it to achieve, or even surpass, SOTA registration performance while maintaining a relatively small resource footprint. Consequently, by leveraging Mamba modules to minimize computational overhead for large-scale PCR, AMBIR simultaneously remains highly effective for low-overlap registration tasks through its iterative framework.

4.3.4 Sensitivity and Stability Analysis of Iteration Rounds
K is set to 5 in Section 4.2.2. To verify the rationality of this value, a sensitivity and stability analysis of the hyperparameter K is added in this section, investigating the influence of

To verify the registration robustness of AMBIR under imperfect conditions, tests with point cloud noise and unequal point cloud density are conducted. As shown in Table 6, the integral smoothing property of the Mamba encoder acts as a spatial low-pass filter that suppresses high-frequency noise [36], and the soft-gating mechanism of CAME dynamically assigns low confidence to regions with mismatched geometric densities, preventing sparse artifacts from contaminating global feature aggregation [37]. These mechanisms enable AMBIR to maintain favorable RR values even under severe noise and density variations, providing protection against vulnerabilities under adversarial or noisy conditions.

4.4.1 Ablation Studies of 3DMatch and KITTI with Comparative Analysis
As shown in Table 7, “w/o” indicates “without,” referring to the ablated model lacking the respective module. AMBIR consists of four essential modules: PICOS, Vanilla Mamba, CAME, and Iteration. When evaluating the RR, Average Runtime, and VRAM usage on the 3DMatch, and the RTE, RRE, RR, Average Runtime, and VRAM usage on the KITTI dataset (all with 1000 sampled points). Results show that removing any module degrades the RTE, RRE, and RR metrics. Although removing PICOS saves only a negligible amount of VRAM, it severely compromises registration performance, which is not worthwhile. Notably, while removing the iteration reduces ART and VRAM usage, it is essential for learning overlap priors in low-overlap scenarios, and its removal causes a drastic drop in registration accuracy under such conditions. The above demonstrates that each component is indispensable to the complete AMBIR registration network.

For comparative analysis between indoor and outdoor datasets, the differences in the action mechanisms of each core module under varying scale and scene conditions are elaborated. PICOS ensures consistent semantic alignment across scales. The iterative framework uncovers hidden overlaps in occluded indoor scenes and progressively reduces large translational errors in outdoor scenes. Additionally, the soft-gating of CAME dynamically suppresses repetitive indoor clutter and filters vast featureless outdoor backgrounds. Crucially, the linear
4.4.2 Ablation Study on Serialization Strategy
To compare the impact of different serialization strategies on registration performance, the AMBIR serialization strategy is replaced, and the effectiveness of various strategies is evaluated on the 3DMatch Dataset, as shown in Table 8. It can be observed that, due to the absence of pre-alignment provided by PICOS, both the Hilbert and Z-order curve methods consume substantial network capacity when learning global rigid invariance from extremely long sequences, which severely degrades RR. The pre-alignment mechanism can leverage GPU parallel computing, requiring only a modest increase in runtime and FLOPs.

To address the challenges that current Transformer-based PCR frameworks suffer from quadratic computational complexity, leading to excessive resource consumption in large-scale scenarios and suboptimal performance in low-overlap environments, an iterative PCR network, AMBIR, fusing Attention and Mamba, is proposed. Specifically, an iterative network architecture is incorporated into the backbone to learn overlap information from prior registration results, thereby enhancing registration performance by leveraging knowledge from the preceding step. To convert 3-D point cloud data to linear data for the Mamba encoder, Prior-Informed Co-aligned Serialization is proposed to ensure that points with adjacent indices after serialization are spatial neighbors, thereby improving the efficiency and robustness of the subsequent registration process. After that, a Consistency-Aware Mamba Encoder is introduced to leverage its advantage in linear computational complexity, making the method more suitable for large-scale point clouds. Overall, AMBIR integrates the advantages of iterative networks in low-overlap scenarios with the benefits of the linear complexity of the Mamba model. It simultaneously resolves the PCR challenges in both scenarios, achieving a balance between performance and efficiency.
Future work will apply AMBIR to industrial and scientific instrument tasks, such as deformation monitoring of large astronomical telescope surfaces, to further broaden its scope of applications. In addition, investigating noise-resistant models under artificial or extreme conditions, as well as mechanisms to prevent noise-induced errors from propagating and amplifying across iterations, is also a worthwhile research direction.
Acknowledgement: None.
Funding Statement: This work was supported by the National Natural Science Foundation of China (Grant No. 12141304).
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Haotian Cao; methodology, Haotian Cao; software, Haotian Cao; validation, Haotian Cao; formal analysis, Haotian Cao; investigation, Haotian Cao; resources, Qingsheng Zhu; data curation, Haotian Cao; writing—original draft preparation, Haotian Cao and Qingsheng Zhu; writing—review and editing, Haotian Cao and Qingsheng Zhu; visualization, Haotian Cao; supervision, Qingsheng Zhu; project administration, Qingsheng Zhu; funding acquisition, Qingsheng Zhu. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The 3DMatch Dataset used in this study is publicly available at https://3dmatch.cs.princeton.edu (accessed on 6 March 2026). The KITTI Dataset used in this study is publicly available at https://www.cvlibs.net/datasets/kitti (accessed on 6 March 2026). The source code and model weights of the study are available from the authors upon reasonable request.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Pomerleau F, Colas F, Siegwart R. A review of point cloud registration algorithms for mobile robotics. Found Trends Robot. 2015;4(1):1–104. [Google Scholar]
2. Tao Y, Yang X, Wang H, Wang J, Li Z, Liang H. Lsreg-net: an end-to-end registration network for large-scale lidar point cloud in autonomous driving. IEEE Sens J. 2025;25(11):20675–86. doi:10.1109/jsen.2025.3562916. [Google Scholar] [CrossRef]
3. Zhang YX, Gui J, Yu B, Cong X, Gong X, Tao W, et al. Deep learning-based point cloud registration: a comprehensive survey and taxonomy. arXiv:2404.13830. 2024. [Google Scholar]
4. Besl PJ, McKay ND. A method for registration of 3-D shapes. IEEE Trans Pattern Anal Mach Intell. 1992;14(2):239–56. [Google Scholar]
5. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA. p. 6000–10. [Google Scholar]
6. Yu H, Li F, Saleh M, Busam B, Ilic S. Cofinet: reliable coarse-to-fine correspondences for robust pointcloud registration. In: Proceedings of the 35th International Conference on Neural Information Processing Systems; 2021 Dec 6–14; Online. p. 23872–84. [Google Scholar]
7. Qin Z, Yu H, Wang C, Guo Y, Peng Y, Xu K. Geometric transformer for fast and robust point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 18–24; New Orleans, LA, USA. p. 11143–52. [Google Scholar]
8. Huang R, Tang Y, Chen J, Li L. A consistency-aware spot-guided transformer for versatile and hierarchical point cloud registration. In: Proceedings of the 38th International Conference on Neural Information Processing Systems; 2024 Dec 10–15; Vancouver, BC, Canada. p. 70230–58. [Google Scholar]
9. Yu J, Ren L, Zhang Y, Zhou W, Lin L, Dai G. PEAL: prior-embedded explicit attention learning for low-overlap point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023 Jun 17–24; Vancouver, BC, Canada. p. 17702–11. [Google Scholar]
10. Chen Z, Ren Y, Zhang T, Dang Z, Tao W, Susstrunk S, et al. Adaptive multi-step refinement network for robust point cloud registration. arXiv:2312.03053. 2023. [Google Scholar]
11. Kalman RE. A new approach to linear filtering and prediction problems. J Basic Eng. 1960;82(1):35–45. doi:10.1115/1.3662552. [Google Scholar] [CrossRef]
12. Gu A, Dao T. Mamba: linear-time sequence modeling with selective state spaces. arXiv:2312.00752. 2023. [Google Scholar]
13. Yang F, Guo L, Chen Z, Tao W. One-inlier is first: towards efficient position encoding for point cloud registration. Adv Neural Inf Process Syst. 2022;35:6982–95. [Google Scholar]
14. Yu H, Qin Z, Hou J, Saleh M, Li D, Busam B, et al. Rotation-invariant transformer for point cloud matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023 Jun 17–24; Vancouver, BC, Canada. p. 5384–93. [Google Scholar]
15. Chen S, Xu H, Li R, Liu G, Fu CW, Liu S. SIRA-PCR: sim-to-real adaptation for 3d point cloud registration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 1–6; Paris, France. p. 14394–405. [Google Scholar]
16. Liu J, Wang G, Liu Z, Jiang C, Pollefeys M, Wang H. Regformer: an efficient projection-aware transformer network for large-scale point cloud registration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 1–6; Paris, France. p. 8451–60. [Google Scholar]
17. Xie Y, Wang B, Li S, Zhu J. Iterative feedback network for unsupervised point cloud registration. IEEE Robot Autom Lett. 2024;9(3):2327–34. doi:10.1109/lra.2024.3355784. [Google Scholar] [CrossRef]
18. Ding Y, Li K, Zhang G, Zhu Z, Wang P, Wang Z, et al. Multi-step depth enhancement refine network with multi-view stereo. PLoS One. 2025;20(2):1–17. doi:10.1371/journal.pone.0314418. [Google Scholar] [CrossRef]
19. Wen X, Xiang P, Han Z, Cao YP, Wan P, Zheng W, et al. PMP-Net: point cloud completion by learning multi-step point moving paths. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. p. 7443–52. [Google Scholar]
20. Liu B, Liu A, Chen H, Cui J, Wang Y, Zhang H. MT-PCR: a hybrid mamba-transformer with spatial serialization for hierarchical point cloud registration. arXiv:2506.13183. 2025. [Google Scholar]
21. Chen C, Li K, Xing K, Wang Y. E2MNet: an end-to-end large-scale point cloud registration network based on Mamba. J Electron Imaging. 2025;34(3):033045. doi:10.1117/1.jei.34.3.033045. [Google Scholar] [CrossRef]
22. Sun Y, Zhang L. MaGo-I2P: image-to-point cloud registration with mamba and geometry recovery. In: Proceedings of the 2025 International Conference on Multimedia Retrieval; 2025 Jun 30–Jul 3; Chicago, IL, USA. p. 1237–45. [Google Scholar]
23. Li Q, Jiang Y, Cheng J, Chen W, Zhao P, Qiao X, et al. AeroMamba: an efficient mamba-based approach for large-scale point cloud registration in aircraft assembly. In: Proceedings of the 2025 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM); 2025 Jul 14–18; Hangzhou, China. p. 1–8. [Google Scholar]
24. Alawieh A, Condurache AP. FA-KPConv: introducing euclidean symmetries to KPConv via frame averaging. arXiv:2505.04485. 2025. [Google Scholar]
25. Katharopoulos A, Vyas A, Pappas N, Fleuret F. Transformers are RNNs: fast autoregressive transformers with linear attention. In: Proceedings of the 2020 12th International Conference on Machine Learning; 2020 Feb 15–17; Shenzhen, China. p. 5156–65. [Google Scholar]
26. Qi CR, Su H, Mo K, Guibas LJ. Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 652–60. [Google Scholar]
27. Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 1251–8. [Google Scholar]
28. Hendrycks D. Gaussian error linear units (Gelus). arXiv:1606.08415. 2016. [Google Scholar]
29. Li J, Lee GH. USIP: unsupervised stable interest point detection from 3d point clouds. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision; 2019 Oct 27–Nov 2; Seoul, Republic of Korea. p. 361–70. [Google Scholar]
30. Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv:1807.03748. 2018. [Google Scholar]
31. Zeng A, Song S, Nießner M, Fisher M, Xiao J, Funkhouser T. 3DMatch: learning local geometric descriptors from RGB-D reconstructions. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 1802–11. [Google Scholar]
32. Huang S, Gojcic Z, Usvyatsov M, Wieser A, Schindler K. Predator: registration of 3D point clouds with low overlap. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. p. 4267–76. [Google Scholar]
33. Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The kitti vision benchmark suite. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition; 2012 Jun 16–21; Providence, RI, USA. p. 3354–61. [Google Scholar]
34. Jordan K, Jin Y, Boza V, You J, Cesista F, Newhouse L, et al. Muon: an optimizer for hidden layers in neural networks. 2024 [cited 2024 Dec 20]. Available from: https://github.com/KellerJordan/muon. [Google Scholar]
35. Fischler MA, Bolles RC. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM. 1981;24(6):381–95. doi:10.1145/358669.358692. [Google Scholar] [CrossRef]
36. Wu Z, Duan Y, Wang H, Fan Q, Guibas LJ. If-defense: 3d adversarial point cloud defense via implicit function based restoration. arXiv:2010.05272. 2020. [Google Scholar]
37. Yang H, Shi J, Carlone L. Teaser: fast and certifiable point cloud registration. IEEE Trans Robot. 2020;37(2):314–33. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools