Open Access
ARTICLE
CF2-SLAM: Conformal-Calibrated Foundation-Factor Graph SLAM across Modalities and Domains
College of Engineering, Pennsylvania State University, University Park, PA, USA
* Corresponding Author: Xiangqin Chen. Email:
Computers, Materials & Continua 2026, 88(2), 39 https://doi.org/10.32604/cmc.2026.079663
Received 26 January 2026; Accepted 10 April 2026; Issue published 15 June 2026
Abstract
Simultaneous localization and mapping (SLAM) must remain reliable when sensing suites and operating conditions vary across platforms and deployments. Beyond correspondence degradation, a dominant deployment failure mode is misweighted constraints: under distribution shift, uncertainty estimates can become miscalibrated, allowing a small set of overconfident factors to dominate iterative optimization and destabilize inference. This article presents conformal-calibrated foundation-factor graph SLAM (Keywords
Simultaneous localization and mapping (SLAM) is a core state-estimation module for embodied platforms ranging from aerial robots to autonomous vehicles and augmented reality/virtual reality (AR/VR) devices. In practice, deployments vary in sensing configuration (monocular/stereo/red-green-blue-depth (RGB-D)/inertial measurement unit (IMU)) and operating conditions (illumination, weather, scene layout, motion, and dynamics), inducing distribution shifts that degrade robustness.
Classical geometric pipelines remain attractive due to transparent objectives and auditable components, but they rely on brittle data association and typically assume stationary noise models. Learned SLAM mitigates perceptual brittleness via learned representations and stronger visual priors [1,2], yet a less visible failure mode often dominates in deployment: miscalibrated confidence under shift. In iterative back-ends (bundle adjustment or factor graphs), relative factor weighting controls conditioning and convergence. A small subset of overconfident, incorrect constraints can dominate the normal equations and cause divergence or persistent bias, especially when mixing factor types of different dimensions and noise profiles.
The target of this work is a unified SLAM framework that operates across modalities while retaining the interpretability of a probabilistic back-end. Two principles guide the design. First, frozen foundation models can provide more transferable representations than task-specific encoders [3]. Second, if learned components emit explicit residuals and covariances, inference can be posed as a classical factor-graph maximum a posteriori (MAP) problem [4], enabling principled fusion. However, transferability alone does not prevent systematic misweighting under shift. An online conformal-style calibration mechanism is therefore introduced to adjust factor covariance magnitudes using residual quantiles [5], aiming to support more reliable optimization behavior over time under distribution shift.
Because widely used SLAM datasets differ in available sensor fields and calibration metadata, Section 5 documents modality availability and Section 6.2 fixes evaluation protocols to reduce inadvertent modality leakage.
Contributions.
This article makes three contributions: (1)
Classical geometric SLAM and visual-inertial odometry (VIO) remain strong baselines when sensing assumptions are matched to deployment. Feature-based systems such as ORB-SLAM3 [6] remain highly competitive, while recent geometry-aware learned systems such as Photo-SLAM [1] and IBD-SLAM [2] illustrate the continued value of combining stronger visual representations with explicit optimization back-ends. However, their accuracy and stability still depend strongly on data association quality and on appropriately tuned noise models across visual and inertial factors.
Learned front-ends improve correspondence robustness under viewpoint, illumination, and texture variation. Representative recent examples include DINOv2 [3], LightGlue [7], RoMa [8], and IBD-SLAM [2]. These methods show that learned representations and learned residual models can substantially strengthen SLAM front-ends, but they also expose a recurring weakness: confidence or covariance estimates that are reliable in-domain can become miscalibrated under cross-dataset, cross-sensor, or synthetic-to-real transfer.
Loop closure is commonly structured as candidate retrieval followed by geometric validation before graph insertion. In this setting, the proposed descriptor-topological module follows the same principle: foundation descriptors are used only for candidate proposal, while accepted loop constraints are instantiated as verified LC factors after dense geometric checking. This is distinct from metric-semantic SLAM in the sense of explicit object- or scene-level labeling, and it is also distinct from recent dense mapping systems such as Gaussian Splatting SLAM [9] and SplaTAM [10], which emphasize reconstruction quality and often assume RGB-D inputs and higher compute budgets.
Recent work on post-hoc neural calibration and conformal prediction has strengthened uncertainty quantification in supervised learning [5,11,12], but their role in SLAM is less straightforward because SLAM involves sequential correlation, heterogeneous factor dimensions, and solver-level sensitivity to relative factor weighting. The emphasis here is therefore not only uncertainty reporting but solver conditioning under distribution shift. A related perspective on resilience under changing operating conditions appears in graph-structured logistics routing, where learned spatiotemporal risk prediction is integrated with dynamic edge weighting to maintain robust decision making under congestion and demand fluctuations [13]. The proposed online conformal layer complements standard robust losses: robust losses suppress large instantaneous outliers, whereas conformal rescaling corrects systematic covariance scale mismatch across factor families so that no single miscalibrated modality dominates the normal equations after transfer.
This work considers heterogeneous sensor streams (monocular/stereo/RGB-D/IMU) and estimates a trajectory (and optionally map parameters) over a horizon. Let the state at time
where
A factor graph over
where factor

Figure 1: System overview of conformal-calibrated foundation-factor graph SLAM (
4.2 Foundation Representations
A frozen foundation model
4.3 Graph Construction and Edge Selection
Optimization is performed over a sliding window of N states with optional loop-closure edges. Temporal edges ensure local observability; sparse covisibility edges add redundancy without quadratic connectivity. Loop candidates are proposed by retrieving neighbors in descriptor space and inserted only after geometric verification (Section 4.5), since false high-confidence loops can bias the entire graph.
4.4 Probabilistic Learned Factors
Factor types and modality dependencies are summarized in Fig. 2; only factors supported by dataset fields (Section 5) are enabled to avoid modality leakage.

Figure 2: Unified factor graph for
Four factor families are used: a foundation-feature (FF) factor, a depth/disparity (D) factor when depth is present, an IMU preintegration factor when IMU exists [14], and a loop-closure (LC) factor after verification. The descriptor-topological loop-closure module described below is therefore not a fifth factor family; rather, it is the proposal-and-verification pipeline whose accepted outputs are instantiated as LC factors in the graph. For FF, given an estimated relative pose
For RGB-D depth consistency:
and for stereo an analogous disparity form applies. For loop closure, after verification a relative pose constraint is added in
The IMU family uses the standard preintegrated residual over pose, velocity, and bias states;
Each factor head outputs a positive-definite covariance
4.5 Descriptor-Topological Loop Closure with Geometric Verification
Loop closure separates candidate proposal from constraint insertion (Fig. 3). In this article, “descriptor-topological” refers to this retrieval-and-verification pipeline, while the actual graph element added after a successful check is the LC factor defined above. Candidates are retrieved using

Figure 3: Descriptor-topological loop closure. Foundation descriptors propose candidates; dense matching verifies geometry; accepted closures become loop-closure (LC) factors in the graph.
To balance accuracy and efficiency, proposal and verification are decoupled: descriptor retrieval generates hypotheses, and dense matching with robust pose estimation acts as a conservative gate, executed sparsely (e.g., on keyframes) with a capped number of candidates per query; the reported frame rate, measured in frames per second (FPS), includes the amortized verification cost. Descriptor retrieval alone is high-recall but insufficiently conservative for graph insertion under perceptual aliasing and repeated structures; dense geometric verification is therefore required before adding a loop-closure factor.
In dynamic scenes, non-stationarity mainly impacts retrieval and correspondence; the verification gate rejects inconsistent hypotheses before graph insertion, and dynamic-aware masking or temporal-consistency checks can be incorporated when needed.
4.6 Online Conformal Calibration of Factor Covariances
Under domain shift, predicted uncertainties can become systematically miscalibrated, altering factor influence in GN/LM. Covariances are therefore rescaled online using residual statistics (Fig. 4).

Figure 4: Online conformal calibration. For each factor type, match observed residual quantiles to target quantiles and rescale covariances accordingly.
For factor
For each factor type
This increases covariance when residuals are larger-than-expected and decreases it when residuals are smaller-than-expected. Warm-up and caps on
Factor heads are trained in two stages, with pretraining on TartanAir to initialize residual and covariance behavior. Optimization uses AdamW (learning rate
where
The inference procedure is summarized in Algorithm 1.

Sensor fields are documented to support cross-modal comparisons, and only factors supported by available measurements are enabled. Evaluation is conducted on seven public benchmarks that jointly cover outdoor driving (KITTI Odometry, KITTI-360), aerial visual-inertial simultaneous localization and mapping on EuRoC Micro Aerial Vehicle (MAV), and indoor RGB-D tracking/relocalization/mapping (TUM RGB-D, ScanNet, 7-Scenes), with TartanAir used for broad pretraining and stress-testing under diverse simulated conditions. Specifically, KITTI/KITTI-360 provide rectified stereo driving sequences for odometry and loop closure under appearance change; EuRoC provides synchronized stereo + IMU for visual-inertial odometry (VIO) evaluation under aggressive motion; TUM RGB-D and 7-Scenes emphasize indoor tracking and relocalization with depth; ScanNet supports large-scale indoor RGB-D tracking and dense reconstruction; TartanAir supplies diverse environments and modalities to initialize factor heads for transfer. Dataset fields and typical tasks are summarized in Table 1.
Baselines and modality requirements are summarized in Table 2. Comparison is made only where required fields exist (Table 1), and runtime/hardware is reported when available.
6.2 Evaluation Protocol and Reporting
Each sequence is evaluated over R independent runs; mean

Reported runtime/frames per second (FPS) includes feature extraction, factor construction, windowed optimization, and the amortized loop-closure cost under the configured proposal/verification schedule. Per-stage timing for retrieval and verification is additionally logged to support reproducibility.
Absolute trajectory error (ATE)/relative pose error (RPE) [20], loop-closure and relocalization precision/recall, mapping quality where depth exists, uncertainty metrics, and efficiency (FPS/memory) are reported. For uncertainty, let
with constant C shared across methods. Expected calibration error (ECE) follows standard binning of nominal vs. empirical coverage; reliability curves and score distributions are reported in Section 7.5. False loops per kilometer use traveled distance along the reference trajectory.
7.1 Outdoor Odometry (KITTI/KITTI-360)
Outdoor trajectory estimation is evaluated under Section 6.2. Results are summarized in Table 4, and representative trajectories are visualized in Fig. 5. On KITTI-360,


Figure 5: Qualitative trajectories on (a) KITTI-360 Sequence 00, (b) KITTI Odometry Sequence 02, and (c) KITTI Odometry Sequence 05. Alignment follows Section 6.2.
7.2 Visual-Inertial SLAM (EuRoC MAV)
On EuRoC with stereo + IMU,

7.3 Indoor RGB-D Tracking and Auxiliary Mapping Evidence (TUM, ScanNet, 7-Scenes)
Indoor RGB-D scenes contain textureless regions and perceptual aliasing. Across TUM, ScanNet, and 7-Scenes (Table 6),

As auxiliary evidence of graph consistency on RGB-D data, quantitative dense mapping results on ScanNet are reported in Table 7, and qualitative examples are shown in Fig. 6.


Figure 6: Qualitative dense reconstruction on ScanNet. Top: reconstructed meshes. Bottom: distance-to-mesh error heatmaps (shared scale).
7.4 Loop Closure and Relocalization
Table 8 reports loop closure precision/recall and relocalization. Geometric verification filters most spurious retrieval candidates, while descriptor-based proposal improves recall under large viewpoint/appearance changes. To directly isolate the necessity of the second stage, an additional comparison between descriptor-only loop closure and descriptor retrieval followed by geometric verification is reported in Table 9. The verification stage is intended as a conservative gate before loop-factor insertion, since even a small number of false loop constraints can bias subsequent graph optimization. Qualitative verified examples are shown in Fig. 7.



Figure 7: Loop-closure examples passing geometric verification: query/retrieved frames, verified correspondences, and inserted loop constraint.
7.5 Uncertainty, Robustness, and Cross-Sensor Shift
Uncertainty is evaluated with negative log-likelihood (NLL)/expected calibration error (ECE) and robustness with failure rate under zero-shot transfer, with emphasis on how conformal calibration behaves under cross-dataset and cross-sensor shift. The transfer from KITTI to KITTI-360 mainly changes appearance statistics, scene layout, and long-horizon driving context under the same stereo sensing regime, whereas the transfer from TartanAir to EuRoC additionally introduces synthetic-to-real, motion-regime, and stereo + IMU VIO differences. In fixed-noise SLAM systems, such shifts can mis-scale visual, depth, and inertial factor families, causing some modalities to dominate the normal equations. The learned heads provide modality-conditioned initial covariances, and the conformal layer further corrects residual scale mismatch online per factor family. Table 10 and Fig. 8 show that conformal calibration reduces ECE and is accompanied by lower failure rates across both transfer settings, consistent with more reliable factor weighting under shift rather than merely better in-domain fitting.


Figure 8: Uncertainty calibration diagnostics. Top: reliability diagrams. Bottom: score distributions under transfer.
Key components are ablated under the same protocol, with particular emphasis on calibration under shift and on the loop-closure design. Table 11 shows that removing conformal calibration increases both ATE and failure rate (A1 vs. A0), replacing the foundation backbone degrades transfer robustness (A2), and geometric-only loop closure substantially increases drift (A3), indicating the value of descriptor-topological proposal. Together with the cross-shift results in Table 10, these ablations support the claim that conformal reweighting is the primary mechanism improving robustness under transfer. Table 9 further isolates the contribution of the geometric verification stage beyond descriptor retrieval alone.

Solver stability in SLAM is tightly coupled to factor weighting. In this work, “stability” is used in an operational sense: fewer failed runs, fewer catastrophic drifts/divergences, and less sensitivity to misweighted factor families during sequential GN/LM updates under transfer. Under domain shift, miscalibrated uncertainties can overweight unreliable constraints and degrade conditioning, producing drift or divergence. The conformal rescaling rule in Eq. (7) provides a simple online mechanism to adjust covariance magnitudes using observed residual statistics, without retraining, and the empirical evidence in Tables 10 and 11 should be interpreted in this operational sense rather than as a stand-alone spectral conditioning proof. A representative divergence-vs.-recovery comparison is shown in Fig. 9.

Figure 9: Failure case analysis on KITTI
Limitations include: (i) foundation backbones increase compute relative to compact convolutional neural network (CNN) front-ends, motivating distillation or reduced token resolution for real-time deployment; (ii) calibration relies on bounded windows and approximate stationarity, so abrupt regime changes can challenge residual statistics despite warm-up and caps; (iii) the conformal-style update is empirical and does not provide strict exchangeability-based coverage guarantees in sequential SLAM; and (iv) loop closure remains subject to the retrieval/verification trade-off under severe viewpoint changes and repeated structures. In highly dynamic scenes with large non-rigid occluders, retrieval and correspondences may degrade; integrating motion- or instance-aware masking with dynamic-aware matching is a natural extension that preserves the current solver formulation. The multi-stage loop-closure pipeline also introduces controllable overhead; sustained real-time operation depends on scheduling choices such as keyframe triggering, candidate caps, and (when available) asynchronous verification.
This article introduced
Acknowledgement: Not applicable.
Funding Statement: The author received no specific funding for this study.
Availability of Data and Materials: The datasets analyzed in this study are publicly available:
• KITTI Odometry: https://www.cvlibs.net/datasets/kitti/eval_odometry.php
• KITTI-360: https://www.cvlibs.net/datasets/kitti-360/
• EuRoC MAV: https://ethz-asl.github.io/datasets/
• TUM RGB-D: https://cvg.cit.tum.de/data/datasets/rgbd-dataset
• ScanNet: https://github.com/ScanNet/ScanNet
• 7-Scenes: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/
• TartanAir: https://theairlab.org/tartanair-dataset/
Ethics Approval: Not applicable.
Conflicts of Interest: The author declares no conflicts of interest.
References
1. Huang H, Li L, Cheng H, Yeung SK. Photo-SLAM: real-time simultaneous localization and photorealistic mapping for monocular stereo and RGB-D cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 21584–93. [Google Scholar]
2. Yin M, Wu S, Han K. IBD-SLAM: learning image-based depth fusion for generalizable SLAM. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 10563–73. [Google Scholar]
3. Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, et al. DINOv2: learning robust visual features without supervision. arXiv:2304.07193. 2023. [Google Scholar]
4. Abdelkarim A, Voos H, Görges D. Factor graphs in optimization-based robotic control—a tutorial and review. IEEE Access. 2025;13(23):28315–34. doi:10.1109/access.2025.3534993. [Google Scholar] [CrossRef]
5. Gibbs I, Candès EJ. Conformal inference for online prediction with arbitrary distribution shifts. J Mach Learn Res. 2024;25(162):1–36. [Google Scholar]
6. Campos C, Elvira R, Rodríguez JJG, Montiel JM, Tardós JD. ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap SLAM. IEEE Trans Robot. 2021;37(6):1874–90. [Google Scholar]
7. Lindenberger P, Sarlin PE, Pollefeys M. LightGlue: local feature matching at light speed. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2023. p. 17627–38. [Google Scholar]
8. Edstedt J, Sun Q, Bökman G, Wadenbäck M, Felsberg M. RoMa: robust dense feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 19790–800. [Google Scholar]
9. Matsuki H, Murai R, Kelly PHJ, Davison AJ. Gaussian splatting SLAM. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 18039–48. [Google Scholar]
10. Keetha N, Karhade J, Jatavallabhula KM, Yang G, Scherer S, Ramanan D, et al. SplaTAM: splat track & map 3D Gaussians for dense RGB-D SLAM. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 21357–66. [Google Scholar]
11. Clarté L, Loureiro B, Krzakala F, Zdeborová L. Expectation consistency for calibration of neural networks. In: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence. Vol. 216. London, UK: PMLR; 2023. p. 443–53. [Google Scholar]
12. Oliveira RI, Orenstein P, Ramos T, Romano JV. Split conformal prediction and non-exchangeable data. J Mach Learn Res. 2024;25(225):1–38. [Google Scholar]
13. Xue Z, Zhao S, Qi Y, Zeng X, Yu Z. Resilient routing: risk-aware dynamic routing in smart logistics via spatiotemporal graph learning. arXiv:2601.13632. 2026. [Google Scholar]
14. Qin T, Li P, Shen S. VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Trans Robot. 2018;34(4):1004–20. [Google Scholar]
15. Sun J, Shen Z, Wang Y, Bao H, Zhou X. LoFTR: detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2021. p. 8922–31. [Google Scholar]
16. Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2012. p. 3354–61. [Google Scholar]
17. Geiger A, Lenz P, Stiller C, Urtasun R. Vision meets robotics: the KITTI dataset. Int J Robot Res. 2013;32(11):1231–7. [Google Scholar]
18. Liao Y, Xie J, Geiger A. KITTI-360: a novel dataset and benchmarks for urban scene understanding in 2D and 3D. IEEE Trans Pattern Anal Mach Intell. 2022;45(3):3292–310. [Google Scholar]
19. Burri M, Nikolic J, Gohl P, Schneider T, Rehder J, Omari S, et al. The EuRoC micro aerial vehicle datasets. Int J Robot Res. 2016;35(10):1157–63. doi:10.1177/0278364915620033. [Google Scholar] [CrossRef]
20. Sturm J, Engelhard N, Endres F, Burgard W, Cremers D. A benchmark for the evaluation of RGB-D SLAM systems. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA: IEEE; 2012. p. 573–80. [Google Scholar]
21. Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Nießner M. ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2017. p. 5828–39. [Google Scholar]
22. Engel J, Koltun V, Cremers D. Direct sparse odometry. IEEE Trans Pattern Anal Mach Intell. 2017;40(3):611–25. doi:10.1109/tpami.2017.2658577. [Google Scholar] [PubMed] [CrossRef]
23. Forster C, Zhang Z, Gassner M, Werlberger M, Scaramuzza D. SVO: semidirect visual odometry for monocular and multicamera systems. IEEE Trans Robot. 2016;33(2):249–65. [Google Scholar]
24. Leutenegger S, Lynen S, Bosse M, Siegwart R, Furgale P. Keyframe-based visual-inertial odometry using nonlinear optimization. Int J Robot Res. 2015;34(3):314–34. doi:10.1177/0278364914554813. [Google Scholar] [CrossRef]
25. Teed Z, Deng J. DROID-SLAM: deep visual slam for monocular, stereo, and RGB-D cameras. Adv Neural Inf Process Syst. 2021;34:16558–69. [Google Scholar]
26. Dai A, Nießner M, Zollhöfer M, Izadi S, Theobalt C. BundleFusion: real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Trans Graph. 2017;36(4):1. [Google Scholar]
27. Zhu Z, Peng S, Larsson V, Xu W, Bao H, Cui Z, et al. Nice-SLAM: neural implicit scalable encoding for SLAM. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2022. p. 12786–96. [Google Scholar]
28. DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: self-supervised interest point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ, USA: IEEE; 2018. p. 224–36. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF

Downloads
Citation Tools