Scale aware dense dynamic SLAM for monocular, stereo and RGBD cameras

Cen, Nuo; Xu, Yi; Wong, Tuck-Whye; Zheng, Yao

doi:10.1038/s41598-026-41208-9

Download PDF

Article
Open access
Published: 24 February 2026

Scale aware dense dynamic SLAM for monocular, stereo and RGBD cameras

Nuo Cen¹,
Yi Xu¹^na1,
Tuck-Whye Wong^1,2^na1 &
…
Yao Zheng¹

Scientific Reports volume 16, Article number: 10285 (2026) Cite this article

1073 Accesses
Metrics details

Subjects

Abstract

Accurate real-time dense mapping with precise scale and localization is crucial for autonomous robot navigation, particularly in dynamic environments. However, existing methods often rely on a single sensor type and lack robustness to dynamic scenes. To address these challenges, a generalized framework for real-time, scale-aware dense mapping with dynamic robustness is proposed in this paper, referred to as SDMFusion. SDMFusion supports monocular, stereo, and RGB-D cameras and is built upon the ORB-SLAM3 system. Three core modules are integrated into SDMFusion. The scale-depth optimization module recovers the absolute scale for monocular and refines the depth maps. The dynamic feature rejection module segments dynamic objects, combining geometric constraints and moving consistency checks to facilitate dynamic feature rejection. The real-time anti-dynamic reconstruction module generates high-quality dense maps of static regions using optimized depth, dynamic masks, and camera poses. Extensive experiments on KITTI, TUM RGB-D, BONN RGB-D, and real-world datasets validate the effectiveness of our approach. The results demonstrate that SDMFusion achieves superior overall performance in accuracy and robustness compared to ORB-SLAM3 and other advanced dynamic SLAM methods. Furthermore, our method effectively eliminates dynamic regions from the dense maps.

Mutual information-based hierarchical NBV decision for active semantic visual SLAM under dynamic environments

Article Open access 20 January 2026

Geometric constraints and semantic optimization SLAM algorithm for dynamic scenarios

Article Open access 29 August 2025

Enhanced visual-inertial SLAM Using SuperPoint and semantic geometric dynamic feature detection

Article Open access 31 March 2026

Introduction

Simultaneous Localization and Mapping (SLAM) serves as a fundamental technique for autonomous robot navigation. SLAM systems are categorized into LiDAR and visual variants based on the type of sensor used. Compared with LiDAR, visual SLAM provides advantages such as compact sensors, lower cost, and richer texture information. These features make it well-suited for applications in low-cost robotics, unmanned aerial vehicles (UAVs), and augmented reality.

Despite these benefits, visual SLAM faces several challenges in real-world scenarios. First, to meet lightweight and low-cost requirements, many consumer platforms are equipped with only monocular cameras. This prevents the recovery of absolute scale and limits localization accuracy in GNSS-denied environments. Second, dynamic objects often interfere with feature extraction and matching, causing pose estimation drift. Third, dense maps are crucial for advanced robotic functions, such as obstacle avoidance and autonomous navigation. However, most visual SLAM systems produce only sparse maps, which are inadequate for such tasks.

Various approaches have been proposed to address these issues. Monocular depth estimation using deep learning techniques has been proposed as a solution to scale ambiguity. These methods enable absolute scale recovery and dense reconstruction without the need for external sensors^1,2,3,4. However, most of them assume static scenes, ignore dynamic objects, and typically support only monocular input. In parallel, multiple strategies^{5,6,7,8,9,10,11,12,13,14} have been developed to improve robustness in dynamic environments by eliminating dynamic regions. Yet, most existing methods are limited to RGB-D inputs. Even systems that support multiple input types, such as DynaSLAM¹¹, fail to recover absolute scale or perform dense mapping for monocular.

To address these challenges, this paper presents SDMFusion, a comprehensive framework enabling real-time, scale-aware dense mapping with enhanced dynamic robustness. It is compatible with monocular, stereo, and RGB-D cameras. The primary contributions of this study are outlined as follows:

(1)
A scale-depth optimization method that leverages DepthAnythingV2 to recover precise absolute scale for monocular and refine depth maps for stereo and RGB-D.
(2)
A dynamic feature rejection strategy that incorporates YOLO11s-seg, geometric constraints, and a feature-level moving consistency check to accurately reject dynamic features.
(3)
A generalized framework named SDMFusion that enables high-quality, real-time, scale-aware dense mapping with dynamic robustness across all camera types.

The structure of this paper is as follows. Section 2 provides a review of related work. Section 3 introduces the proposed methodology. Section 4 presents the analysis of experimental results, and Sect. 5 offers the concluding remarks.

Related work

Monocular depth estimation for SLAM

Scale uncertainty is a common issue in monocular SLAM. To address this, various studies^15,16,17 have introduced auxiliary sensors such as IMUs, stereo cameras, or depth sensors to obtain absolute scale. However, for cost and size considerations, recent research has increasingly focused on recovering absolute scale using monocular images alone. Among these efforts, monocular depth estimation has become a key technique.

Early works, such as CNN-SLAM¹, introduced CNN-predicted dense depth maps into monocular SLAM. These predictions were combined with SLAM measurements to enhance performance, particularly in low-texture regions. In CNN-SVO¹⁸, predicted depth initialized the feature depth variance and mean during mapping. DVSO¹⁹ incorporated a two-stage optimization framework and fused predicted depth as virtual stereo measurements into DSO²⁰. Pose, uncertainty, and predicted depth were incorporated into D3VO²¹ to improve both tracking and optimization. Several other methods further explored this direction. Steenbeek et al.² combined SLAM with CNN-based depth estimation to achieve real-time dense mapping and scale calibration on UAVs. Yin et al.²² employed deep convolutional neural fields for scale recovery and depth estimation. Tiwari et al.²³ explored joint optimization of depth and pose. Sun et al.²⁴ enhanced visual odometry using monocular depth estimation, but provided only relative depth. Luo et al.³ fused monocular SLAM with an adaptive online depth predictor to improve heterogeneous scene reconstruction. DRM-SLAM⁴ proposed a depth fusion scheme using CNNs for robust depth prediction and absolute scale recovery.

Despite promising results, many of these monocular depth estimation methods exhibit limited absolute accuracy, especially in complex scenes. To improve this, recent models^{25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41} have used techniques such as multi-scale modeling, ranked regression, domain adaptation, diffusion mechanisms, and multi-tasking. Among them, DepthAnythingV2⁴¹ has emerged as a leading model due to high accuracy, strong generalization ability, and inference efficiency. In this work, a scale-depth optimization module is constructed based on DepthAnythingV2 to recover absolute scale and refine depth. Moreover, most existing methods still assume static scenes, leading to degraded performance in dynamic environments.

Dynamic SLAM

Conventional SLAM methods typically assume a static environment. However, in real-world settings with pedestrians, vehicles, or animals, this assumption is often violated. Dynamic objects contaminate feature extraction and reduce localization accuracy. Thus, precise identification and removal of dynamic regions are essential for robust and accurate SLAM.

Dynamic SLAM techniques are generally divided into geometric and deep learning-based approaches. Geometric methods identify dynamic features without deep learning. Some^5,7 utilize moving consistency, recognizing dynamic features by evaluating deviations from camera motion. However, these rely on static assumptions during pose estimation and require accurate motion models. This leads to a well-known “chicken-or-egg” problem⁸. Besides, Sun et al.⁴² used the homography matrix between frames alongside RANSAC for dynamic object segmentation. However, when dynamic objects dominate the frame, static features are often mistakenly removed. To address this, DM-SLAM⁴³ proposed DLRSAC, assuming that static features are spatially well-distributed, applying a grid-based model to distinguish dynamics. Similarly, Sun et al.⁴⁴ created a foreground motion model to separate dynamic and static features. ReFusion⁶ employed alignment residuals to detect dynamic regions and perform dense reconstruction in such environments.

Dynamic SLAM performance has markedly improved with the rise of deep learning. Compared with traditional techniques, deep learning models offer greater scene adaptability and more accurate dynamic object extraction. For example, DS-SLAM¹⁰ integrated moving consistency check and SegNet⁴⁵, enabling dynamic feature removal. Although effective for highly dynamic objects, its performance depends heavily on motion detection and can mistakenly eliminate static features. DynaSLAM¹¹ employed Mask-RCNN⁴⁶ and multi-view geometry for dynamic feature identification and removal. It performs well in complex scenarios but is computationally intensive. RDS-SLAM⁴⁷ introduced parallel semantic and optimization threads into ORB-SLAM3¹⁷, supporting SegNet and Mask-RCNN to filter dynamics. Although near real-time performance is achieved, the localization accuracy remains relatively low. Anebarassane et al.⁴⁸ integrated YOLOv8-seg⁴⁹ into ORB-SLAM3, optimizing segmentation for real-time operation. Detect-SLAM¹² used SSD⁵⁰ for object detection with ORB-SLAM2¹⁶, processing only keyframes for efficiency. DO-SLAM¹³ incorporated YOLOv5⁴⁹ and polar geometric constraints into ORB-SLAM2. YOLO-SLAM¹⁴ enhanced ORB-SLAM2 using a lightweight YOLOv3⁴⁹ variant and new geometric constraints. Liu et al.⁵¹ combined YOLOv5 and geometric constraints within the ORB-SLAM2 framework to mitigate the impact of dynamic objects. Liu et al.⁵² incorporated a depth-aware point-line attentional graph neural network and RGB-D sensing into ORB-SLAM3 to enhance robustness against dynamic objects and low-texture regions. Despite the promising results of the above methods, most are centered around RGB-D inputs. Some support monocular or stereo but lack absolute scale recovery and dense reconstruction for monocular, limiting their practical application.

Methods

Framework

SDMFusion is a scale-aware dense SLAM framework with dynamic robustness built upon ORB-SLAM3¹⁷. The system incorporates three modules: the scale-depth optimization module, the dynamic feature rejection module, and the real-time anti-dynamic dense reconstruction module. As illustrated in Fig. 1, the system adopts a modular design for flexible integration. Firstly, in the scale-depth optimization module, DepthAnythingV2⁴¹ is used to generate dense depth maps from RGB or grayscale images. For monocular, these depth maps provide the absolute scale necessary for accurate localization and dense reconstruction. For RGB-D and stereo, the raw depth data can be refined and completed. Secondly, the dynamic feature rejection module employs YOLO11s-seg⁴⁹ for real-time instance segmentation. Dynamic masks are then generated and passed to the tracking module based on object category and spatial relationships. Based on the masks, ORB features⁵³ are categorized into static and dynamic sets. The dynamic features are then evaluated through a moving consistency check, and only truly dynamic features are rejected. Finally, the real-time anti-dynamic dense reconstruction module constructs dense maps using keyframes’ optimized depth maps, camera poses, and dynamic masks. Detailed technical aspects of each module are described below.

Scale depth optimization

This module provides an absolute scale for monocular and refines depth maps for RGB-D and stereo. The module is built on DepthAnythingV2⁴¹, which enhances accuracy and robustness using synthetic data, larger model capacity, and large-scale pseudo-labeled images.

The original implementation of DepthAnythingV2 is based on the PyTorch framework. To ensure real-time operation, the pre-trained model is optimized with NVIDIA TensorRT (TensorRT 8.6, https://github.com/NVIDIA/TensorRT). Model inference has been re-implemented using TensorRT’s inference engine. We build the TensorRT engine for the DepthAnythingV2 pre-trained model following open-source implementations⁵⁴. Specifically, we first export the Open Neural Network Exchange (ONNX) model from the official DepthAnythingV2 codebase, then compile it into a TensorRT engine using the open-source framework⁵⁴. The resulting TensorRT engine conducts inference using FP16 precision with an input resolution of 518 × 518 pixels. In this study, to adapt to varying scenarios, both indoor and outdoor pre-trained metric models are converted and deployed separately.

DepthAnythingV2 processes RGB or grayscale images and outputs corresponding depth maps. In monocular mode, the predicted depth is used directly for absolute scale recovery and dense reconstruction. The method for recovering the absolute scale is intentionally simple: the predicted depth map is treated as pseudo-depth for RGB-D initialization. Specifically, for each input frame, the system first predicts a depth map and generates a corresponding segmentation mask. Both are then combined with the RGB image and passed to the RGB-D tracking function for map initialization and subsequent pose optimization. The RGB-D tracking function first converts the RGB image into a grayscale image and then extracts ORB features from it. Subsequently, dynamic features are filtered out by combining the segmentation mask with moving consistency check. Next, the remaining static features are subsequently rectified. Using the predicted depth map, the horizontal coordinate of the matching feature in the hypothetical right image is computed for each feature. These features are then assigned to an image grid. Subsequently, map initialization is performed. For each static feature, the corresponding 3D world coordinates are obtained by back-projecting the feature using its associated predicted depth value, as defined in Eq. (3). This process constructs the initial 3D map. At this point, absolute scale recovery for monocular SLAM has been achieved based on the predicted depth map.

In stereo mode, the Semi-Global Block Matching (SGBM)⁵⁵ algorithm is first used for generating the initial depth image. A global scale factor $\:S$ is then computed as the sum of predicted depth divided by the sum of initial depth for valid regions, as defined in Eq. (1). The valid regions are defined as the intersection of valid pixels present in both the predicted and initial depth maps. For stereo outdoor scenes, the valid pixels are defined as 3 m to 50 m and 0.05 m to 80 m for initial and predicted depth maps, respectively. Invalid regions in the initial depth map are subsequently filled by multiplying the predicted depth values by the global scale factor $\:S$, as defined in Eq. (2). In RGB-D mode, the sensor provides raw depth maps, which are often noisy or incomplete. The same refinement method is applied as in stereo mode. For RGB-D indoor scenes, the valid pixels are defined as 0.05 m to 8 m and 0.05 m to 20 m for raw and predicted depth maps, respectively. Assuming that the initial, predicted, and refined depth maps are denoted as $\:{D}_{init}$, $\:{D}_{pred}$, and $\:{D}_{refined}$, respectively, then

$$\:S=\frac{sum\left({D}_{pred}\left[valid\left({D}_{init}\right)\cap\:valid\left({D}_{pred}\right)\right]\right)}{sum\left({D}_{init}\left[valid\left({D}_{init}\right)\cap\:valid\left({D}_{pred}\right)\right]\right)}$$

(1)

$$\:{D}_{refined}\left[p\right]={D}_{init}\left[p\right]\:if\:p\in\:valid\left({D}_{init}\right)\:else\:S\cdot{D}_{pred}\left[p\right]$$

(2)

Where $\:S$ represents the scale factor, $\:sum\left(\right)$ refers to the accumulation, $\:valid\left(\right)$ denotes the extraction of valid values, [] stands for the regional indexing, and $\:p$ denotes a pixel coordinate. Figure 2 shows that the refined depth is more complete and smoother, with clearer details.

Dynamic feature rejection

YOLO⁴⁹ has gained widespread use in segmentation and detection due to its efficiency and high accuracy. For handling dynamic objects efficiently, YOLO11s-seg is introduced. The open-source YOLO11s-seg pre-trained model and configuration were directly utilized, which was trained on the COCO-Seg dataset, including 80 classes. Specifically, the model was configured with an input resolution of 640 × 640 pixels, a confidence threshold of 0.25, an IoU threshold of 0.7, and non-maximum suppression (NMS) disabled. With low computational overhead, YOLO11s-seg ensures high accuracy while meeting the real-time processing needs of SLAM systems.

To align with dynamic SLAM, the original category labels of YOLO11 have been reclassified according to dynamic attributes. Most existing methods tend to concentrate only on objects that are intrinsically dynamic, like pedestrians and vehicles. However, objects like chairs and keyboards, which are potentially dynamic, can also be displaced by external forces. If such objects are not processed, the robustness and accuracy may be negatively affected. Therefore, a three-level dynamic attribute classification mechanism is proposed in this paper to achieve accurate dynamic feature rejection. As shown in Table 1, objects are categorized into static, potentially dynamic, and dynamic based on dynamic attributes. Dynamic objects are defined as those capable of active motion, including people, various cars, and animals. Potentially dynamic objects are those that can be influenced by dynamic objects to move, such as chairs, laptops, and cell phones. Static objects are those that generally remain stationary, like refrigerators, microwaves, and toilets. After the initial classification, the potentially dynamic objects are further determined based on their spatial positions. Any potentially dynamic object located near dynamics is classified as dynamic, while others are treated as static. Based on this rule, the final dynamic mask is generated to identify the dynamic regions. The tracking thread then receives this mask for the removal of dynamic features.

Table 1 Categories of objects.

Subjects

Abstract

Similar content being viewed by others

Mutual information-based hierarchical NBV decision for active semantic visual SLAM under dynamic environments

Geometric constraints and semantic optimization SLAM algorithm for dynamic scenarios

Enhanced visual-inertial SLAM Using SuperPoint and semantic geometric dynamic feature detection

Introduction

Related work

Monocular depth estimation for SLAM

Dynamic SLAM

Methods

Framework

Scale depth optimization

Dynamic feature rejection

Real-time anti-dynamic dense reconstruction

Results

Implementation and experiment setup

Datasets and metrics

Implementation details

KITTI dataset

TUM RGB-D dataset

BONN RGB-D dataset

Real dataset

Ablation study

Runtime analysis

Discussion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links