GMM-searcher: efficient object search in large-scale scenes using large language models

Zheng, Lanxiang; Mei, Ruidong; Zou, Bingzhi; Zhao, Zhijun; Wei, Mingxin; Siliu, Xu

doi:10.1038/s41598-025-00788-8

Download PDF

Article
Open access
Published: 14 May 2025

GMM-searcher: efficient object search in large-scale scenes using large language models

Lanxiang Zheng¹,
Ruidong Mei²,
Bingzhi Zou³,
Zhijun Zhao³,
Mingxin Wei^4,5 &
…
Xu Siliu⁴

Scientific Reports volume 15, Article number: 16709 (2025) Cite this article

2889 Accesses
Metrics details

Subjects

Abstract

Autonomous object search in large-scale, dynamic environments presents significant challenges for robots. This paper introduces GMM-Searcher, a framework designed to efficiently guide robots in object search tasks while adapting to complex and changing environments. By leveraging the reasoning capabilities of large language models (LLMs), GMM-Searcher directs the robot toward areas where the object is most likely to be found. For large-scale environments, an adaptive-resolution topological graph (ARTG) is combined with Gaussian Mixture Models (GMM) to optimize memory usage while preserving high environmental fidelity. GMM also stores search experiences to enhance performance in repeated tasks and adapt to environmental changes, informing the LLM’s decision-making process for more efficient subsequent searches. Real-world experiments show that GMM-Searcher significantly enhances search efficiency and equips robots with lifelong learning capabilities and adaptability.

Sensory-motor control with large language models via iterative policy refinement

Article Open access 15 March 2026

Correspondence of high dimensional emotion structures elicited from video clips between humans and multimodal LLMs

Article Open access 01 September 2025

Enhancing intention prediction and interpretability in service robots with LLM and KG

Article Open access 06 November 2024

Introduction

In the field of intelligent robotics, significant progress has been made in developing autonomous systems capable of operating in complex environments¹. One of the challenging tasks for autonomous robots is object search, where they must navigate through unpredictable and dynamic environments while quickly detecting and locating target objects². Despite advances in this area, much of the existing research has primarily focused on object search within small, structured indoor environments^3,4. In large-scale outdoor environments such as industrial parks or construction sites, object search plays a critical role in tasks like equipment monitoring, inventory management, and security monitoring. However, most current methods are not well-suited to these large-scale, unstructured, and feature-sparse environments, where object search continues to face significant challenges.

Object search in large-scale, unstructured, and dynamically changing environments presents significant challenges in achieving both accurate and memory-efficient modeling. As the scale of the environment grows, the memory required for modeling increases exponentially, placing extreme pressure on storage and querying. Existing methods either minimize memory usage at the expense of detail^5,6, or preserve detail with significantly higher resource consumption⁷. Furthermore, the complexity and unpredictability of such environments heighten the demands on the robot’s decision-making, requiring thorough analysis and understanding of its surroundings for informed and adaptive choices. Additionally, for repetitive tasks, the ability of robots to adapt to environmental changes and learn from past experiences to improve future search efficiency becomes essential. However, most search frameworks^3,4,5,6,7 lack mechanisms to integrate knowledge from previous tasks, relying on predetermined strategies that show little improvement after repeated executions.

To address these challenges, this paper introduces GMM-Searcher, a novel framework designed for efficient object search in large-scale environments, while equipping the robot with lifelong learning capabilities to adapt to changing tasks and environments. GMM-Searcher is particularly well-suited for searching objects with certain spatial distribution characteristics, such as vehicles being located outdoors rather than indoors, and bus stops being situated along the roadside rather than in a garden. The framework is shown in Fig. 1, it integrates ARTG-GMM, an abbreviation for the combination of an adaptive-resolution topological graph (ARTG) and Gaussian Mixture Models (GMM), which models both spatial layouts and semantic information. This hybrid approach retains the simplicity inherent in topological graphs⁶ while leveraging GMM’s capacity to represent complex objects^8,9, significantly reducing memory requirements while maintaining high-detail representation. Its incremental construction enables rapid adaptation to environmental changes. By incorporating the reasoning capabilities of large language models (LLMs)^10,11,12, the robot can make more informed and efficient decisions, particularly in dynamic and unfamiliar environments. To enhance performance in repetitive tasks, GMM captures and stores search experiences from past successes, continually expanding and merging Gaussian components to form a knowledge base. These experiences not only accelerate the precision of LLM reasoning for continuous improvement but also equip the robot with lifelong learning abilities. Real-world experiments validate that GMM-Searcher significantly enhances object search efficiency in large-scale environments. It reduces unnecessary search efforts, improves adaptability to environmental changes, and increases the robot’s performance in repetitive tasks.

The main contributions of this paper are as follows:

A novel framework for object search that leverages the reasoning capabilities of LLMs to enhance the efficiency of robot target searches in large-scale environments.
An efficient environmental representation method that combines an adaptive-resolution topological graph with GMMs, enabling incremental updates.
An explicit method for recording historical search experiences, improving performance in repetitive tasks, and supporting the robot’s lifelong learning ability.
Validation of the proposed methods through real-world experiments, demonstrating improved efficiency and adaptability in large-scale object search tasks.

Related work

In object search tasks, traditional approaches primarily depend on predefined search strategies, utilizing detailed path planning and grid-based mapping techniques^3,13. Dang et al.¹³ introduced a sampling-based semantic exploration method that accelerates the search by maximizing the gain from exploring new spaces and improving observation resolution in previously mapped regions with detected objects of interest. Kim et al.³ designed efficient visitation strategies that sequentially explore frontiers, enabling rapid exploration of unknown environments to locate target objects. While these methods are efficient, they rely heavily on the precision of sensor inputs¹³ and the accuracy of underlying environmental models³. This dependence often leads to rigid search strategies that struggle to adapt to dynamic or changing targets. As a result, these methods exhibit limited flexibility and reduced performance in large-scale environments, where unpredictability and environmental complexity present additional challenges.

Data-driven approaches, such as Convolutional Neural Networks (CNNs)^14,15,16 and Reinforcement Learning (RL)^17,18,19,20, have substantially improved the efficiency of robotic object search tasks. Pandey et al.¹⁴ propose a behavior-based neural network (BNN) that enhances mobile robot navigation by improving path efficiency and collision detection. Ye et al.¹⁵ develop a hierarchical policy learning framework, integrating a high-level policy for sub-goal planning with a low-level policy for exploration. These models, trained on large datasets, effectively identify patterns and predict object locations, thus streamlining search strategies. However, despite these advancements, the application of such techniques is often restricted to indoor or small-scale environments. In larger, more complex settings, challenges like sparse rewards and limited generalization hinder their effectiveness, making it difficult to achieve optimal results in dynamic, expansive environments.

Vision-based and semantic-integrated strategies have also emerged, allowing robots to integrate visual and semantic data, thereby enhancing their environmental understanding and making efficient search decisions^20,21,22,23. By leveraging reasoning capabilities to predict necessary actions, these approaches improve task execution in similar environments^24,25. The introduction of Large Language Models (LLMs) has further enhanced these tasks by offering sophisticated reasoning skills that help guide decisions in complex environments. For instance, Zhang et al.⁴ introduced a framework that uses natural language to model relationships between objects and environments, allowing robots to locate targets more efficiently. Schmalstieg et al.²⁶ explored interactive search tasks, guiding robots through complex sequences, such as navigating cabinets and doors, to find objects. Honerkamp et al.²⁷ extended these ideas to large-scale environments, using open-vocabulary scene graphs to boost LLMs’ semantic reasoning in expansive object searches. Even in the case of obscured objects, LLMs can reason about likely locations, helping robots make informed navigation choices²⁸. Nevertheless, most existing research is limited to small-scale or indoor environments, or depends on pre-built maps. There is less focus on dynamic, large-scale outdoor scenarios where unpredictable elements, such as pedestrians and moving vehicles, introduce additional challenges.

In robotic exploration and search tasks, several common methods are used for environmental representation, including grid maps, topological maps, and hybrid metric-topological maps. Grid maps^29,30 provide highly accurate spatial details but can become computationally expensive when applied to large-scale environments. Nvblox³¹, based on grid maps, proposes an incremental Euclidean distance field construction method, where each cell stores the distance to the nearest obstacle. Octomap^32,33 addresses this issue by optimizing memory use through a hierarchical voxel structure, which efficiently represents free space. Topological maps^34,35, typically used in 2D environments, represent space as a graph of nodes and edges, offering significant memory efficiency. However, they lack the detailed geometric information necessary for more complex tasks. Recent advances, like semantic maps²⁰ or semantic Octomap³⁶, augment these representations with object or area labels, improving the robot’s comprehension of its surroundings, though these methods often require sophisticated perception systems. Hybrid metric-topological maps^23,27,37,38 combine semantic information with a topological framework, offering a balance between efficiency and detail. However, they can struggle to accurately capture object size and geometry, limiting their application in environments with high levels of complexity.

In contrast to these limitations, our GMM-Searcher framework addresses large-scale object search by unifying adaptive-resolution topological graphs and GMM to model spatial-semantic distributions. Unlike static-map-dependent methods, our hybrid system dynamically constructs environment representations, minimizing memory usage without sacrificing detail. Integrated LLM reasoning leverages stored search experiences encoded as GMM components to continuously refine future task efficiency and enhance adaptability to dynamic environmental changes.

ARTG-GMM environment representation

In large-scale scenes, effective environmental representation is crucial for search tasks. This paper introduces the ARTG-GMM, comprising a 2.5D ARTG and multiple GMMs. The ARTG indicates the traversable areas for the robot in 3D space, while the GMMs model the objects within the environment. This section outlines the construction of the ARTG-GMM.

Adaptive-resolution topological graph

The environment consists largely of free space with relatively few obstacles, making traditional map representations, such as grid maps or voxel maps, inefficient. To mitigate memory consumption, this paper analyzes the robot’s traversability in 3D space and maps it onto a 2.5D ARTG. The ARTG is modeled as an undirected graph $\mathscr {G}=\{\mathscr {V}, \mathscr {E} \}$, where $\mathscr {V}$ represents the vertices and $\mathscr {E}$ represents the edges connecting adjacent vertices.

When the robot operates in complex environments, it must account for both ground obstacles and overhead obstructions, such as overhanging branches. To ensure safety, the ARTG maps obstacles, narrow passages, and steep slopes as collision areas, as shown in Fig. 2. The ARTG is generated by randomly sampling the free space and is incrementally updated as the robot moves. For precise environmental modeling, the vertex sampling density is higher near collision areas and sparser in open areas, as illustrated in Fig. 3. The sampling density $\alpha d_{\text {obs}}$, with $\alpha > 0$, is proportional to the distance from the nearest obstacle in the region. An edge between two vertices is established when the distance between them is less than $d_{\text {con}}$, and the edge does not pass through any collision areas. The connection distance $d_{\text {con}}$ is defined as:

$$\begin{aligned} d_{\text {con}} = \min (\beta d_{\text {obs}}, d_{\text {max}}), \end{aligned}$$

(1)

where $\beta> \alpha > 0$ is a scaling factor, and $d_{\text {max}}$ is the maximum allowable distance for constructing an edge. The denser graph structure near collision areas captures fine details of their boundaries, while the sparser structure in open areas maintains simplicity. Additionally, to preserve the sparsity of the ARTG, any newly sampled vertex located closer than $\alpha d_{\text {obs}}$ to the nearest vertex is pruned.

Gaussian mixture model

For objects of various shapes in the environment, GMM is well-suited to capture complex object layouts through a mixture of Gaussian distributions, enabling the representation of diverse object types at different locations. Its compact structure efficiently reduces storage requirements, making it an effective method for spatial representation³⁹.

The GMM is a probabilistic model that represents a combination of several Gaussian distributions. Each component corresponds to an individual Gaussian, capturing the probability of encountering a target at specific locations. The probability density function for each component is defined as:

$$\begin{aligned} \mathscr {N}(\textbf{x}|\mu _i, \Sigma _i) = \frac{1}{\sqrt{(2\pi )^k |\Sigma _i|}} e^{-\frac{1}{2} (\textbf{x} - \mu _i)^T \Sigma _i^{-1} (\textbf{x} - \mu _i)}, \end{aligned}$$

(2)

where $\textbf{x} \in \mathbb {R}^k$ is the location, $\mu _i$ is the mean vector of the Gaussian component, and $\Sigma _i$ is the covariance matrix, quantifying the uncertainty in position. The probability density of the GMM with m components is expressed as:

$$\begin{aligned} p(\textbf{x} | \{\pi _i, \mu _i, \Sigma _i \}_{i=1}^N) = \sum _{i=1}^{m} \pi _i \mathscr {N}(\textbf{x}|\mu _i, \Sigma _i), \end{aligned}$$

(3)

with $\pi _i$ is the mixture weight of the i-th Gaussian component.

For each object, a GMM with N Gaussian components is used to model its structure, as shown in Fig. 3. The parameters $\{\pi _i, \mu _i, \Sigma _i \}_{i=1}^N$ of the GMM are optimized using the Expectation-Maximization (EM) algorithm, ensuring unique identification of the object. EM is an iterative process: during the E-step, the probability of each data point belonging to each Gaussian component is calculated based on the current parameter estimates. In the M-step, the parameters are updated to maximize the likelihood given these assignments. This process repeats until convergence, as detailed in⁴⁰.

Each object $O_j$ is ultimately represented by a collection $O_j = \{ \{\pi _i, \mu _i, \Sigma _i$ $\}_{i=1}^N, label, color \}$. Here, the label identifies the object’s semantic category, and color corresponds to the semantic point cloud coloring. Both label and color are updated using Bayesian probability⁴¹, enhancing the model’s robustness against scene segmentation and point cloud semantic matching errors. Additionally, as the robot moves, the object’s semantic point cloud is progressively refined, and the parameters $\{\pi _i, \mu _i, \Sigma _i \}_{i=1}^N$ are updated accordingly.

Finally, the ARTG-GMM is represented as $\{\mathscr {G}, \mathscr {O} \}$, where $\mathscr {G}$ is incrementally updated as the robot moves, and the object set $\mathscr {O} = \{O_1, O_2, \cdots , O_n \}$ expands as new objects are discovered. The ARTG-GMM supports incremental additions and local updates, making it suitable for large-scale environments. Its unique structure not only reduces storage and querying pressure on onboard platforms but also maintains high-precision modeling of free space.

LLM-based search strategy

LLMs excel at understanding and reasoning across multimodal data, making them particularly advantageous for search tasks in complex environments. This chapter introduces how LLMs enhance robotic search efficiency through Chain-of-Thought (CoT)⁴² reasoning, and how to enable robots with lifelong learning capabilities.

Searching through CoT

CoT reasoning allows LLMs to decompose complex tasks into structured, sequential steps, enhancing decision-making in real-world scenarios. In search tasks, this ability enables the robot to incrementally make decisions for the next steps by continuously analyzing the environment, inferring potential target locations, and adapting its strategy accordingly.

In the proposed framework, GMM-Searcher employs CoT to guide the LLM’s reasoning process, enabling the incremental refinement of search strategies in complex environments. A demonstration of this process is shown in Fig. 4. By leveraging task descriptions, the LLM integrates its world knowledge with environmental constraints to identify promising search areas, as illustrated in Fig. 4a and b. Subsequently, the current environmental observation $\mathscr {V}_e$, historical observation $\mathscr {V}_h$, and the corresponding prompt $\mathscr {P}_h$ for the historical observation are sequentially fed into the LLM, allowing it to fully understand the surrounding environment and historical movement. Finally, the LLM integrates all the information and aligns it with the task requirements, generating a search strategy to guide the robot in completing the search task.

Specifically, for the current environmental observation $\mathscr {V}_e$ captured via a panoramic camera, it is discretized into N regions, with each region corresponding to a decision $D_i$, where $D_i \in \mathscr {D} = \{D_1, D_2, \dots , D_N \}$. This regional division facilitates the LLM’s decision-making process, allowing it to select discrete decisions from the continuous environmental observation. Through iterative reasoning, the LLM evaluates new observations and integrates them with prior knowledge to select the optimal decision $D^* \in \mathscr {D}$. This iterative process allows the LLM to continuously refine its understanding and improve its search strategies based on the most recent observations.

Despite the rich information provided by the environmental observation $\mathscr {V}_e$, distortion from cylindrical mapping poses interpretation challenges, even for experienced observers. Furthermore, although the LLM has memory capabilities, it is not sensitive to depth in three-dimensional space, which makes it difficult for the model to recall areas previously traversed by the robot, potentially leading to the robot wandering in a specific region. To address this, both the environmental observation $\mathscr {V}_e$ and ARTG-GMM are simultaneously input to the LLM, where ARTG-GMM offers a more intuitive historical observation $\mathscr {V}_h$ in the form of a 2D top-down image. The associated prompt describes various elements using a standardized { ’label’, ’color’ } format. As shown in Fig. 4c, each element is assigned a unique label and color code, ensuring clarity and structure in the input data for LLM-based reasoning. Additionally, the historical observation $\mathscr {V}_h$ is divided into regions that correspond one-to-one with the regions in the environmental observation $\mathscr {V}_e$, allowing the LLM to intuitively relate the two without needing to consider the robot’s current orientation. This standardized prompt and division of regions ensure consistent identification of key objects, aiding the LLM in effectively processing environmental input. By continuously incorporating new environmental and historical observations, a series of search strategies is generated to guide the robot in completing the search task.

Notably, while the LLM’s reasoning search strategy facilitates target search, it does not guarantee success in all scenarios. For instance, in certain situations, a target may be overlooked due to occlusion or may be missed if located in a corner. Once the LLM has initially ignored these areas, it can be challenging to return for further exploration. Therefore, when the ARTG-GMM has covered more than the threshold (such as 70%) of the search area without locating the target, the GMM-Searcher will adopt a strategy similar to TARE⁴³ to ensure complete coverage of the exploration area. Specifically, for the regions still uncovered by ARTG-GMM, frontiers will be generated at the boundaries, and a Traveling Salesman Problem (TSP) will be formulated to create a route that sequentially visits multiple frontiers. This strategy will guide the robot to explore areas that were not reached during the LLM reasoning process, thereby ensuring that the entire area is thoroughly covered and avoiding any unsearched object areas.

Lifelong learning

LLMs possess an inherent ability to adapt and learn across various tasks or through specialized training, demonstrating a distinct focus on specific tasks. In search tasks, this capability manifests as increasingly accurate reasoning about object locations, aligning better with real-world expectations. However, a significant limitation is their inability to record the historical positions of all objects when switching search objects. Although reasoning improves in coherence and accuracy over time, the efficiency gains in task execution remain limited.

To address this limitation, GMM-Searcher explicitly stores task execution experiences using GMMs. This approach allows the robot to recall and leverage past experiences when encountering similar tasks or environments. Specifically, for each search object, a corresponding GMM is constructed. When an object is found at a location $\textbf{p}$, with target size $\textbf{s}$, the corresponding GMM update steps are as follows:

First, a new component $G_{new}$ is created with parameters initialized as follows:

$$\begin{aligned} \begin{aligned} \mu _{\text {new}}&= \textbf{p}, \\ \Sigma _{\text {new}}&= \text {diag}(\textbf{s}^2), \\ \pi _{\text {new}}&= \frac{1}{N+1}, \\ \end{aligned} \end{aligned}$$

(4)

where N is the number of components in the GMM.

Next, the new component $G_{\text {new}}$ attempts to merge with existing components within the GMM. It is only merged with the closest component $G_i=\{\pi _i, \mu _i, \Sigma _i\}$ if the distance $d(\mu _{\text {new}}, \mu _i) < d_{\text {max}}$ is satisfied, where $d(\cdot ,\cdot )$ is the Mahalanobis distance and $d_{\text {max}}$ is the distance threshold. If merged, the parameters of $G_i'$ are updated as follows:

$$\begin{aligned} \begin{aligned} \mu _{i}'&= \frac{\pi _{i} \cdot \mu _{i} + \pi _{\text {new}} \cdot \mu _{\text {new}}}{\pi _{i} + \pi _{\text {new}}}, \\ \Sigma _{i}'&= \frac{\pi _i}{\pi _i + \pi _{\text {new}}} \left( \Sigma _i + (\mu _i - \mu _i')(\mu _i - \mu _i')^T \right) + \frac{\pi _{\text {new}}}{\pi _i + \pi _{\text {new}}} \left( \Sigma _{\text {new}} + (\mu _{\text {new}} - \mu _i')(\mu _{\text {new}} - \mu _i')^T \right) ,\\ \pi _{i}'&= \pi _{i} + \pi _{\text {new}}. \end{aligned} \end{aligned}$$

(5)

Finally, the weights of all components in the GMM are normalized as $\pi _i = \pi _i / \sum _{m=1}^{N+1} \pi _m$.

As the robot repeats similar tasks, the accumulated experiences are intuitively added into the historical observation $\mathscr {V}_h$, similar to how objects are represented in $\mathscr {V}_h$. These experiences use the same { ’label’, ’color’ } format for prompts. The key difference is that areas with a higher probability of finding the target have lower transparency in their associated ’color’ values. Additionally, for search tasks that the robot has not previously encountered, if the ARTG-GMM contains relevant information about the target, a corresponding GMM is constructed to facilitate the search process. This ensures that even for unfamiliar tasks, the robot can efficiently leverage existing knowledge from similar tasks or environments, thereby improving overall search efficiency.

Path optimization

Once a search strategy is determined, it must be optimized for effective robotic navigation. A local target is selected within the region corresponding to decision $D^*$, ensuring it is both reachable and free from obstacles, while maintaining a safe distance to avoid potential collisions. Additionally, the local target should be positioned far enough from the current location to enable continuous movement during the decision-making process, preventing the robot from halting due to an overly short travel distance. Then, path optimization⁴⁴ is applied to ensure smooth and safe navigation. As new decisions are made, path optimization guarantees a seamless transition to the updated target, allowing the robot to maintain smooth and adaptive movement throughout.

Experiments

This section describes the experimental setup and assesses the search efficiency using multiple metrics to validate the proposed framework’s effectiveness.

Implementation details

The effectiveness of the proposed framework is evaluated using a two-wheeled self-balancing robot equipped with an Insta360 X4 camera and a Mid-360 LiDAR, as shown in Fig. 5. The camera captures 360-degree panoramas at 30 fps, while the LiDAR operates at 10Hz. Network connectivity is provided by a 4G radio, enabling upload speeds of up to 50 Mbps and download speeds of up to 150 Mbps. All algorithms are executed on an onboard NVIDIA Jetson Orin NX, with FAST-LIO2⁴⁵ used for precise localization.

The experiments take place in two real-world scenarios, depicted in Fig. 6. The first scenario is a 320m $\times$ 210m outdoor campus area, where moving vehicles and pedestrians present additional challenges for the experiment. The second scenario involves a semi-enclosed 200m $\times$ 100m area, surrounded by four academic buildings with regularly distributed classrooms. During the search process, the robot operates at a maximum speed of 1.0 m/s. The ARTG-GMM functions with a minimum distance of $d_{\text {min}} = 0.2$ m and a maximum distance of $d_{\text {max}} = 1.2$ m. Grounded-SAM⁴⁶ with RAM⁴⁷ is applied for panoramic semantic segmentation, utilizing the onboard GPU. GPT-4⁴⁸ serves as the LLM, capable of handling both image and text inputs concurrently. To assist in decision-making, both environmental and historical observations are segmented into 11 regions. Path optimization uses the recommended parameters⁴⁴, with a maximum acceleration of 1.5 $\hbox {m/s}^2$, a maximum angular velocity of 1 rad/s.

Exploration decision analysis

Understanding historical observations is crucial for LLMs to gain a comprehensive understanding of their environment and utilize historical information, enabling them to make more accurate decisions. To evaluate the understanding capability of LLMs regarding historical observations presented in image form, including ARTG-GMM and historical movement trajectories (illustrated in Fig. 4c), an experiment is designed. This involves inputting both environmental observations obtained during exploration and their corresponding historical observations into various LLMs. The models are then manually queried about the objects contained within each region, and the accuracy of their responses is recorded. If errors occur, the correct answers are provided to the LLMs to facilitate self-learning. The experiment is conducted using models such as GPT-4o, GPT-4o-mini⁴⁸, LearnLM 1.5⁴⁹, and Moonshot AI⁵⁰, all of which support multimodal inputs of images and text.

The results of these experiments are illustrated in Fig. 7, displaying the accuracy of reasoning after multiple iterations of errors and corrections. As shown, all models demonstrate a steady improvement in accuracy with an increasing number of reasoning optimization steps. Notably, GPT-4o exhibits higher accuracy compared to the other models, reaching over 90% accuracy after 30 optimization steps. Initially, the interquartile range (IQR) of the box plots is relatively large, indicating a wider spread in the accuracy. However, as the number of optimization steps increases, the IQR becomes narrower, suggesting that the data distribution becomes more uniform and consistent. This trend indicates that the models not only improve in accuracy but also become more robust and reliable as they undergo iterative learning and correction. The consistent improvements across models highlight the effectiveness of iterative feedback and correction in enhancing the LLMs’ ability to reason about complex scenes.

Overall, the experimental outcomes suggest that the ARTG-GMM representation is compatible with the reasoning strengths of the tested large models, facilitating effective decision-making and understanding of historical information in complex scenarios.

Building on the LLMs’ improved reasoning capabilities with historical observations, Fig. 8 illustrates the decision-making process during the first search for a target car in Scene 1, showing several environmental observations. The green-highlighted segment represents the decision made by the LLM. In Fig. 8a, the vehicle is detected, and the LLM predicts that the region may be an open parking lot, indicating the target car is likely located there, prompting the robot to move toward it. When the target car is not found in this area, the robot selects an unexplored region, as shown in Fig. 8b. In Fig. 8c, no car is found nearby, and no likely parking areas are identified; however, with a junction detected in region 5, the robot moves toward that direction. Finally, in Fig. 8d, the robot successfully locates the target car in region 6.

Environmental representation analysis

Figure 9 illustrates the details of a robot’s search task in Scene 1 using the proposed algorithm. In the figure, the Euclidean Signed Distance Functions (ESDF)³¹ show different colors to represent the varying distances from obstacles, which are used for trajectory optimization. The red line indicates the robot’s movement path, while the red dots represent the vertices in the ARTG-GMM. The colorful point cloud is generated from the GMM in ARTG-GMM through Gaussian Mixture Model reconstruction.

Table 1 presents a comparison of the proposed ARTG-GMM with several commonly used robotic mapping methods, where the bold values in the table indicate optimal performance metrics. The comparison is made based on two metrics: memory usage (Memory) and average update time (Time) across two scenes with different methods. The proposed ARTG-GMM method consistently outperforms benchmarks, such as Octomap⁵¹, Voxblox⁵², Nvblox³¹, and UFOmap⁵³, in terms of both memory usage and processing time. Since the resolution in the ARTG part of ARTG-GMM is adaptive, its performance remains constant across different resolutions.

Table 1 Comparisons of environmental representations of two scenes.

Full size table

For the 0.1m resolution, ARTG-GMM demonstrates the lowest memory usage in both scenes (12.84 MB for Scene 1 and 9.17 MB for Scene 2). Specifically, this represents a 99.8% reduction compared to Nvblox, a 99.1% reduction compared to Octomap, and a 99.3% reduction compared to Voxblox in Scene 1. Similarly, in Scene 2, ARTG-GMM achieves a 99.9% reduction compared to Nvblox, a 99.2% reduction compared to Octomap, and a 99.2% reduction compared to Voxblox. ARTG-GMM also exhibits the fastest update times, with 46.98 ms for Scene 1 and 48.09 ms for Scene 2. While benchmarks perform well during the initial stages of mapping, their update times increase as scene size and memory usage grow. In contrast, ARTG-GMM maintains consistently fast and stable update times, with most of the computational overhead attributed to the EM parameter fitting process.

At the 0.3m resolution, the memory usage and update times of the benchmarks are significantly reduced; however, ARTG-GMM continues to demonstrate the lowest memory usage. This represents a 98.8% reduction compared to Nvblox, a 74.0% reduction compared to Octomap, and a 90.7% reduction compared to Voxblox in Scene 1. In Scene 2, ARTG-GMM achieves reductions of 98.3% compared to Nvblox, 55.4% compared to Octomap, and 88.0% compared to Voxblox. Although ARTG-GMM’s update time remains highly competitive, it is slightly slower than UFOmap.

This analysis highlights that ARTG-GMM offers low memory usage and efficient processing times, making it highly suitable for large-scale environments with limited computational resources.

Searching path analysis

To evaluate the performance of the proposed algorithm, we compare it against several benchmarks^6,43,54 across two scenes, as shown in Fig. 10. All algorithms are configured identically, with benchmark parameters sourced from their respective open-source projects. In Fig. 10, the search trajectories for each algorithm are distinguished by different colors, and the colored point cloud represents the data collected by the robot using the proposed method. The ellipsoids in Fig. 10c and d visualize the GMM-based historical search experience, where darker colors indicate areas with a higher probability of locating the target. Figure 10a and b show search trajectories without prior knowledge, while Fig. 10c and d illustrate trajectories after leveraging accumulated search experience, with both starting points and target locations changed. Figure 10a and c focus on car searches, while Fig. 10b and d focus on classroom searches.

As shown in Fig. 10a and b, GMM-Searcher consistently achieves the shortest search trajectory compared to the benchmarks during first-time searches. This efficiency is largely due to the LLM’s ability to interpret the environment and infer potential object locations, allowing the robot to bypass areas where the object is unlikely to be found. For example, in Scene 1, while searching for a vehicle, GMM-Searcher avoids narrow roads and open spaces near the boundaries, which are less likely to contain the vehicle. Similarly, in Scene 2, when searching for a classroom in Building A, GMM-Searcher identifies that the robot is near Building D and intelligently directs it towards Building A before beginning the search, whereas the benchmarks immediately commence searching immediately without this spatial awareness, as seen in 10b.

After accumulating historical search experience, as shown in Fig. 10c and d, GMM-Searcher demonstrates more informed search strategies, even when the starting point or target location is altered. For example, in Scene 1, GMM-Searcher efficiently guides the robot to areas with a higher probability of containing the target, leading to successful searches, as illustrated in Fig. 10c.

Table 2 Comparison of experiments in two scenes.

Full size table

Although all methods achieve a 100% success rate in searching for the object across multiple experiments, there are still notable differences in the search paths between the methods, as shown in Table 2. This table provides a detailed comparison of different methods across multiple experiments. The term “First-time Search” refers to search tasks conducted without any prior information. The “Repeated Search - Same Task” indicates searches performed after accumulating historical search experience, where both the starting and target locations remain unchanged. In contrast, “Repeated Search - Changed Task” refers to searches conducted after accumulating historical search experience, with both the starting and target locations altered.

In the first-time search, where no prior task data is available, the proposed method significantly reduces the search path length and time compared to the benchmarks. In Scene 1, the average path length is 426.29 m, representing reductions of approximately 17.1%, 36.0%, and 36.5% compared to EFP, TARE, and SEER, respectively. The time required is also reduced to 450.31 s, offering reductions of about 21.5%, 39.9%, and 39.2%, respectively. Similarly, in Scene 2, the proposed method achieves a path length of 160.63 m and a time of 167.32 s, marking significant improvements over the benchmarks.

The benefits of historical experience are apparent in the repeated search tasks with the same start and target (Repeated Search - Same Task). In Scene 1, the path length is reduced to 401.78 m, with a further drop in standard deviation, indicating more stable and efficient performance. The time decreases to 409.97 s, highlighting the method’s capability to learn from past searches. Meanwhile, benchmarks do not show any improvement, as noted by the ’-’ symbol in Table 2.

When the start and target locations are changed (Repeated Search - Changed Task), the proposed method still delivers superior performance. In Scene 1, the path length is reduced to 575.12 m, with a time of 599.08 s, both outperforming the benchmarks. In Scene 2, the proposed method achieves a path length of 201.11 m and a time of 209.49 s. These results confirm the method’s adaptability and long-term learning capabilities, making it highly efficient even when search conditions change.

When the start and target locations change (Repeated Search - Changed Task), the proposed method still demonstrates significant superiority. In Scene 1, the path length is reduced to 575.12 m, achieving reductions of approximately 60.2%, 57.1%, and 59.6% compared to EFP, TARE, and SEER, respectively. The time decreases to 599.08 s, offering reductions of about 60.5%, 59.8%, and 59.9%, respectively. In Scene 2, the proposed method achieves a path length of 201.11 m and a time of 209.49 s. These improvements underscore the method’s adaptability and its ability to learn over time, maintaining high efficiency even as search conditions change.

Time consumption

Table 3 Average time consumption of each module.

Full size table

Table 3 summarizes the time consumption of each module in the proposed framework. Despite some modules, such as panoramic segmentation and GPT decision-making, taking longer to process, the framework remains efficient overall. This efficiency is achieved by leveraging the strong environmental understanding and inference capabilities of the LLM, which offsets the additional time required for these more resource-intensive tasks. As a result, the system effectively balances precision and speed across all modules.

Conclusion

This paper introduces GMM-Searcher, a novel framework for autonomous robotic search tasks in large-scale environments. The framework utilizes ARTG-GMM to efficiently manage the storage of environmental data, while the GMM component effectively captures and utilizes historical search experiences. By leveraging the deep understanding and reasoning capabilities of LLMs, the system intelligently predicts potential target locations and guides the robot in making more efficient search decisions. This method consistently outperforms traditional benchmarks in both path length and time consumption across various search scenarios. The results demonstrate the framework’s adaptability to dynamic conditions and its strong long-term learning abilities, establishing it as a highly efficient solution for autonomous search tasks.

Future work will explore ways to further enhance the system’s capabilities, including developing more advanced techniques to capture and reason about higher-level semantic information, as well as investigating methods to handle situations where the spatial distribution of targets is less predictable. By expanding the framework’s understanding of the environment and search context, we aim to further improve the robustness and generalization of the GMM-Searcher approach.

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Mavrogiannis, C. et al. Core challenges of social robot navigation: A survey. ACM Trans. Human-Robot Interaction 12, 1–39 (2023).
Article Google Scholar
Sun, J., Wu, J., Ji, Z. & Lai, Y.-K. A survey of object goal navigation. IEEE Transactions on Automation Science and Engineering (2024).
Kim, H., Kim, H., Lee, S. & Lee, H. Autonomous exploration in a cluttered environment for a mobile robot with 2d-map segmentation and object detection. IEEE Robotics Automation Lett. 7, 6343–6350 (2022).
Article Google Scholar
Zhang, M., Tian, G., Cui, Y., Zhang, Y. & Xia, Z. Hierarchical semantic knowledge-based object search method for household robots. IEEE Trans. Emerging Topics Comput. Intell. 8, 930–941 (2024).
Article Google Scholar
Zhang, Y. et al. Building metric-topological map to efficient object search for mobile robot. IEEE Trans. Industr. Electron. 69, 7076–7087 (2022).
Article ADS Google Scholar
Zhang, H. et al. Efp: Efficient frontier-based autonomous uav exploration strategy for unknown environments. IEEE Robotics Automation Lett. 9, 2941–2948 (2024).
Article Google Scholar
Park, J. et al. Zero-shot active visual search (zavis): Intelligent object search for robotic assistants. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2004–2010 (IEEE, 2023).
Dong, H. et al. Mr-gmmapping: Communication efficient multi-robot mapping system via gaussian mixture model. IEEE Robotics Automation Lett. 7, 3294–3301 (2022).
Article Google Scholar
Zhao, L. et al. Adaptive parameter estimation of gmm and its application in clustering. Futur. Gener. Comput. Syst. 106, 250–259 (2020).
Article Google Scholar
Hori, K., Suzuki, K. & Ogata, T. Enhancement of long-horizon task planning via active and passive modification in large language model. Scientific Reports (2024).
Chang, Y. et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15, 1–45 (2024).
Article Google Scholar
Zhang, C. et al. A survey on potentials, pathways and challenges of large language models in new-generation intelligent manufacturing. Robotics Computer-Integrated Manufact. 92, 102883 (2025).
Article Google Scholar
Dang, T., Papachristos, C. & Alexis, K. Autonomous exploration and simultaneous object search using aerial robots. In 2018 IEEE Aerospace Conference, 1–7 (IEEE, 2018).
Pandey, K. K. & Parhi, D. R. Trajectory planning and the target search by the mobile robot in an environment using a behavior-based neural network approach. Robotica 38, 1627–1641 (2020).
Article Google Scholar
Ye, X. & Yang, Y. Efficient robotic object search via hiem: Hierarchical policy learning with intrinsic-extrinsic modeling. IEEE Robotics Automation Lett. 6, 4425–4432 (2021).
Article Google Scholar
Ye, T. et al. Real-time object detection network in uav-vision based on cnn and transformer. IEEE Trans. Instrum. Meas. 72, 1–13 (2023).
Google Scholar
Kulkarni, T.D., Narasimhan, K., Saeedi, A. & Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems29 (2016).
Luo, Q., Luan, T. H., Shi, W. & Fan, P. Deep reinforcement learning based computation offloading and trajectory planning for multi-uav cooperative target search. IEEE J. Sel. Areas Commun. 41, 504–520 (2022).
Article Google Scholar
Yang, B. et al. Hogn-tvgn: Human-inspired embodied object goal navigation based on time-varying knowledge graph inference networks for robots. Adv. Eng. Inform. 62, 102671 (2024).
Article Google Scholar
Chen, B. et al. Semnav-hro: A target-driven semantic navigation strategy with human-robot-object ternary fusion. Eng. Appl. Artif. Intell. 127, 107370 (2024).
Article Google Scholar
Anderson, P. et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3674–3683 (2018).
Moudgil, A., Majumdar, A., Agrawal, H., Lee, S. & Batra, D. Soat: A scene-and object-aware transformer for vision-and-language navigation. Adv. Neural. Inf. Process. Syst. 34, 7357–7367 (2021).
Google Scholar
Chen, Q., Chen, J. & Huang, W. Pathfinding method for an indoor drone based on a bim-semantic model. Adv. Eng. Inform. 53, 101686 (2022).
Article Google Scholar
Li, J., Tan, H. & Bansal, M. Envedit: Environment editing for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15407–15417 (2022).
Li, X., Wang, Z., Yang, J., Wang, Y. & Jiang, S. Kerm: Knowledge enhanced reasoning for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2583–2592 (2023).
Schmalstieg, F., Honerkamp, D., Welschehold, T. & Valada, A. Learning hierarchical interactive multi-object search for mobile manipulation. IEEE Robotics Autom. Lett. 8, 8549–8556 (2023).
Article Google Scholar
Honerkamp, D., Büchner, M., Despinoy, F., Welschehold, T. & Valada, A. Language-grounded dynamic scene graphs for interactive object search with mobile manipulation. IEEE Robotics Autom. Lett. 9, 8298–8305 (2024).
Article Google Scholar
Dorbala, V. S., Mullen, J. F. Jr. & Manocha, D. Can an embodied agent find your â€œcat-shaped mugâ€? llm-based zero-shot object navigation. IEEE Robotics Autom. Lett. 9, 4083–4090 (2024).
Article Google Scholar
Elfes, A. Using occupancy grids for mobile robot perception and navigation. Computer 22, 46–57 (1989).
Article Google Scholar
Xu, T., Liu, J. & Zhou, Z. Implementation of two-roller scheduling path planning under road construction scenarios. Sci. Rep. 15, 6767 (2025).
Article CAS PubMed PubMed Central Google Scholar
Millane, A. et al. nvblox: Gpu-accelerated incremental signed distance field mapping. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 2698–2705 (IEEE, 2024).
Wurm, K. M., Hornung, A., Bennewitz, M., Stachniss, C. & Burgard, W. Octomap: A probabilistic, flexible, and compact 3d map representation for robotic systems. In Proc. of the ICRA 2010 workshop on best practice in 3D perception and modeling for mobile manipulation, vol. 2, 3 (2010).
Karlsson, S., Koval, A., Kanellakis, C. & Nikolakopoulos, G. D$_{+}^{*}$: A risk aware platform agnostic heterogeneous path planner. Expert Syst. Appl. 215, 119408 (2023).
Article Google Scholar
Xu, Z. et al. Ultra-fast semantic map perception model for autonomous driving. Neurocomputing 599, 128162 (2024).
Article Google Scholar
Zhong, H., Cong, M., Wang, M., Du, Y. & Liu, D. Hb-rrt: A path planning algorithm for mobile robots using halton sequence-based rapidly-exploring random tree. Eng. Appl. Artif. Intell. 133, 108362 (2024).
Article Google Scholar
Asgharivaskasi, A. & Atanasov, N. Semantic octree mapping and shannon mutual information computation for robot exploration. IEEE Trans. Rob. 39, 1910–1928 (2023).
Article Google Scholar
Wang, Z., Li, M., Wu, M., Moens, M.-F. & Tuytelaars, T. Instruction-guided path planning with 3d semantic maps for vision-language navigation. Neurocomputing 129457 (2025).
Zhang, Y. et al. Building metric-topological map to efficient object search for mobile robot. IEEE Trans. Industr. Electron. 69, 7076–7087 (2021).
Article Google Scholar
Reynolds, D. A. et al. Gaussian mixture models. Encyclopedia of biometrics741 (2009).
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–22 (1977).
Article MathSciNet Google Scholar
Asgharivaskasi, A. & Atanasov, N. Semantic octree mapping and shannon mutual information computation for robot exploration. IEEE Trans. Rob. 39, 1910–1928 (2023).
Article Google Scholar
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022).
Google Scholar
Cao, C. et al. Autonomous exploration development environment and the planning algorithms. In 2022 International Conference on Robotics and Automation (ICRA), 8921–8928 (IEEE, 2022).
Choi, J. et al. Safe and efficient trajectory optimization for autonomous vehicles using b-spline with incremental path flattening. IEEE Transactions on Intelligent Transportation Systems (2024).
Xu, W., Cai, Y., He, D., Lin, J. & Zhang, F. Fast-lio2: Fast direct lidar-inertial odometry. IEEE Trans. Rob. 38, 2053–2073 (2022).
Article Google Scholar
Ren, T. et al. Grounded sam: Assembling open-world models for diverse visual tasks (2024). 2401.14159.
Zhang, Y. et al. Recognize anything: A strong image tagging model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1724–1732 (2024).
Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Jurenka, I. & et al. Towards responsible development of generative ai for education: An evaluation-driven approach. Tech. Report, Google (2024).
AI, M. Kimi (2023). Accessed: 2025-01-15.
Hornung, A., Wurm, K. M., Bennewitz, M., Stachniss, C. & Burgard, W. Octomap: An efficient probabilistic 3d mapping framework based on octrees. Auton. Robot. 34, 189–206 (2013).
Article Google Scholar
Oleynikova, H., Taylor, Z., Fehr, M., Siegwart, R. & Nieto, J. Voxblox: Incremental 3d euclidean signed distance fields for on-board mav planning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1366–1373 (IEEE, 2017).
Duberg, D. & Jensfelt, P. Ufomap: An efficient probabilistic 3d mapping framework that embraces the unknown. IEEE Robotics Autom. Lett. 5, 6411–6418 (2020).
Article Google Scholar
Tao, Y. et al. Seer: Safe efficient exploration for aerial robots using learning to predict information gain. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 1235–1241 (IEEE, 2023).

Download references

Acknowledgements

This work was supported by the Guangdong Provincial Universities’ Characteristic Innovation Project (2024KTSCX360) and the Guangdong Educational Science Planning Project (2023GXJK837).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 511400, China
Lanxiang Zheng
School of Systems Science and Engineering, Sun Yat-Sen University, Guangzhou, 511400, China
Ruidong Mei
School of Information Engineering and Business Management, Guangdong Nanhua Vocational College of Industry And Commerce, Qingyuan, 511510, China
Bingzhi Zou & Zhijun Zhao
School of Biomedical Engineering and Imaging, Xianning Medical College, Hubei University of Science and Technology, Xianning, 437100, China
Mingxin Wei & Xu Siliu
School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, 519000, China
Mingxin Wei

Authors

Lanxiang Zheng
View author publications
Search author on:PubMed Google Scholar
Ruidong Mei
View author publications
Search author on:PubMed Google Scholar
Bingzhi Zou
View author publications
Search author on:PubMed Google Scholar
Zhijun Zhao
View author publications
Search author on:PubMed Google Scholar
Mingxin Wei
View author publications
Search author on:PubMed Google Scholar
Xu Siliu
View author publications
Search author on:PubMed Google Scholar

Contributions

L.Z. designed the methodology and wrote the original draft. R.M. curated the data and performed validation. B.Z. reviewed and edited the manuscript. M.W. and S.X. supervised the project and administered the project. All authors reviewed the manuscript.

Corresponding author

Correspondence to Mingxin Wei.

Ethics declarations

Competing interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, L., Mei, R., Zou, B. et al. GMM-searcher: efficient object search in large-scale scenes using large language models. Sci Rep 15, 16709 (2025). https://doi.org/10.1038/s41598-025-00788-8

Download citation

Received: 02 March 2025
Accepted: 30 April 2025
Published: 14 May 2025
Version of record: 14 May 2025
DOI: https://doi.org/10.1038/s41598-025-00788-8