Introduction

In the field of intelligent robotics, significant progress has been made in developing autonomous systems capable of operating in complex environments1. One of the challenging tasks for autonomous robots is object search, where they must navigate through unpredictable and dynamic environments while quickly detecting and locating target objects2. Despite advances in this area, much of the existing research has primarily focused on object search within small, structured indoor environments3,4. In large-scale outdoor environments such as industrial parks or construction sites, object search plays a critical role in tasks like equipment monitoring, inventory management, and security monitoring. However, most current methods are not well-suited to these large-scale, unstructured, and feature-sparse environments, where object search continues to face significant challenges.

Object search in large-scale, unstructured, and dynamically changing environments presents significant challenges in achieving both accurate and memory-efficient modeling. As the scale of the environment grows, the memory required for modeling increases exponentially, placing extreme pressure on storage and querying. Existing methods either minimize memory usage at the expense of detail5,6, or preserve detail with significantly higher resource consumption7. Furthermore, the complexity and unpredictability of such environments heighten the demands on the robot’s decision-making, requiring thorough analysis and understanding of its surroundings for informed and adaptive choices. Additionally, for repetitive tasks, the ability of robots to adapt to environmental changes and learn from past experiences to improve future search efficiency becomes essential. However, most search frameworks3,4,5,6,7 lack mechanisms to integrate knowledge from previous tasks, relying on predetermined strategies that show little improvement after repeated executions.

To address these challenges, this paper introduces GMM-Searcher, a novel framework designed for efficient object search in large-scale environments, while equipping the robot with lifelong learning capabilities to adapt to changing tasks and environments. GMM-Searcher is particularly well-suited for searching objects with certain spatial distribution characteristics, such as vehicles being located outdoors rather than indoors, and bus stops being situated along the roadside rather than in a garden. The framework is shown in Fig. 1, it integrates ARTG-GMM, an abbreviation for the combination of an adaptive-resolution topological graph (ARTG) and Gaussian Mixture Models (GMM), which models both spatial layouts and semantic information. This hybrid approach retains the simplicity inherent in topological graphs6 while leveraging GMM’s capacity to represent complex objects8,9, significantly reducing memory requirements while maintaining high-detail representation. Its incremental construction enables rapid adaptation to environmental changes. By incorporating the reasoning capabilities of large language models (LLMs)10,11,12, the robot can make more informed and efficient decisions, particularly in dynamic and unfamiliar environments. To enhance performance in repetitive tasks, GMM captures and stores search experiences from past successes, continually expanding and merging Gaussian components to form a knowledge base. These experiences not only accelerate the precision of LLM reasoning for continuous improvement but also equip the robot with lifelong learning abilities. Real-world experiments validate that GMM-Searcher significantly enhances object search efficiency in large-scale environments. It reduces unnecessary search efforts, improves adaptability to environmental changes, and increases the robot’s performance in repetitive tasks.

The main contributions of this paper are as follows:

  • A novel framework for object search that leverages the reasoning capabilities of LLMs to enhance the efficiency of robot target searches in large-scale environments.

  • An efficient environmental representation method that combines an adaptive-resolution topological graph with GMMs, enabling incremental updates.

  • An explicit method for recording historical search experiences, improving performance in repetitive tasks, and supporting the robot’s lifelong learning ability.

  • Validation of the proposed methods through real-world experiments, demonstrating improved efficiency and adaptability in large-scale object search tasks.

Related work

In object search tasks, traditional approaches primarily depend on predefined search strategies, utilizing detailed path planning and grid-based mapping techniques3,13. Dang et al.13 introduced a sampling-based semantic exploration method that accelerates the search by maximizing the gain from exploring new spaces and improving observation resolution in previously mapped regions with detected objects of interest. Kim et al.3 designed efficient visitation strategies that sequentially explore frontiers, enabling rapid exploration of unknown environments to locate target objects. While these methods are efficient, they rely heavily on the precision of sensor inputs13 and the accuracy of underlying environmental models3. This dependence often leads to rigid search strategies that struggle to adapt to dynamic or changing targets. As a result, these methods exhibit limited flexibility and reduced performance in large-scale environments, where unpredictability and environmental complexity present additional challenges.

Data-driven approaches, such as Convolutional Neural Networks (CNNs)14,15,16 and Reinforcement Learning (RL)17,18,19,20, have substantially improved the efficiency of robotic object search tasks. Pandey et al.14 propose a behavior-based neural network (BNN) that enhances mobile robot navigation by improving path efficiency and collision detection. Ye et al.15 develop a hierarchical policy learning framework, integrating a high-level policy for sub-goal planning with a low-level policy for exploration. These models, trained on large datasets, effectively identify patterns and predict object locations, thus streamlining search strategies. However, despite these advancements, the application of such techniques is often restricted to indoor or small-scale environments. In larger, more complex settings, challenges like sparse rewards and limited generalization hinder their effectiveness, making it difficult to achieve optimal results in dynamic, expansive environments.

Vision-based and semantic-integrated strategies have also emerged, allowing robots to integrate visual and semantic data, thereby enhancing their environmental understanding and making efficient search decisions20,21,22,23. By leveraging reasoning capabilities to predict necessary actions, these approaches improve task execution in similar environments24,25. The introduction of Large Language Models (LLMs) has further enhanced these tasks by offering sophisticated reasoning skills that help guide decisions in complex environments. For instance, Zhang et al.4 introduced a framework that uses natural language to model relationships between objects and environments, allowing robots to locate targets more efficiently. Schmalstieg et al.26 explored interactive search tasks, guiding robots through complex sequences, such as navigating cabinets and doors, to find objects. Honerkamp et al.27 extended these ideas to large-scale environments, using open-vocabulary scene graphs to boost LLMs’ semantic reasoning in expansive object searches. Even in the case of obscured objects, LLMs can reason about likely locations, helping robots make informed navigation choices28. Nevertheless, most existing research is limited to small-scale or indoor environments, or depends on pre-built maps. There is less focus on dynamic, large-scale outdoor scenarios where unpredictable elements, such as pedestrians and moving vehicles, introduce additional challenges.

In robotic exploration and search tasks, several common methods are used for environmental representation, including grid maps, topological maps, and hybrid metric-topological maps. Grid maps29,30 provide highly accurate spatial details but can become computationally expensive when applied to large-scale environments. Nvblox31, based on grid maps, proposes an incremental Euclidean distance field construction method, where each cell stores the distance to the nearest obstacle. Octomap32,33 addresses this issue by optimizing memory use through a hierarchical voxel structure, which efficiently represents free space. Topological maps34,35, typically used in 2D environments, represent space as a graph of nodes and edges, offering significant memory efficiency. However, they lack the detailed geometric information necessary for more complex tasks. Recent advances, like semantic maps20 or semantic Octomap36, augment these representations with object or area labels, improving the robot’s comprehension of its surroundings, though these methods often require sophisticated perception systems. Hybrid metric-topological maps23,27,37,38 combine semantic information with a topological framework, offering a balance between efficiency and detail. However, they can struggle to accurately capture object size and geometry, limiting their application in environments with high levels of complexity.

Fig. 1
Fig. 1
Full size image

The proposed GMM-Searcher framework consists of three main components. First, the perception module captures and processes panoramas and point clouds, providing comprehensive environmental data. These data serve two purposes: real-time observations for the LLM and storage within the adaptive-resolution topological graph with GMMs (ARTG-GMM). The LLM utilizes both real-time environmental observations and historical observations, applying its reasoning capabilities to formulate well-informed search strategies. For repeated tasks, the GMM stores past task experiences, progressively enhancing the LLM’s decision-making accuracy over time.

In contrast to these limitations, our GMM-Searcher framework addresses large-scale object search by unifying adaptive-resolution topological graphs and GMM to model spatial-semantic distributions. Unlike static-map-dependent methods, our hybrid system dynamically constructs environment representations, minimizing memory usage without sacrificing detail. Integrated LLM reasoning leverages stored search experiences encoded as GMM components to continuously refine future task efficiency and enhance adaptability to dynamic environmental changes.

ARTG-GMM environment representation

In large-scale scenes, effective environmental representation is crucial for search tasks. This paper introduces the ARTG-GMM, comprising a 2.5D ARTG and multiple GMMs. The ARTG indicates the traversable areas for the robot in 3D space, while the GMMs model the objects within the environment. This section outlines the construction of the ARTG-GMM.

Adaptive-resolution topological graph

The environment consists largely of free space with relatively few obstacles, making traditional map representations, such as grid maps or voxel maps, inefficient. To mitigate memory consumption, this paper analyzes the robot’s traversability in 3D space and maps it onto a 2.5D ARTG. The ARTG is modeled as an undirected graph \(\mathscr {G}=\{\mathscr {V}, \mathscr {E} \}\), where \(\mathscr {V}\) represents the vertices and \(\mathscr {E}\) represents the edges connecting adjacent vertices.

Fig. 2
Fig. 2
Full size image

3D space robot traversability analysis: Obstacles, narrow passages, and steep slopes mapped as non-traversable areas.

When the robot operates in complex environments, it must account for both ground obstacles and overhead obstructions, such as overhanging branches. To ensure safety, the ARTG maps obstacles, narrow passages, and steep slopes as collision areas, as shown in Fig. 2. The ARTG is generated by randomly sampling the free space and is incrementally updated as the robot moves. For precise environmental modeling, the vertex sampling density is higher near collision areas and sparser in open areas, as illustrated in Fig. 3. The sampling density \(\alpha d_{\text {obs}}\), with \(\alpha > 0\), is proportional to the distance from the nearest obstacle in the region. An edge between two vertices is established when the distance between them is less than \(d_{\text {con}}\), and the edge does not pass through any collision areas. The connection distance \(d_{\text {con}}\) is defined as:

$$\begin{aligned} d_{\text {con}} = \min (\beta d_{\text {obs}}, d_{\text {max}}), \end{aligned}$$
(1)

where \(\beta> \alpha > 0\) is a scaling factor, and \(d_{\text {max}}\) is the maximum allowable distance for constructing an edge. The denser graph structure near collision areas captures fine details of their boundaries, while the sparser structure in open areas maintains simplicity. Additionally, to preserve the sparsity of the ARTG, any newly sampled vertex located closer than \(\alpha d_{\text {obs}}\) to the nearest vertex is pruned.

Gaussian mixture model

For objects of various shapes in the environment, GMM is well-suited to capture complex object layouts through a mixture of Gaussian distributions, enabling the representation of diverse object types at different locations. Its compact structure efficiently reduces storage requirements, making it an effective method for spatial representation39.

The GMM is a probabilistic model that represents a combination of several Gaussian distributions. Each component corresponds to an individual Gaussian, capturing the probability of encountering a target at specific locations. The probability density function for each component is defined as:

$$\begin{aligned} \mathscr {N}(\textbf{x}|\mu _i, \Sigma _i) = \frac{1}{\sqrt{(2\pi )^k |\Sigma _i|}} e^{-\frac{1}{2} (\textbf{x} - \mu _i)^T \Sigma _i^{-1} (\textbf{x} - \mu _i)}, \end{aligned}$$
(2)

where \(\textbf{x} \in \mathbb {R}^k\) is the location, \(\mu _i\) is the mean vector of the Gaussian component, and \(\Sigma _i\) is the covariance matrix, quantifying the uncertainty in position. The probability density of the GMM with m components is expressed as:

$$\begin{aligned} p(\textbf{x} | \{\pi _i, \mu _i, \Sigma _i \}_{i=1}^N) = \sum _{i=1}^{m} \pi _i \mathscr {N}(\textbf{x}|\mu _i, \Sigma _i), \end{aligned}$$
(3)

with \(\pi _i\) is the mixture weight of the i-th Gaussian component.

For each object, a GMM with N Gaussian components is used to model its structure, as shown in Fig. 3. The parameters \(\{\pi _i, \mu _i, \Sigma _i \}_{i=1}^N\) of the GMM are optimized using the Expectation-Maximization (EM) algorithm, ensuring unique identification of the object. EM is an iterative process: during the E-step, the probability of each data point belonging to each Gaussian component is calculated based on the current parameter estimates. In the M-step, the parameters are updated to maximize the likelihood given these assignments. This process repeats until convergence, as detailed in40.

Fig. 3
Fig. 3
Full size image

The schematic of the ARTG-GMM environmental representation. The ARTG represents free space with a higher density of vertices near obstacles, and GMMs model spatial obstacles.

Each object \(O_j\) is ultimately represented by a collection \(O_j = \{ \{\pi _i, \mu _i, \Sigma _i\) \(\}_{i=1}^N, label, color \}\). Here, the label identifies the object’s semantic category, and color corresponds to the semantic point cloud coloring. Both label and color are updated using Bayesian probability41, enhancing the model’s robustness against scene segmentation and point cloud semantic matching errors. Additionally, as the robot moves, the object’s semantic point cloud is progressively refined, and the parameters \(\{\pi _i, \mu _i, \Sigma _i \}_{i=1}^N\) are updated accordingly.

Finally, the ARTG-GMM is represented as \(\{\mathscr {G}, \mathscr {O} \}\), where \(\mathscr {G}\) is incrementally updated as the robot moves, and the object set \(\mathscr {O} = \{O_1, O_2, \cdots , O_n \}\) expands as new objects are discovered. The ARTG-GMM supports incremental additions and local updates, making it suitable for large-scale environments. Its unique structure not only reduces storage and querying pressure on onboard platforms but also maintains high-precision modeling of free space.

LLM-based search strategy

LLMs excel at understanding and reasoning across multimodal data, making them particularly advantageous for search tasks in complex environments. This chapter introduces how LLMs enhance robotic search efficiency through Chain-of-Thought (CoT)42 reasoning, and how to enable robots with lifelong learning capabilities.

Searching through CoT

CoT reasoning allows LLMs to decompose complex tasks into structured, sequential steps, enhancing decision-making in real-world scenarios. In search tasks, this ability enables the robot to incrementally make decisions for the next steps by continuously analyzing the environment, inferring potential target locations, and adapting its strategy accordingly.

Fig. 4
Fig. 4
Full size image

A demonstration of using CoT to guide the LLM’s reasoning and decision-making process. The LLM infers potential target locations based on task descriptions and refines these inferences by incorporating environmental and historical observations. The robot selects a direction within the predefined region for its next movement, iterating this process until the target is located.

In the proposed framework, GMM-Searcher employs CoT to guide the LLM’s reasoning process, enabling the incremental refinement of search strategies in complex environments. A demonstration of this process is shown in Fig. 4. By leveraging task descriptions, the LLM integrates its world knowledge with environmental constraints to identify promising search areas, as illustrated in Fig. 4a and b. Subsequently, the current environmental observation \(\mathscr {V}_e\), historical observation \(\mathscr {V}_h\), and the corresponding prompt \(\mathscr {P}_h\) for the historical observation are sequentially fed into the LLM, allowing it to fully understand the surrounding environment and historical movement. Finally, the LLM integrates all the information and aligns it with the task requirements, generating a search strategy to guide the robot in completing the search task.

Specifically, for the current environmental observation \(\mathscr {V}_e\) captured via a panoramic camera, it is discretized into N regions, with each region corresponding to a decision \(D_i\), where \(D_i \in \mathscr {D} = \{D_1, D_2, \dots , D_N \}\). This regional division facilitates the LLM’s decision-making process, allowing it to select discrete decisions from the continuous environmental observation. Through iterative reasoning, the LLM evaluates new observations and integrates them with prior knowledge to select the optimal decision \(D^* \in \mathscr {D}\). This iterative process allows the LLM to continuously refine its understanding and improve its search strategies based on the most recent observations.

Despite the rich information provided by the environmental observation \(\mathscr {V}_e\), distortion from cylindrical mapping poses interpretation challenges, even for experienced observers. Furthermore, although the LLM has memory capabilities, it is not sensitive to depth in three-dimensional space, which makes it difficult for the model to recall areas previously traversed by the robot, potentially leading to the robot wandering in a specific region. To address this, both the environmental observation \(\mathscr {V}_e\) and ARTG-GMM are simultaneously input to the LLM, where ARTG-GMM offers a more intuitive historical observation \(\mathscr {V}_h\) in the form of a 2D top-down image. The associated prompt describes various elements using a standardized { ’label’, ’color’ } format. As shown in Fig. 4c, each element is assigned a unique label and color code, ensuring clarity and structure in the input data for LLM-based reasoning. Additionally, the historical observation \(\mathscr {V}_h\) is divided into regions that correspond one-to-one with the regions in the environmental observation \(\mathscr {V}_e\), allowing the LLM to intuitively relate the two without needing to consider the robot’s current orientation. This standardized prompt and division of regions ensure consistent identification of key objects, aiding the LLM in effectively processing environmental input. By continuously incorporating new environmental and historical observations, a series of search strategies is generated to guide the robot in completing the search task.

Notably, while the LLM’s reasoning search strategy facilitates target search, it does not guarantee success in all scenarios. For instance, in certain situations, a target may be overlooked due to occlusion or may be missed if located in a corner. Once the LLM has initially ignored these areas, it can be challenging to return for further exploration. Therefore, when the ARTG-GMM has covered more than the threshold (such as 70%) of the search area without locating the target, the GMM-Searcher will adopt a strategy similar to TARE43 to ensure complete coverage of the exploration area. Specifically, for the regions still uncovered by ARTG-GMM, frontiers will be generated at the boundaries, and a Traveling Salesman Problem (TSP) will be formulated to create a route that sequentially visits multiple frontiers. This strategy will guide the robot to explore areas that were not reached during the LLM reasoning process, thereby ensuring that the entire area is thoroughly covered and avoiding any unsearched object areas.

Lifelong learning

LLMs possess an inherent ability to adapt and learn across various tasks or through specialized training, demonstrating a distinct focus on specific tasks. In search tasks, this capability manifests as increasingly accurate reasoning about object locations, aligning better with real-world expectations. However, a significant limitation is their inability to record the historical positions of all objects when switching search objects. Although reasoning improves in coherence and accuracy over time, the efficiency gains in task execution remain limited.

To address this limitation, GMM-Searcher explicitly stores task execution experiences using GMMs. This approach allows the robot to recall and leverage past experiences when encountering similar tasks or environments. Specifically, for each search object, a corresponding GMM is constructed. When an object is found at a location \(\textbf{p}\), with target size \(\textbf{s}\), the corresponding GMM update steps are as follows:

First, a new component \(G_{new}\) is created with parameters initialized as follows:

$$\begin{aligned} \begin{aligned} \mu _{\text {new}}&= \textbf{p}, \\ \Sigma _{\text {new}}&= \text {diag}(\textbf{s}^2), \\ \pi _{\text {new}}&= \frac{1}{N+1}, \\ \end{aligned} \end{aligned}$$
(4)

where N is the number of components in the GMM.

Next, the new component \(G_{\text {new}}\) attempts to merge with existing components within the GMM. It is only merged with the closest component \(G_i=\{\pi _i, \mu _i, \Sigma _i\}\) if the distance \(d(\mu _{\text {new}}, \mu _i) < d_{\text {max}}\) is satisfied, where \(d(\cdot ,\cdot )\) is the Mahalanobis distance and \(d_{\text {max}}\) is the distance threshold. If merged, the parameters of \(G_i'\) are updated as follows:

$$\begin{aligned} \begin{aligned} \mu _{i}'&= \frac{\pi _{i} \cdot \mu _{i} + \pi _{\text {new}} \cdot \mu _{\text {new}}}{\pi _{i} + \pi _{\text {new}}}, \\ \Sigma _{i}'&= \frac{\pi _i}{\pi _i + \pi _{\text {new}}} \left( \Sigma _i + (\mu _i - \mu _i')(\mu _i - \mu _i')^T \right) + \frac{\pi _{\text {new}}}{\pi _i + \pi _{\text {new}}} \left( \Sigma _{\text {new}} + (\mu _{\text {new}} - \mu _i')(\mu _{\text {new}} - \mu _i')^T \right) ,\\ \pi _{i}'&= \pi _{i} + \pi _{\text {new}}. \end{aligned} \end{aligned}$$
(5)

Finally, the weights of all components in the GMM are normalized as \(\pi _i = \pi _i / \sum _{m=1}^{N+1} \pi _m\).

As the robot repeats similar tasks, the accumulated experiences are intuitively added into the historical observation \(\mathscr {V}_h\), similar to how objects are represented in \(\mathscr {V}_h\). These experiences use the same { ’label’, ’color’ } format for prompts. The key difference is that areas with a higher probability of finding the target have lower transparency in their associated ’color’ values. Additionally, for search tasks that the robot has not previously encountered, if the ARTG-GMM contains relevant information about the target, a corresponding GMM is constructed to facilitate the search process. This ensures that even for unfamiliar tasks, the robot can efficiently leverage existing knowledge from similar tasks or environments, thereby improving overall search efficiency.

Path optimization

Once a search strategy is determined, it must be optimized for effective robotic navigation. A local target is selected within the region corresponding to decision \(D^*\), ensuring it is both reachable and free from obstacles, while maintaining a safe distance to avoid potential collisions. Additionally, the local target should be positioned far enough from the current location to enable continuous movement during the decision-making process, preventing the robot from halting due to an overly short travel distance. Then, path optimization44 is applied to ensure smooth and safe navigation. As new decisions are made, path optimization guarantees a seamless transition to the updated target, allowing the robot to maintain smooth and adaptive movement throughout.

Experiments

This section describes the experimental setup and assesses the search efficiency using multiple metrics to validate the proposed framework’s effectiveness.

Fig. 5
Fig. 5
Full size image

The two-wheeled self-balancing robot used in the experiments.

Implementation details

The effectiveness of the proposed framework is evaluated using a two-wheeled self-balancing robot equipped with an Insta360 X4 camera and a Mid-360 LiDAR, as shown in Fig. 5. The camera captures 360-degree panoramas at 30 fps, while the LiDAR operates at 10Hz. Network connectivity is provided by a 4G radio, enabling upload speeds of up to 50 Mbps and download speeds of up to 150 Mbps. All algorithms are executed on an onboard NVIDIA Jetson Orin NX, with FAST-LIO245 used for precise localization.

Fig. 6
Fig. 6
Full size image

Two large-scale experimental scenes: Scene 1 is outdoors, and Scene 2 is a semi-open indoor environment.

The experiments take place in two real-world scenarios, depicted in Fig. 6. The first scenario is a 320m \(\times\) 210m outdoor campus area, where moving vehicles and pedestrians present additional challenges for the experiment. The second scenario involves a semi-enclosed 200m \(\times\) 100m area, surrounded by four academic buildings with regularly distributed classrooms. During the search process, the robot operates at a maximum speed of 1.0 m/s. The ARTG-GMM functions with a minimum distance of \(d_{\text {min}} = 0.2\) m and a maximum distance of \(d_{\text {max}} = 1.2\) m. Grounded-SAM46 with RAM47 is applied for panoramic semantic segmentation, utilizing the onboard GPU. GPT-448 serves as the LLM, capable of handling both image and text inputs concurrently. To assist in decision-making, both environmental and historical observations are segmented into 11 regions. Path optimization uses the recommended parameters44, with a maximum acceleration of 1.5 \(\hbox {m/s}^2\), a maximum angular velocity of 1 rad/s.

Exploration decision analysis

Understanding historical observations is crucial for LLMs to gain a comprehensive understanding of their environment and utilize historical information, enabling them to make more accurate decisions. To evaluate the understanding capability of LLMs regarding historical observations presented in image form, including ARTG-GMM and historical movement trajectories (illustrated in Fig. 4c), an experiment is designed. This involves inputting both environmental observations obtained during exploration and their corresponding historical observations into various LLMs. The models are then manually queried about the objects contained within each region, and the accuracy of their responses is recorded. If errors occur, the correct answers are provided to the LLMs to facilitate self-learning. The experiment is conducted using models such as GPT-4o, GPT-4o-mini48, LearnLM 1.549, and Moonshot AI50, all of which support multimodal inputs of images and text.

The results of these experiments are illustrated in Fig. 7, displaying the accuracy of reasoning after multiple iterations of errors and corrections. As shown, all models demonstrate a steady improvement in accuracy with an increasing number of reasoning optimization steps. Notably, GPT-4o exhibits higher accuracy compared to the other models, reaching over 90% accuracy after 30 optimization steps. Initially, the interquartile range (IQR) of the box plots is relatively large, indicating a wider spread in the accuracy. However, as the number of optimization steps increases, the IQR becomes narrower, suggesting that the data distribution becomes more uniform and consistent. This trend indicates that the models not only improve in accuracy but also become more robust and reliable as they undergo iterative learning and correction. The consistent improvements across models highlight the effectiveness of iterative feedback and correction in enhancing the LLMs’ ability to reason about complex scenes.

Fig. 7
Fig. 7
Full size image

Two large-scale experimental scenes: Scene 1 is outdoors, and Scene 2 is a semi-open indoor environment.

Overall, the experimental outcomes suggest that the ARTG-GMM representation is compatible with the reasoning strengths of the tested large models, facilitating effective decision-making and understanding of historical information in complex scenarios.

Fig. 8
Fig. 8
Full size image

The robot’s decision-making process during the search task involves dividing each environmental observation into 11 regions. The green-highlighted region represents the next direction of movement, chosen by the robot based on both environmental and historical observations.

Building on the LLMs’ improved reasoning capabilities with historical observations, Fig. 8 illustrates the decision-making process during the first search for a target car in Scene 1, showing several environmental observations. The green-highlighted segment represents the decision made by the LLM. In Fig. 8a, the vehicle is detected, and the LLM predicts that the region may be an open parking lot, indicating the target car is likely located there, prompting the robot to move toward it. When the target car is not found in this area, the robot selects an unexplored region, as shown in Fig. 8b. In Fig. 8c, no car is found nearby, and no likely parking areas are identified; however, with a junction detected in region 5, the robot moves toward that direction. Finally, in Fig. 8d, the robot successfully locates the target car in region 6.

Environmental representation analysis

Fig. 9
Fig. 9
Full size image

Search task details in Scene 1 using the proposed algorithm. The local ESDF map (colored regions) shows distance from obstacles for trajectory optimization, with the red line indicating the robot’s path and red dots as ARTG-GMM vertices. The colorful point cloud is reconstructed via GMM in ARTG-GMM.

Figure 9 illustrates the details of a robot’s search task in Scene 1 using the proposed algorithm. In the figure, the Euclidean Signed Distance Functions (ESDF)31 show different colors to represent the varying distances from obstacles, which are used for trajectory optimization. The red line indicates the robot’s movement path, while the red dots represent the vertices in the ARTG-GMM. The colorful point cloud is generated from the GMM in ARTG-GMM through Gaussian Mixture Model reconstruction.

Table 1 presents a comparison of the proposed ARTG-GMM with several commonly used robotic mapping methods, where the bold values in the table indicate optimal performance metrics. The comparison is made based on two metrics: memory usage (Memory) and average update time (Time) across two scenes with different methods. The proposed ARTG-GMM method consistently outperforms benchmarks, such as Octomap51, Voxblox52, Nvblox31, and UFOmap53, in terms of both memory usage and processing time. Since the resolution in the ARTG part of ARTG-GMM is adaptive, its performance remains constant across different resolutions.

Table 1 Comparisons of environmental representations of two scenes.

For the 0.1m resolution, ARTG-GMM demonstrates the lowest memory usage in both scenes (12.84 MB for Scene 1 and 9.17 MB for Scene 2). Specifically, this represents a 99.8% reduction compared to Nvblox, a 99.1% reduction compared to Octomap, and a 99.3% reduction compared to Voxblox in Scene 1. Similarly, in Scene 2, ARTG-GMM achieves a 99.9% reduction compared to Nvblox, a 99.2% reduction compared to Octomap, and a 99.2% reduction compared to Voxblox. ARTG-GMM also exhibits the fastest update times, with 46.98 ms for Scene 1 and 48.09 ms for Scene 2. While benchmarks perform well during the initial stages of mapping, their update times increase as scene size and memory usage grow. In contrast, ARTG-GMM maintains consistently fast and stable update times, with most of the computational overhead attributed to the EM parameter fitting process.

At the 0.3m resolution, the memory usage and update times of the benchmarks are significantly reduced; however, ARTG-GMM continues to demonstrate the lowest memory usage. This represents a 98.8% reduction compared to Nvblox, a 74.0% reduction compared to Octomap, and a 90.7% reduction compared to Voxblox in Scene 1. In Scene 2, ARTG-GMM achieves reductions of 98.3% compared to Nvblox, 55.4% compared to Octomap, and 88.0% compared to Voxblox. Although ARTG-GMM’s update time remains highly competitive, it is slightly slower than UFOmap.

This analysis highlights that ARTG-GMM offers low memory usage and efficient processing times, making it highly suitable for large-scale environments with limited computational resources.

Fig. 10
Fig. 10
Full size image

Comparison of the proposed algorithm’s search trajectories against benchmarks. (a) and (b) show first-time searches without prior experience, while (c) and (d) illustrate repeated searches utilizing historical experience, with variations in both the starting position and the search object. The colored point clouds represent data captured by the robot while using the proposed algorithm.

Searching path analysis

To evaluate the performance of the proposed algorithm, we compare it against several benchmarks6,43,54 across two scenes, as shown in Fig. 10. All algorithms are configured identically, with benchmark parameters sourced from their respective open-source projects. In Fig. 10, the search trajectories for each algorithm are distinguished by different colors, and the colored point cloud represents the data collected by the robot using the proposed method. The ellipsoids in Fig. 10c and d visualize the GMM-based historical search experience, where darker colors indicate areas with a higher probability of locating the target. Figure 10a and b show search trajectories without prior knowledge, while Fig. 10c and d illustrate trajectories after leveraging accumulated search experience, with both starting points and target locations changed. Figure 10a and c focus on car searches, while Fig. 10b and d focus on classroom searches.

As shown in Fig. 10a and b, GMM-Searcher consistently achieves the shortest search trajectory compared to the benchmarks during first-time searches. This efficiency is largely due to the LLM’s ability to interpret the environment and infer potential object locations, allowing the robot to bypass areas where the object is unlikely to be found. For example, in Scene 1, while searching for a vehicle, GMM-Searcher avoids narrow roads and open spaces near the boundaries, which are less likely to contain the vehicle. Similarly, in Scene 2, when searching for a classroom in Building A, GMM-Searcher identifies that the robot is near Building D and intelligently directs it towards Building A before beginning the search, whereas the benchmarks immediately commence searching immediately without this spatial awareness, as seen in 10b.

After accumulating historical search experience, as shown in Fig. 10c and d, GMM-Searcher demonstrates more informed search strategies, even when the starting point or target location is altered. For example, in Scene 1, GMM-Searcher efficiently guides the robot to areas with a higher probability of containing the target, leading to successful searches, as illustrated in Fig. 10c.

Table 2 Comparison of experiments in two scenes.

Although all methods achieve a 100% success rate in searching for the object across multiple experiments, there are still notable differences in the search paths between the methods, as shown in Table 2. This table provides a detailed comparison of different methods across multiple experiments. The term “First-time Search” refers to search tasks conducted without any prior information. The “Repeated Search - Same Task” indicates searches performed after accumulating historical search experience, where both the starting and target locations remain unchanged. In contrast, “Repeated Search - Changed Task” refers to searches conducted after accumulating historical search experience, with both the starting and target locations altered.

In the first-time search, where no prior task data is available, the proposed method significantly reduces the search path length and time compared to the benchmarks. In Scene 1, the average path length is 426.29 m, representing reductions of approximately 17.1%, 36.0%, and 36.5% compared to EFP, TARE, and SEER, respectively. The time required is also reduced to 450.31 s, offering reductions of about 21.5%, 39.9%, and 39.2%, respectively. Similarly, in Scene 2, the proposed method achieves a path length of 160.63 m and a time of 167.32 s, marking significant improvements over the benchmarks.

The benefits of historical experience are apparent in the repeated search tasks with the same start and target (Repeated Search - Same Task). In Scene 1, the path length is reduced to 401.78 m, with a further drop in standard deviation, indicating more stable and efficient performance. The time decreases to 409.97 s, highlighting the method’s capability to learn from past searches. Meanwhile, benchmarks do not show any improvement, as noted by the ’-’ symbol in Table 2.

When the start and target locations are changed (Repeated Search - Changed Task), the proposed method still delivers superior performance. In Scene 1, the path length is reduced to 575.12 m, with a time of 599.08 s, both outperforming the benchmarks. In Scene 2, the proposed method achieves a path length of 201.11 m and a time of 209.49 s. These results confirm the method’s adaptability and long-term learning capabilities, making it highly efficient even when search conditions change.

When the start and target locations change (Repeated Search - Changed Task), the proposed method still demonstrates significant superiority. In Scene 1, the path length is reduced to 575.12 m, achieving reductions of approximately 60.2%, 57.1%, and 59.6% compared to EFP, TARE, and SEER, respectively. The time decreases to 599.08 s, offering reductions of about 60.5%, 59.8%, and 59.9%, respectively. In Scene 2, the proposed method achieves a path length of 201.11 m and a time of 209.49 s. These improvements underscore the method’s adaptability and its ability to learn over time, maintaining high efficiency even as search conditions change.

Time consumption

Table 3 Average time consumption of each module.

Table 3 summarizes the time consumption of each module in the proposed framework. Despite some modules, such as panoramic segmentation and GPT decision-making, taking longer to process, the framework remains efficient overall. This efficiency is achieved by leveraging the strong environmental understanding and inference capabilities of the LLM, which offsets the additional time required for these more resource-intensive tasks. As a result, the system effectively balances precision and speed across all modules.

Conclusion

This paper introduces GMM-Searcher, a novel framework for autonomous robotic search tasks in large-scale environments. The framework utilizes ARTG-GMM to efficiently manage the storage of environmental data, while the GMM component effectively captures and utilizes historical search experiences. By leveraging the deep understanding and reasoning capabilities of LLMs, the system intelligently predicts potential target locations and guides the robot in making more efficient search decisions. This method consistently outperforms traditional benchmarks in both path length and time consumption across various search scenarios. The results demonstrate the framework’s adaptability to dynamic conditions and its strong long-term learning abilities, establishing it as a highly efficient solution for autonomous search tasks.

Future work will explore ways to further enhance the system’s capabilities, including developing more advanced techniques to capture and reason about higher-level semantic information, as well as investigating methods to handle situations where the spatial distribution of targets is less predictable. By expanding the framework’s understanding of the environment and search context, we aim to further improve the robustness and generalization of the GMM-Searcher approach.