Introduction

As technology rapidly advances, robots are no longer confined to specific fields such as industry1, agriculture2, and service sectors3; they are progressively integrating into various aspects of daily life, much like humans. Developing general-purpose robotic frameworks that can adapt to diverse environments and tasks has become a crucial research focus in robotics. However, current robotic frameworks typically depend on large amounts of training data for specific scenarios, limiting their adaptability to uncertain environments and unknown tasks4. This limitation is especially evident in tasks requiring multimodal perception and manipulation5, where robots often struggle to operate effectively in unstructured environments. Building robotic frameworks for diverse scenarios often necessitates creating an extensive training database, which is time-consuming, labour-intensive, and costly6. Additionally, in uncertain environments, single-language information is often insufficient for accurately conveying task instructions; multimodal user instructions and contextual information are also necessary but challenging to collect and process. In recent years, zero-shot models have shown exceptional generalisation abilities in language, vision, and auditory fields, offering new approaches to addressing the manipulation challenges faced by robots in uncertain environments.

The ability of robots to generalise in uncertain environments and unknown tasks has consistently been a central topic in robotics research. In methods based on reinforcement learning, Gupta et al.7 proposed the learning of invariant feature spaces to achieve zero-shot generalisation, while Finn et al.8 enhanced generalisation by modelling the uncertainty in state value functions. Search-based methods, such as Monte Carlo Tree Search (MCTS), have also been successfully utilized in AlphaGo9. Chua et al.10 suggested that model-based predictive reinforcement learning can achieve zero-shot generalisation. In the realm of imitation learning, various approaches have adapted to new scenarios using robot demonstrations11,12, human videos13,14, language instructions15,16, and target images17. Ho and Ermon18 improved the adaptability of imitation learning by reframing it as an adversarial generation problem. Recently, few-shot methods based on meta-learning and transfer learning, such as Model-Agnostic Meta-Learning (MAML)19, domain adaptation networks20, and transfer reinforcement learning21, have been used to enhance robots’ generalisation capabilities. Unlike these previous methods, a key aspect of our approach is leveraging large zero-shot models trained on a wider range of data than what the robot typically encounters.

To enhance the capability of robots to perform tasks in uncertain environments, researchers have recently started exploring the application of zero-shot models in robotic control. Google’s Saycan22 leverages the PaLM model23 for robotic operations, while PaLM-E24 combines the PaLM model23, with 540 billion parameters, and the ViT model25, with 22 billion parameters, to create a comprehensive visual-language model. The RT-2 large visual-language-action (VLA) model26 exhibits stronger generalisation and emergent capabilities, learning from both internet and robotic data and converting this knowledge into control instructions. In addition to specialized zero-shot models for robots, researchers have also integrated existing visual or language zero-shot models into robotic frameworks for tasks like object classification27, detection28, and segmentation29. For example, CLIPORT30 utilizes the CLIP model27 for encoding semantic understanding and object manipulation in robots, with extensions into the 3D domain by Mohit et al31. CaP32 uses specific context prompts to steer the output of LLMs, Socratic Models33 add perceptual information to LLMs, and LID34 employs LLMs for sequential decision-making. R3M35 enhances the learning of downstream robotic tasks using diverse human video data36, while DALL-E-Bot37 employs Stable Diffusion to generate target scene images for guiding robot actions. Instruct2Act38 directs robots in visual tasks through API calls to foundational visual modules. Our proposed method differs from existing research by not being confined to a single visual or auditory modality. Instead, it offers an open library of zero-shot models and robotic action modules. The framework dynamically determines how to combine these models and actions based on given instructions, thereby managing instructions across various modalities. This approach significantly improves the framework’s flexibility and adaptability, enabling it to handle a broader range of task scenarios.

Imagine a scenario where a robot framework is instructed to “turn off the alarm clock placed on the Harry Potter book.” To execute this instruction successfully, the robot must first comprehend the specific meaning of the instruction. It then needs to perform several tasks, including scanning its environment, identifying the book, and recognizing the distinctive features of the Harry Potter cover. Simultaneously, the robot must use its microphone to detect and locate the sound of the alarm clock. By integrating these visual and auditory inputs, the robot can precisely locate the alarm clock on the book and turn it off. While current technology can manage these tasks individually, combining them into a single framework capable of functioning based on natural language or multimodal instructions presents a significant challenge. This complexity surpasses the capabilities of traditional end-to-end training frameworks, necessitating a more advanced approach to tackle the issue of robot task execution in uncertain environments.

This study introduces an innovative robotic framework called “Panda Act”, which uses a multi-layer modular design to handle the entire process from receiving natural language and multimodal instructions to the precise execution of tasks. The core feature of the framework is its ability to generate a series of operational steps from the given instructions, specifically in the form of a Python script for the robot. Each line of this script invokes the framework’s supported modules, including visual zero-shot models, auditory zero-shot models, and robotic action control modules. These modules operate in a hierarchical sequence, where each module uses the output from the previous one as input and produces intermediate results for the next module.

Fig. 1
figure 1

A robotic task is executed by invoking multiple modules within the “Panda Act” framework. The LLM in “Panda Act” autonomously selects which framework modules to call based on task instructions. The green modules represent the zero-shot models currently included in the framework (Text-Davinci-00339, Llama340, GPT-441, SAM42, HQ-SAM43, Mobile-SAM44, CLIP27, Open-CLIP45, ImageBind46).

Figure 1 illustrates the workflow of the “Panda Act” framework. Initially, the framework leverages the semantic parsing capabilities of the GPT-4 model to request users to clarify any semantically ambiguous instructions, ensuring accurate and comprehensive task information. Based on the clarified task content, the framework then selects suitable models for processing. For instance, it employs the “Segment Anything Model” (SAM) for segmenting environmental images and the ImageBind model to match environmental sounds with images, thereby precisely locating the target object. Finally, the framework generates a robot operation sequence from the recognition results, enabling the robot to accurately execute the task. This modular design enhances the framework’s flexibility and scalability, allowing it to adapt to uncertain task environments. By integrating various zero-shot models and robot control modules, the “Panda Act” framework exhibits unique advantages in processing multimodal instructions and executing unknown tasks.

Unlike existing methods that directly generate robot task code (such as ChatGPT for Robotics47), “Panda Act” uses an improved approach to robot program generation. Instead of directly outputting robot control code, it controls robot behaviour by invoking independent perception and action modules. This method enhances the success rate and reliability of executing unknown tasks. Specifically, LLMs like GPT-4 first parse natural language and multimodal instructions, then dynamically determine the necessary framework modules based on task requirements, and finally generate Python code to call the relevant modules for executing robot tasks. This reduces the burden on individual modules and makes the framework more modular and scalable.

We extensively evaluated the “Panda Act” framework in two environments: a simulated environment using PyBullet48 and a real-world environment featuring a Dobot robotic arm and an Intel RealSense D435i camera. The evaluation focused on the framework’s zero-shot manipulation performance in both environments, its language and multimodal interaction capabilities, and its ability to perform unknown tasks. Results showed that even in entirely zero-shot scenarios, the “Panda Act” framework exhibited strong manipulation abilities, significantly outperforming methods that require learning tasks from scratch. This validates the effectiveness of integrating zero-shot multimodal models to enhance robotic manipulation capabilities.

The innovations and unique contributions of this paper can be summarised as follows:

  • The “Panda Act” framework differs from existing robotic frameworks that rely heavily on extensive training data for specific scenarios49,50,51. Our framework leverages the generalisation capabilities of zero-shot models, significantly enhancing the adaptability of robotic frameworks in uncertain environments. With its multi-layer modular design, integrating linguistic, visual, and auditory zero-shot models, “Panda Act” can perform operations without requiring additional task-specific training.

  • While most current robotic frameworks can only handle single text instructions52,53,54, our framework can process various user inputs, including pure-language instruction, language-image instruction, language-sound instruction, and directed-enhanced instruction. This multimodal interaction approach increases the flexibility for users to describe unknown tasks and significantly improves the efficiency and accuracy of task execution in uncertain environments.

  • Compared to most existing research which focuses on robotic performance in simulated environments38,48, our work provides a more comprehensive evaluation by including both a PyBullet simulation and a real-world setting with a Dobot robotic arm and an Intel RealSense D435i camera. Experimental results demonstrate that the “Panda Act” framework exhibits excellent manipulation capabilities in both settings, significantly outperforming methods that learn tasks from scratch. This provides new insights and directions for integrating zero-shot multimodal models into robotic frameworks.

The rest of the paper is organized as follows: The second section introduces the methodology, including the “Panda Act” framework architecture, multimodal interaction modes, framework design details, and module integration methods. The third section presents the experimental results and performance evaluation in the PyBullet simulation environment. In the fourth section, we validate the effectiveness and adaptability of the proposed methods through a series of real-world experiments conducted on the Dobot robotic arm. The fifth section provides the conclusion and future work.

Methodology

Framework overview

We propose a new multi-layer modular architecture designed to enhance the flexibility and efficiency of robot task execution. As shown in Fig. 2, this framework comprises four functional layers: the task instruction comprehension layer, the visual image segmentation layer, the cross modal matching layer, and the robot action control layer. The framework integrates nine zero-shot model modules and four basic robot action modules, enabling it to execute unknown tasks based on user natural language and multimodal instructions, along with images captured by a top-mounted camera.

Firstly, the LLM interacts with the user to obtain detailed task information. Based on the task requirements, the language model then automatically selects the appropriate zero-shot model modules and robot action modules, generating complete executable code. To ensure scalability and ease of use, all modules are encapsulated as functions. Moreover, we provide the language model with detailed function descriptions and example contexts to guide its output, thereby enhancing the accuracy and efficiency of task execution.

Fig. 2
figure 2

“Panda Act” framework overview. Based on natural language and multimodal task instructions, LLM selects appropriate zero-shot model modules and robot action modules. It then generates executable code, which ultimately drives the robot to complete the task.

Prompts for “Panda Act”

To improve task execution precision within our framework, we developed a comprehensive set of guiding prompt strategies to aid the LLM in understanding user instructions and generating precise robot control code. The core of this strategy involves creating a structured decision-making environment for the model. Firstly, we establish clear role definitions, such as “You are a robotic arm named ‘Panda Act’,” ensuring consistent contextual awareness in every interaction. We also provide the model with a detailed list of accessible functions and example task descriptions, clearly outlining the tools and methods available during code generation. To ensure the framework’s operational safety, we set explicit operational boundaries. Considering the potential ambiguity in user inputs, we introduce a “Question” tagging mechanism that allows the model to request additional information proactively, thereby improving the accuracy of task comprehension. Finally, we instruct the model to generate code in a specific format, making it easier for the framework to use regular expressions for subsequent code extraction and processing.

How to combine modules in “Panda Act”

To flexibly utilize diverse modules for unknown robotic tasks, we devised a comprehensive robotic task processing pipeline with standardized input and output parametric at each layer, guiding robots to execute tasks based on natural language and multimodal instructions.

Task instruction understanding layer: The standardised input encompasses task-oriented natural language and multimodal directives, producing outputs of textual features T, image features I, and audio features A, as extracted by LLMs. Depending on the task requirements, different LLMs can be utilized. For example, tasks demanding straightforward, rapid, and economically efficient solutions can leverage Text-Davinci-003 models. For tasks requiring higher levels of reasoning, more expensive but more powerful GPT-4 models can be employed. Additionally, for tasks requiring local execution to ensure privacy, Llama-3 models should be selected.

Visual image segmentation layer: The standard input is the environmental image \(I_e\), with the output being its corresponding features \(I_{ei}\). Once the robot captures a scene image, the visual segmentation model delineates masks for potential objects based on this input, subsequently cropping the images at these mask locations to obtain a series of environment image features \(I_{ei}\).

Cross modal matching layer: With task features and environmental image features as standard inputs, the output is the task’s target image \(I_{ti}\). Textual T, image I, and audio A features, alongside the environmental image features \(I_{ei}\), are routed to their respective matching models, yielding corresponding matched image results. Currently, CLIP models are used for image-text matching, Open-CLIP model for image-image matching, and ImageBind model for image-sound matching.

Robotic action control layer: The standard input is the task’s target image \(I_{ti}\), which outputs the robot’s action. Employing the robot’s hand-eye calibration module, the target image’s central point is mapped to its actual position in the robot’s coordinate framework and subsequently relayed to the relevant action module to initiate the task.

However, the contemporary visual zero-shot segmentation model, the “Segment Anything Model” (SAM), is plagued with limited segmentation accuracy and prolonged response times. This frequently results in the framework receiving incomplete scene images or enduring excessive operation durations. To mitigate the challenges induced by the SAM model, we integrated the Mobile-SAM and HQ-SAM models into the framework. When tasks are time-sensitive, the LLM leverages the five times faster44 Mobile-SAM model, while when tasks are precision-sensitive, the LLM will use the HQ-SAM model with higher accuracy.

The LLM serves as the central decision-making component that analyzes task instructions and determines the appropriate module selection and execution sequence. The process involves three key steps: (1) the LLM classifies the input into one of four interaction modes (pure-language, language-image, language-sound, or directed-enhanced), (2) based on keywords such as “hurry up” or “precisely”, it selects the appropriate segmentation model (Mobile-SAM for speed, HQ-SAM for precision, or standard SAM for balanced performance), and (3) it constructs and executes the processing pipeline where each module’s output serves as input to the next module in the sequence. This systematic approach ensures consistent task execution while maintaining framework flexibility across diverse scenarios.

Flexible modal inputs in “Panda Act”

The “Panda Act” framework is capable of flexibly handling various types of input. Based on the natural language instructions provided by users, our LLM can autonomously determine specific interaction patterns. To accommodate different interaction needs, we have designed four interaction modes for the “Panda Act”: pure-language interaction, language-image interaction, language-sound interaction, and directed-enhanced interaction.

Pure-language instruction: LLMs are capable of extracting key information from users’ descriptive instructions, including requirements for segmentation speed and accuracy, characteristics of target objects and objects, and potential operational behaviours. For instance, when a user inputs instructions such as “Hurry up and put the apples in the fruit basket,” the framework infers that the task involves rapidly locating and grasping apples, then placing them into the fruit basket. To achieve this goal, the framework calls on the Mobile-SAM model with a faster response time to segment the image. Subsequently, it utilizes the CLIP model to separately extract text features of “apples” and “fruit basket” from the image, then compares these features with the image features, calculates their similarity, and thus precisely locates the position of the apples and fruit basket. Finally, the framework calls on the pick-and-place action, which controls the mechanical arm to grasp the apples and place them in the specified position of the fruit basket.

Language-image instruction: Under this mode, instructions are used to describe the target object by referring to images, guiding the framework to complete robot tasks. For example, when the input instruction is “Place the object in image 1 on image 2,” in which image 1 and image 2 represent the addresses of the sample images, the LLM identifies this as an image-based interaction task. If no speed or accuracy requirements are explicitly stated by the user, the framework defaults to using the standard SAM model for image segmentation. Subsequently, the framework inputs both the sample image and the current scene image into the Open-CLIP model, determining the corresponding relationship between the sample image and the target object in the scene based on the similarity of their feature vectors.

Language-sound instruction: Under this mode, the instructions require the robot to act according to the surrounding sounds. For example, when the instruction is “rotate the ringing alarm clock by 90 degrees”, the specific object “ringing alarm clock” is explicitly mentioned in the instruction. The framework automatically records the sound of the environment for approximately 10 seconds. Then, the current scenario image and the recorded environmental sound are input into the ImageBind model to pinpoint the sound-emitting object by comparing feature embeddings.

Directed-enhanced instruction: When the target object cannot be described in words or multi-modal information, directed enhanced instructions provide an effective alternative. For this mode, we have designed a separate GUI interface for users to click or select the target object. When the instruction content includes phrases such as “I don’t know how to describe this object”, the framework will automatically launch the GUI interface for clicking/selecting the target object. Users can then click or select the target object using this interface. The framework will transmit the click/select coordinates to the HQ-SAM model, prompting the model to segment and localise the selected object.

Simulations

Simulation environment construction

A virtual simulation environment has been established using PyBullet and the Universal Robot UR5 robotic arm. The operational workspace of this environment spans dimensions of 0.5m x 1m, employing the VIMABench simulation suite48. As depicted in Fig. 3, this suite comprises an extensible collection of 3D objects and textures.

Fig. 3
figure 3

3D objects and textures in the simulated environment.

Within the virtual simulation environment, two observational perspectives are provided: a frontal view and a top-down view, with the latter being primarily employed in this study. The end-effector of the robotic arm utilizes a suction cup. Moreover, the simulation environment incorporates fundamental operational actions such as “pick and place”, “rotation”, and “pushing”.

Figure 4 provides a detailed illustration of the virtual simulation environment constructed in this paper. On the left, the robotic arm operation scenario is depicted, which encompasses a Universal Robot UR5 arm, an operation console, and various objects of diverse morphologies. The UR5 arm includes six rotational joints, enabling it to perform a wide range of operations in the simulated environment. The right side of Fig. 4 represents the camera perspective in the virtual milieu, employing a top-down viewpoint. This perspective vividly captures the geometry, texture, and relative positioning of each object, offering a comprehensive understanding of the scene’s layout for the framework.

Fig. 4
figure 4

Simulated experimental environment. The left image is the simulated environment, and the right is the top camera view.

Zero-shot robotic tasks evaluation

The zero-shot robotic tasks performance of the framework is evaluated using the VIMABench suite of tasks48. VIMABench encompasses six categories and 17 task templates, with PyBullet selected as the backend and the default renderer for the evaluation benchmarks. As illustrated in Fig. 5, the paper selects representative meta-tasks from the suite, such as visual manipulation, scene comprehension, and rotational operations, to comprehensively assess the “Panda Act” framework’s zero-shot robotic task capabilities.

Fig. 5
figure 5

Evaluate the “Panda Act” framework with the VIMABench evaluation suite.

We selected four representative meta-tasks (Task 1, Task 2, Task 3, and Task 4) from VIMABench, encompassing everything from simple object manipulation to visual reasoning, thoroughly assessing the performance of the “Panda Act” framework in zero-shot robotic tasks. The following is a detailed description of these tasks:

  • Task 1: Identify and select a designated object, subsequently placing it into a specific container. The task is deemed successful only when all specified objects are situated within the container.

  • Task 2: Insert objects with distinct textures into containers of a specified hue. Success is achieved once all objects bearing the designated texture are housed within containers of the stipulated colour.

  • Task 3: Rotate an object clockwise along the Z-axis to a specified angle. The task is considered successful only when the object’s position aligns with its original, and its orientation corresponds with the predetermined post-rotation angle.

  • Task 4: Learn the relationships of novel vocabulary terms. The task is successful when all target objects lie within the domain of the container.

For our experiments, we employed GPT-4 as the language model. The framework autonomously chooses the appropriate image segmentation and multimodal perception zero-shot models based on task prompts. The framework controls the model’s output solely through limited prompts to GPT-4, without any training or fine-tuning for the tasks.

To evaluate the performance of the “Panda Act” framework, we conducted experiments using the VIMABench benchmark and compared the results with current methods that learn tasks from scratch. Table 1 presents the evaluation results of the PandaAct framework against other baseline methods on VIMABench. To ensure a fair comparison, we selected models with the largest number of parameters, specifically those with 200 million parameters, from among these methods. Our comparison baselines include the following task-learning methods from scratch:

Gato is a model that only has a decoder, trained using pure supervised learning. Gato predicts actions in an autoregressive manner, with different tasks specified by providing the model with the corresponding initial token sequence55.

Flamingo is a vision-language model that embeds a variable number of hint images into a fixed number of tokens through the Perceiver Resampler module, associating it with the language encoder through cross-attention to the encoded hints56.

GPT is a behaviour cloning agent based on the GPT architecture, conditioned through tokenised multimodal hints. It decodes the next step of action autoregressively based on the multimodal hints and interaction history57.

Table 1 Simulation results for all methods (Success Rates in %).

We employed four meta-tasks to assess these techniques, with each evaluation conducted on 150 instances, utilising the task success rate as a metric for measurement. This evaluation was determined by the VIMABench simulator based on the configuration. Table 1 showcases the evaluation results of the “Panda Act” framework on VIMABench, where each reported success rate represents the mean performance across multiple evaluation runs with standard deviations of ±2.1% for Task 1, ±1.8% for Task 2, ±1.5% for Task 3, and ±2.3% for Task 4. We directly employed different baseline experimental results from48. The results indicate that our approach significantly outperforms three other strategies. This underscores the “Panda Act” framework’s exceptional performance in comprehending intricate instructions and executing precise operations. It is noteworthy that our framework is entirely zero-shot, having undergone no specific training, relying solely on task information and prompts from an LLM without the involvement of other technologies or data. To further validate our methodology, we conducted an ablation study on the “Panda Act” framework.

Ablation analysis

LLMs

In the “Panda Act” framework, the language model serves as a central component. To delve deeper into the efficacy of LLMs within the framework, this study utilises both the GPT-4, Llama-3 and Text-Davinci-003 models for comparative testing experiments within the VIMABench task suite. The outcomes are depicted in Fig. 6a.

Fig. 6
figure 6

Comparison of success rates across different models.

The results reveal that the GPT-4 model demonstrates superior performance compared to the Llama-3 and Text-Davinci-003 models, indicating that LLMs play a pivotal role in the success of the experiments. Upon further analysis of the code generated by three language models, this study identifies two primary causative factors for the disparities:

  • Hallucination: In certain experiments, Llama-3 and Text-Davinci-003 models produced outputs that were either irrelevant to the actual task or logically incongruent, a phenomenon attributed to the hallucination tendencies of LLMs.

  • Omission: The tests indicate that, in comparison to the GPT-4 model, Llama-3 and Text-Davinci-003 models are more prone to overlooking or forgetting crucial operational steps or information, leading to the framework’s inability to execute tasks accurately.

To provide a more comprehensive evaluation, we also analysed the response time performance of different LLMs across all meta-tasks, as shown in Table 2.

Table 2 Response time comparison of different LLMs (Seconds).

The response time analysis reveals interesting trade-offs between different models. Text-Davinci-003 demonstrates the fastest response times with an average of 3.1 s, followed by GPT-4 at 3.7 s, while Llama-3 shows the slowest performance at 5.1 s. However, this speed advantage of Text-Davinci-003 comes at the cost of reduced task accuracy due to the omission issues mentioned above. GPT-4 provides the optimal balance between response time and task completion accuracy. The slower response time of Llama-3 can be attributed to its increased computational complexity and the additional processing overhead required for local execution.

Zero-shot models

The operation of the “Panda Act” framework is also intricately linked to its zero-shot models. To elucidate the impact of these zero-shot models on the framework’s overarching performance, this study conducted a comprehensive ablation experiment.

As depicted in Fig. 6b and c, this study further contrasts the success rates of different zero-shot visual models and multimodal perceptual models in the Experiment. The findings underscore that variances in zero-shot models have a significant influence on success rates. More specifically, there is a positive correlation between the performance of zero-shot models and the success rate of the experiments. Compared to the multimodal perceptual models, the quality of image segmentation from the visual segmentation models has a more pronounced impact on the framework’s overall performance. This study postulates that this might be due to the visual segmentation model’s outputs directly influencing the matching results of the multimodal perceptual models, and there exists a sequential relationship between the two.

Experiment I: tests of interaction modes

In this section, we evaluated our framework in a real-world environment. Our real-world platform comprises a Dobot Magician robotic arm and an Intel RealSense D435i depth camera capturing RGB-D images at a resolution of 1280 × 720.

Fig. 7
figure 7

Tests of interaction modes.

Case I: pure-language interaction

As depicted in Fig. 7a, this study tested and validated the pure language interaction model under actual conditions. The test environment consisted of a robotic arm, dishes, and green vegetables. The framework captured scene images via a RealSense camera installed at the top.

The test employed “Put the greens on the plate” as the language instruction.

The movement of robots in the testing process is divided into three stages:

  1. (1)

    Firstly, the LLM extracts segmentation speed and accuracy requirements, the target object and object information, and potential action information from the user’s descriptive sentences.

  2. (2)

    Then, the LLM calls upon the segmentation base model, such as SAM, based on “greens” and “plate”, and calls upon the image-text matching base model, such as Clip, based on “put” and “on”.

  3. (3)

    Finally, the LLM determines whether to use the Pick and Place action based on “put” and “on”, generates executable Python code, segments the image, obtains the object position, and sends it to the robot for execution.

Through natural language instructions, users can use this framework without any additional learning or training. Furthermore, the GPT-4 model can provide feedback to users, enabling human-in-the-loop control. This enhances the smoothness and user experience of interaction. This validates that the framework can correctly understand and execute pure language instructions in a real environment.

Case II: language-image interaction

As illustrated in Fig. 7b, this study conducted an empirical test on the language-image instruction in the actual environment. The test scenario included a pizza and a banana. To validate this interaction pattern, we adopted the “Place the Image1(Banana Image) on the Image2(Pizza Image)” as the interaction instruction, where Image1 represents the image path of the banana locally loaded and Image2 represents the image path of the pizza.

The movement of robots during testing is divided into the following stages:

  1. (1)

    Firstly, through the parsing of “image1” and “image2”, LLMs are able to identify that the user’s intention is to engage in image interaction.

  2. (2)

    Next, it extracts the address information of the example image from the user input.

  3. (3)

    If the user does not specify specific speed and precision requirements, the framework will default to a general image segmentation model, such as the SAM model.

  4. (4)

    When in the image interaction mode, the LLM will invoke the image-image matching base model, such as ImageBind.

  5. (5)

    Based on the instructions of “Place” and “on”, the LLM determines to execute the Pick and Place action.

  6. (6)

    Finally, the LLM generates executable Python code for image segmentation, object location acquisition, and transmission to the robot to execute corresponding actions.

This interaction pattern significantly reduces the challenges faced by users when describing the target, reduces misunderstandings caused by semantic ambiguity, and ensures that the framework can accurately identify target objects in uncertain environments.

Case III: language-sound interaction

As shown in Fig. 7c, we have designed a language-sound interaction mode to address specific scenarios that require sound localisation. In the experiment, we set up test scenarios including a mechanical arm, a plate, and a clock that is ringing.

The testing employed the “Put the ringing alarm clock on a plate” as the interaction instruction. The framework automatically recognized this instruction as a language-sound interaction task and recorded the environmental audio for 10 seconds.

The movement of robots during testing is divided into the following stages:

  1. (1)

    Firstly, the LLM identified the language-sound interaction as the task by analyzing the instructions and subsequently recorded 10 seconds of environmental sounds for analysis.

  2. (2)

    Subsequently, the LLM employed the default split model, such as the SAM model, to extract visual features of the target object.

  3. (3)

    The large model then leveraged the image-sound matching model, such as ImageBind, to precisely locate the sound source position.

  4. (4)

    Based on the instructions such as “Put” and “On”, the LLM determined to adopt the “Pick and place” action sequence.

  5. (5)

    Ultimately, the LLM generated executable Python code for image segmentation, and object location recognition and sent the corresponding instructions to the robot to perform the corresponding actions.

By equipping robots with the ability to perceive sound, they can respond flexibly to various uncertain operational environments, especially when sound information is more important than visual information.

Case IV: directed-enhanced interaction

As illustrated in Fig. 7d, the test environment is composed of a series of disorganized objects. When the user indicates that they cannot accurately describe the target through language or images. The framework immediately identifies this input as an instruction enhancement task and prompts a GUI interface to guide the user in identifying the target object via clicking or encircling, as shown in Fig. 8. Given the clarity of the task objective, the framework does not activate the multimodal perception module; instead directly segments the target object based on the selected area and drives the mechanical arm to execute the required operation, significantly enhancing the operational performance of the robot in uncertain environments.

Fig. 8
figure 8

GUI interface for directed-enhanced interaction.

The movement of robots during testing is divided into the following stages:

  1. (1)

    Firstly, the users express their needs through natural language, such as “I don’t know how to describe this object” or “Can I click/select the object for operation?”

  2. (2)

    Subsequently, the LLM identifies the user’s need and determines to use the directed-enhanced mode, thereby initiating the click/select interface.

  3. (3)

    The user selects the object they want to operate by clicking or selecting.

  4. (4)

    The click/select coordinates of the user are transferred to the segmentation model, which generates a high-quality segmentation of the object and determines its location.

  5. (5)

    Finally, based on the position information, executable instructions are generated, which are then sent to the underlying controller. The robot then executes the corresponding action.

The directed-enhanced interaction mode provides a more intuitive and flexible operation method for the user. It is particularly suitable for uncertain scenarios that are difficult to describe through language. This model adopts only high-precision segmentation models, thus significantly enhancing the robot’s response speed and accuracy. To evaluate the usability of this interaction mode, we conducted preliminary tests with 5 research team members, each performing 8–10 trials of object selection and manipulation tasks. The results showed consistent performance across users with success rates ranging from 85 to 95% (mean: 91.2%, standard deviation: ±3.8%), indicating the framework’s robustness to different user interaction styles.

Experiment II: tests of adaptability

Case I: zero-shot robotic tasks

We constructed a sample set of 50 one-shot English instructions for three typical basic tasks: placing, rotating, and picking, with each task type tested 15 times to calculate reliable success rates and analyse the reasons for task failure. The reported success rates (76% for placing, 84% for rotating, 80% for picking) represent mean values across these trials with standard deviations of ±4.2%, ±3.8%, and ±5.1% respectively.

As shown in Fig. 9, overall, our approach shows good generalisation ability on different tasks with an average success rate between 75% and 85%, note that our approach is directly transferred from a simulated environment to a real one without any training and data fine-tuning. By further analysing the experimental results, we find that code generation errors are the main cause of failure, and we speculate that the main reason is due to the uncontrollable nature of the output of the LLM. To provide concrete insights into these failures, we present typical code generation error examples.

figure a

Example 1. Missing essential processing steps.

figure b

Example 2. Incorrect parameter matching.

Fig. 9
figure 9

Test of zero-shot robotic tasks.

Error Example 1 demonstrates critical pipeline incompleteness where the LLM bypasses essential processing steps, including image segmentation, object cropping, and coordinate transformation, resulting in attempting to pass raw image data directly to CLIP and subsequently to the action module without proper object localisation. Error Example 2 shows parameter type mismatching where the LLM incorrectly assigns audio data to the CLIP module, which is designed for text-image matching, instead of using the ImageBind module for audio-visual correspondence.

Potential strategies to address these limitations include implementing pipeline integrity checking, parameter type validation, and template-based error recovery mechanisms with human-in-the-loop intervention when automatic correction fails. We also found that the number of actions in a robot’s task also has an impact on the task success rate. Tasks with fewer actions usually have a higher success rate, possibly due to reduced code complexity, which in turn increases the task success rate.

Case II: colour understanding tasks

To evaluate the framework’s capacity to understand colours, this article constructed a scenario comprising five differently coloured cubes, as depicted in Fig. 10. The framework issued the instructions “Pick up the yellow cube” and “Pick up the blue cube”. It successfully identified the cubes corresponding to the specified colours and accomplished the tasks. This indicates that the framework can comprehend the association between colour terms and their corresponding scenarios, effectively responding to task directives.

Fig. 10
figure 10

Test of colour understanding tasks.

Case III: new conceptual understanding tasks

This paper also assessed the framework’s understanding of novel concepts and its ability to generalise these to tasks. As depicted in Fig. 11, the task instructions were “The red cube is for ketchup and the green cube is for vegetables. Now please put the ketchup on the burger.” and “The red cube is for flame, the blue cube is for water. Now, please extinguish the flame.” The framework was required to learn that the red cube represented ketchup, the green cube represented vegetables, the red square represented fire, and the blue cube represented water.

Fig. 11
figure 11

Test of new conceptual understanding tasks.

Safety and ethical considerations

Unlike rule-based systems, LLMs may occasionally produce outputs that are syntactically correct but semantically inconsistent, or misinterpret user intentions in ambiguous instructions. In real-world robotic control, such issues could lead to undesired behaviours, unintended motion, or in extreme cases, hardware damage or safety risks.

To mitigate these risks, our current framework enforces strict constraints on executable functions by limiting LLM outputs to a predefined and verified function library. All experiments were conducted under human supervision in controlled environments using non-critical objects and low-force robotic arms.

Limitations and future work

While the “Panda Act” framework demonstrates significant advantages in zero-shot robotic manipulation, several key limitations must be acknowledged that provide directions for future research.

Computational and Performance Limitations: The framework faces substantial computational demands due to its reliance on multiple large-scale models, including GPT-4, various zero-shot vision models (SAM, HQ-SAM, Mobile-SAM), and multimodal perception models (CLIP, Open-CLIP, ImageBind). The sequential processing of these models creates inherent latency with computational bottlenecks primarily occurring in LLM reasoning, code generation, and visual segmentation processes. Additionally, dependency on cloud-based LLMs introduces network latency and reliability concerns for industrial applications.

Experimental scope limitations: Our current real-world experiments primarily focus on static environments with clearly visible objects, which limits the demonstration of the framework’s capabilities in more complex practical scenarios.

Interactive capabilities limitations: While our framework supports initial user interaction through multimodal instructions and GUI-based object selection, it lacks mechanisms for online re-planning and partial code re-generation during execution, limiting its adaptability to unexpected situations or real-time user corrections.

Future research directions: To address these limitations, future work will focus on: (1) computational efficiency improvements through model compression techniques, heterogeneous computing architectures, and edge computing integration; (2) expanded experimental validation including dynamic environments, occlusion handling, and interactive user feedback loops; (3) enhanced human-robot interaction capabilities with execution state monitoring, real-time error correction, and dynamic re-planning mechanisms; and (4) comprehensive user studies with formal experimental protocols to systematically evaluate user experience across different populations.

Conclusions

Traditional robot frameworks often rely heavily on extensive training for specific tasks and environments, limiting their generalisation ability. Recently, large language models (LLMs) and zero-shot models have shown strong generalisation across domains, offering new avenues to address this limitation. In this study, we explore the use of multiple zero-shot models to solve the generalisation problem of robots in uncertain environments. We built a robot framework named “Panda Act”, integrating language, vision, and auditory zero-shot models. This framework is not constrained by the environment and does not require specific scene learning, but rather utilizes the generalisation capacity of multimodal zero-shot models. The framework flexibly processes language and multimodal instructions, generating executable code that dynamically invokes the appropriate zero-shot models based on parsed task intent. This approach avoids specific task training and allows the framework to understand and execute instructions and scenes it has never seen before, significantly outperforming methods that require learning tasks from scratch. Notably, our method also exhibits good generalisation in real robot environments and enables the execution of tasks with complex semantic meanings.