Enhancing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks

Ren, Ruochen; Wang, Zhipeng; Yang, Chaoyun; Liu, Jiahang; Jiang, Rong; Zhou, Yanmin; Jiang, Shuo; He, Bin

doi:10.1038/s41597-025-04798-z

Download PDF

Data Descriptor
Open access
Published: 21 March 2025

Enhancing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks

Ruochen Ren^1,2,3,
Zhipeng Wang^1,2,4,
Chaoyun Yang^1,2,
Jiahang Liu^1,2,4,
Rong Jiang^1,2,4,
Yanmin Zhou^1,2,4,
Shuo Jiang ORCID: orcid.org/0000-0003-3645-6301^1,2,4 &
…
Bin He^1,2,4

Scientific Data volume 12, Article number: 476 (2025) Cite this article

4950 Accesses
4 Citations
Metrics details

Subjects

Abstract

The advent of large language models has transformed human-robot interaction by enabling robots to execute tasks via natural language commands. However, these models primarily depend on unimodal data, which limits their ability to integrate diverse and essential environmental, physiological, and physical data. To address the limitations of current unimodal dataset problems, this paper investigates the novel and comprehensive multimodal data collection methodologies which can fully capture the complexity of human interaction in the complex real-world kitchen environments. Data related to the use of 17 different kitchen tools by 20 adults in dynamic scenarios were collected, including human tactile information, EMG signals, audio data, whole-body movement, and eye-tracking data. The dataset is comprised of 680 segments (~11 hours) with data across seven modalities and includes 56,000 detailed annotations. This paper bridges the gap between real-world multimodal data and embodied AI, paving the way for a new benchmark in utility and repeatability for skill learning in robotics areas.

Multimodal tactile sensing fused with vision for dexterous robotic housekeeping

Article Open access 11 August 2024

An assistive robot that enables people with amyotrophia to perform sequences of everyday activities

Article Open access 11 March 2025

Embodied tactile perception of soft objects properties

Article Open access 12 February 2026

Background & Summary

The development of large models, including language models, Vision Language Models (VLM)^1,2, and vision neural networks³, has transformed human-robot interaction, allowing robots to perform tasks in a more natural manner⁴. However, a significant limitation remains: the lack of integrated multimodal data restricts robots’ capacity to perform fine manipulation tasks that require precise, closed-loop feedback and delicate hand-eye coordination, such as opening a condiment cup or unscrewing the can⁵. These tasks depend on haptics to regulate force, a mix of sensory inputs such as visual attention and hand movements, and the ability to adjust and reprogram in real time in response to small changes in environmental images and sounds.

Recognizing the limitations of unimodal data, it becomes evident that many existing dataset, initially gathered through unimodal means, are insufficient for the evolving needs of these models. In the previous research studies, human grasping data was typically captured using RGB cameras⁶.

Unimodal grasping

For instance, Sun et al.⁷ provided a taxonomy of multi-object grasping, their dataset lacks fine-grained motion capture and multimodal integration, limiting its utility for tasks requiring precise bimanual manipulation analysis. Bullock et al.⁸ proposed dataset focused on scene grasp based in home and machining environments. Brahmbhatt et al.^9,10 collected images of grasped household objects and modeled their textures using 3D printing. While capturing the textures of 3D-printed objects is straightforward, noticeable differences remain between these and real objects’ textures. To address this problem, Dreher et al.¹¹ proposed extracting symbolic spatial object relations from raw RGB-D video data captured from robotic viewpoints. This method constructs a graph-based scene representation, training a graph network classifier with these representations and action labels to learn object-action relationships. Hasson et al.¹² focused on reconstructing hand and object poses from vision data. However, it does not include physical interaction data, such as forces and torques. Taheri et al.¹³ captured the complete grasping action of a person with an object, including detailed 3D body shape and pose, object shape, and body-object contact, without using RGB images. Nicora et al.¹⁴ provided motion-capture data and multi-view video sequences from a cooking scene to explore perspective-invariant action properties. Aganian et al.¹⁵ presented a dataset encompassing 95.2k annotated fine-grained actions, recorded over 51.6 hours by three cameras, representing the potential viewpoints of a cooperating robot. This dataset uniquely annotates actions performed by each hand separately, reflecting the simultaneous use of both hands in assembly tasks. Lastly, Fan et al.¹⁶ created a dataset where two hands manipulate articulated objects, such as scissors or laptops, emphasizing the dexterity required to handle such items.

Multimodal grasping

Single visual data alone struggles to provide sufficient information for embodied intelligence. Recent research has concentrated on multimodal human data set construction. The richness of multimodal information improves human grasping behavior understanding, downstream cross-modal tasks, and the development of big data-multimodal macro models. Multimodal data can also facilitate fine-grained distinctions, such as using vision, audio, and force to distinguish material properties of food products¹⁷. These dataset are useful for addressing specific tasks. Wang et al.¹⁸ provide a comprehensive dataset integrating visual and tactile data from robotic grasping experiments, enabling researchers to analyze grasping processes, and detect slip events. Han et al.¹⁹ provide a multimodal resource designed to model human grasp intent for prosthetic hands, incorporating paired eye-view and hand-view images, EMG signals, and IMU data to facilitate research on intent inference and motion planning through multimodal integration. Lastrico et al.²⁰ propose pick-and-place-focused dataset containing multiple cameras, motion capture systems, and wrist-worn inertial measurement devices that observe movements from various angles.

Hand manipulation

A part of the grasping dataset focuses on hand-object interaction, usually bimodal based on RGB-D cameras. Rohrbach et al.²¹ introduced a fine-grained cooking activity dataset which includes 65 activities recorded in realistic kitchen environments. Huang et al.²² introduced a comprehensive dataset focusing on daily interactive manipulation, capturing 3D position, orientation, force, and torque data of objects during fine manipulation tasks. Garcia et al.²³ proposed a first-person-based model of hand-object interaction with only a 3D model of the hand. Hampali et al.²⁴ provided 3D annotations for hand and object poses but did not integrate dynamic temporal data or multimodal signals. Chao et al.²⁵ collected a grasping dataset for three downstream tasks: 2D detection, 6D object estimation, and 3D hand posture estimation. However, the study lacked task orientation and theoretical depth in real-world applications. Alia et al.²⁶ introduced a dataset combining macro- and micro-level activity recognition during cooking sessions. Pereira et al.²⁷ introduced a novel dataset of 2,866 food flipping movements involving different foods and utensils, capturing 3D trajectories, forces, torques, and subject gaze, providing valuable data for advancing robotics, human activity recognition, and bio-inspired control systems. Elangovan et al.²⁸ collected a dataset of object grasping and manipulation in a kitchen environment, but the sample size was small, the statistical analysis lacked depth, the study was overly one-sided, and the data required better hand-object modeling. Mastinu et al.²⁹ proposed a first-person capture of grasping detail data based on radar and flight sensors, including hand kinematics and proximity vision data during the object’s grasping action. Zhao et al.³⁰ proposed a dataset that captures bimanual manipulation using low-cost hardware, however, it focuses on a narrow set of tasks and lacks emphasis on multimodal fusion.

Whole-body manipulation

Manipulation involves not only the hand but the entire body, and it is critical that objects contact multiple body parts. However, existing dataset need to capture the complexity of these interactions. Human grasping is a whole-body movement, and the body posture varies for different types of grasping. Some studies collect hand-specific grasping details alongside human whole-body movements. Krebs et al.³¹ created a database of whole-body furniture manipulation tasks for motion capture. The dataset has a single, non-scene-specific scene. Similar to Krebs’s work, Liu et al.³² proposed a first-person, two-handed interaction dataset for object manipulation that combines camera and depth data. However, the scene arrangement is overly homogeneous, and the dataset lacks rich semantic information. Kwon et al.³³ built a first-person dataset of hand-object manipulation based on RGB and depth for first-person interaction recognition. It provided two hands and manipulated objects with 3D poses, shape-rich annotations, and detailed information such as action labels, camera poses, scene point clouds, and object meshes. DelPreto et al.³⁴ proposed a multimodal dataset for kitchen tasks, it focuses on wearable sensors and lacks annotations based on high fine-granularity.

Therefore, existing dataset are often limited by their homogeneity in modality, primarily focusing on single visual modalities and lacking motion mechanics information. This limitation restricts their applicability in general artificial intelligence or autonomous learning in unstructured environments³⁵. Additionally, these dataset lack a unified sensor collection framework, leading to data quality and type inconsistencies across different studies. Furthermore, the reproducibility of these dataset could be improved due to the specialized nature of their collection environments and conditions. Most dataset are also narrowly focused on downstream tasks, with insufficient integration and synthesis of the collected data, limiting the depth of insights that can be drawn from them. Many datasets focus on a narrow set of tasks or lack the richness of multimodal data that is essential for developing more advanced, large-scale models. In contrast, our datase called Kaiwu Kitchen dataset covers a broader range of tasks, incorporates multiple modalities (such as tactile, visual, and auditory data).

This paper proposes a multimodal data collection strategy within real-world kitchen environments. By establishing a highly centralized framework that integrates the existing equipment in the Kaiwu Lab, this approach aims to enhance the data’s diversity, utility, and reproducibility, facilitating more comprehensive and robust human-robot interaction and embodied intelligence research. The contributions of this paper are as follows:

Whole-body multimodal data

This dataset identifies critical sensing modality closely related to manipulation learning process which includes multi-view data of human activities, hand muscle activity, first-person vision, eye-tracking data, scene audio during kitchen tasks, hand kinematics for precise actions, and dynamic full-body motion capture data, surpassing other dataset listed (Table 1). It provides a comprehensive understanding of human actions in kitchen environments, supplying detailed data for robots to acquire multimodal skills.

Table 1 Comparison of the proposed dataset with existing dataset on hand interaction.

Full size table

Fine-grained data labeling

In this study, 680 regions of interest were labeled for first-person attention data, and 536,467 objects were annotated in multi-view video data. Additionally, 14,511 motion segmentation events and 4,254 gesture segmentation events for left- and right-hand fine operations were meticulously recorded, along with multimodal myoelectric and cross-modal synchronization data between movements and gestures. The dataset’s utility for complex analyses is enhanced by this expertly annotated data, significantly advancing research in human-robot interaction and embodied intelligence.

Methods

In this section, we highlight the main pipelines of our data collection. First, the design of the study and the setup, including its equipment and technical characteristics, are described. Next, the preprocessing steps applied to the data for cleaning and synchronization are discussed.

Design

The selected activities and use cases in our dataset provide essential insights into embodied intelligence, enhancing robotic design by understanding real-world interactions. We have recorded the dataset with activities related to each manipulated object to expand this type of dataset further. Each object has three primary tasks: Grasp, Carry, and Place. In addition to the primary tasks, manipulation tasks (A1 and A2) are introduced to highlight the specific functionalities. For example, tasks such as cutting fruit, stirring ingredients, or rolling the dough illustrate the specific manipulations enabled by different tools. Considering the temporal and spatial variations and differences among individuals, we engaged 20 volunteers to perform diverse whole-body movements in an unstructured kitchen setting. Each movement was performed twice to ensure consistency. To explore human interactions with kitchen objects more deeply, we designed 17 object manipulations involving common kitchen items, using a refrigerator cabinet as the scene. These tasks include single and multi-object grasping and single-handed manipulations, detailed in Table 2. The recordings capture primary tasks and manipulation tasks when the object is the primary focus. We developed 30 task scenarios from these 17 general action categories, initially defined coarsely and later annotated with fine-grained details necessary for task completion.

Table 2 Long sequence of task movement breakdowns.

Full size table

The choice of a kitchen setting for our data collection is particularly significant because kitchens are environments rich in varied and complex human-object interactions. They involve frequent manipulation of everyday objects, requiring fine and gross motor skills essential for understanding and replicating embodied intelligence in robotics. The kitchen is a natural setting where people perform various activities, such as cooking, cleaning, and organizing, which involve manipulating objects of different sizes, shapes, and weights. This variety makes the kitchen an ideal environment to study how humans interact with objects in a real-world context, providing critical insights that can be generalized to other settings.

Human interaction with everyday objects involves three-dimensional movements of the entire body. For example, retrieving vegetables from a low fridge requires crouching. Accordingly, a series of full-body grasping movements for kitchen tasks has been established. The terms Place and Put back describe distinct actions: Place involves setting an item on a counter or table, whereas Put back refers to returning it to its original location, such as placing fruit back into the refrigerator’s crisper. The dataset includes 17 objects, allowing volunteers to perform various actions freely, thus ensuring their autonomy. These actions span single and multiple-object manipulations and one- and two-handed tasks.

To record everyday activities in the kitchen (Fig. 1), a wearable sensor suite equipped with environmental sensors was used. These wearable devices can capture detailed human movements from different perspectives. The suite is scalable to various work environments, such as conference and assembly rooms. The multiple views provide additional information, mitigating issues caused by occlusions. However, the motion capture suit may impede some of the volunteers’ movements during the tasks, while the data glove may hinder their dexterity.

Equipment

Optical motion capture system

The optical motion capture system more accurately captures details of full-body movement in the kitchen. Participants were instructed to wear a motion capture suit with 53 markers before the experiment began. The model is shown in Fig. 2. NOKOV^TM Optical 3D Motion Capture uses 20 motion capture lenses to model the human body by identifying motion capture points on the sportswear worn by volunteers. This enables accurate capture and estimation of 3D spatial coordinates (x, y, z) and orientation at different time intervals. Prior to the experiment, each volunteer was required to perform a cross-motion calibration. Additionally, marker positions on the suit were adjusted to account for height differences among subjects. Collecting a large number of human demonstrations of the horizontal skills of the dynamic capture data can greatly improve the robot’s ability to operate³⁶.

Multiview cameras

Three multi-view Kinect^TM cameras were arranged at the experimental site. To augment the camera’s capacity for capturing the experimenter’s elbow movements, back movements, and partially obstructed objects on the sides, Kinect^TM cameras were positioned at the front-left, front-right, and behind the experimenter. Videos were recorded at 1920 × 1080 resolution to obtain RGB and depth data, which were used for post-processing point cloud data. Such multi-view RGB-D data setups have been shown to enhance robotic skill acquisition, such as improving object manipulation and grasp planning in cluttered environments³⁷.

Eye tracker

The Tobii^TM device included a first-person RGB camera, an eye movement sensor, and an IMU to collect information about the wearer’s eye position, providing stable and accurate eye-tracking data. Calibration was performed before wearing. During the experiment, the detection frame transmits data via network sockets and stores the recorded data in the master file at the end of the experiment. Collecting eye attention data from a first-person perspective can help robots better predict human attention areas, which is important for robots to understand human intentions and actions automatically. Eye-tracking technology enhances robotic grasping performance by enabling robots to identify areas of human attention, allowing for more natural and precise manipulation tasks³⁸.

Electromyography

The Delsys^TM EMG sensor measures muscle activity. The electrodes used to measure the surface EMG signal are double differential and measure a 2 kHz EMG signal. It also includes a built-in 3-axis accelerometer gyroscope and magnetometer. The data from all channels is transmitted wirelessly using the Python API. The four forearm muscles selected on both left and right arms are the flexor digitorum superficialis, the flexor carpi radialis, the extensor carpi radialis, and the extensor digitorum superficialis. For the circular array of motors, eight equidistant positions were selected on the left hand. The motors on the right hand were placed in positions 9–16 (Fig. 3). Myoelectric control recognizes human movement intentions through the surface electromyographic signals, thus enabling human-computer interaction. EMG signals have been extensively used in human-robot interaction (HRI) systems to decode motion intention, improve robotic manipulation, and facilitate tasks such as prosthesis control and rehabilitation training³⁹.

Haptic and motion integrated data gloves

The Wise-glove^TM integrates finger and upper limb motion capture, grip force acquisition, and 19 fiber optic angle sensors distributed on the back of the fingers to measure the hand’s movement posture in real time (Fig. 4). Nineteen miniature grip force sensors are distributed on the palm and the range of the underside of the finger to measure grip force in real-time. Position sensors are also distributed on the arm and hand joints to capture arm movement in real-time. The grip force of each functional area of the human hand can be measured in real-time using 19 miniature grip force sensors distributed throughout the palm and fingertips. Additionally, arm movement can be captured in real-time using position sensors distributed in the joints of the arm and palm. Collecting information on the angle changes and trajectories of the human hand during actions helps the multi-fingered hand of a robot mimic human movement patterns. Haptic data records the force exerted by humans during tasks, helping robots adjust their hand contact force to avoid damaging objects or causing grip slippage. Systems such as the MimicTouch framework demonstrate the usefulness of haptic data for improving robotic manipulation in contact-rich tasks, enabling robots to better replicate human-like tactile-guided control strategies⁴⁰.

Microphones

Experiments used a portable and robust MOMA^TM microphone, considering the sound source’s location during kitchen operations. Microphones were placed above and below the refrigerator (positions 1 and 2), to the left and right of the kitchen countertop (positions 3 and 4), above and below the cupboard (positions 5 and 6), and clipped to the left and right hands (positions 7 and 8) (Fig. 5). The microphones help in understanding the details of the operator’s movements. For example, by analyzing sound source information, the system can identify specific actions such as opening the fridge, chopping vegetables, or putting down items. Microphones can enhance robotic manipulation performance, particularly in occluded scenarios, by enabling robots to use audio cues for tasks such as locating objects and detecting contact events⁴¹.

Synchronization

To ensure the accuracy and usability of the collected sensor data, each volunteer underwent calibration of the motion capture and eye-tracker before the experiment. The person in charge guided the start of the recording, which included glove pose calibration and a calibration session. A customized data acquisition software designed specifically for this purpose was synchronized to match the initial and final moments of each recording. The acquisition software was developed in Python and synchronized with the devices’ SDK libraries.

Data annotation

During the kitchen activity, the human expert segmented the actions of fine-grained primitives by observing the RGB video stream (Fig. 6). They also determined the start and end frames of each action and mapped these events to other involved sensors (such as data gloves and EMG devices). The study annotates grasping actions as fine-grained. Table 2 presents a comprehensive list of fine-grained actions, with corresponding specific annotations detailed in the accompanying annotation file.

Hand grasping was annotated by observing the RGB video of the data glove’s 3D model stream. The frame when the subject’s made contactde with object was identified as the start of the grasp. The grasp events identified in the action camera video were mapped to the corresponding depth and inertial motion data frames. Each grasp’s start and end frames were recorded separately, and their time frames were mapped to the EMG data acquisition table.

Given the original RGB video of Kinect^TM, the annotator manually annotates the moving objects in the video and the 2D masks involving the manipulated objects at a certain sampling frequency (Fig. 7). Eye-movement annotation draws the AOI region of interest of the target object (Fig. 8).

Ethical approval

All shared data was completely anonymized. All participants signed informed consent for data, media acquisition, and public release. The ethical approval was provided by the Ethical Committee of Tongji University tjdxsr055.

Data Records

A comprehensive list structure of the dataset was presented (Fig. 9). Further details on the field contents are described below. Kitchen activity data for each subject can be found in the folder [sub1-sub20]. This field codes the subject number. It is labeled sub1, sub2, sub3, … for the 20 subjects. The 17 objects in Table 2 are represented by each sensor’s sub-folders [C1-C17]. The experiment was repeated twice to account for spatiotemporal variations in human behaviors at different operational locations.

Under each subject’s folder are Delsys, Wise Glove, Nokov, Tobii, and MOMA Sensor data, each with a recorded timestamp.

Delsys^TM contains acc and EMG data of 16 EMG channels and the corresponding time stamps under the action split.

The glove folder contains two CSV files for the left and right hands, including data from 20 position sensors and 19 mechanical sensors, as well as a pre-processed video of the exported hand model.

Nokov^TM contains TRB and CAP files, which can be opened with the official software, as well as exported common 3D files. The data includes mechanical sensors, pre-processed exported hand model video, and various files. Official software can open TRB and CAP files for Nokov and export them as TRC files. The TRC file can be opened with software like Microsoft Excel and contains the following attributes: The data format frame, time, timestamp, XYZ coordinates.

Tobii^TM contains raw recording data from the eye tracker: pupil gaze data, IMU data, G3 video with gaze points, and scene video without gaze points. It also includes post-processed labeled AOIs and Tobii Pro Lab exported CSV files.

The Kinect^TM video includes depth data from multiple viewpoints and RGB images, as well as generated masks and corresponding segmented JSON files.

The voice file includes the timestamps and corresponding audio recordings.

Technical Validation

To minimize sensor displacement, participants wore form-fitting attire during the experiment. The wearable sensors, including IMUs and an eye-tracker, were positioned following the manufacturer’s guidelines outlined in the Methods section. Additionally, the signal integrity of each sensor was manually checked using the acquisition software before each trial. To ensure high-quality eye tracker data, we performed the initial calibration phase before each participant began the experiment. Participants were instructed to perform a defined task in a kitchen setup, which was recorded using a high-resolution eye tracker. The data collected included types of eye movements and their respective durations, along with the spatial distribution of fixations across the visual field.

The eye movement trajectory of a person grasping a chopping board is recorded, and AOI are labeled by an expert (Fig. 10(a,b)).

Fixation duration is significantly longer than saccades, suggesting deep visual processing at specific points of interest. Saccades show shorter durations, indicative of rapid gaze shifts (Fig. 10(c)).

Unspecified movements were minimal, indicating clean data collection with few artifacts such as blinks or sensor errors.

The heatmap shows that when 20 participants completed task C1, their attention was highly concentrated in the middle-right area of the visual field. This area corresponds to where participants operate the cutting board (Fig. 10(d)). This concentration indicates that most participants focus on the object while operating it, and the data are consistent with expected behavior.

Table 3 lists various features related to eye-tracking metrics while AOI. The features detailed in the table are as follows.

Table 3 AOI related features.

Full size table

The heatmap illustrates the proportion of time spent by 20 volunteers performing various action categories across 17 tasks (C1–C17). The vertical axis represents the tasks (C1–C17), while the horizontal axis denotes the action categories, such as Approach, Grab, and Cut. The color gradient ranges from yellow (low proportion) to dark blue (high proportion), corresponding to the normalized values between 0.0 and 0.4 (Fig. 11). Actions such as Grab and Approach are widely distributed across multiple tasks, as evidenced by the darker color along several rows. This suggests that these movements serve as fundamental components in a wide variety of tasks. Conversely, actions like Spray or Roll occur infrequently and are restricted to a limited number of tasks. This pattern suggests that tasks are influenced by the availability of specific kitchen tools or objects. Certain tasks, such as C10, exhibit a high concentration of a specific action, represented by a prominent dark blue cell. This indicates that the corresponding action is dominant in the completion of that task. The heatmap thus highlights the variability in action distribution across tasks, offering insights into task composition and action relevance in different scenarios.

These included a variety of common kitchen actions such as Approach, Take, and Put Down, as well as left and right-hand use (Fig. 12(a)). Data were obtained from 20 participants (sub1-sub20) who were recorded for the completion time of the movements during the performance of specific C17 tasks (Fig. 12(b)).

Certain movements, such as Grab and Lift Up, show large temporal differences between the right and left hands, which point to the different demands on coordination and strength control during the execution of these movements. The median time difference for the right hand is generally lower than for the left hand in most movements, which may reflect the right-handed dominance of most participants. Movements that are complex or require fine manipulation (e.g., Unscrew and Tighten) showed greater time variance and more outliers, implying that execution is affected by various factors, such as differences in individual skills and strategies. Such data can help optimize the design and programming of robot arms when developing robots or automated systems involving fine hand manipulation.

The effects of various experimental factors on force measurement distribution are analyzed to ensure the data’s reliability and real-world applicability. These factors include the specific joint or measurement point (since each corresponds to a distinct sensor), the type of hand movement, the individual subject, and the repetition of the movement. This analysis is conducted separately for each hand (Fig. 13). The resulting box plots offer a detailed view of the force distribution across 19 measurement points on both the right and left hands. The variability observed across different sensors underscores the importance of considering individual differences in muscle strength and hand dynamics when analyzing this data. This approach ensures that the dataset comprehensively reflects the diverse factors at play.

For most tasks, the median forces are consistently below 1 Newton, with a few outliers indicating occasional higher force application. The interquartile range is narrow for most tasks, suggesting that the applied forces are relatively consistent.

The data indicate a clear distinction in the roles of each hand during tasks: the right hand often performs precision work requiring less force, while the left hand engages in more strength-oriented tasks. For robotic systems, especially those designed for dual-handed operations, programming should incorporate adaptive force modulation capabilities to mimic the observed asymmetrical force distribution. This can enhance robot’s ability to effectively perform complex, multi-component tasks. Multimodal sensory integration in robots, such as combining tactile feedback, force sensors, and visual input, could be optimized based on these human force application patterns to improve robotic precision and efficiency in similar tasks.

To validate the applicability of our dataset for enhancing embodied intelligence in robotic systems, we conduct an in-depth analysis of joint angle variability. The distribution of joint angles across different segments of both hands when performing the same task is analyzed (Fig. 14). Specifically, the thumb joints display lower median angles and less variability, while the index, middle, ring, and little fingers exhibit higher median angles. The middle and ring fingers show exceptionally high angles, with varying degrees of interquartile range (IQR) that indicate different levels of movement variability. The metacarpophalangeal (MCP) joint has a high median angle with significant variability, underscoring its crucial role in hand movements for the task.

The analysis targets the mechanical range of motion captured by sensors located at strategic points, including the base of the thumb, the tips of the fingers, and the knuckles. The variability in joint angles, as indicated (Fig. 14), reflects human hands’ natural range of motion. This is essential for designing robots capable of mimicking complex human hand movements. For example, segments like midlle and ring1 exhibit a wide range of motion, highlighting their importance in grasping and manipulating objects-critical aspects in robot design for achieving human-like dexterity.

Based on whether the palm is involved, the degree of thumb involvement, and the degree of involvement of the side of the finger⁴², we classify nine grasping methods for the experts to annotate, see Table 4.

Table 4 Hand actions.

Full size table

In tasks C1 through C17, Lumbrical grasp occurs most frequently, as the object is preferentially picked up for each action, and this action is most often a finger-touch grasp (Fig. 15). The C11 and C12 tasks involving squeezing detergent and spraying alcohol spray are inevitably designed for Cyl grasping, which is consistent with human common sense.

These data, along with their meticulous analysis, significantly advance the development of advanced robotic systems. They are instrumental in fostering embodied intelligence that mirrors human-like agility and precision. This ensures that the dataset is not only robust and reliable but also accurately represents human behavior in real-world scenarios.

Code availability

No code is required to use this data, but Python code is provided for synchronized multimodal sensor acquisition during experiment recording. The code is available at https://github.com/rrc11111/data.

References

Wang, W. et al. Vision LLM: Large language model is also an open-ended decoder for vision-centric tasks. In Proceedings of the 37th International Conference on Neural Information Processing Systems 61501–61513 (2023).
Liu, H., Li, C., Li, Y., & Lee, J. Improved baselines with visual instruction tuning. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition 26286–26296 (2024).
Wright, L. G. et al. Deep physical neural networks trained with backpropagation. Nature 601, 549–555 (2022).
ADS CAS PubMed PubMed Central MATH Google Scholar
Neil, A. et al. Open X-Embodiment: robotic learning dataset and RT-X models. IEEE International Conference on Robotics and Automation 6892–6903 (2024).
Luan, H., Wang, M., Zhang, Q., You, Z. & Jiao, Z. Variable stiffness fibers enable universal and programmabler re-foldability strategy for modular soft robotics. Advance Science 11, 2307350 (2024).
CAS Google Scholar
Cini, F., Ortenzi, V., Corke, P. & Controzzi, M. On the choice of grasp type and location when handing over an object. Scientific Robot 4, 9757 (2019).
MATH Google Scholar
Sun, Y., Amatova, E. & Chen, T. Multi-object grasping - types and taxonomy. In International Conference on Robotics and Automation 777–783 (2022).
Bullock, I. M., Feix, T. & Dollar, A. M. The Yale human grasping dataset: Grasp, object, and task data in household and machine shop environments. The International Journal of Robotics Research 34, 251–255 (2015).
Google Scholar
Brahmbhatt, S., Ham, C., Kemp, C. C. & Hays, J. ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging. in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 87018711 (2019).
Brahmbhatt, S., Tang, C., Twigg, C. D., Kemp, C. C. & Hays, J. ContactPose: A dataset of grasps with object contact and hand pose. European Conference on Computer Vision 361–378 (2020).
Dreher, C. R. G., Wachter, M. & Asfour, T. Learning object-action relations from bimanual human demonstration using graph networks. IEEE Robotics and Automation Letters 5, 187–194 (2020).
Google Scholar
Hasson, Y. et al. Learning joint reconstruction of hands and manipulated objects. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11799–11808 (2019).
Taheri, O., Ghorbani, N., Black, M. J. & Tzionas, D. GRAB: A dataset of whole-body human grasping of objects. in Computer Vision 12349, 581–600 (2020).
MATH Google Scholar
Nicora, E. et al. The MoCA dataset, kinematic and multi-view visual streams of fine-grained cooking actions. Scientific Data 7, 432 (2020).
PubMed PubMed Central Google Scholar
Aganian, D., Stephan, B., Eisenbach, M., Stretz, C. & Gross, H. M. ATTACH Dataset: Annotated two-handed assembly actions for human action understanding. IEEE International Conference on Robotics and Automation 11367–11373 (2023).
Fan, Z. et al. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. IEEE/CVF Conference on Computer Vision and Pattern Recognition 12943–12954 (2023).
Sawhney, A., Lee, S., Zhang, K., Veloso, M. & Kroemer, O. Playing with food: Learning food item representations through interactive exploration. International Symposium on Experimental Robotics (2021).
Wang, T. et al. Multimodal grasp data set: A novel visual–tactile data set for robotic manipulation. International Journal of Advanced Robotic Systems 16, 1 (2019).
MATH Google Scholar
Han, M. et al. HANDS: a multimodal dataset for modeling toward human grasp intent inference in prosthetic hands. Intelligent Service Robotics 13, 179–185 (2020).
PubMed MATH Google Scholar
Lastrico, L. et al. The effects of selected object features on a pick-and-place task: A human multimodal dataset. The International Journal of Robotics Research 43, 98–109 (2024).
Google Scholar
Rohrbach, M, et al. A database for fine grained activity detection of cooking activities. 2012 IEEE conference on computer vision and pattern recognition 1194–1201 (2012).
Huang, Y. & Sun, Y. A dataset of daily interactive manipulation[J]. The International Journal of Robotics Research. 38, 879–886 (2019).
ADS MATH Google Scholar
Garcia-Hernando, G., Yuan, S., Baek, S. & Kim, T.-K. First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. IEEE/CVF Conference on Computer Vision and Pattern Recognition 409–419 (2018).
Hampali, S., Rad, M., Oberweger, M. & Lepetit, V. HOnnotate: A method for 3D annotation of hand and object poses. IEEE/CVF Conference on Computer Vision and Pattern Recognition 31933203 (2020).
Chao, Y. W. et al. DexYCB: A benchmark for capturing hand grasping of objects. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9040–9049 (2021).
Alia, S et al. Summary of the cooking activity recognition challenge. Human Activity Recognition Challenge 1–13 (2021).
Pereira et al. Flipping food during grilling tasks, a dataset of utensils kinematics and dynamics, food pose and subject gaze. Scientific Data 9, 5 (2022).
PubMed PubMed Central MATH Google Scholar
Elangovan, N. et al. On human grasping and manipulation in kitchens: Automated annotation, insights, and metrics for effective data collection. IEEE International Conference on Robotics and Automation 11329–11335 (2023).
Mastinu, E., Coletti, A., Mohammad, S. H. A., Van Den Berg, J. & Cipriani, C. HANDdata first-person dataset including proximity and kinematics measurements from reach-to-grasp actions. Scientific Data 10, 405 (2023).
PubMed PubMed Central Google Scholar
Zhao, T., Kumar, V., Levine, S. & Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. Robotics: Science and Systems (2023).
Krebs, F., Meixner, A., Patzer, I. & Asfour, T. The KIT bimanual manipulation dataset. In 2020 IEEE-RAS 20th International Conference on Humanoid Robots 499–506 (2021).
Liu, Y. et al. HOI4D: A 4D egocentric dataset for category-level human-object interaction. IEEE/CVF Conference on Computer Vision and Pattern Recognition 20981–20990 (2022).
Kwon, T., Tekin, B., Stuhmer, J., Bogo, F. & Pollefeys, M. H2O: Two hands manipulating objects for first person interaction recognition. IEEE/CVF International Conference on Computer Vision 10118–10128 (2021).
DelPreto, J. et al. ActionSense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment. In Advances in Neural Information Processing Systems 35, 13800–13813 (2022).
MATH Google Scholar
Cheng, X., Zhang, K., Wu, T., Xu, Z. & Gou, X. An opinions-updating model for large-scale group decision-making driven by autonomous learning. Information Sciences 662, 0020–0255 (2024).
Google Scholar
Dong, Z., Li, Z., Yan, Y., Calinon, S. & Chen, F. Passive bimanual skills learning from demonstration with motion graph attention networks. IEEE Robotics and Automation Letters 7, (2022).
Huang, D., Tang, C. & Zhang, H. Efficient object rearrangement via multi-view fusion. IEEE International Conference on Robotics and Automation 18193–18199 (2024).
Wang, S. et al. What you see is what you grasp: User-friendly grasping guided by near-eye-tracking. IEEE International Conference on Development and Learning 194–199 (2023).
Xiong, D., Zhao, Y. & Zhao, X. Intuitive human-robot-environment interaction with EMG signals: A review. IEEE/CAA Journal of Automatica Sinica 11, 1075–1091 (2024).
MATH Google Scholar
Yu, K. et al. MimicTouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. The Conference on Robot Learning (2024).
Du, M., Lee, O. Y., Nair, S. & Finn, C. Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual Imitation Learning.Robotics: Science and Systems (2022).
Margarita, V., Sancho-Bru, J. L., Gracia-Ibanez, V. & Perez-Gonzalez, A. An introductory study of common grasps used by adults during performance of activities of daily living. Journal of Hand Therapy 27, 225–234 (2014).
Google Scholar
Ren, R. et al. Empowering robotic systems with multimodal sensory data. ScienceDB https://doi.org/10.57760/sciencedb.13080 (2024).

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62088101, Grant 62403366, Grant 62403363, and Grant 62473294; in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100, Grant 24511104400 and Grant 22ZR1467100; and in part by the Fundamental Research Funds for the Central Universities under Grant 0200121005/174.

Author information

Authors and Affiliations

National Key Laboratory of Autonomous Intelligent Unmanned Systems, Shanghai, 201109, China
Ruochen Ren, Zhipeng Wang, Chaoyun Yang, Jiahang Liu, Rong Jiang, Yanmin Zhou, Shuo Jiang & Bin He
Frontiers Science Center for Intelligent Autonomous Systems, Shanghai, 201109, China
Ruochen Ren, Zhipeng Wang, Chaoyun Yang, Jiahang Liu, Rong Jiang, Yanmin Zhou, Shuo Jiang & Bin He
Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, 201804, China
Ruochen Ren
College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, China
Zhipeng Wang, Jiahang Liu, Rong Jiang, Yanmin Zhou, Shuo Jiang & Bin He

Authors

Ruochen Ren
View author publications
Search author on:PubMed Google Scholar
Zhipeng Wang
View author publications
Search author on:PubMed Google Scholar
Chaoyun Yang
View author publications
Search author on:PubMed Google Scholar
Jiahang Liu
View author publications
Search author on:PubMed Google Scholar
Rong Jiang
View author publications
Search author on:PubMed Google Scholar
Yanmin Zhou
View author publications
Search author on:PubMed Google Scholar
Shuo Jiang
View author publications
Search author on:PubMed Google Scholar
Bin He
View author publications
Search author on:PubMed Google Scholar

Contributions

R.R., C.Y., J.L. and R.J. were responsible for data collection. R.R., C.Y. and J.L. are designed for data manipulation and visualization and are responsible for data annotation. R.R. completed the manuscript writing. Z.W., S.J. and B.H. participated in the manuscript review and editing. B.H., Y.Z., Z.W. and S.J. handled the conceptual design and supervised the data collection experiment and the whole research project.

Corresponding authors

Correspondence to Zhipeng Wang or Shuo Jiang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ren, R., Wang, Z., Yang, C. et al. Enhancing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks. Sci Data 12, 476 (2025). https://doi.org/10.1038/s41597-025-04798-z

Download citation

Received: 15 October 2024
Accepted: 11 March 2025
Published: 21 March 2025
Version of record: 21 March 2025
DOI: https://doi.org/10.1038/s41597-025-04798-z

Subjects

Abstract

Similar content being viewed by others

Multimodal tactile sensing fused with vision for dexterous robotic housekeeping

An assistive robot that enables people with amyotrophia to perform sequences of everyday activities

Embodied tactile perception of soft objects properties

Background & Summary

Unimodal grasping

Multimodal grasping

Hand manipulation

Whole-body manipulation

Whole-body multimodal data

Fine-grained data labeling

Methods

Design

Equipment

Optical motion capture system

Multiview cameras

Eye tracker

Electromyography

Haptic and motion integrated data gloves

Microphones

Synchronization

Data annotation

Ethical approval

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links