Background & Summary

The development of large models, including language models, Vision Language Models (VLM)1,2, and vision neural networks3, has transformed human-robot interaction, allowing robots to perform tasks in a more natural manner4. However, a significant limitation remains: the lack of integrated multimodal data restricts robots’ capacity to perform fine manipulation tasks that require precise, closed-loop feedback and delicate hand-eye coordination, such as opening a condiment cup or unscrewing the can5. These tasks depend on haptics to regulate force, a mix of sensory inputs such as visual attention and hand movements, and the ability to adjust and reprogram in real time in response to small changes in environmental images and sounds.

Recognizing the limitations of unimodal data, it becomes evident that many existing dataset, initially gathered through unimodal means, are insufficient for the evolving needs of these models. In the previous research studies, human grasping data was typically captured using RGB cameras6.

Unimodal grasping

For instance, Sun et al.7 provided a taxonomy of multi-object grasping, their dataset lacks fine-grained motion capture and multimodal integration, limiting its utility for tasks requiring precise bimanual manipulation analysis. Bullock et al.8 proposed dataset focused on scene grasp based in home and machining environments. Brahmbhatt et al.9,10 collected images of grasped household objects and modeled their textures using 3D printing. While capturing the textures of 3D-printed objects is straightforward, noticeable differences remain between these and real objects’ textures. To address this problem, Dreher et al.11 proposed extracting symbolic spatial object relations from raw RGB-D video data captured from robotic viewpoints. This method constructs a graph-based scene representation, training a graph network classifier with these representations and action labels to learn object-action relationships. Hasson et al.12 focused on reconstructing hand and object poses from vision data. However, it does not include physical interaction data, such as forces and torques. Taheri et al.13 captured the complete grasping action of a person with an object, including detailed 3D body shape and pose, object shape, and body-object contact, without using RGB images. Nicora et al.14 provided motion-capture data and multi-view video sequences from a cooking scene to explore perspective-invariant action properties. Aganian et al.15 presented a dataset encompassing 95.2k annotated fine-grained actions, recorded over 51.6 hours by three cameras, representing the potential viewpoints of a cooperating robot. This dataset uniquely annotates actions performed by each hand separately, reflecting the simultaneous use of both hands in assembly tasks. Lastly, Fan et al.16 created a dataset where two hands manipulate articulated objects, such as scissors or laptops, emphasizing the dexterity required to handle such items.

Multimodal grasping

Single visual data alone struggles to provide sufficient information for embodied intelligence. Recent research has concentrated on multimodal human data set construction. The richness of multimodal information improves human grasping behavior understanding, downstream cross-modal tasks, and the development of big data-multimodal macro models. Multimodal data can also facilitate fine-grained distinctions, such as using vision, audio, and force to distinguish material properties of food products17. These dataset are useful for addressing specific tasks. Wang et al.18 provide a comprehensive dataset integrating visual and tactile data from robotic grasping experiments, enabling researchers to analyze grasping processes, and detect slip events. Han et al.19 provide a multimodal resource designed to model human grasp intent for prosthetic hands, incorporating paired eye-view and hand-view images, EMG signals, and IMU data to facilitate research on intent inference and motion planning through multimodal integration. Lastrico et al.20 propose pick-and-place-focused dataset containing multiple cameras, motion capture systems, and wrist-worn inertial measurement devices that observe movements from various angles.

Hand manipulation

A part of the grasping dataset focuses on hand-object interaction, usually bimodal based on RGB-D cameras. Rohrbach et al.21 introduced a fine-grained cooking activity dataset which includes 65 activities recorded in realistic kitchen environments. Huang et al.22 introduced a comprehensive dataset focusing on daily interactive manipulation, capturing 3D position, orientation, force, and torque data of objects during fine manipulation tasks. Garcia et al.23 proposed a first-person-based model of hand-object interaction with only a 3D model of the hand. Hampali et al.24 provided 3D annotations for hand and object poses but did not integrate dynamic temporal data or multimodal signals. Chao et al.25 collected a grasping dataset for three downstream tasks: 2D detection, 6D object estimation, and 3D hand posture estimation. However, the study lacked task orientation and theoretical depth in real-world applications. Alia et al.26 introduced a dataset combining macro- and micro-level activity recognition during cooking sessions. Pereira et al.27 introduced a novel dataset of 2,866 food flipping movements involving different foods and utensils, capturing 3D trajectories, forces, torques, and subject gaze, providing valuable data for advancing robotics, human activity recognition, and bio-inspired control systems. Elangovan et al.28 collected a dataset of object grasping and manipulation in a kitchen environment, but the sample size was small, the statistical analysis lacked depth, the study was overly one-sided, and the data required better hand-object modeling. Mastinu et al.29 proposed a first-person capture of grasping detail data based on radar and flight sensors, including hand kinematics and proximity vision data during the object’s grasping action. Zhao et al.30 proposed a dataset that captures bimanual manipulation using low-cost hardware, however, it focuses on a narrow set of tasks and lacks emphasis on multimodal fusion.

Whole-body manipulation

Manipulation involves not only the hand but the entire body, and it is critical that objects contact multiple body parts. However, existing dataset need to capture the complexity of these interactions. Human grasping is a whole-body movement, and the body posture varies for different types of grasping. Some studies collect hand-specific grasping details alongside human whole-body movements. Krebs et al.31 created a database of whole-body furniture manipulation tasks for motion capture. The dataset has a single, non-scene-specific scene. Similar to Krebs’s work, Liu et al.32 proposed a first-person, two-handed interaction dataset for object manipulation that combines camera and depth data. However, the scene arrangement is overly homogeneous, and the dataset lacks rich semantic information. Kwon et al.33 built a first-person dataset of hand-object manipulation based on RGB and depth for first-person interaction recognition. It provided two hands and manipulated objects with 3D poses, shape-rich annotations, and detailed information such as action labels, camera poses, scene point clouds, and object meshes. DelPreto et al.34 proposed a multimodal dataset for kitchen tasks, it focuses on wearable sensors and lacks annotations based on high fine-granularity.

Therefore, existing dataset are often limited by their homogeneity in modality, primarily focusing on single visual modalities and lacking motion mechanics information. This limitation restricts their applicability in general artificial intelligence or autonomous learning in unstructured environments35. Additionally, these dataset lack a unified sensor collection framework, leading to data quality and type inconsistencies across different studies. Furthermore, the reproducibility of these dataset could be improved due to the specialized nature of their collection environments and conditions. Most dataset are also narrowly focused on downstream tasks, with insufficient integration and synthesis of the collected data, limiting the depth of insights that can be drawn from them. Many datasets focus on a narrow set of tasks or lack the richness of multimodal data that is essential for developing more advanced, large-scale models. In contrast, our datase called Kaiwu Kitchen dataset covers a broader range of tasks, incorporates multiple modalities (such as tactile, visual, and auditory data).

This paper proposes a multimodal data collection strategy within real-world kitchen environments. By establishing a highly centralized framework that integrates the existing equipment in the Kaiwu Lab, this approach aims to enhance the data’s diversity, utility, and reproducibility, facilitating more comprehensive and robust human-robot interaction and embodied intelligence research. The contributions of this paper are as follows:

Whole-body multimodal data

This dataset identifies critical sensing modality closely related to manipulation learning process which includes multi-view data of human activities, hand muscle activity, first-person vision, eye-tracking data, scene audio during kitchen tasks, hand kinematics for precise actions, and dynamic full-body motion capture data, surpassing other dataset listed (Table 1). It provides a comprehensive understanding of human actions in kitchen environments, supplying detailed data for robots to acquire multimodal skills.

Table 1 Comparison of the proposed dataset with existing dataset on hand interaction.

Fine-grained data labeling

In this study, 680 regions of interest were labeled for first-person attention data, and 536,467 objects were annotated in multi-view video data. Additionally, 14,511 motion segmentation events and 4,254 gesture segmentation events for left- and right-hand fine operations were meticulously recorded, along with multimodal myoelectric and cross-modal synchronization data between movements and gestures. The dataset’s utility for complex analyses is enhanced by this expertly annotated data, significantly advancing research in human-robot interaction and embodied intelligence.

Methods

In this section, we highlight the main pipelines of our data collection. First, the design of the study and the setup, including its equipment and technical characteristics, are described. Next, the preprocessing steps applied to the data for cleaning and synchronization are discussed.

Design

The selected activities and use cases in our dataset provide essential insights into embodied intelligence, enhancing robotic design by understanding real-world interactions. We have recorded the dataset with activities related to each manipulated object to expand this type of dataset further. Each object has three primary tasks: Grasp, Carry, and Place. In addition to the primary tasks, manipulation tasks (A1 and A2) are introduced to highlight the specific functionalities. For example, tasks such as cutting fruit, stirring ingredients, or rolling the dough illustrate the specific manipulations enabled by different tools. Considering the temporal and spatial variations and differences among individuals, we engaged 20 volunteers to perform diverse whole-body movements in an unstructured kitchen setting. Each movement was performed twice to ensure consistency. To explore human interactions with kitchen objects more deeply, we designed 17 object manipulations involving common kitchen items, using a refrigerator cabinet as the scene. These tasks include single and multi-object grasping and single-handed manipulations, detailed in Table 2. The recordings capture primary tasks and manipulation tasks when the object is the primary focus. We developed 30 task scenarios from these 17 general action categories, initially defined coarsely and later annotated with fine-grained details necessary for task completion.

Table 2 Long sequence of task movement breakdowns.

The choice of a kitchen setting for our data collection is particularly significant because kitchens are environments rich in varied and complex human-object interactions. They involve frequent manipulation of everyday objects, requiring fine and gross motor skills essential for understanding and replicating embodied intelligence in robotics. The kitchen is a natural setting where people perform various activities, such as cooking, cleaning, and organizing, which involve manipulating objects of different sizes, shapes, and weights. This variety makes the kitchen an ideal environment to study how humans interact with objects in a real-world context, providing critical insights that can be generalized to other settings.

Human interaction with everyday objects involves three-dimensional movements of the entire body. For example, retrieving vegetables from a low fridge requires crouching. Accordingly, a series of full-body grasping movements for kitchen tasks has been established. The terms Place and Put back describe distinct actions: Place involves setting an item on a counter or table, whereas Put back refers to returning it to its original location, such as placing fruit back into the refrigerator’s crisper. The dataset includes 17 objects, allowing volunteers to perform various actions freely, thus ensuring their autonomy. These actions span single and multiple-object manipulations and one- and two-handed tasks.

To record everyday activities in the kitchen (Fig. 1), a wearable sensor suite equipped with environmental sensors was used. These wearable devices can capture detailed human movements from different perspectives. The suite is scalable to various work environments, such as conference and assembly rooms. The multiple views provide additional information, mitigating issues caused by occlusions. However, the motion capture suit may impede some of the volunteers’ movements during the tasks, while the data glove may hinder their dexterity.

Fig. 1
Fig. 1
Full size image

Data capture system. The project focuses on a human-centered perspective, capturing multimodal information from wearable and external devices in a kitchen scenario.

Equipment

Optical motion capture system

The optical motion capture system more accurately captures details of full-body movement in the kitchen. Participants were instructed to wear a motion capture suit with 53 markers before the experiment began. The model is shown in Fig. 2. NOKOVTM Optical 3D Motion Capture uses 20 motion capture lenses to model the human body by identifying motion capture points on the sportswear worn by volunteers. This enables accurate capture and estimation of 3D spatial coordinates (x, y, z) and orientation at different time intervals. Prior to the experiment, each volunteer was required to perform a cross-motion calibration. Additionally, marker positions on the suit were adjusted to account for height differences among subjects. Collecting a large number of human demonstrations of the horizontal skills of the dynamic capture data can greatly improve the robot’s ability to operate36.

Fig. 2
Fig. 2
Full size image

Human body 53-stick-point schematic. (a) Human model. (b) Motion capture model. Volunteers put on the motion capture suits attached to the mark, and the mark position is adjusted according to the volunteer’s body type. The setup recognizes the model in real time.

Multiview cameras

Three multi-view KinectTM cameras were arranged at the experimental site. To augment the camera’s capacity for capturing the experimenter’s elbow movements, back movements, and partially obstructed objects on the sides, KinectTM cameras were positioned at the front-left, front-right, and behind the experimenter. Videos were recorded at 1920 × 1080 resolution to obtain RGB and depth data, which were used for post-processing point cloud data. Such multi-view RGB-D data setups have been shown to enhance robotic skill acquisition, such as improving object manipulation and grasp planning in cluttered environments37.

Eye tracker

The TobiiTM device included a first-person RGB camera, an eye movement sensor, and an IMU to collect information about the wearer’s eye position, providing stable and accurate eye-tracking data. Calibration was performed before wearing. During the experiment, the detection frame transmits data via network sockets and stores the recorded data in the master file at the end of the experiment. Collecting eye attention data from a first-person perspective can help robots better predict human attention areas, which is important for robots to understand human intentions and actions automatically. Eye-tracking technology enhances robotic grasping performance by enabling robots to identify areas of human attention, allowing for more natural and precise manipulation tasks38.

Electromyography

The DelsysTM EMG sensor measures muscle activity. The electrodes used to measure the surface EMG signal are double differential and measure a 2 kHz EMG signal. It also includes a built-in 3-axis accelerometer gyroscope and magnetometer. The data from all channels is transmitted wirelessly using the Python API. The four forearm muscles selected on both left and right arms are the flexor digitorum superficialis, the flexor carpi radialis, the extensor carpi radialis, and the extensor digitorum superficialis. For the circular array of motors, eight equidistant positions were selected on the left hand. The motors on the right hand were placed in positions 9–16 (Fig. 3). Myoelectric control recognizes human movement intentions through the surface electromyographic signals, thus enabling human-computer interaction. EMG signals have been extensively used in human-robot interaction (HRI) systems to decode motion intention, improve robotic manipulation, and facilitate tasks such as prosthesis control and rehabilitation training39.

Fig. 3
Fig. 3
Full size image

Wearable configuration of electromyography. The numbers (1–16) correspond to the rows in the associated dataset’s Excel file, where each row contains data collected from the respective numbered sensor shown in the figure.

Haptic and motion integrated data gloves

The Wise-gloveTM integrates finger and upper limb motion capture, grip force acquisition, and 19 fiber optic angle sensors distributed on the back of the fingers to measure the hand’s movement posture in real time (Fig. 4). Nineteen miniature grip force sensors are distributed on the palm and the range of the underside of the finger to measure grip force in real-time. Position sensors are also distributed on the arm and hand joints to capture arm movement in real-time. The grip force of each functional area of the human hand can be measured in real-time using 19 miniature grip force sensors distributed throughout the palm and fingertips. Additionally, arm movement can be captured in real-time using position sensors distributed in the joints of the arm and palm. Collecting information on the angle changes and trajectories of the human hand during actions helps the multi-fingered hand of a robot mimic human movement patterns. Haptic data records the force exerted by humans during tasks, helping robots adjust their hand contact force to avoid damaging objects or causing grip slippage. Systems such as the MimicTouch framework demonstrate the usefulness of haptic data for improving robotic manipulation in contact-rich tasks, enabling robots to better replicate human-like tactile-guided control strategies40.

Fig. 4
Fig. 4
Full size image

Glove sensor distribution. (a) Angle sensors. (b) Force sensors.

Microphones

Experiments used a portable and robust MOMATM microphone, considering the sound source’s location during kitchen operations. Microphones were placed above and below the refrigerator (positions 1 and 2), to the left and right of the kitchen countertop (positions 3 and 4), above and below the cupboard (positions 5 and 6), and clipped to the left and right hands (positions 7 and 8) (Fig. 5). The microphones help in understanding the details of the operator’s movements. For example, by analyzing sound source information, the system can identify specific actions such as opening the fridge, chopping vegetables, or putting down items. Microphones can enhance robotic manipulation performance, particularly in occluded scenarios, by enabling robots to use audio cues for tasks such as locating objects and detecting contact events41.

Fig. 5
Fig. 5
Full size image

Microphone installation. Four pairs of microphones are placed near the refrigerator, hand, cupboard, and table. This image is best viewed in colour to fully appreciate the details of the installation.

Synchronization

To ensure the accuracy and usability of the collected sensor data, each volunteer underwent calibration of the motion capture and eye-tracker before the experiment. The person in charge guided the start of the recording, which included glove pose calibration and a calibration session. A customized data acquisition software designed specifically for this purpose was synchronized to match the initial and final moments of each recording. The acquisition software was developed in Python and synchronized with the devices’ SDK libraries.

Data annotation

During the kitchen activity, the human expert segmented the actions of fine-grained primitives by observing the RGB video stream (Fig. 6). They also determined the start and end frames of each action and mapped these events to other involved sensors (such as data gloves and EMG devices). The study annotates grasping actions as fine-grained. Table 2 presents a comprehensive list of fine-grained actions, with corresponding specific annotations detailed in the accompanying annotation file.

Fig. 6
Fig. 6
Full size image

Two-hand segmentation of the chopping board task. Top: Motion visualization of the chopping board task. Bottom: Segmentation trajectories for both hands (left and right). The motions that appear include rest, approach, catch, move, put, pend, and additional motions before the grasping task, after carrying and place.

Hand grasping was annotated by observing the RGB video of the data glove’s 3D model stream. The frame when the subject’s made contactde with object was identified as the start of the grasp. The grasp events identified in the action camera video were mapped to the corresponding depth and inertial motion data frames. Each grasp’s start and end frames were recorded separately, and their time frames were mapped to the EMG data acquisition table.

Given the original RGB video of KinectTM, the annotator manually annotates the moving objects in the video and the 2D masks involving the manipulated objects at a certain sampling frequency (Fig. 7). Eye-movement annotation draws the AOI region of interest of the target object (Fig. 8).

Fig. 7
Fig. 7
Full size image

Semantic segmentation. Semantic segmentation in everyday operations.

Fig. 8
Fig. 8
Full size image

Areas of observational interest (AOI) in expert labeled human operations.

Ethical approval

All shared data was completely anonymized. All participants signed informed consent for data, media acquisition, and public release. The ethical approval was provided by the Ethical Committee of Tongji University tjdxsr055.

Data Records

A comprehensive list structure of the dataset was presented (Fig. 9). Further details on the field contents are described below. Kitchen activity data for each subject can be found in the folder [sub1-sub20]. This field codes the subject number. It is labeled sub1, sub2, sub3, … for the 20 subjects. The 17 objects in Table 2 are represented by each sensor’s sub-folders [C1-C17]. The experiment was repeated twice to account for spatiotemporal variations in human behaviors at different operational locations.

Fig. 9
Fig. 9
Full size image

Structure of the provided dataset. The corresponding data have been made available and are accessible through ScienceDB43 (https://doi.org/10.57760/sciencedb.13080).

Under each subject’s folder are Delsys, Wise Glove, Nokov, Tobii, and MOMA Sensor data, each with a recorded timestamp.

DelsysTM contains acc and EMG data of 16 EMG channels and the corresponding time stamps under the action split.

The glove folder contains two CSV files for the left and right hands, including data from 20 position sensors and 19 mechanical sensors, as well as a pre-processed video of the exported hand model.

NokovTM contains TRB and CAP files, which can be opened with the official software, as well as exported common 3D files. The data includes mechanical sensors, pre-processed exported hand model video, and various files. Official software can open TRB and CAP files for Nokov and export them as TRC files. The TRC file can be opened with software like Microsoft Excel and contains the following attributes: The data format frame, time, timestamp, XYZ coordinates.

TobiiTM contains raw recording data from the eye tracker: pupil gaze data, IMU data, G3 video with gaze points, and scene video without gaze points. It also includes post-processed labeled AOIs and Tobii Pro Lab exported CSV files.

The KinectTM video includes depth data from multiple viewpoints and RGB images, as well as generated masks and corresponding segmented JSON files.

The voice file includes the timestamps and corresponding audio recordings.

Technical Validation

To minimize sensor displacement, participants wore form-fitting attire during the experiment. The wearable sensors, including IMUs and an eye-tracker, were positioned following the manufacturer’s guidelines outlined in the Methods section. Additionally, the signal integrity of each sensor was manually checked using the acquisition software before each trial. To ensure high-quality eye tracker data, we performed the initial calibration phase before each participant began the experiment. Participants were instructed to perform a defined task in a kitchen setup, which was recorded using a high-resolution eye tracker. The data collected included types of eye movements and their respective durations, along with the spatial distribution of fixations across the visual field.

The eye movement trajectory of a person grasping a chopping board is recorded, and AOI are labeled by an expert (Fig. 10(a,b)).

Fig. 10
Fig. 10
Full size image

Eye-tracking data. (a) Eye gaze trajectory. (b) Observing areas of interest. (c) Types of eye gaze. (d) Heat map of eye gaze. The figure shows the attentional line of sight of the subjects wearing the eye-tracking device.

Fixation duration is significantly longer than saccades, suggesting deep visual processing at specific points of interest. Saccades show shorter durations, indicative of rapid gaze shifts (Fig. 10(c)).

Unspecified movements were minimal, indicating clean data collection with few artifacts such as blinks or sensor errors.

The heatmap shows that when 20 participants completed task C1, their attention was highly concentrated in the middle-right area of the visual field. This area corresponds to where participants operate the cutting board (Fig. 10(d)). This concentration indicates that most participants focus on the object while operating it, and the data are consistent with expected behavior.

Table 3 lists various features related to eye-tracking metrics while AOI. The features detailed in the table are as follows.

Table 3 AOI related features.

The heatmap illustrates the proportion of time spent by 20 volunteers performing various action categories across 17 tasks (C1–C17). The vertical axis represents the tasks (C1–C17), while the horizontal axis denotes the action categories, such as Approach, Grab, and Cut. The color gradient ranges from yellow (low proportion) to dark blue (high proportion), corresponding to the normalized values between 0.0 and 0.4 (Fig. 11). Actions such as Grab and Approach are widely distributed across multiple tasks, as evidenced by the darker color along several rows. This suggests that these movements serve as fundamental components in a wide variety of tasks. Conversely, actions like Spray or Roll occur infrequently and are restricted to a limited number of tasks. This pattern suggests that tasks are influenced by the availability of specific kitchen tools or objects. Certain tasks, such as C10, exhibit a high concentration of a specific action, represented by a prominent dark blue cell. This indicates that the corresponding action is dominant in the completion of that task. The heatmap thus highlights the variability in action distribution across tasks, offering insights into task composition and action relevance in different scenarios.

Fig. 11
Fig. 11
Full size image

Action task correlation.

These included a variety of common kitchen actions such as Approach, Take, and Put Down, as well as left and right-hand use (Fig. 12(a)). Data were obtained from 20 participants (sub1-sub20) who were recorded for the completion time of the movements during the performance of specific C17 tasks (Fig. 12(b)).

Fig. 12
Fig. 12
Full size image

Action segment. (a) Left- and right-hand movement segmentation statistics. (b) Sub1 C17 action segmentation. Right and left-hand time histograms for all actions with left and right-hand C17 task box plots.

Certain movements, such as Grab and Lift Up, show large temporal differences between the right and left hands, which point to the different demands on coordination and strength control during the execution of these movements. The median time difference for the right hand is generally lower than for the left hand in most movements, which may reflect the right-handed dominance of most participants. Movements that are complex or require fine manipulation (e.g., Unscrew and Tighten) showed greater time variance and more outliers, implying that execution is affected by various factors, such as differences in individual skills and strategies. Such data can help optimize the design and programming of robot arms when developing robots or automated systems involving fine hand manipulation.

The effects of various experimental factors on force measurement distribution are analyzed to ensure the data’s reliability and real-world applicability. These factors include the specific joint or measurement point (since each corresponds to a distinct sensor), the type of hand movement, the individual subject, and the repetition of the movement. This analysis is conducted separately for each hand (Fig. 13). The resulting box plots offer a detailed view of the force distribution across 19 measurement points on both the right and left hands. The variability observed across different sensors underscores the importance of considering individual differences in muscle strength and hand dynamics when analyzing this data. This approach ensures that the dataset comprehensively reflects the diverse factors at play.

Fig. 13
Fig. 13
Full size image

The box diagram obtained by the glove at 20 people for the C6 operation.

For most tasks, the median forces are consistently below 1 Newton, with a few outliers indicating occasional higher force application. The interquartile range is narrow for most tasks, suggesting that the applied forces are relatively consistent.

The data indicate a clear distinction in the roles of each hand during tasks: the right hand often performs precision work requiring less force, while the left hand engages in more strength-oriented tasks. For robotic systems, especially those designed for dual-handed operations, programming should incorporate adaptive force modulation capabilities to mimic the observed asymmetrical force distribution. This can enhance robot’s ability to effectively perform complex, multi-component tasks. Multimodal sensory integration in robots, such as combining tactile feedback, force sensors, and visual input, could be optimized based on these human force application patterns to improve robotic precision and efficiency in similar tasks.

To validate the applicability of our dataset for enhancing embodied intelligence in robotic systems, we conduct an in-depth analysis of joint angle variability. The distribution of joint angles across different segments of both hands when performing the same task is analyzed (Fig. 14). Specifically, the thumb joints display lower median angles and less variability, while the index, middle, ring, and little fingers exhibit higher median angles. The middle and ring fingers show exceptionally high angles, with varying degrees of interquartile range (IQR) that indicate different levels of movement variability. The metacarpophalangeal (MCP) joint has a high median angle with significant variability, underscoring its crucial role in hand movements for the task.

Fig. 14
Fig. 14
Full size image

The boxes show the range between the first quartile (25th percentile) and the third quartile (75th percentile). The width of the boxes indicates the spread of the middle 50% of the data.

The analysis targets the mechanical range of motion captured by sensors located at strategic points, including the base of the thumb, the tips of the fingers, and the knuckles. The variability in joint angles, as indicated (Fig. 14), reflects human hands’ natural range of motion. This is essential for designing robots capable of mimicking complex human hand movements. For example, segments like midlle and ring1 exhibit a wide range of motion, highlighting their importance in grasping and manipulating objects-critical aspects in robot design for achieving human-like dexterity.

Based on whether the palm is involved, the degree of thumb involvement, and the degree of involvement of the side of the finger42, we classify nine grasping methods for the experts to annotate, see Table 4.

Table 4 Hand actions.

In tasks C1 through C17, Lumbrical grasp occurs most frequently, as the object is preferentially picked up for each action, and this action is most often a finger-touch grasp (Fig. 15). The C11 and C12 tasks involving squeezing detergent and spraying alcohol spray are inevitably designed for Cyl grasping, which is consistent with human common sense.

Fig. 15
Fig. 15
Full size image

Nine grip heat maps of the left and right hand.

These data, along with their meticulous analysis, significantly advance the development of advanced robotic systems. They are instrumental in fostering embodied intelligence that mirrors human-like agility and precision. This ensures that the dataset is not only robust and reliable but also accurately represents human behavior in real-world scenarios.