Background & Summary

Human robot collaborative (HRC) paradigm exhibits great potential pathway to achieve mass customization1, which takes full advantage of precision and repeatability of robot, and cognition, flexibility of human operators to achieve ergonomic working conditions with better productivity. In this context, the efficient recognition of human assembly intentions by robots is crucial for ensuring both the efficiency and safety of collaborative assembly processes2.

Currently, vision-based human assembly intention recognition approaches have gained more and more attentions, such as single modal-based approaches3, dual modal-fusion approaches4, triple modal-fusion approaches5. However, these works unanimously emphasized the absence of sufficient datasets within the industrial context. And most approaches were verified with custom datasets, which is unfavorable for the investigation of high-performance, cross-scenes application aimed at recognizing operator intentions. While existing datasets have been developed, they suffer from three critical shortcomings6,7: (1) limited modalities, impeding full exploitation of spatiotemporal information; (2) confinement to predefined assembly cases, overlooking stochastic operator behaviors; (3) non-standardized annotation frameworks and protocols, obstructing systematic data utilization. Table 1 presents an overview of several datasets currently available for assembly scenarios. Hence, in this paper, we propose a new human assembly intention recognition dataset MCV-Intention that is collected within real industrial scenario with six modalities and two views using custom data acquisition software developed by our research team. The main contribution of this paper are as follows:

  • Leveraging our custom data acquisition software, a multimodalities dataset encompassing six modalities and two views are collected (MCV-Intention), with small satellite serving as the assembly object. The entire data collection process is conducted by fifteen subjects. This dataset comprehensively simulates real-world assembly scenarios, incorporating both unfamiliar assembly and post-training proficient assembly.

  • An annotation protocol encompassing comprehensive details is established.

  • A comprehensive suite of benchmark experiments is conducted using state-of-the-art algorithms founded on convolutional techniques and attention mechanisms. It can be viewed as the baseline for research community related this topic.

Table 1 Existing dataset for human assembly intention recognition.

Methods

Dataset acquisition protocol

Most existing datasets are collected with predefined rules, where every assembly action is constrained, and no errors are permitted during the procedure. However, there is a significant likelihood that operators will commit errors or unexpected behaviors during task execution, such as pause, disassembly, wrong usage of tools, etc. This may result in the human assembly intention model exhibiting excellent performance during the training phase, yet demonstrating suboptimal efficacy in practical applications. Therefore, to remedy this gap, we propose MCV-Intention, a multimodalities and cross-view human assembly intention recognition dataset with natural human behaviors. The details of dataset acquisition processes are as follows:

  1. (1)

    The subjects are required to perform the assembly process twice, namely pre-training and after post-training.

  2. (2)

    Before training, the subjects are not aware of the assembly process and only gained information with assembly documents, where some abnormal actions will occur.

  3. (3)

    After training, the subjects are required to perform assembly tasks with meticulous guidance to ensure the successful and normal completion of the assembly procedure.

Dataset acquisition environment

Here, we build the dataset acquisition environment from the viewpoint of hardware and software environments, as shown in Fig. 1.

Fig. 1
Fig. 1
Full size image

Hardware environment for data acquisition.

Hardware environment

Hardware environment is an execution cell where assembly process is conducted. It consists of assembled objects (small satellite), tools, a top-down camera, a front-camera, and a server, as shown in Fig. 1. Here, top-down camera is responsible for collecting top-down view of assembly process (the last display zone in Fig. 2). And the rest of modalities are captured by front-facing camera (the 1–6 display zone in Fig. 2).

Fig. 2
Fig. 2
Full size image

Custom data acquisition software.

Software environment

The software is developed by our research team using PyQT, which is mainly used to collect different modalities and views data during assembly process. The parameter setting area is used to set width and height of captured images, and frames per second (fps). In addition, modal selection can be done in modal selection zone according user’s requirements. All captured data are virtualized in data display zone, as shown in Fig. 2. Here, front and top-down RGB modal data are directly captured with cameras. Skeleton and mask modal are obtained with mediapipe of google. Optical flow modal is calculated by Lucas-Kanade method8. Depth and infrared modal are acquired with camera SDK.

Assembled object

Most existing human assembly action recognition datasets are compiled through the use of significantly simplified assemblies, such as toy, simplified gear box, etc., thereby failing to adequately represent the complexities of real-world assembly operations. Therefore, we use a small satellite as assembled object, as shown in Fig. 3. It consists of 9 major parts (including radiator, inner plane, holder, solar plane, battery holder, chassis, battery, play load, and top plane), 23 submodules, and 100 screws. The small satellite has overall dimensions of 400 mm × 400 mm × 400 mm, weighs 35 kg, and was fabricated using additive manufacturing (3D printing) upon design completion. The entire assembly process typically requires approximately thirty minutes to complete when performed by a skilled worker. The assembly process of small satellite is depicted in Fig. 4. The assembly steps can be specified as follows: (1) battery module assembly, (2) left plane assembly, (3) right plane assembly, (4) left solar plane assembly, (5) right solar plane assembly, (6) front solar plane assembly, (7) back solar plane assembly, (8) heat dissipation assembly, (9) finally assembly. In addition, the detailed assembly documents and process can be found from our dataset website.

Fig. 3
Fig. 3
Full size image

The small satellite. Constructed by opensource software FreeCAD. Original model can be download here.

Fig. 4
Fig. 4
Full size image

Assembly procedures of small satellite.

Annotation process

The purpose of dataset annotation is to differentiate various actions. The annotation process involves manual frame-by-frame labeling, followed by verification by three additional researchers to ensure accuracy. Examples of dataset annotation are presented in Table 2. Different actions are denoted by distinct IDs, encompassing both normal and abnormal actions. The annotation example is shown in Table 2.

Table 2 dataset annotation details.

Ethical approval

This research was conducted under protocols reviewed and approved by the Institutional Review Board (IRB) of The First Affiliated Hospital of Xi’an Jiaotong University (IRB No: XJTU1AF2025LSYY-036). Participants were recruited on a voluntary basis from the university community. We established specific inclusion and exclusion criteria to ensure participant suitability. Inclusion criteria required healthy adults with the normal vision, cognitive function, and motor skills necessary for the assembly task. Individuals with any physical or cognitive impairments that could affect task performance, or those with prior familiarity with the assembly process, were excluded. Prior to their involvement, all participants provided written informed consent. The consent form explicitly detailed the study’s procedures, potential minimal risks (including minor physical fatigue and discomfort from being filmed), and the data handling policy. Crucially, participants were informed and consented that their fully anonymized data, including video recordings, would be made publicly available for research purposes through scientific data repositories like Zenodo. The research team adhered to all approved guidelines for data collection, cleaning, storage, and dissemination.

Data Records

To minimize data redundancy, we set height and width as 480 and 640 in pixels while fps is 15. There are mainly three steps about data collection: (1) For pre-training phase, all 15 subjects, initially unfamiliar with assembly procedures, were provided with process documentation to guide their attempts. Their lack of familiarity led to various assembly issues, capturing their pre-training state data; (2) For training phase, an experienced assembly worker trains the subject until they are familiar with the entire assembly process and can complete the assembly without referring to the manual; (3) For post-training phase, after becoming familiar with the process, subjects performed the post-training assembly, where no-error process is required for each assembly step. If any error occurs, the subjects will be asked to perform assembly step again. Data collection process is conducted on a per-assembly-step basis. Finally, a total size of 32 GB of MCV-Intention dataset is collected following the protocol. The dataset has been organized and uploaded to zenodo, which can be obtained from zenodo database9. The dataset examples are shown in Table 3. There is total six modalities (RGB, depth, skeleton, optical flow, infrared, mask) and two views (front view and top-down view), where skeleton modal is stored in .txt file format, primarily containing three-dimensional coordinates (x, y, z) for 33 human skeletal joints and other modalities are saved with .jpg format. The dataset is organized as shown in Fig. 5.

Table 3 Dataset example. OF denotes optical flow, IR is infrared.
Fig. 5
Fig. 5
Full size image

Dataset structure. Subject i means different volunteers. 1–9 denotes assembly steps according to the order.

Technical Validation

Dataset analysis

The dataset distribution is shown in Fig. 6, where abnormal and normal means pre-training and post-training. Here, all abnormal and normal frames are calculated separately. The duration for completing assembly tasks varies among subjects, reflecting their diverse levels of experience with such operations. Moreover, during pre-training, all subjects committed errors, such as arbitrary pauses, incorrect installations, and omissions, thereby extending the time required for assembly tasks before training, as depicted in Fig. 6(a). In addition, the time expenditure for each step varies as a result of differences in the complexity of the installation processes. As shown in Fig. 6(b), the installation of the battery required the least amount of time, whereas the installation of the heat dissipation consumed the most time. Here, battery assembly procedure is intuitive and simple in design while heat dissipation installation requires specialized tools under narrow space.

Fig. 6
Fig. 6
Full size image

Total frames of normal and abnormal assembly process (examples with RGB modal).

Here, to evaluate the efficiency and quality of MCV-Intention, we reproduce the 8 outstanding algorithms for human action recognition tasks based on convolution and attention mechanism, namely CorrNet10, CSN11, ResNet3D12, SlowFast13, ViViT14, MViT15, AcT16, and SR2M4. Due to the informational efficacy of RGB and skeleton modalities, they are frequently employed as source modalities for assembly intention recognition17. Therefore, the RGB and skeleton modal are used to demonstrate the efficiency of MCV-Intention dataset. The detailed information of above-mentioned models are as follows.

CorrNet short for correlation networks, leverages a learnable correlation operator to establish frame-to-frame correspondences across convolutional feature maps in various network layers. CSN is a 3D channel-separated network, where all convolutional operations are separated into either pointwise 1 × 1 × 1 or depthwise 3 × 3 × 3 convolutions. ResNet3D is proposed for human assembly recognition based on context information. It contains residual convolutional neural network with 34 layers and a long short-term memory recurrent neural network. SlowFast contains slow pathway and fast pathway for feature extraction. Here, slow pathway extracts spatial semantics based on low frame rate, while fast pathway is utilized to capture motion with fine temporal resolution based on high frame rate. ViViT refers to video vision transformer, which is a pure-transformer architecture-based video classification network. MViT is multiview transformers network, which is also developed with transformer architecture. It uses separate transformer encoder to represent different views of one input video with a lateral connection to integrate features across different views. AcT is action transformer, a simple, fully, self-attentional architecture for action recognition based on skeleton modal. SR2M denotes skeleton-RGB integrated network for human assembly action prediction. Here, 3D Resnet model18 is used to extract features of RGB modal, while multi-scale graph convolution model19 is responsible for feature extraction of skeleton modal.

And the training process setting is depicted in Table 4.

Table 4 The Setting of Benchmark Experiments.

We conduct benchmark experiments on two highly used modalities (RGB and skeleton). Here, model accuracy refers to the video classification accuracy, defined as the proportion of correctly classified samples for a specific action category relative to the total number of samples in that category. As shown in Table 5, for single-modality input with normal schema, the highest accuracy achieved is only 90.25% with SlowFast, while attention-based models generally exhibit mediocre performance. For instance, ViViT achieves an accuracy of merely 65.24%, and MViT reaches 81.58%. Similarly, the skeleton modal-based AcT model records a notably lower accuracy of 60.00%, potentially attributable to the limited contextual information it provides, which hinders its ability to recognize complex assembly processes. In contrast, the multi-modalities-based SR2M model demonstrates a superior accuracy of 91.20%. On the other hand, for abnormal assembly actions, their inherent high uncertainty poses significant challenges for algorithms, particularly for attention-based algorithms. For instance, ViViT achieves an accuracy of only 31.22%, while AcT attains a mere 30.14%. In contrast, other convolution-based algorithms outperform their attention-based counterparts, with SlowFast achieving an accuracy of 66.12%. Furthermore, the multimodalities fusion algorithm SR2M demonstrates a certain advantage in handling uncertainty, reaching an accuracy of 70.23%.

Table 5 Benchmark experiment of MCV-Intention on the state-of-the-art methods.

It can be easily concluded from benchmark experiments that those state-of-the-art algorithms perform suboptimal performance for abnormal assembly process with maximum accuracy is 70.23%, while maximum accuracy for normal assembly process is over 90%. It falls short of the standards required for practical application. In contrast to a normal assembly process, an abnormal assembly process encompasses a variety of aberrant behaviors exhibited by operators. These behaviors include repeated adjustments to unfamiliar assembly tasks, the use of inappropriate tools, errors in assembly followed by disassembly and reassembly, and intermittent pauses, etc., as shown in Fig. 7. This significantly increases the uncertainty of the system. The distinct habits of individual assemblers contribute to an anomalous distribution within the dataset, which may constitute the primary reason for the reduced accuracy observed in the anomalous dataset. However, real-world assembly scenarios more closely resemble abnormal conditions. Therefore, more efficient algorithms are needed.

Fig. 7
Fig. 7
Full size image

Abnormal behaviors of operators.

In addition, compared to models based on convolutional operations, models leveraging attention mechanisms exhibit inferior performance, potentially attributable to their limited inductive bias capability. Furthermore, owing to the complexity of scenes, these algorithms may fail to adequately extract and integrate contextual information, resulting in diminished system robustness. Additionally, multimodalities fusion demonstrates superior performance, suggesting that the alignment of different modalities could prove more effective and decrease the uncertainties in enhancing overall system efficacy20. Therefore, future research on assembly intention recognition should not solely focus on accuracy. Instead, it must fully leverage multimodalities datasets to effectively extract temporal and spatial information in different resolutions, while quantifying the uncertainty of human operators. This approach will enhance the overall robustness of HRC assembly systems.