Introduction

A surgeon’s technical proficiency is critical and has been linked to postoperative adverse events, morbidity, and mortality1,2. In extracranial-intracranial (EC-IC) bypass surgery, the fragility of cerebral arteries necessitates advanced microsurgical skills to achieve complete intimal approximation between the recipient and donor arteries3,4,5. This demanding technique requires precise instrument handling, smooth and efficient movements, and gentle tissue manipulation to prevent vessel wall tears6,7. A comprehensive, objective assessment of microsurgical skills provides a reliable means to identify deficiencies in surgical trainees through constructive feedback and validates surgeon proficiency, ultimately enhancing patient safety8.

Traditionally, microsurgical skill assessments have relied on subjective evaluations by master surgeons9. Although various criteria-based scoring systems have been developed to reduce subjectivity, they require substantial human and time resources, making real-time feedback impractical10,11,12,13. Quantitative methodologies, such as force and motion sensors affixed to surgical instruments, have been explored14,15,16; however, their reliance on specialized sensors and equipment limits their widespread adoption and raises concerns regarding generalizability and reproducibility17,18.

With the increasing availability of surgical video recordings, video analytics is gaining traction for skill assessment across various procedures19,20,21. Artificial intelligence (AI)-driven video analysis is also being increasingly applied in surgical skill evaluation22,23. Building on this trend, we previously developed two AI models for assessing microvascular anastomosis performance: one incorporating a semantic segmentation algorithm to evaluate vessel area (VA) changes24 and another using an object detection algorithm to analyze surgical instrument-tip motion25.

This study examines whether combining these AI models improves the accuracy of surgeon performance assessment and identifies which aspects of microsurgical skills correlate with AI-derived parameters.

Methods

Combined AI model

Two AI models were developed as described previously: a semantic segmentation algorithm for the VA and a trajectory tracking algorithm for the instrument tip24,25. These models were based on the Residual Network 50 (ResNet-50) and You Only Look Once version 2 (YOLOv2), which were trained on clinical microsurgical videos and microvascular anastomosis practice videos.

ResNet-50 is a 50-layer deep convolutional neural network that utilizes residual learning, enabling the training of very deep architectures, and is widely applied to tasks such as segmentation and classification in medical imaging26. YOLOv2 is a real-time, deep learning-based object detection algorithm designed to detect objects quickly and accurately in a single pass. It divides the input image into grids and simultaneously predicts the presence of objects within each grid cell, making it ideal for high-speed, real-time applications such as surveillance systems for detecting people and vehicles, and identifying anatomical structures in medical images27. Both models were implemented in MATLAB (MathWorks, Natick, MA, USA). Detailed training procedures for each model can be found in our previous studies24,25.

The accuracy of these models was ensured by an Intersection over Union of 0.93 and a mean Dice similarity coefficient of 0.8724,25. We integrated both AI models to comprehensively analyze the microsurgical performance, as shown in Fig. 1.

Fig. 1
figure 1

Graphical user interface of the custom-built software, showcasing the artificial intelligence (AI)-powered vessel segmentation and instrument tracking models. The interface provides real-time visualization of vessel area changes and instrument motion for microsurgical performance assessment.

Participants

This study adhered to the SQUIRE guidelines and Declaration of Helsinki. This study was conducted with institutional approval from the Hokkaido University Hospital (No. 018-0291). As our facility regularly holds off-the-job microvascular anastomosis training sessions for educational purposes, surgeon participants, including both instructors and trainees, were recruited from these sessions. Fourteen surgeons with varying levels of microsurgical experience, ranging from postgraduate years (PGY) 1 to 28, participated in the experimental surgical performance analysis. Table 1 summarizes the characteristics of the participating surgeons. All participants were right-handed.

Table 1 Characteristics of participating surgeons.

Microsurgical task

Each surgeon was assigned to perform interrupted suturing following two stay sutures in end-to-side anastomosis using artificial blood vessels (2.0 mm blood vessel model, WetLab Incorporated, Shiga, Japan) and 10-0 nylon monofilament threads (C26-004-01, Muranaka Medical Instruments Co., Ltd, Osaka, Japan), which assimilated the actual EC-IC bypass procedure (Fig. 2). To standardize the anastomosis procedure and minimize procedural variability, tasks such as stabilizing the two vessels, cutting the donor artery, performing arteriotomy on the recipient vessel, and preparing two stay sutures were performed and confirmed by a single instructor. A surgical trial was defined as the completion of a single suturing process, consisting of four phases: Phase A, grasping and inserting the needle; Phase B, pushing and extracting the needle; Phase C, pulling the threads to the first knot; and Phase D, tying three knots and cutting the threads25. To minimize the learning effect of repeated trials, only the first two trials from each surgeon were selected for the performance assessment.

Fig. 2
figure 2

(a) Microscope, video recording device, and instruments. (b) Single-stitch suturing task for end-to-side anastomosis.

Criteria-based objective assessment

The Stanford Microsurgery and Resident Training Scale was used to assess each surgeon’s performance10,11. This rating scale consists of nine technical categories: (1) instrument handling, (2) respect for tissue, (3) efficiency, (4) suture handling, (5) suturing technique, (6) quality of the knot, (7) final product, (8) operation flow, and (9) overall performance. Three experts independently rated all surgeons’ performances on a scale of 1 to 5 for each technical category in a blinded manner, ensuring that the participants’ identities remained concealed. The score from the first two trials averaged across the three raters was used as the representative score for each surgeon.

Video parameters of dual AI model

Table 2 presents the parameters generated by the combined AI model. These include the coefficient of variation (CV) of all measured VA values (CV-VA), the relative change in VA over time (ΔVA), the maximum absolute value of ΔVA during the procedure (Max-ΔVA), the number of tissue deformation errors (No. of TDE), the path distance (PD) for the right and left forceps tips, and the normalized jerk index (NJI) for the right and left forceps tips24,25. As the mean ± 1.96 × SD of ΔVA for all trials by all surgeons was calculated as ± 1.13 in a previous study, this threshold was used for the definition of TDE24.

Table 2 Definitions of parameters provided by the artificial intelligence (AI) model.

We analyzed videos from all surgeons’ trials and calculated each parameter. The average of the first two trials was used as the representative parameter for each surgeon.

Statistical analysis

The results are expressed as the mean ± standard deviation. The interrater reliability of the criteria-based objective rating scale was assessed using Cronbach’s α coefficient among the three raters. Correlations between the AI-based parameters and each category of the rating scale were analyzed using Spearman’s rank correlation coefficient (ρ).

To assess the discriminative ability of each AI model’s parameters, we conducted discriminant analyses between the good- and poor-performance groups stratified by the distribution of the criteria-based rating scale scores of the surgeons. For each AI model, we selected the parameter that showed a significant correlation (p < 0.05) with the highest number of technical categories and included it in discriminant analysis (Models 1 and 2). Finally, to evaluate the combined use of both AI models, all the parameters from Models 1 and 2 were included in the discriminant analysis (Model 3).

Statistical analyses were performed using JMP Pro (version 17.0.0; SAS Institute Inc., Cary, North Carolina, USA). Statistical significance was set at p < 0.05.

Results

Criteria-based scale

The high interrater reliability for the criteria-based rating scale was confirmed with a Cronbach’s α coefficient of 0.88–0.97 for each category (Supplementary Table 1). The original criteria-based scale scores for each surgeon are provided in Table 1, and detailed in Supplementary Table 2.

The total score of the criteria-based rating scale significantly correlated with the surgeon’s experience (Fig. 3). As the total score on the criteria-based rating scale across the 14 surgeons exhibited a bimodal distribution, performances that scored over 35 points were regarded as good, whereas trials that scored under 35 were regarded as poor (Fig. 3).

Fig. 3
figure 3

Distribution of criteria-based rating scale scores across surgeons with varying experience levels, along with the criteria used to define good and poor performance. PGY postgraduate year, N number.

Correlation between AI-based and criteria-based performance analysis

Supplementary Table 3 provides the AI-derived performance parameters for each surgeon. Table 3 presents the p-values from the Spearman’s rank correlation analyses, and Supplementary Table 4 provides the Spearman’s rank correlation coefficient (ρ).

Table 3 Spearman’s rank correlation analysis (p-values) between the criteria-based scale and parameters provided by the AI model, including selected parameters for discriminant analysis.

The No. of TDE for all phases and Phase C were significantly correlated with instrument handling, respect for tissue, efficiency and overall performance. In addition, the No. of TDE for Phase C was significantly correlated with suturing technique and final product. Although CV-VA and Max-ΔVA for all phases did not show significant correlations with technical categories, Max-ΔVA for Phase B was significantly correlated with instrument handling and respect for tissue. Therefore, the No. of TDE for Phase C and Max-ΔVA for Phase B were included in Models 1 and 3 for subsequent discriminant analysis.

The PD of the right forceps (Rt-PD) and the NJI of the left forceps (Lt-NJI) for all phases were significantly correlated with almost all performance categories. The NJI of the right forceps (Rt-NJI) for all phases was significantly correlated with five technical categories, while the Rt-NJI for Phase C and Phase D were significantly correlated with seven categories. The PD of the left forceps (Lt-PD) was significantly correlated with efficiency, suture handling, suturing technique, and operation flow. Therefore, Rt-PD, Lt-PD, and Lt-NJI for all phases and Rt-NJI for Phase C were included in Models 2 and 3 for the subsequent discriminant analysis.

Discriminative abilities of each model for surgical performance

Supplementary Table 5 provides the discriminant functions of Models 1–3.

The receiver operating characteristic (ROC) curves for Models 1–3 used to distinguish between good and poor performance are shown in Fig. 4. The AUC values for Models 1 and 2 were 0.85 (95% confidence interval [CI]: 0.53–0.97) and 0.96 (95% CI: 0.67–1.00), respectively. Model 3 demonstrated the highest AUC value of 1.00.

Fig. 4
figure 4

Receiver operating characteristic curves for Models 1–3, illustrating their ability to differentiate between good and poor performance. AUC area under the curve.

Discussion

We employed a combined AI-based video analysis approach to assess the microvascular anastomosis performance by integrating VA changes and instrument motion. By comparing technical category scores with AI-generated parameters, we demonstrated that the parameters from both AI models encompassed a wide range of technical skills required for microvascular anastomosis. Furthermore, ROC curve analysis indicated that integrating parameters from both AI models improved the ability to distinguish surgical performance compared to using a single AI model. A distinctive feature of this study was the integration of multiple AI models that incorporated both tools and tissue elements.

AI-based technical analytic approach

Traditional criteria-based scoring by multiple blinded expert surgeons was a highly reliable method for assessing surgeon performance with minimal interrater bias (Fig. 2 and Supplementary Table 1). However, the significant demand for human expertise and time makes real-time feedback impractical during surgery and training10,11,18. A recent study demonstrated that self-directed learning using digital instructional materials provides non-inferior outcomes in the initial stages of microsurgical skill acquisition compared to traditional instructor-led training28. However, direct feedback from an instructor continues to play a critical role when progressing toward more advanced skill levels and actual clinical practice.

AI technology can rapidly analyze vast amounts of clinical data generated in modern operating theaters, offering real-time feedback capabilities. The proposed method’s reliance on surgical video analysis makes it highly applicable in clinical settings18. Moreover, the manner in which AI is utilized in this study addresses concerns regarding transparency, explainability, and interpretability, which are fundamental risks associated with AI adoption. One anticipated application is AI-assisted devices that can promptly provide feedback on technical challenges, allowing trainees to refine their surgical skills more effectively29,30. Additionally, an objective assessment of microsurgical skills could facilitate surgeon certification and credentialing processes within the medical community.

Theoretically, this approach could help implement a real-time warning system, alerting surgeons or other staff when instrument motion or tissue deformation exceeds a predefined safety threshold, thereby enhancing patient safety17,31. However, a large dataset of clinical cases involving adverse events such as vascular injury, bypass occlusion, and ischemic stroke would be required. For real-time clinical applications, further data collection and computational optimization are necessary to reduce processing latency and enhance practical usability. Given that our AI model can be applied to clinical surgical videos, future research could explore its utility in this context.

Related works: AI-integrated instrument tracking

To contextualize our results, we compared our AI-integrated approach with recent methods implementing instrument tracking in microsurgical practice. Franco-González et al. compared stereoscopic marker-based tracking with a YOLOv8-based deep learning method, reporting high accuracy and real-time capability32. Similarly, Magro et al. proposed a robust dual-instrument Kalman-based tracker, effectively mitigating tracking errors due to occlusion or motion blur33. Koskinen et al. utilized YOLOv5 for real-time tracking of microsurgical instruments, demonstrating its effectiveness in monitoring instrument kinematics and eye-hand coordination34.

Our integrated AI model employs semantic segmentation (ResNet-50) for vessel deformation analysis and a trajectory-tracking algorithm (YOLOv2) for assessment of instrument motion. The major advantage of our approach is its comprehensive and simultaneous evaluation of tissue deformation and instrument handling smoothness, enabling robust and objective skill assessment even under challenging conditions, such as variable illumination and partial occlusion. YOLO was selected due to its computational speed and precision in real-time object detection, making it particularly suitable for live microsurgical video analysis. ResNet was chosen for its effectiveness in detailed image segmentation, facilitating accurate quantification of tissue deformation. However, unlike three-dimensional (3D) tracking methods32, our current method relies solely on 2D imaging, potentially limiting depth perception accuracy.

These comparisons highlight both the strengths and limitations of our approach, emphasizing the necessity of future studies incorporating 3D tracking technologies and expanded datasets to further validate and refine AI-driven microsurgical skill assessment methodologies.

Future challenges

Microvascular anastomosis tasks typically consist of distinct phases, including vessel preparation, needle insertion, suture placement, thread pulling, and knot tying. As demonstrated by our video parameters for each surgical phase (phases A–D), a separate analysis of each surgical phase is essential to enhance skill evaluation and training efficiency. However, our current AI model does not have the capability to automatically distinguish these surgical phases.

Previous studies utilizing convolutional neural networks (CNN) and recurrent neural networks (RNN) have demonstrated high accuracy in recognizing surgical phases and steps, particularly through the analysis of intraoperative video data35,36. Khan et al. successfully applied a combined CNN-RNN model to achieve accurate automated recognition of surgical workflows during endoscopic pituitary surgery, despite significant variability in surgical procedures and video appearances35. Similarly, automated operative phase and step recognition in vestibular schwannoma surgery further highlights the ability of these models to handle complex and lengthy surgical tasks36. Such methods could be integrated into our current AI framework to segment and individually evaluate each distinct phase of microvascular anastomosis, enabling detailed performance analytics and precise feedback.

Furthermore, establishing global standards for video recording is critical for broadly implementing and enhancing computer vision techniques in surgical settings. Developing guidelines for video recording that standardize resolution, frame rate, camera angle, illumination, and surgical field coverage can significantly reduce algorithmic misclassification issues caused by shadows or instrument occlusion18,37. Such standardization ensures consistent data quality, crucial for training accurate and widely applicable AI models across diverse clinical settings 37. These guidelines would facilitate large-scale data sharing and collaboration, substantially improving the reliability and effectiveness of AI-based surgical assessment tools globally.

Technical consideration

The semantic segmentation AI models were designed to assess respect for tissue during the needle manipulation process24. As expected, the Max-ΔVA correlated with respect for tissue in Phase B (from needle insertion to extraction). Proper needle extraction requires following its natural curve to avoid tearing the vessel wall6,7, and these technical nuances were well captured by these parameters. Additionally, the No. of TDE correlated with respect for tissue in Phases C, indicating that even during the process of pulling the threads, surgeons must exercise caution to prevent thread-induced vessel wall injury6,7. These parameters also correlated with instrument handling, efficiency, suturing technique and overall performance—an expected finding, as proper instrument handling and suturing technique are fundamental to respecting tissue. Thus, the technical categories are interrelated and mutually influential.

Trajectory-tracking AI models were designed to assess motion economy and the smoothness of surgical instrument movements25. Motion economy can be represented by the PD during a procedure. The smoothness and coordination of movement are frequently assessed using jerk-based metrics, where jerk is defined as the time derivative of acceleration. Since these jerk indexes are influenced by both movement duration and amplitude, we utilized the NJI, first proposed by Flash and Hogan38. The NJI is calculated by multiplying the jerk index by [(duration interval)5/(path length)2], with lower values indicating smoother movements. The dimensionless NJI has been used as a quantitative metric to evaluate movement irregularities in various contexts, such as jaw movements during chewing39,40, laparoscopic skills41, and microsurgical skills16,25. In this study, the Rt-PD and Lt-NJI correlated with a broad range of technical categories. Despite their distinct roles in microvascular anastomosis, coordinated bimanual manipulation is essential for optimal surgical performance6,7. With regard to Rt-NJI, these trends were particularly evident in Phases C and D, highlighting the importance of the motion smoothness in thread pulling and tying knots in determining overall surgical proficiency.

Overall, integrating these parameters enabled a comprehensive assessment of complex microsurgical skills, as each parameter captured different technical aspects. Despite its effectiveness, the model still exhibited some degree of misclassification when differentiating between good and poor performance. Notably, procedural time—a key determinant of surgical performance24,25—was intentionally excluded from the analysis. Although further exploration of additional parameters remains essential, integrating procedural time could significantly improve the classification accuracy.

This study employed the Stanford Microsurgery and Resident Training scale10,11 as a criteria-based objective assessment tool, as it covers a wide range of microsurgical technical aspects. Future research incorporating leakage tests or the Anastomosis Lapse Index13, which identifies ten distinct types of anastomotic errors, could provide deeper insights into the relationship between the quality of the final product and various technical factors.

Limitations

As mentioned above, a fundamental technical limitation of this analytical approach is the lack of 3D kinematic data, particularly in the absence of depth information. Another constraint was that when the surgical tool was outside the microscope’s visual field, kinematic data of the surgical instrument could not be captured25. Additionally, the semantic segmentation model occasionally misclassified images containing shadows from surgical instruments or hands24. To mitigate this issue, future studies should expand the training dataset to include shadowed images, thereby improving model robustness. Given that the AI model in this study utilized the ResNet-50 and YOLOv2 networks, further investigation is warranted to optimize network architecture selection. Exploring alternative deep learning models or fine-tuning existing architectures could further improve the accuracy and generalizability of surgical video analysis18.

Our study had a relatively small sample size with respect to the number of participating surgeons, although it included surgeons with a diverse range of skills. Moreover, we did not evaluate the data from repeated training sessions to estimate the learning curve or determine whether feedback could enhance training efficacy. Future studies should evaluate the impact of AI-assisted feedback on the learning curve of surgical trainees and assess whether real-time performance tracking leads to more efficient skill acquisition.

Conclusion

A combined AI-based video analysis approach incorporating VA changes and instrument motion effectively captured a broad spectrum of microsurgical technical skills and evaluated microvascular anastomosis performance. Moreover, this approach is highly adaptable to clinical applications, can advance computer-assisted surgical education, and contributes to the improvement of patient safety.