Introduction

Drilling into bone along a trajectory is one of the most common yet most challenging tasks in medical interventions. However, it is not without risk, as important structures can be irreparably injured unintentionally during drilling. Its applications span multiple regions of the body, from the cranium to the jaw, shoulder, spine, pelvis, and ultimately to the extremities1,2,3,4,5,6,7,8,9,10. Although these procedures are all conducted on bone tissue and adhere to universal biomechanical principles11,12, the surgeons still often face significant challenges in maintaining drilling accuracy13, due to challenges of limited visibility of the underlying anatomy and plunging especially in minimal invasive surgery14. Limited visibility could obscure the view of critical anatomical landmarks, making it difficult for the surgeon to assign the entry point and difficult to navigate the instrument during drilling15. Plunging is defined as the penetration of the drill bit beyond the far cortex8, even experienced surgeons plunged over the far cortex by an average of 6.33 mm, thus causing inadvertent and iatrogenic damage to adjacent tissue16. The increasing improvement and spread of a variety of imaging modalities had paved the way for the establishment of image-guided optical navigation systems, enhancing the accuracy and safety of such interventions3,17.

As a result, image-based navigation is increasingly used to improve the drilling accuracy in different surgical applications, such as pedicle screw placement in spine surgery17,18, total shoulder arthroplasty and fracture fixation in orthopedic surgery3,19,20, implant socket drilling in dental surgery21, or zygomatic implant placement in oral and maxillofacial surgery22. A systematic review showed that reported rates of pedicle screw misplacement ranged from 6–31% for the freehand technique, while with conventional navigation systems (CNS) misplacement rates were reduced to 0–11%17.

While CNSs have proven to increase accuracy, they are limited by the cognitive challenge of integrating 2D imagery with 3D spatial understanding and issues related to hand-eye coordination23,24. This cognitive discontinuity requires surgeons to mentally reconstruct a 3D surgical space from multiple 2D images, a task that relies heavily on their spatial cognitive abilities24. Furthermore, hand-eye coordination problems may occur if the direction of movement shown on the monitor does not match the actual movement of the surgeon’s hands. This could place a significant cognitive burden on the surgeon who have to constantly shift their attention between the monitor and the surgical field to check anatomical locations and landmarks and synchronize their hand movements with what they see23. These limitations compromise the procedure efficiency and increase the likelihood of surgical complications by disrupting the spatial orientation of the surgeon and attention to critical anatomical details25,26.

Augmented Reality (AR), particularly through head-mounted displays (HMDs) provides a solution by integrating critical data and guidance direct into the surgeon's field of view27, thus potentially improving the surgical navigation by enhancing the accuracy23,28 and reducing the cognitive workload29. Indeed, studies on AR-based navigation systems (ARNS) have been able to demonstrate their technical feasibility in trajectory drilling. They mainly focused on the specific scenario, for example, pedicle screw placement27,28,30, fracture fixation31,32, total shoulder arthroplasty33, or dental implant placement34, although the latter is not generalizable due to the use of a dental contra-angle handpiece instead of a surgical drill and very short trajectories. In addition, some studies have demonstrated the clinical feasibility of ARNS, particularly in the context of pedicle screw placement, either comparing it to a freehand approach or evaluating the results using a clinical score35,36, or in pilot studies for dental placement37. However, statements about the actual accuracy or generalizability of the navigation system’s application are constrained by the studies’ methodologies like limited scope of application, limited number of participants, often lacking comparisons to CNS with optical tracking techniques, and adoption of tracked sleeve, which impedes real-time depth information. Only a very limited number of studies have directly compared the surgical drilling performance of ARNS with that of CNS (Table 1), none in a randomized controlled trial (RCT).

Table 1 Overview of studies comparing ARNS to CNS

Given these limitations, our study aimed to comprehensively compare the accuracy and efficiency of ARNS and CNS in drilling trajectories in a crossover RCT design (Fig. 1). The primary endpoint was the maximum projected translational deviation between the planned and performed trajectories, and the secondary endpoints were the translational deviation at the entry point in 3D Euclidean distance, projected translational deviation at the entry and end points, angular deviation, depth deviation, time, workload using NASA-TLX, and overall user experience using the System Usability Scale (SUS).

Fig. 1: Experimental setup.
figure 1

a ARNS using HoloLens 2 with virtual twin representations of the block (blue) and drill (yellow) are displayed next to the phantom. b ARNS presented a virtual block with a blue boundary, accompanied by a drill and guidance. The guidance consisted of two tori for the translational deviation of the drill tip (tip indicator) and tail (tail indicator) from the planned trajectory, while a middle torus in ellipse shows the angular deviation. The colors of all indicators turn green if the translational and angular errors are within 1mm and 1°. The virtual drill turns green from yellow when it reaches the planned depth and red if deeper (>1 mm). c The CNS setting displayed drilling navigation on an external 2D monitor (right). The optical tracker is not shown. d The graphical user interface of CNS displays three sub-windows with orthogonal CT planes of the phantom block, illustrating the spatial alignment of the drill with the trajectory. A compensatory view in the lower right sub-window provided a transverse trajectory perspective that enhanced the accuracy of the drill alignment. The circle (represents drill tip) and middle ring (tail) change from yellow to green when translational error is ≤ 1 mm. A depth slider to the left visualizes the current depth (yellow) relative to the planned depth (purple). The slider and the outer ring change from yellow to green at the planned depth (0–1 mm) and turn red if it exceeds that depth by >1 mm.

Results

Cohort

36 participants (10 female and 26 male) from three groups (surgeons, medical/dental students, and engineers) each with 12 participants were successfully included. The mean age of participants was 31.2 ± 9.6 (mean ± s.d.; range: 20–59). The average clinical experience of surgeons was 11.0 ± 8.8 years, the average clinical semesters of medical/dental students were 3.7 ± 2.1 (1.8 ± 1.0 years) and the average work experience of engineers was 5.5 ± 9.7 years. Eleven participants had experience with CNS, most of whom were surgeons. Nine participants had experience with AR-HMD, most of whom were engineers (Table 2).

Table 2 Characteristics of the cohort

Drilling accuracy and efficiency

In the postoperative CBCT scans, all 360 trajectories were evaluated (without the 72 for familiarization). ARNS and CNS demonstrated comparable accuracy in translational alignment. The primary endpoint, the maximum projected translational deviation between the executed and planned trajectories, showed no significant difference between ARNS with 1.11 ± 0.47 mm and CNS with 1.04 ± 0.47 mm (LMM, p = 0.152; Fig. 2a). Both systems also exhibited similar projected accuracy at entry and endpoints with deviation at entry points for ARNS and CNS being 0.93 ± 0.41 mm and 0.97 ± 0.45 mm (LMM, p = 0.381; Fig. 2b) and at endpoints 1.00 ± 0.52 mm for ARNS compared to 0.92 ± 0.51 mm for CNS (LMM, p = 0.128; Fig. 2c). Additionally, the deviations remained comparable with the translational deviation at the entry point measured in 3D Euclidean distance. ARNS showed 0.95 ± 0.42 mm and CNS 0.98 ± 0.46 mm, again with no significant difference (LMM, p = 0.413; Fig. 2d).

Fig. 2: Comparison of accuracy by method.
figure 2

af Comparison of accuracy metrics between ARNS (blue) and CNS (yellow) methods (x-axis). Each violin plot (colored) includes a boxplot (white) with a red point marking the mean value. The black points are outliers. The results represent the average evaluations of the segmented meshes by two independent investigators. P values are from the corresponding LMM model. a Maximum projected translational deviation in mm (y-axis). b Projected translational deviation at the entry points in mm (y-axis). c Projected translational deviation at the endpoints in mm (y-axis). d Translational deviation in 3D Euclidean distance at entry points in mm (y-axis). e Angular deviation in degree (y-axis). f Depth deviation in mm (y-axis).

While for the other secondary endpoints, the angular deviation was 1.11 ± 0.61° in the ARNS, significantly higher than the CNS 0.73 ± 0.36 ° (LMM, p < 0.001; Fig. 2e). Similarly, the depth deviation for ARNS is 1.27 ± 1.59 mm, significantly higher than the CNS with 0.52 ± 0.82 mm (LMM, p < 0.001; Fig. 2f). Interestingly, the subgroup analysis showed slight differences between surgeons, students and engineers. According to the LMMs, the engineers were worse at translational deviations and the students were better at depth deviation (LMM, p < 0.05; Fig. 3). Depth deviation outliers occur primarily in ARNS and rapid drilling (Fig. 4a), mostly by surgeons and engineers (Fig. 4b). Furthermore, we observed a correlation between age and depth deviation (LMM, p = 0.010, Fig. 4c). An additional analysis of the primary endpoint and secondary endpoints by profession is available in the Supplementary Information (Supplementary Table 1).

Fig. 3: Comparison of accuracy by profession.
figure 3

af Comparison of accuracy metrics between the surgeon (green), student (purple) and engineer (gray) groups (x-axis). Each violin plot (colored) includes a boxplot (white) with a red point marking the mean value. The black points are outliers. The results represent the average evaluations of the segmented meshes by two independent investigators. P values are from the corresponding LMM model. a Maximum projected translational deviation in mm (y-axis). b Projected translational deviation at the entry points in mm (y-axis). c Projected translational deviation at the endpoints in mm (y-axis). d Translational deviation in 3D Euclidean distance at entry points in mm (y-axis). (e) Angular deviation in degree (y-axis). f Depth deviation in mm (y-axis).

Fig. 4: Additional investigation of depth deviation.
figure 4

a, b The relationship between drilling time in seconds (x-axis) and depth deviation in mm (y-axis); the blue line represents the LOESS (locally estimated scatterplot smoothing) function, while the dots indicate individual measurements. The vertical dashed line marks the drilling time of 5 seconds, while the horizontal dashed line marks the depth deviation of 1 mm and the solid line marks the depth deviation of 0 mm. a ARNS (left) and CNS (right), individual measurement points are shown in blue (ARNS) and yellow (CNS). b Surgeons (left), Students (middle), and Engineers (right), individual measurement points are shown in green (Surgeons), purple (Students), gray (Engineers). c Relationship between participants' age in years (x-axis) and their drilling depth deviation in mm (y-axis); the blue line represents the linear model (LM) function, and the dots indicate individual measurements. d Visualization showing the calculation of the accuracy metrics used. The planned drill hole (25 mm long) is shown in red, the conducted drill hole is shown in gray. Angular deviation was calculated as the angle between the two vectors (dashed lines). Projected translational deviation at entry (red line, C1 to planned vector) and end (red line, C2 to C3/planned vector) as orthogonal distance. The maximum projected translational deviation (primary endpoints) is the larger value of the two projected translational errors. Euclidean deviation (red line, C1 to P1) on the surface and depth deviation (red curly brackets, P2 to C3).

Nine out of 720 time periods of drilling were excluded from the analysis due to invalid logging. There was no significant difference in the time to locate the entry point between ARNS and CNS, recorded at 22.9 ± 11.4 s and 22.5 ± 11.6 s, respectively (Mann–Whitney U test, p = 0.737; Fig. 5a). However, there was a notable difference in drilling completion time with ARNS taking 12.2 ± 11.2 s and with CNS taking 15.3 ± 16.0 s (Mann–Whitney U test, p < 0.001; Fig. 5b). There was no significant difference in NASA-TLX workload score between ARNS 49.67 ± 15.76 and CNS 55.40 ± 14.87, with a mean difference of 5.73 (Unpaired t-test, p = 0.117) (Fig. 5c; Supplementary Table 1). However, the results of the SUS concerning the overall user experience of ARNS with 77.64 ± 15.78 was significantly higher than the CNS with 64.65 ± 19.17 (Mann–Whitney U test, p = 0.003; Fig. 5d).

Fig. 5: Comparison of efficiency.
figure 5

ad Comparison of time and subjective scales between ARNS (blue) and CNS (yellow) methods (x-axis). Each violin plot (colored) includes a boxplot (white) with a red point marking the mean value. The black points are outliers. a Results of the time to find the entry point of ARNS and CNS methods (y-axis). b Results of the time to drill until reaching the planned depth (y-axis). c Results of subjective workload by NASA-TLX, where the higher values correspond to greater workload (y-axis). d Results of subjective assessments of overall all user experience by System Usability Score, where higher values correspond to greater user experience (y-axis).

Questionnaires

The Likert-type questions (scored from 1 to 4, with 1 indicating strong disagreement and 4 indicating strong agreement) revealed that ARNS was generally preferred over CNS. Specifically, ARNS was perceived as better than CNS at finding the entry point (3.4 vs. 2.8), setting the orientation (3.4 vs. 2.9), being easy to use (3.4 vs. 2.9), being beneficial for drilling (3.5 vs. 2.9), integrating (3.4 vs. 2.8), and being intuitive (3.4 vs. 2.8), drilling more accurately (2.9 vs. 2.5) compared to CNS. However, ARNS did not show clear advantages in safety perception (3.2 vs. 2.8), systematic complexity (1.6 vs. 1.9), drilling interference (1.5 vs. 1.8), need for technical assistance (2.1 vs. 2.1), and need for learning (2.0 vs. 2.4), but these did not affect the significant advantages of ARNS in other key performance indicators and overall better performance than CNS (Table 3).

Table 3 Likert Questionnaire

In addition, different positive and negative feedback about the ARNS and CNS were provided (Table 4). Correspondingly, in total 27 (75%) participants preferred the ARNS, while 9 (25%) participants preferred the CNS (Table 2). An additional analysis of the questionnaires by profession is available in the Supplementary Information (Supplementary Tables 1 and 2).

Table 4 Open questions summarized

Discussion

To our knowledge, this is the first crossover RCT dedicated to systematically comparing ARNS and CNS to evaluate the accuracy and efficiency of drilling trajectories. The principal findings revealed no significant difference in translational deviation. However, CNS demonstrated better performance in angle and depth deviation compared to ARNS, although these generally did not have significant clinical implications, except for cases involving outliers. For instance, during the placement of pedicle screws, the differences between two systems should not result in a deviation exceeding 2 mm, thus allowing both systems to be categorized under the same classification38. Notably, drilling time was significantly faster with ARNS, but without clinical relevance. In practical terms, ARNS and CNS were effectively comparable in their drilling performance. NASA-TLX workload was comparable between the two methods. While ARNS markedly outperformed CNS in terms of overall user experience. In general, 75% of the participants preferred the ARNS, while only 25% of the participants preferred the CNS.

The need for such a confirmatory study arises from the limitations observed in previous research, which employed exploratory designs27,28,30,31,32,33,39, including limited participants and/or lack of control groups, thereby compromising the reliability and validity27,28,31,33,39. On the contrary, our study adopted a confirmatory design, with pre-registration of a study protocol, properly powered based on sample size calculation and predefined effects, ensuring reliable and valid results in the systematic comparison of an ARNS with a CNS as control. This rigorous approach is key to definitively assessing the accuracy and efficiency of ARNS in trajectory drilling and providing solid insights into the field.

In the past, drilling guided by CNS with optical tracking has demonstrated good accuracy in many surgical fields17, including pedicle screw placement40 or zygomatic implant surgery41. According to the findings of various studies, the range of translational and angular deviation has been reported to be between 1.27 and 6.43 mm and between 2.68 and 3.09° for the CNS22,40,41,42. Our CNS evaluated by post-CT showed good results among the above-mentioned studies, thereby making it a valid control. Similarly, the feasibility of ARNS was demonstrated by a number of studies, yet without a comparison to CNS. The range of translational and angular deviation was reported to be between 1.4 mm and 2.77 mm, and between 3.0° and 3.8°, respectively28,32,33,39. In contrast, our ARNS obtained results in translational deviation of 0.95 ± 0.42 mm, which together with angular deviation of 1.11 ± 0.61° outperformed the ARNS in the above-mentioned feasibility studies.

Nevertheless, only two other studies beside ours compared ARNS with CNS for surgical drilling (Table 1). Yet, this is necessary to consider the individual settings and factors (registration, tracking system used, drill skiving, type of visualization, etc.) that inevitably occur in any study. Mueller et al. had no significant difference between ARNS (3.4 ± 1.6 mm/4.3 ± 2.3°) and CNS (3.2 ± 2.0 mm/3.5 ± 1.4°). However, the study of Mueller et al. was limited by the low accuracy of marker-based tracking (>2.0 mm/>2.0°) in ARNS and the limited number of participants (1 experienced surgeon)27. Wolf et al. claimed that the ARNS (virtual twin: median angular deviation 0.9/1.2°) outperformed the CNS (median: 1.7/1.8°) for the two study groups (experts/novices). In contrast, our ARNS and CNS achieved better accuracy with a median angular deviation of 0.97° and 0.68°, respectively. Although the results of Wolf et al. provide good insight into optimal ARNS visualization, the validity of the results is limited because their evaluation was based solely on tracking data (4% of measurements had to be excluded due to incompleteness by Wolf et al.30). The evaluation of tracking data instead of postoperative CTs has potential additional errors30. This may compromise the translation of findings into real-world clinical settings. In addition, only angular deviations were evaluated30, lacking depth or translational information due to the exclusive use of a tracked sleeve, which may be to some extent depending on the surgical scenario.

Our ARNS benefits from the incorporating the virtual twin and state-of-the-art imaging-based registration (IBR) and optical tracking technique. Many studies used procedural registration methods to map the image data to the patient, such as Paired Point Registration (PPR)30,31 and Iterative Closest Point (ICP)32,33,39. However, these methods are prone to human error and inaccuracy, which reduces the overall accuracy during navigation. In contrast, other studies, like ours, used IBR, which is widely used in surgical navigation systems43,44. Compared to procedural registration such as ICP or PPR, IBR has been shown to be more accurate, improving the accuracy of both of our methods, but at the cost of additional radiation exposure45. Apart from registration, tracking itself is another possible source of inaccuracy. In contrast to optical tracking, 2D fiducial marker-based tracking has demonstrated a wide range of root mean square error (RMSE), ranging from 0.87 mm to more than 10 mm46, although it is commonly used in studies27,33,39. However, most of the optical tracking cameras available on the market achieved an accuracy of RMSE less than 0.5 mm, especially for the fusionTrack 500 in our study, the accuracy in RMSE of 0.09 mm (for up to 2 meters) is outstanding among them47.

Although superimposition is often perceived as the intuitive way of AR visualization in healthcare applications and used in many studies, it has many drawbacks, especially when using OST HMDs48. The holographic overlay can be prone to focus rivalry and vergence-accommodation conflict, which can disrupt the surgeon's depth perception, distort the observation of important anatomical structures and increase cognitive load. Moreover, superimposition introduces additional errors in the registration between virtual and physical counterparts. In literature, the reported registration accuracy was in the range of 0.62 to 6.93 mm in translation, and the average angular accuracy was in the range of 1.32° to 6.80°49, which could have significant impacts on navigation accuracy in the end. Overall, this may subsequently lead to a reduction in accuracy50,51. However, our ARNS adopted “virtual twin” free from this registration error in OST, as everything takes place in virtual space. Correspondingly, Wolf et al. reported that this “superimposition” visualization was the most distracting of five possible ARNS visualization approaches, making it difficult for users to locate the tooltip. On the contrary, a virtual twin ARNS visualization performed better in terms of orientation and was rated higher in terms of overall user experience and cognitive load30.

Another source of error may be skiving, which may have occurred in the aforementioned studies. Skiving is the displacement of the drill during drilling caused by the geometry of the drill tip52,53. Skiving can be reduced by using a sharper drill tip53. Therefore, the tip was sharpened from 118° to 90° in this study. Furthermore, the solely navigated sleeve configuration in some studies could also account for the limited accuracy27,28,30,39, where the lack of real-time depth information and tolerance introduces error into the final results54. By using sleeves, tolerance errors and increased friction from sleeves could compromise accuracy54. The use of sleeves raises temperatures, which could potentially result in thermal osteonecrosis11,55. If the sleeve is tracked instead of the drill, real-time depth information is not provided by the navigation system. If the sleeve does not limit the allowable depth, this could result in nerve damage in spine and dental implant surgery and cortical bone penetration in orthopedic surgery17,56,57.

Furthermore, due to limitations of the trial design in the aforementioned studies, confounding factors were not addressed. Therefore, the observed differences between ARNS and CNS could be caused by bias. In contrast, our RCT addressing confounding shows that there is no significant difference in translation deviation. However, we found significant differences in the mean angular deviation of the ARNS compared to the CNS (1.11° vs. 0.73°). In terms of depth, our ARNS were found to over drill by an average of 1.27 mm compared to the CNS with an average of 0.52 mm. Although the ARNS results showed a mean difference in the sub-millimeter/-degree range for depth deviation (0.75 mm) and angle deviation (0.39°) compared to CNS, from a clinical perspective and excluding outliers, this difference would not be clinically relevant in the vast majority of surgical scenarios. For example, in pedicle screw placement surgery, a breach of less than 2 mm is considered acceptable38.

Increased deviations for target orientation and target depth with ARNS compared to CNS may have been caused by differences in the graphical user interface (GUI). GUI differences are the result of the different display types (i.e., external 2D view, first-person 3D view, compensatory top-down display), which may also result in different scaling of the navigation information. However, we did not find any significant differences in translation errors. Yet, we found other possible explanations for the observed differences. For example, the depth guidance in ARNS was based on the spatial position of the virtual drill and the planned position, as well as a color transition from yellow to green (correct depth) and then to red (too deep). Interestingly, four participants reported that depending on the positioning of the HL2 on the head, the colors were distorted (chromatic aberration)58. This caused yellow objects to appear green, making it difficult for the study participants to judge when the correct target depth had been reached, and may explain some of the outliers in depth for the ARNS. This is a technical limitation of HL2, which only became evident after the inclusion of a larger number of participants.

Although stereoscopic visualization is seen as an advantage in hand-eye coordination, it could bring a limitation in this task. In the compensatory display of the CNS, the angular deviation was seen from top-down, so that deviations in all spatial directions were detected immediately. In the ARNS, however, the view was almost always from one direction depending on the position of the user. In addition, the stereoscopic vision of the participants (despite passing Lang-Stereotest II) may have introduced a further error. Both could have been reasons for the reduced angular deviation.

In the subgroup analysis in accuracy, there was no significant difference between student and surgeon groups (Fig. 3a–e, despite f). But the engineer group underperformed significantly in the translational deviation compared with other two groups (Fig. 3a–d). One reason may be the superior hand-eye coordination resulting from the accumulated experience of surgical training for the surgeons. Interestingly, the students performed best at depth deviation, probably due to the younger age of this group with faster motor reaction time (LMM adjusted for age, age p = 0.010; Fig. 4c). This finding is consistent with studies of motor reaction time and age in the literature59.

Nevertheless, these possible causes of sub-millimeter/-degree differences in angular and depth deviation between ARNS and CNS could probably be resolved in the future by optimizing the GUI and HMD hardware used, approaching the limits of the technical accuracy of the registration method and the optical tracking system. In summary, although the ARNS is slightly inferior to the CNS on angular and depth deviations, it is clinically close to the specifications of the CNS. Yet, ARNS was preferred more often and had better user experience regardless of professional background, probably due to its easy to understand and intuitive visualization. However, only 58.3% of the surgeons preferred ARNS while 41.7% preferred CNS. The reason for this was that all surgeons (n = 3) with no previous experience in CNS preferred the CNS. Among surgeons with previous experience in CNS (n = 9), 77.8% (n = 7) preferred the ARNS, while surgeons in general (including those without experience in surgical navigation) had mixed opinions. Furthermore, the average surgical experience of the surgeons who preferred ARNS to CNS was slightly higher (12.0 years) compared to those who preferred CNS (10.4 years).

The limitations of this study are that although we were able to demonstrate the technical performance of ARNS on a phantom in postoperative CTs and the benefit to usability, the potential benefit for a real-life scenario such as zygomatic implantation or pedicle screw placement needs to be demonstrated to account for real operating room settings where target surgical sites have complexed structure instead of a flat surface.

Finally, all studies, including ours, have conducted a superiority study. For clinical use, non-inferiority study (calculating confidence intervals instead of inference statistics) in particular could be one of the leading aspect for integration into the clinical workflow. In this regard, the findings and results of our study can be used for a first non-inferiority study in the future.

In this RCT we were able to for the first time provide compelling evidence that ARNS and CNS have comparable accuracy in translational error. The observed differences in angle and depth deviation are probably due to limited stereoscopic vision, hardware and setup limitations, and the design concept of the ARNS. In the future, these factors could be addressed by adapting the hardware and guidance. Nevertheless, the ARNS was preferred over the CNS by most participants and the majority of surgeons with previous navigation experience, with significant overall better user experience. Altogether, depending on the accuracy required, ARNS could be a viable and possible alternative to the use of CNS for guided trajectory drilling.

Methods

36 subjects with different professional backgrounds and different levels of manual dexterity, who were right-handed and passed the Lang Stereotest II (assessment of spatial vision) (LANG-STEREOTEST AG, Switzerland), were recruited and performed trajectory drilling on a block phantom in a randomized cross-over order with ARNS and CNS (Fig. 6).

Fig. 6: CONSORT flow diagram.
figure 6

CONSORT flow diagram illustrating the flow of participants from enrollment, through allocation, including cross-over, to follow-up and analysis. This diagram was created according to the requirements of the CONSORT reporting guidelines and modified for crossover studies.

This study was approved by the local ethical commission of the University Hospital RWTH Aachen (EK 23-011; Chairman Prof. G. Schmalzing; approval date, 31.01.2023), has been registered in advance in the German Clinical Trial Register (DRKS00031357) with study protocol and followed the CONSORT 2010 guidelines and its extension for crossover studies60,61. Informed consent was obtained from all subjects involved in the study.

Navigation systems

The ARNS system was developed through an interdisciplinary collaboration between medical engineers, medical informaticians, and multiple clinicians at a university hospital with extensive experience in surgical navigation systems. The ARNS software was programmed using C# in Unity (v2021.3.30f1) and was deployed on a HoloLens 2 (HL2) (Microsoft Corporation, Redmond, WA, USA). Visualization of a virtual twin was implemented as suggested by Wolff et al.30, with a holographic scene manually placed adjacent to the surgical field in a standardized manner (directly next to the phantom block on the opposite side of the patient; see Fig. 1a). The holographic scene displayed the phantom block as a 3D block outlined in blue, anchored stationary in the physical environment by the inherent function of the HL2. The virtual drill (yellow) was positioned relative to the virtual phantom block. This was based on the relative coordinates of the real drill and the real phantom block obtained in real time by the optical tracking system (Fig. 1a). Due to the use of relative coordinates, it was not necessary to transfer them to HL2 space. For user guidance, the translational and the depth deviation have been designed to be limited to 1 mm and the angle deviation to 1 °, respectively, which was considered safe and acceptable range38. To achieve this, three elements were present at the tip, middle, and tail of the drill, denoted by the lower torus, middle torus, and upper torus respectively. The adoption of the three tori was motivated by the study of Tu et al.31. The lower and upper tori adjusted their radius based on the shortest distance to the drill, with the inner radius having the exact diameter of the drill when the distance was ≤2 mm and changing accordingly when the distance was >2 mm. In addition, the color of the tori changed from yellow to green when the distance was ≤1 mm. The color of the virtual drill changed from yellow to green when the depth exceeded the target depth by 0–1 mm and to red >1 mm (Fig. 1b, Supplementary Movie 1). The rationale behind color feedback was motivated by color associations62.

The CNS software was based on an already developed software in C++ at the Chair of Medical Engineering (mediTEC). It was adjusted to meet the requirements for this study and to have similar guidance logic as the ARNS software. The CNS displayed the navigation information from the same optical tracking camera as the ARNS on an external 2D monitor (Fig. 1c). It featured a quad-display interface as many other CNSs used in OR, with three windows showing orthogonal multiplanar reconstructions (MPR) of cone-beam computed tomography (CBCT) scans of the phantom block, accompanied by a fourth compensatory display designed for precise trajectory correction. The MPR view depicted the real-time position of the drill (in yellow) to the preplanned trajectory (purple) and nearby structures of the phantom model, which, along with optical tracking, helped to precisely locate the entry point and perform accurate drilling. The compensatory display introduced a dual-circle visualization, a sphere, and a ring, representing the tip and the tail of the drill respectively to assist alignment of the drill to the trajectory in a top-down view. Beside, there was also a depth slider visualizing the current depth of the drill regarding the planned depth and the peripheral ring to indicate depth. This setup adopted the same color-coded feedback mechanisms as the ARNS: If either the central ring or the sphere has an alignment error ≤1 mm, it will turn from yellow to green. To ensure the depth accuracy, the depth indicator of the peripheral ring turned green to denote depth discrepancies within the 0–1 mm range and shifted to red for deviations >1 mm, thereby offering a clear, intuitive cue for real-time depth (Fig. 1d, Supplementary Movie 1).

The registration process between the tracker on the phantom and the CBCT data was identical for both systems. The CBCT scan was acquired using the Surgivisio imaging system (eCential Robotics, Gière, France), which performed auto-registration. The auto-registration was based on a fixed patient reference with an attached calibration phantom, as described in the literature43,63. The trajectories were planned in the coordinate system of the patient reference. A server software written in C++ sent the tracking data from an external optical tracking camera fusionTrack 500 (Atracsys LLC, Switzerland) via Wi-Fi to the ARNS and the CNS clients. The clients compute the spatial relationship transformation in exactly the same way and then display it in their respective visualization method.

Sample size calculation

The sample size calculation was performed in R (version R4.3.1, www.r-project.org). The data needed to calculate the sample size was taken from a preliminary test with six participants. After excluding the first trajectory for practice, 60 drilled trajectories were evaluated, one of which was excluded due to technical failure of the system. The CNS achieved a projected maximum translational deviation of 1.45 ± 0.70 mm (mean ± SD) and that of the ARNS was 1.75 ± 1.28 mm. A simulation-based power analysis was performed to plan the number of cases for a linear mixed-effects model (LMM) with lmerTest Package64. This LMM assessed the maximal projected translational deviation, with navigation systems, starting navigation system (ARNS or CNS), and group (surgeon, student, engineer) as fixed effects, and sequence (the order to drill trajectories) and subjects as random effects. With significance level of α = 0.05, This resulted in 36 participants (including 4 for dropout) to meet the power of 80%.

Trial

The study was conducted in a surgical setting using a phantom resembling cortical bone (PUR modeling board M330, Sika AG, Baar, Switzerland). This phantom was attached to a tracked 3D-printed frame, and the assembly was placed on a Vertebroplasty Trunk model (Sawbones Corporate, Washington, U.S.) situated on a table. The scenario consisted of six trajectories (25 mm length, 3 mm diameter), which were pre-planned and registered with the Surgivisio CBCT imaging system (eCential Robotics, Gières, France). The trajectories were angled at 10–15 degrees relative to the phantom’s upper surface normal. Each subject conducted the drilling with a tracked cordless handheld drill (Colibri II, DePuy Synthes, Indiana, US). To reduce skiving during drilling53, a 3.0 mm diameter drill bit was tipped at a 90° angle (Craftomat metal drill HSS-R Speed, BAHAG AG, Germany).

Each subject, stratified by profession, was randomly assigned to start with one of two navigation systems according to an urn randomization rule (sampling without replacement). The two blocks were balanced to 18 participants each, 6 in each professional group. The random allocation was planned and performed by B.P.

Participants received a brief introduction to the respective systems before proceeding with the drilling tasks. The first drilled hole was for practice and was not included in the later evaluation. Afterward, participants drilled five consecutive holes, with the system automatically displaying the next planned trajectory after each time the drill was removed from the drill hole. Thereby two time periods for each drill task were automatically recorded. The first period was from the start of each drill until the drill penetrated to a depth of 3 mm, which was referred to the time find the entry point. The second period was from the moment the depth reached 3 mm until the target depth (25 mm) was reached, which was referred to the drilling time. If the depth was not reached, the start time of the next trajectory was used as the end time of the current trajectory.

Upon completion of six times drilling with one navigation system, the subject completed a weighted NASA-TLX questionnaire to measure workload, SUS, and a Likert questionnaire for qualitative assessment. The other navigation system was then administered using the same procedure. Finally, the subject completed a final open-ended questionnaire to qualitatively compare the two systems.

Evaluation

After the trial, all phantom blocks were scanned with the abovementioned CBCT scanner. To enhance the visibility of the drilling trajectories in the CT scan, rods 3D printed by a Prusa SLS SPEED were inserted. The rods' tips were designed to be flat, without the 1.5 mm long conical sharp tip (90°), to allow for maximum insertion into the drill hole, as debris could obstruct proper placement. The acquired CT scans were analyzed in 3D Slicer (v5.2.2). Two independent investigators (Y.L. and P.B.) evaluated the conducted trajectories before comparing them to the ground truth.

The rods were segmented using the SegmentEditorExtraEffects in 3D Slicer by outlining their path and then exported as OBJ meshes. If the discrepancy of the translational deviation was ≥ 0.5 mm or the angular deviation was ≥1.0° between two investigators, B.P. checked the discrepancy, then Y.L. and P.B. repeated the segmentations to reduce bias in segmentation. Supplementary Fig. 1a, b shows the conducted and planned trajectories, which were all displayed with the trajectory mesh for clear comparison.

Afterward, the models were automatically compared to the planned trajectories using a Python script (v3.6.1) with VTK packages. The calculations were performed as follows: \({P}_{1}\) and \({P}_{2}\) represented the entry point and the endpoints of the planned trajectory, whereas those of the conducted trajectory were denoted as \({C}_{1}\) and \({C}_{2}\). Entry points for both planned and conducted trajectories were where the mesh of the trajectories penetrated the block's upper surface. \({C}_{2}\) was identified as the furthest point along the central axis plus an additional 1.5 mm offset to account for the flat ends of the rods (Fig. 2d).

The maximum projected translational deviation (Fig. 2a) was the larger value of projected translational deviation at two ends (Fig. 2b, c), denoted as the minimum radius of the inclusion cylinder from the planned trajectory that covers the conducted trajectory. The projected translational deviation was the shortest distance from the conducted trajectory to the planned trajectory, which was the length of shortest distance from \({C}_{1}\) and \({C}_{2}\) to the central axis of the planned trajectory respectively (Fig. 2b, c). In addition, the translational deviation at the entry point was evaluated by 3D Euclidean distance between \({P}_{1}\) and \({C}_{1}\)(Fig. 2d). The angular deviation in degrees was calculated by the angle between the planned and conducted trajectories, using the following function: \({Angular\; deviation}={\cos }^{-1}\left(\frac{\vec{A}* \vec{B}}{{||}\vec{A}{||}* {||}\vec{B}{||}}\right)\), where \(\vec{A}\) and \(\vec{B}\) were the vectors along the central axis of the planned and the conducted trajectory respectively (Fig. 2e). The depth deviation was calculated as the distance between \({P}_{2}\) and \({C}_{3}\), where \({C}_{3}\) is \({C}_{2}\) projected onto the planned trajectory axis (Fig. 2f).

Statistical analysis

The statistical analysis was also performed in R. The translational deviation between the planned and conducted entry and endpoints, angular deviation, and depth, were evaluated with an LMM as described above in the sample size calculation. The perceived workload in NASA-TLX was evaluated by unpaired t-test. Time and SUS between systems were compared using Mann–Whitney U tests. Normal distribution was tested using the Shapiro–Wilk test. A p < 0.05 was considered significant.