Introduction

Artificial Intelligence (AI)-based tools are becoming increasingly prevalent in clinical practice. In obstetrics and perinatology, they frequently appear as software in ultrasonographs, aiding in the diagnosis of the fetus and the reproductive organ. During fetal diagnostics, such software can assist in probe guidance, perform automatic measurements after determining the plane, and delineate appropriate planes after obtaining a 3D volume acquisition1,2. Deep learning-based software is also capable of locating anatomical structures in the heart and central nervous system, as well as drawing planes from three-dimensional volumes obtained from imaging of the heart and central nervous system of the fetus3. Currently available software can also classify scans of the central nervous system as abnormal4.

SonoCNS (GE Healthcare, Chicago, IL, USA) is an artificial intelligence-based tool for the automatic measurement of fetal head structures. After acquiring the volume of the fetal head (acquisition is conducted upon visualization of the thalamic plane of the fetal head) [Fig. 1], the software automatically determines the appropriate planes and measures the dimensions of the central nervous system structures. The system allows for measurement of: biparietal diameter (BPD), occipitofrontal diameter (OFD), head circumference (HC), cisterna magna (CM), transverse cerebellar diameter (TCD), and posterior horn of the lateral ventricle diameter (Vp) [Fig. 2]. Additionally, the planes created by the software can be useful in assessing midline structures of the central nervous system such as the corpus callosum and the cavum septi pellucidi. In the mid-sagittal section, the system also delineates the fetal profile, on which manual measurements of facial features, such as the nasal bone or frontomaxillary angle, can be performed.

The examination begins by obtaining a thalamic plane of the fetus with the angle of the ultrasonographic beam close to 90 degrees in relation to the midline of the brain. Then, using a 3D gate, the area containing the entire obtained plane is marked. The system automatically retrieves the volume and then delineates additional planes – sagittal, transcerebral, and transventricular. This is the moment when the operator can verify the accuracy of the delineated planes. After pressing the “start measurements” button, the system locates the specified structures and measures them. This can then be confirmed by the operator and added to the final examination report.

Fig. 1
figure 1

Transthalamical ultrasound scan of the fetal head with marked area for 3D volume acquisition.

Fig. 2
figure 2

The SonoCNS software autonomously delineates the appropriate planes from the acquired volume and applies automatic measurements.

Aim

The aim of the study was to assess the repeatability of measurements of the central nervous system structures when using SonoCNS and to test the reproducibility during the simultaneous manual and automatic assessment. An additional goal was to compare the performance of the software in the second and third trimesters of pregnancy and the timing of automatic and manual measurements.

Materials and methods

The study included patients during routine midtrimester and third trimester screening at the Department of Obstetrics and Gynecology, Provincial Combined Hospital in Kielce. We obtained approval for the study from the Bioethics Committee at the Jan Kochanowski University in Kielce (25/2022). All methods were performed in accordance with relevant local regulations and guidelines of the ethical commission. All participants provided informed consent for ultrasonographic diagnosis and participation in the study. The examinations were conducted using a Voluson E10 ultrasonographic device (General Electric), software version BT20EC350, with a RM6C-D volume probe.

The examinations were performed by two experienced sonographers with over seven years of experience in performing screening ultrasound examinations in pregnancy. In addition to manual measurements of CNS structures, the sonographers also performed automatic measurements twice using the SonoCNS software. We assessed repeatability and reproducibility by calculating intraclass correlation coefficients (ICCs) between manual measurement and the mean of automatic measurements for intraobserver variability, and between two automatic measurements for interobserver variability. ICC values were calculated for each obtained parameter separately, considering all measurements obtained in the study, as well as separately for second and third-trimester examinations. Additionally, we calculated ICC (Intraclass Correlation Coefficient) values separately for the second and third trimesters. In cases where the automatic measurement failed to obtain a given value, the measurement was not repeated, and this occurrence was marked in the database to quantify the number of such instances.

Another variable assessed in the study was the measurement time. An assistant present during the examination (responsible for entering measurement data into the result-generating system) measured the examination time in the case of manual and automatic measurement.

Data analysis was performed using Statistica 13.1 (Tibco Software, Palo Alto) and SPSS 26 (IBM). Demographic data for the group were presented for quantitative data as a median and for qualitative data as a percentage. We used the interquartile range as a measure of dispersion. The Mann-Whitney U test was used to compare quantitative data. To assess intra- and interobserver variability, we calculated interclass correlation coefficients (ICC) along with a 95% confidence interval. ICC values were classified as follows: >0.9 – excellent, 0.75–0.9 – good, 0.5–0.75 - moderate, < 0.5 – poor5. Based on the available literature, we determined that interrater and intrarater agreement already exists (no study concerning ICC analysis found a consistency result of less than 0.5 for the main parameters obtained in the study, such as BPD and HC used in fetal biometry6,7. The minimum number of patients to be included in the study was calculated based on a table developed by Bujang et al8.; the minimum number of patients in the group should be 83. Differences were considered statistically significant in the case of p < 0.05.

Results

The study included 381 patients, with 270 (70%) examined during the second trimester of pregnancy (gestational age 18–22 weeks) and 111 examined during the third trimester (gestational age 28–33 weeks). Operator A conducted 31% of the examinations (119 patients—80 in the second trimester and 29 in the third trimester), while Operator B conducted 69% (266 patients—190 in the second trimester and 82 in the third trimester).

The demographic characteristics of the study group are presented in Table 1. Overall repeatability and reproducibility for the entire cohort of patients, as well as separately for the second and third trimesters, are shown in Table 2. The ICC values for interobserver and intraobserver variability for parameters BPD, HC, and OFD ranged from good to excellent reproducibility in the general population and subgroups. Measurements of CM and Vp in both the general population and subgroups fell into the moderate and poor reproducibility categories. For some ICCs related to these measurements in specific groups, it was not possible to obtain ICC values different from 0.5. TCD measurements were more varied, ranging from moderate to good reproducibility, while OFD showed good and excellent reproducibility.

In at least one instance of automatic measurement, acquisition was unsuccessful in 11% of CM (cisterna magna) measurements, 8% of Vp (ventricular plane) measurements, 6% of cerebellum measurements, and 1% of BPD (biparietal diameter), HC (head circumference), and OFD (occipitofrontal diameter) measurements. In cases where a cutoff value was used to determine abnormality (CM and Vp), we evaluated the number of abnormal results in the automatic measurements (> 10 mm) when the manual measurement was accurate. False positive results were observed in 1 case for Vp (0.2% of all obtained results, excluding cases where the system failed to measure) and 9 cases for CM (1.8%).

The average time for manual measurement was 63 s compared to 14 s for automatic measurement (the difference in time for data collection was significant, p < 0.001).

Table 1 Baseline characteristics of the study group.
Table 2 Obtained ICC values for the entire Population and for specific pregnancy trimesters [CI – confidence interval].

Discussion

SonoCNS™ appears to be a helpful tool for assessing the fetal head and CNS structures. The results of our study indicate high reliability of the tool in the domain of biometric measurements commonly used in most models for estimating fetal mass and size. However, measurements of Vp and CM were less favorable, with some cases not significantly different from 0.5, indicating poor performance. A characteristic feature of Vp and CM measurements is that they are fluid spaces, the boundaries of which in the ultrasonographic image are delineated by fetal CNS structures. Both structures are also not located in the basic (thalamic) plane identified by the operator but in planes automatically delineated by the operating system. During image analysis, it may become apparent that despite good visibility of intracranial structures in the thalamic plane, the delineation of planes by the system based on the obtained 3D volume in other planes does not provide clear boundaries for Vp and CM due to ultrasonographic artifacts. The positioning of these structures also makes them susceptible to being obscured by the acoustic shadow from the fetal skull bones. Both structures are also relatively small, typically up to 10 mm in normal cases. In contrast, other structures with better reproducibility and repeatability have larger dimensions and are mostly located in the peripheral parts of the fetal head.

In the literature, there have been few studies evaluating SonoCNS™ in terms of performance. The largest among them, published in 2021, involved 143 patients undergoing midtrimester screening6. This study compared intraobserver variability in cases of manual measurement and measurement using SonoCNS™. The results of that study significantly differed from ours, with only the HC and BPD parameters in the published study having an ICC greater than 0.8. In other cases, the ICC was in the poor range (< 0.5). The cited study employed a different protocol, involved more examiners (eight doctors), and was conducted in the United States with a study group characterized by a slightly higher BMI. These factors might contribute to the obtained differences; however, the trends in the results are similar to those in our work, with the lowest ICC achieved for measurements of CM, Vp, and TCD.

In another study published by Gembicki et al9., the mean error for parameters ranged from 1.26 mm (standard deviation (SD) = 1.6) for BPD to 0.87 (4.22) mm for HC, and 0.55 (0.82), 0.16 (0.82) for 0.16 (1.34) for TCD, and 0.13 (0.67) for CM. However, it’s important to remember that such presentation of results does not reflect the relative (percentage) error in individual cases, and measurements of TCD, CM, and Vp have significantly smaller absolute values than HC and BPD. This makes these results difficult to compare with the findings of our study.

Another study compared the detection rate of CNS structures between the 18th and 34th week of pregnancy, which was 75%, and for pregnancies before the 28th week, it was 85%. Differences in the percentage of manual and automatic identification were also greatest in the case of intracranial structures10. These variations highlight the need for careful consideration and contextual understanding when implementing SonoCNS™ in clinical practice, particularly for specific structures and patient demographics.

Accurate repeatability of measurement is not equally important for all structures measured during a fetal head examination using SonoCNS™. The precision of measurement appears to be more crucial for dimensions that have a strong correlation with the estimated fetal weight and are part of algorithms used for its assessment (HC, BPD, TAD). Lower repeatability may be acceptable for structures where the absolute measurement value is not critical, but rather whether it falls below the threshold for abnormalities (e.g., Vp < 10 mm). In this situation, the more important factor is whether automatic measurements result in false positive findings within the obtained measurements. In our study, the percentage of such measurements was minimal: 0.2% for Vp and 1.8% for CM.

A problem that arises during ultrasound image acquisition is cases where the system fails to delineate specific measurements, which may be due to blurring of the boundaries of relevant structures. In our work, we calculated the percentage of such situations. We observed that in some cases, this was the result of incorrect angling of the plane in cases of automatic delineation of the fetal CNS transthalamic plane. Such situations occurred in difficult examination conditions where obtaining the original, perfect thalamic projection necessary for volume acquisition was challenging or partially obscured, requiring the probe to be moved to another bony window to visualize all structures of a given plane. Such situations could be the positioning of the fetal head in the lesser pelvis or an acoustic shadow cast by a fetal limb, causing the projection obtained through the abdomen to be slightly oblique. The amount of adipose tissue can also affect repeatability6. We also observed that the repeatability for some fluid space measurements, such as Vp and CM, is greater in the third trimester of pregnancy. Intuitively, we expected the opposite result. We anticipated that the increased calcification of bones in the third trimester would result in acoustic shadows that reduce the quality of imaging planes. However, repeatability was higher, likely due to better delineation of the aforementioned structures. We believe that bone calcification may still impact situations where automatic measurements of structures are impossible; 28 out of 42 (66%) cases in which automatic CM measurement failed occurred in the third trimester.

The range of examination time for automatic measurements was low and depended only on the time to obtain the thalamic projection and set the gate for volume acquisition. For manual measurements, this time was more varied due to the need to visualize three planes: transventricular, transthalamical, and transcereberral. Appropriate measurements had to be made on each obtained plane. Our study only involved fetuses with normal intracranial anatomy; we did not test the system in cases where an abnormality was diagnosed. However, the system may be useful in patients where abnormal CNS anatomy is associated with abnormal dimensions of structures, such as microcephaly, ventriculomegaly, or Dandy-Walker malformation. However, due to the low prevalence of some of these complications, it may be challenging to verify the system’s performance in practice.

Despite the need for improvements in the measurement of intracranial structures, it’s important to recognize a significant advantage of SonoCNS™ software beyond biometric measurements. The system delineates all planes of the fetal CNS recommended in the 20 + 2 methodology by the International Society of Ultrasound in Obstetrics and Gynecology11,12 and additionally delineates the fetal profile, whose assessment is also recommended during fetal anatomy examination. This allows for the assessment of appropriate structures on all planes and additionally the profile. Under good examination conditions in this plane, the corpus callosum and midline CNS structures of the fetus can also be visualized. This may be beneficial for novice sonographers who may struggle with visualizing this plane due to the need for proper probe rotation. Therefore, this tool can be particularly useful in the hands of sonographers with the appropriate theoretical background but lower technical skills, potentially shortening the learning curve.

Undoubtedly, AI-based systems are the future of fetal screening studies. Current literature indicates that such algorithms can locate appropriate planes and classify sections as abnormal, even in real-time13. Modern real-time software is also capable of recognizing standard planes, saving them in the device’s memory, and checking their compliance with standard diagnostic planes (e.g., SonoLyst, Voluson, GE)14. Such algorithms are also being developed for diagnostics in the first trimester of pregnancy15, with AI used for automatic measurements of non-inheritable chromosomal aberration markers in the first trimester (like nuchal translucency and the presence of the nasal bone). Descriptions of such software exist in the literature16. Software capable of identifying all necessary fetal head structures between 10 and 14 weeks of pregnancy, such as the thalami, midbrain, palate, 4th ventricle, cisterna magna, nuchal translucency (NT), nasal tip, nasal skin, and nasal bone, is also described17. Based on brain images, software can also estimate gestational age with high accuracy18,19. Recognize fetal facial expressions like eye blinking, mouthing, face without any expression, scowling, and yawning20. While SonoCNS™ is available on devices from one manufacturer, other ultrasound device manufacturers also offer software capable of performing similar functions, like 5DCNS+™ (Samsung Medison, Seoul, Republic of Korea), Smart Planes CNS™ (Mindray, Shenzhen, China).

Our study’s limitation is the lack of analysis considering potential confounding factors such as subcutaneous fat thickness and fetal positioning. We aimed to reflect the general population without overly complicating the results. The study is also limited to specific periods during pregnancy, but this aligns with the recommended times for fetal anatomy screening studies. The software also allows for marker adjustment before confirming measurements or for manual marking on system-delineated planes in the absence of automatic measurements; however, in our study protocol, we allowed the software to operate without external control. Additionally, patients were recruited for the study only when the participating operator was available in the prenatal testing lab. Yet, we believe the study population represents the general population attending screening examinations.

Conclusion

The SonoCNS™ software is characterized by good to excellent reproducibility and repeatability in the measurement of fetal skull biometry (BPD, HC, and OFD). However, the software shows poor to moderate intra- and interobserver variability in measurements of fluid-filled intracranial structures (CM, Vp), and moderate to good reproducibility for TCD (transcerebellar diameter). Beyond biometric measurements, the software is clinically useful for identifying appropriate planes from the acquired fetal head volume and has the potential to reduce examination time.