Abstract
Ultrasound imaging plays an important role in fetal growth and maternal-fetal health evaluation, but due to the complicated anatomy of the fetus and image quality fluctuation, its interpretation is quite challenging. Although deep learning include Convolution Neural Networks (CNNs) have been promising, they have largely been limited to one task or the other, such as the segmentation or detection of fetal structures, thus lacking an integrated solution that accounts for the intricate interplay between anatomical structures. To overcome these limitations, Fetal-Net-a new deep learning architecture that integrates Multi-Scale-CNNs and transformer layers-was developed. The model was trained on a large, expertly annotated set of more than 12,000 ultrasound images across different anatomical planes for effective identification of fetal structures and anomaly detection. Fetal-Net achieved excellent performance in anomaly detection, with precision (96.5%), accuracy (97.5%), and recall (97.8%) showed robustness factor against various imaging settings, making it a potent means of augmenting prenatal care through refined ultrasound image interpretation.
Similar content being viewed by others
Introduction
Ultrasound imaging has become an integral tool in obstetrics and gynecology, for it provides information necessary for determining fetal development and health1,2. Non-invasive procedures allow clinicians to monitor fetal development3,4, diagnose probable abnormalities5,6, and make important decisions regarding prenatal care7,8. Although it is vital, correct interpretation of ultrasound images remains challenging due to complexity of fetal anatomy and variability in image quality9,10. The need for development of correct and automatic analysis of such images has led to the scientific community exploring the use of deep learning methodologies11,12,13. In particular, convolutional neural networks (CNNs) have shown great promise in a diverse range of problems relating to medical imaging14,15. Over the last few years, advancement in deep learning has led to more complex models that can handle the complexity of ultrasound images16,17. The existing solutions extend to performing individual tasks like segmentation, fetal structure identification18,19,20 and gestational age assessment21,22,23. Though they sound promising, these techniques do not serve the whole interpretation that is actually needed for effective prenatal care, especially regarding the interactions among the different anatomical structures24,25. Other problems, such as noise, anatomical variation, and uneven quality render a perfect analysis nearly impossible, which can lead to an improper diagnosis or the loss of critical anomalies26,27,28,29. While significant progress has been accomplished in automatic ultrasound interpretation, several critical challenges remain. Current approaches tend to be focused on individual mantras of fetal examination and view the picture from a rather narrow perspective without considering the intricate interaction(s) between different anatomical structures. Problems such as noise, anatomical defects, and heterogeneously resolved solutions continue to create problems for accurate detection and classification of fetal structures. There is a requirement for a model that not only concentrates on particular tasks but also incorporates the overall context within ultrasound images to give a more precise and trustworthy interpretation.
The intricate interactions refer to dependencies such as the spatial proximity between the fetal brain and ventricles, or between the thorax and femur for gestational assessment. These relationships affect detection reliability and require a model that captures both region-specific features and broader anatomical context achieved via multi-scale and transformer layers in Fetal-Net.
To overcome these challenges, the goals of this study are:
-
To develop a deep learning model that integrates multi-scale convolutional neural networks (CNNs) with transformer layers for comprehensive ultrasound image interpretation.
-
To accurately detect various fetal structures across multiple anatomical planes using a large, diverse dataset.
-
To improve anomaly detection in maternal-fetal ultrasound images by capturing both local and global features.
-
To enhance the robustness and generalization of the model across different imaging conditions and anatomical•.
-
variations.
This paper provides a significance contribution to the body of research. The novel contributions of this study include:
-
a)
Introduction of Fetal-Net, a novel deep learning framework that combines multi-scale CNNs with transformer layers to provide a comprehensive interpretation of maternal-fetal ultrasound images.
-
b)
Development of a robust model trained on over 12,000 ultrasound images from various anatomical planes, achieving high accuracy, precision, and recall in both anatomical structure detection and anomaly identification.
-
c)
Demonstration of the model’s adaptability and consistency across different imaging conditions, making it a valuable tool for improving prenatal care.
-
d)
Enhancement of interpretability through the use of transformer layers, which allow for the generation of attention maps that provide visual insights into the model’s decision-making process.
To bridge the gap between accurate fetal structure interpretation and the limitations of current deep learning models, we introduce Fetal-Net—a unified architecture tailored to handle the complexity of ultrasound imaging. By integrating multi-scale convolutional processing with transformer-based contextual modeling, Fetal-Net is designed to overcome key challenges such as anatomical variability, poor image quality, and lack of interpretability.
-
The model captures both local and global features using multi-scale CNNs and Transformer layers.
-
Attention maps aid interpretability, addressing the black-box issue in medical AI.
-
Robustness across machines (Voluson E6, Aloka, etc.) ensures generalizability under variable clinical conditions.
Introduction, Review of Literature, Methodology, Results and Discussions, and Conclusions make up the five sections that make up this research. In order to lay the groundwork for the research, the introduction section gives a thorough background on the importance and background of correct maternal-fetal ultrasound interpretation. A comprehensive examination of current methodologies in the field of maternal-fetal ultrasonography interpretation is carried out in the literature review section. The suggested method, Fetal-Net, for improving the interpretation of maternal-fetal ultrasound, is described in depth in the methods section. This section presents the results and debates of Fetal-Net’s extensive training and evaluation on maternal-fetal ultrasound pictures. Key findings, contributions, and consequences are summarised in the conclusion portion of the research.
Literature review
Recent advancements in deep learning have significantly influenced the field of maternal-fetal ultrasound analysis, improving the ability to segment, classify, and detect anatomical structures and anomalies. However, limitations remain, particularly regarding robustness to noise, anatomical variability, and generalizability across different imaging conditions. This section organizes related works into three primary categories: segmentation models, anatomical classification approaches, and anomaly/biometric measurement techniques.
Segmentation models
DeepLab3 was introduced for semantic segmentation using atrous convolution and deep networks, demonstrating promising results in extracting fetal structures from ultrasound images. Similarly, DRINet4 applied deep residual learning for medical segmentation tasks, offering an effective way to localize fetal features. Fetal-Net’s segmentation design was influenced by cascaded CNNs like DW-Net25, which were previously used to segment echocardiographic views. Despite these successes, Rueda et al.15 highlighted persistent issues with image noise and anatomical variations that limit segmentation accuracy.
Anatomical structure classification
Various methods have been developed for identifying specific fetal anatomical regions. Chen et al.4 demonstrated CNN-based circumference detection in 2D ultrasound images, while Krishna and Kokil10 proposed a stacked ensemble deep learning model for standard plane classification, yielding high accuracy. Yang et al.26 introduced SSR-Net for stagewise regression, applicable in fetal age estimation and structure recognition. Szegedy et al.19 also improved classification via deep architecture refinements (ILSVRC2014), contributing to architectural elements in Fetal-Net.
Anomaly detection and biometric measurement
Fetal anomaly detection has been approached with increasingly complex architectures. Cho et al.6 integrated deep learning with system-on-chip technology for real-time biometric measurements, while Liu et al.11 used CNN-based segmentation for first trimester biometric prediction with high accuracy. Thomas and Harikumar23 employed ensemble learning to improve plane identification in fetal imaging. Additionally, bottleneck transformers were introduced by Srinivas et al.18 to enhance feature extraction in visual identification, laying the foundation for transformer usage in Fetal-Net.
These prior works underscore the fragmented nature of existing solutions, which often focus on single tasks such as segmentation or classification. Fetal-Net differentiates itself by integrating multi-scale CNNs with transformer layers into a unified model capable of capturing both local and global interactions in ultrasound images. It also builds on normalization strategies24 and transformer integration techniques27,28 to enhance robustness and interpretability. A comparison of existing methods is presented in Table 1.
Finally, by combining transformers and multi-scale convolutional neural networks, “Fetal-Net” might have completely altered the way ultrasounds of the mother and her unborn child are understood. Improving the standard of care for pregnant women and their unborn children was the ultimate goal of this innovative approach, which combined deep learning techniques with neural network topologies to provide efficient and precise analysis.
The literature study demonstrates the necessity for further research into the use of transformers and Multi-Scale Convolutional Neural Networks to enhance the interpretation of maternal-fetal ultrasound. However, while many studies have focused on automated Fetal structure detection, gestational age calculation, and segmentation using deep learning methods, there is a lack of comprehensive research that integrates Multi-Scale CNNs with transformers for improved accuracy (see, for example16,17). In addition, the existing literature on maternal-fetal ultrasonography analysis is somewhat sparse, with the majority of studies concentrating on the examination of specific fetal structures. This research aims to address that gap by introducing a novel method for interpreting maternal-fetal ultrasound images using transformers and Multi-Scale Convolutional Neural Networks (CNNs).
Most prior works are designed for isolated tasks like segmentation or classification and often overlook anatomical context. Fetal-Net fills this gap by modeling interrelated structures through its fusion of local (CNN) and contextual (Transformer) features, which has not been previously achieved in an end-to-end model.
Methodology
In this part, we present the Fetal-Net framework and describe the suggested methodology. The process begins with the extraction of features Fi from the maternal-fetal ultrasound picture I using Multi-Scale CNNs, which is illustrated by the multi-scale feature extraction. Next, the process of modelling contextual dependencies R using a Transformer encoder-decoder module is detailed, including the processes involved. In order to extract comprehensive features Ffused, the fusion technique for comprehensive information is presented, including the integration of multi-scale characteristics and contextual dependencies. Θ model parameters optimisation to minimise the composite loss Lcomposite is the performance improvement technique that is introduced. The generation of attention maps A to highlight areas of interest within ultrasound pictures is described in detail, along with the steps taken to improve interpretability using attention maps. In conclusion, the model’s robustness and generalizability are discussed, including the steps used during training to reduce data loss (Ldata) and a regularisation term (Rreg) in a variety of datasets.
Automatic analysis of maternal-fetal ultrasound pictures is achieved using our approach, which makes use of a multi-scale Convolutional Neural Network (CNN) with transformers. Data from two hospitals in Barcelona, Spain—Hospital Clinic and Hospital Sant Joan de Deu—are part of a large dataset obtained from BCNatal. Carefully selected from 1,792 pregnant women who had standard prenatal testing in the second or third trimester, the dataset contains more than 12,000 ultrasound images. The dataset, referred to as the BCNatal Maternal-Fetal Ultrasound Dataset.
The process of harmonizing transformers and Multi-Scale Convolutional Neural Networks (CNN) for better maternal-fetal ultrasound interpretation is still ongoing. Previous studies have mainly failed to address the interaction between the various Fetal structures within the constraint of the mother, with most studies only looking at individual domains of maternal-fetal ultrasound examination9. There seems to exist a gap for a concrete framework to handle this interaction12. Addressing this gap, the present research proposes a novel method that combines transformers and Multi-Scale CNNs in a complementary fashion13. Fetal structures identification, their segmentation, and analysis, keeping into consideration their interrelations, requires one such model that can efficiently capture both global and local information in ultrasound images14.
Feature extraction and fusion
Given a maternal-fetal ultrasound image \(\:\text{I}\) with spatial dimensions \(\:\text{H}\:\times\:\:\text{W}\) and a set of scales \(\:\{\text{S}1,\:\text{S}2,\:.\:.\:.\:,\:\text{S}\text{n}\:\}\), the problem is to extract multi-scale features Fi using Multi-Scale Convolutional Neural Networks (CNNs) with learnable parameters \(\:{\uptheta\:}\text{i}\) :
The challenge is to integrate multi-scale features \(\:\text{F}\) and contextual dependencies \(\:\text{R}\) to derive comprehensive features \(\:\text{F}\text{f}\text{u}\text{s}\text{e}\text{d}\) using learnable fusion weights \(\:{\upalpha\:}\text{i}\):
Contextual dependency
Given the multi-scale features \(\:\text{F}\:=\:\{\text{F}1,\:\text{F}2,\:.\:.\:.\:,\:\text{F}\text{n}\:\}\), the objective is to model contextual dependencies \(\:\text{R}\) using a Transformer encoder-decoder module. Let \(\:\text{F}\text{e}\text{m}\text{b}\text{e}\text{d}\) be the embedding of \(\:\text{F}\) through a learnable linear transformation \(\:\text{E}\):
The context matrix \(\:\text{C}\) is computed by applying a self-attention mechanism with learnable weights \(\:\text{W}\text{q},\text{W}\text{k}\:,\text{W}\text{v}\) :
The transformed context \(\:\text{T}\) is generated by a linear transformation \(\:\text{D}\) followed by non-linearity \(\:{\upphi\:}\):
The contextual dependencies \(\:\text{R}\) are obtained by passing T through another linear transformation \(\:\text{L}\):
Performance optimization and interpretability
The problem is to optimize the model parameters \(\:{\Theta\:}\) to minimize the composite loss \(\:\text{L}\text{c}\text{o}\text{m}\text{p}\text{o}\text{s}\text{i}\text{t}\text{e}\), accounting for tasks like fetal biometric measurements \(\:\text{L}\text{b}\text{i}\text{o}\) and anomaly detection \(\:\text{L}\text{a}\text{n}\text{o}\text{m}\):
The challenge is to train a model \(\:\text{M}\) that minimizes a combination of data loss Ldata and a regularization term \(\:\text{R}\text{r}\text{e}\text{g}\) across a diverse dataset of maternal-fetal ultrasound images \(\:\text{D}\):
Where \(\:\lambda\:reg\) controls the strength of regularization.
Performance improvement
Given a maternal-fetal ultrasound image \(\:\text{I}\) and the trained model \(\:\text{M}\), the task is to generate attention maps \(\:\text{A}\) that emphasize regions of interest within the image, leveraging attention weights \(\:\text{W}\) :
Dataset
Exclusions from the dataset include multiple pregnancies, congenital abnormalities, and aneuploidies, and the gestational age range is 18–40 weeks. An expert in maternal-fetal medicine has labelled each photograph in the collection with the name of anatomical planes. Five main maternal-fetal anatomical planes and one additional category for variations make up the six separate groups that make up the dataset.
Representative images of anatomical plane classification (e.g., Fetal Brain, Thorax) and anomaly detection (e.g., cervical shortening)30.
Dataset composition
More than 1,792 patients who went in for their regular checks in the second and third trimesters provided the dataset’s more than 12,000 ultrasound scans.
Labeling and categories
A skilled maternal-fetal physician painstakingly affixed names to anatomical planes, classifying them as follows: Fetal Abdomen, Fetal Brain, Trans-thalamic, Trans-cerebellum, Trans-ventricular, Fetal Femur, Fetal Thorax, Maternal Cervix, and Other.
Dataset distribution
Detailed information on the dataset’s distribution across several anatomical planes is presented in Table 2, which also details the number of patients, images, and clinical importance.
The distribution of patients across different anatomical planes is shown in the pie chart in Fig. 1. Within the Fetal Brain category, the highest proportion can be found (60.3%), followed by other (41.0%) and Maternal Cervix (16.3%).
In a similar vein, the dispersion of pictures across anatomical planes is illustrated visually in Fig. 2. The relevance of the Fetal Brain category in the dataset is highlighted by the fact that it has the highest proportion (24.9%).
The ‘Fetal Brain’ category includes general neurodevelopmental views, while the Trans-thalamic, Trans-cerebellar, and Trans-ventricular planes represent specific slices within the brain that are essential for biometric measurement and anomaly detection.
Distribution of images across
Table 3 shows the distribution between machines and operators, however the majority of the photos in the collection come from the Voluson E6, Voluson S10, and Aloka systems.
Figure 3 shows, using a pie chart, the relative contributions of several ultrasound machines to the overall distribution of pictures. The machines Voluson E6 (51.7%), Aloka (28.7%), Voluson S10 (9.1%), and other machines (10.5%) produce the most images.
Data preprocessing
The ultrasound pictures undergo a number of preprocessing stages to get them ready for analysis and model training. These stages are discussed as:
Image resizing
Reducing photos to a consistent resolution while keeping their aspect ratio helps to reduce variations in picture size. This technique is called scaling.
Where:
-
Original Image refers to the raw ultrasound image.
-
Target Resolution is the desired resolution for the resized image.
Image enhancement
To improve contrast and uncover hidden information, ultrasound images are enhanced using techniques such histogram equalisation. The improved picture is a result of the histogram equalisation process.
Where:
-
Resized Image is the image after resizing.
-
histeq denotes the histogram equalization operation.
Normalization
In order to train dependable models, min-max normalisation is used to bring pixel values into a uniform range. Here is the definition of normalisation.
Where:
Enhanced Image represents the image after enhancement.
Data augmentation
Rotating, flipping, and zooming are data augmentation techniques that are used to improve model generalisation and avoid overfitting. The goal of the augmentation procedure is to provide an enhanced image.
Where:
-
Normalized Image is the image after normalization.
-
augment denotes the data augmentation operation.
Label encoding
To make training the model easier, each image has a unique numerical label that represents the anatomical plane. One definition of label encoding is as.
Where:
-
Anatomical Plane Label is the categorical label associated with the image.
-
encode represents the label encoding operation.
The dataset is imbalanced, with over-representation of brain views due to clinical emphasis on neurodevelopment. To address this, we used class-wise data augmentation and introduced weighted loss functions to ensure balanced learning.
Before any analysis or model training can take place on the ultrasound Fetal images dataset, this preparation pipeline must be in place.

Algorithm 1: Ultrasound Fetal Images Data Preprocessing.
Proposed architecture and model implementation
The multi-scale CNN blocks allow extraction of spatially localized features at varying levels of granularity, which is vital for identifying fine-grained fetal structures. Transformer layers are used to capture long-range contextual relationships between regions, modeling dependencies across anatomical structures. This synergy enables both local detail recognition and global contextual reasoning.
Our suggested model, Fetal Net, is a transformer-based multi-scale convolutional neural network (CNN), and we provide it here. Ultrasound pictures of the mother and her unborn child can be better detected and segmented with the help of this design.
Model architecture
Among the many essential parts of the Fetal Net design are the transformer layers and multi-scale CNN blocks. In order to accurately portray anatomical details, the model is built to capture hierarchical aspects at various scales.
Multi-Scale CNN blocks
The Multi-Scale CNN Blocks are composed of \(\:\text{L}\) convolutional layers, each operating at a different scale. Let \(\:\mathbf{X}\text{i}\text{n}\) be the input tensor, and \(\:\text{X}\text{o}\text{u}\text{t}\) be the output tensor after passing through the multi-scale CNN blocks. The operation at each scale \(\:\text{l}\) can be defined as
Where:
\(\:\text{N}\text{l}\) is the number of filters at scale \(\:\text{l},\)
\(\:{\text{X}}_{\text{i}\text{n},\text{i}}^{\left(\text{I}\right)}\) is the input feature map at scale \(\:\text{l}\) for filter \(\:\text{i}\),
\(\:{\text{W}}_{\text{i}}^{\left(\text{I}\right)}\) is the weight tensor for filter \(\:\text{i}\) at scale \(\:\text{l}\),
\(\:{\text{b}}_{\text{i}}^{\left(\text{I}\right)}\) is the bias term for filter \(\:\text{i}\) at scale \(\:\text{l}\),
\(\:{\upsigma\:}\) is the activation function (e.g., ReLU),
∗denotes the convolution operation.
By merging the feature maps from each scale, the output of the multi-scale CNN blocks is generated:
This method allows the model to learn representations at different scales by capturing hierarchical characteristics at different levels.
Transformer layers
In our suggested Fetal Net model, the Transformer layers play a crucial role. With the help of feed-forward neural networks and self-attention processes, the model can grasp complicated spatial relationships and long-range interdependence in each Transformer layer.
Let \(\:,{\text{H}}_{\text{i}\text{n}}^{\left(\text{I}\right)}\)represent the input hidden states at layer \(\:\text{I}\), and \(\:\:{\text{H}}_{\text{o}\text{u}\text{t}}^{\left(\text{I}\right)}\)denote the output hidden states after the Transformer layer. The self-attention mechanism is defined as follows:
Where:
\(\:{\text{W}}_{\text{Q}}^{\left(\text{I}\right)}\), \(\:{\text{W}}_{\text{K}}^{\left(\text{I}\right)}\) and \(\:{\text{W}}_{\text{v}}^{\left(\text{I}\right)}\) are learnable weight matrices for query, key, and value projections, \(\:\text{d}\text{k}\) is the dimensionality of the key vectors, softmax is the softmax function applied along the sequence dimension, LayerNorm is the layer normalization operation.
The feed-forward neural network in each Transformer layer is defined as:
Where:
\(\:{\text{W}}_{\text{f}\text{f}}^{\left(\text{I}\right)}\)is the weight matrix for the feed-forward layer,
\(\:{\text{b}}_{\text{f}\text{f}}^{\left(\text{I}\right)}\) is the bias term for the feed-forward layer.
After the Transformer layer, the last hidden states are acquired by adding the output of the feed-forward and self-attention branches:
This combination enables the model to learn intricate spatial representations and capture global dependencies within the input sequence.
Multi-Scale integration
The Multi-Scale Integration module in our proposed Fetal Net model allows the model to collect features at different granularities by fusing input from multiple sizes.
Let \(\:{\text{H}}_{\text{M}\text{S}\text{I}}^{\left(\text{I}\right)}\) denote the output hidden states after the Multi-Scale Integration module at layer l. The input to this module consists of the hidden states from the Transformer layers at different scales, represented as\(\:{\text{H}}_{\text{o}\text{u}\text{t}}^{(\text{I}-\text{s})}\), \(\:{\text{H}}_{\text{o}\text{u}\text{t}}^{\left(\text{I}\right)},\:{\text{H}}_{\text{o}\text{u}\text{t}}^{\left(\text{I}+\text{s}\right)}\), where s is the scale difference.
The Multi-Scale Integration operation is defined as:
Where:
Conv1D is the one-dimensional convolution operation,
\(\:{\text{W}}_{\text{M}\text{S}\text{I}}^{\left(\text{I}\right)}\) is the learnable weight matrix for the convolution,
\(\:{\text{b}}_{\text{M}\text{S}\text{I}}^{\left(\text{I}\right)}\) is the bias term for the convolution.
This process uses a convolutional operation to combine data from several scales, which lets the model develop hierarchical representations and capture features at many levels of abstraction.
Model details
Each CNN block uses convolutional filters of sizes3,5,7 to extract features at multiple scales. The transformer layers apply 8-head self-attention to model inter-region dependencies. Multi-scale integration combines outputs using 1D convolution and normalization, creating a fused feature map that encapsulates both fine and coarse details across anatomical planes.
Figure 4 illustrates the Fetal-Net architecture, which consists of input image, multi-scale CNN blocks, transformer layers, multi-scale integration, and output labels.
Input image
It begins with an input image of ultrasound, which is the major source of data for analysis.
Multi-Scale CNN blocks
Four blocks of CNN (CNN Block 1 to CNN Block 4) extract spatial and hierarchical information from the input image. The use of multi-scale processing enables the network to identify fine as well as coarse details.
Transformer layers
The CNN features are fed into six transformer layers (Transformer 1 to Transformer 6), allowing global contextual understanding and improving feature representation by self-attention mechanisms.
Multi-Scale integration
Features from the transformer layers and CNN are integrated in this step to take advantage of both local spatial details and global context.
Output labels
The processed features finally are used for segmentation, classification, or other predictive tasks to yield the output labels.
This CNN-transformer hybrid model successfully blends CNNs for spatial feature learning and transformers for long-distance relationships, and therefore it is capable of handling advanced ultrasound image processing.
Table 4 summarises the model parameters and offers the specific equations and values for each layer in the Fetal Net design.
To clarify the technical depth, each CNN block operates at a distinct receptive field using filters of sizes3,5,7. The model includes six transformer layers employing 8-head attention to ensure diverse contextual encoding. The Multi-Scale Integration module further consolidates hierarchical representations via 1D convolutions and layer normalization, providing deeper abstraction than a naive CNN-Transformer stack.

Algorithm 2: Fetal-Net Mathematical Model.
Performance evaluation
Precisely measuring the performance of medical image segmentation models, including the proposed Fetal-Net, requires strong quantitative measures that capture both technical accuracy and clinical applicability. In this work, we measure Fetal-Net using well-established standards specific to fetal imaging tasks: Intersection over Union (IoU), Dice Coefficient, Sensitivity, Specificity, and F1 score. These metrics together tackle the essential necessities of prenatal diagnostic procedures, in which accurate delineation of fetal anatomy structures is essential to prevent misdiagnosis.
Intersection over union (IoU)
To find out how close the anticipated and actual segmentations are to each other, researchers use the Intersection over Union (IoU) or Jaccard Index. It is used to measure the overlap between the predicted segmentation mask and the actual or ground truth segmentation. Its definition is.
Where T P is the number of true positives, F P is the number of false positives, and F N is the number of false negatives.
Dice coefficient
Another metric for the degree to which the predicted and actual segments overlap in space is the Dice Coefficient. It quantifies how well the model identifies subtle or small structures, its definition is.
Where TP is true positive, FN is false negative, and FP is false positive.
Sensitivity and specificity
How well the model can detect positive and negative instances are measured by sensitivity (recall) and specificity, respectively.
Where T N is the number of true negatives, TP is true positive, FN is false negative, and FP is false positive.
F1 score
The F1 Score is the harmonic mean of precision and recall.
These metrics provide a comprehensive evaluation of the model’s performance in different aspects of image segmentation.
Inclusion criteria
The dataset used in this study was curated under strict inclusion guidelines to ensure consistency and clinical relevance. Only ultrasound images from singleton pregnancies were selected, specifically those recorded between 18 and 40 weeks of gestation, which corresponds to the standard period for detailed fetal anatomical assessment. Each image included in the dataset represented a clearly identifiable standard anatomical plane, such as the fetal brain, femur, or thorax, as determined by expert sonographers. Images with severe motion artifacts, poor contrast, or incomplete anatomical visibility were excluded to maintain high data quality. These criteria ensured that the training and evaluation of Fetal-Net were based on clinically interpretable, high-resolution images aligned with routine prenatal screening practices.
Anomaly detection module
The anomaly detection component of Fetal-Net is integrated into the shared network architecture and operates as a binary classification head branching from the fused feature representation \(\:{F}_{fused}\) . After the multi-scale CNN and Transformer layers generate spatial and contextual embeddings, these are merged through the multi-scale integration module to form \(\:{F}_{fused}\) , which captures both local anatomical features and global dependencies. For anomaly detection, this fused feature vector is passed through a series of fully connected layers, followed by a sigmoid activation function to produce a binary output indicating the presence or absence of fetal abnormalities. The model is trained using a combined loss function, which includes an anomaly-specific binary cross-entropy loss component. This joint training strategy allows the network to learn both anatomical classification and anomaly detection tasks simultaneously, enabling efficient multi-task learning while leveraging shared representations. During inference, the anomaly detection module provides a probability score for each input image, which can be thresholded to flag potential abnormal conditions, thereby enhancing the clinical utility of the system.
Results and discussion
Here, the findings of our experiments on the Fetal-Net model for anatomical part classification of ultrasound images are given. A variety of maternal-fetal anatomical planes are covered in the data that are used to test the model. Among them are the fetal abdomen, brain, trans-thalamic, trans-cerebellum, trans-ventricular, fetal femur, fetal thorax, maternal cervix, and some more places. Our foremost aim is to properly classify all of the parts of the body, and our second aim is to distinguish between normal and pathological cases within those regions. How precisely the model can identify each structure in ultrasound images is made clear by the classification results. Furthermore, the capacity to make distinctions between standard and abnormal cases illuminates the model’s capability in diagnosing any issues arising during pregnancy. A thorough assessment of abnormalities discovered by Fetal-Net is subsequently followed by subtopics that elaborate on the explicit classification results per anatomical element. Our analysis focuses on the strengths and weaknesses of the model, as well as its potential therapeutic uses and avenues for improvement.
Performance of Fetal-Net on anatomical parts detection
The performance metrics for the classification of various anatomical parts using Fetal-Net are summarized below:
Figure 5 illustrates the Intersection over Union (IoU) scores achieved by Fetal-Net across training epochs. The consistently high IoU values, stabilizing above 0.94, demonstrate the model’s ability to accurately segment and detect anatomical structures within maternal-fetal ultrasound images.
Figure 6 presents the sensitivity and specificity metrics of Fetal-Net during classification of anatomical regions. High sensitivity indicates the model’s effectiveness in correctly identifying true positives, while high specificity confirms its strength in minimizing false positives across different fetal anatomical planes.
Figure 7 displays the confusion matrix for anatomical part classification using Fetal-Net. It highlights the number of correct and incorrect predictions across different fetal regions, showing strong diagonal dominance and minimal misclassification.
Figure 8 shows the Dice Coefficient scores over training epochs, reflecting Fetal-Net’s high segmentation accuracy. The scores remain consistently above 0.94, confirming reliable overlap between predicted and actual anatomical structures.
Figure 9 summarizes key performance metrics—accuracy, precision, recall, and F1-score—for anatomical structure detection. The high values across all metrics underscore the robustness and effectiveness of the proposed model.
Performance of Fetal-Net in anomaly detection in fetus
The results of the anomaly detection in the foetus using Fetal-Net are shown in Table 5. Table 5 illustrates the efficacy of Fetal-Net in Fetal anomaly identification by providing an overview of the critical parameters, such as recall, accuracy, precision, and F1 score. In Fig. 10, Shown in the bar chart are the parameters of accuracy, precision, recall, and F1 score, which collectively demonstrate Fetal-Net’s performance in anomaly identification. Figure 11 displays the anomaly detection categorization results on the confusion matrix Figure 12.
Table 6 presents a comprehensive comparison between the proposed Fetal-Net model and several state-of-the-art approaches in fetal ultrasound image analysis. These existing models utilize various architectures, such as CNN-based segmentation for fetal biometric extraction, ensemble deep learning for plane identification, and hybrid models for classification tasks. While many of these methods demonstrate strong performance in specific tasks, they typically focus on either segmentation or classification in isolation. In contrast, Fetal-Net integrates multi-scale CNNs with transformer layers to jointly handle anatomical classification and anomaly detection tasks. With an accuracy of 97.5%, precision of 96.5%, and recall of 97.8%, Fetal-Net outperforms the benchmark models across key evaluation metrics. This improved performance is attributed to the model’s ability to capture both local details and long-range dependencies, enabling more accurate and robust analysis of complex fetal structures within ultrasound images. For instance, Specktor-Fadida et al.21 employed a CNN-based segmentation approach on a whole-body MRI dataset to perform segmentation and fetal weight estimation, demonstrating high repeatability and reproducibility. Similarly, Liu et al.11 focused on deep learning-based segmentation for biometric measurements in the first trimester, achieving high accuracy in biometric parameters. Thomas et al.23 suggested an ensemble deep learning approach to detect fetal planes from a series of ultrasound images to enhance the identification accuracy. Cho et al.6 developed a system-on-chip implementation in conjunction with deep learning for real-time measurement of fetal biometrics, demonstrating strong performance in field tests. Krishna et al.10 applied a stacked ensemble of deep models to classify normal fetal ultrasound planes and obtained higher classification accuracy. Sriraam et al.22 created a contour detection technique to estimate CRL, advancing the evaluation of fetal growth. Ziani29 introduced a hybrid deep learning solution that used multimodal data fusion for fetal ECG classification with high accuracy in classification tasks for ECG. The current proposed Fetal-Net model combines Multi-Scale CNNs with transformer layers and was trained with a vast dataset of 12,000 ultrasound images. Fetal-Net performs well for both anatomical detection and anomaly detection tasks, and performance statistics indicate 97.5% accuracy, 96.5% precision, and 97.8% recall. These results indicate the superior performance of Fetal-Net in managing complicated ultrasound image analysis tasks, providing a more complete and trustworthy solution for prenatal diagnosis.
Clinical applicability was assessed by testing Fetal-Net on real-world scans from BCNatal hospitals. Attention maps were generated to provide visual cues for clinicians, and high agreement with manual annotations indicates strong potential for clinical integration.
Discussion
Results show that Fetal-Net is efficient in both the domain of anatomical part detection and the domain of anomaly detection, for which it has been thoroughly evaluated. The model has consistently demonstrated strong performance metrics across multiple tasks, including classification of anatomical structures and identification of fetal anomalies. Table 5 confirms this performance, where Fetal-Net achieved accuracy of 97.5%, precision of 96.5%, recall of 97.8%, and F1 score of 96.6% in anomaly detection. These results are also visualized in Fig. 10, which shows the performance metrics in a bar chart format, and in Fig. 11, which presents the anomaly detection confusion matrix. These visuals indicate the model’s effectiveness in differentiating between normal and abnormal cases, with minimal false positives and false negatives, which is critical in clinical environments where diagnostic errors can have serious implications. High values of Intersection over Union (IoU), Dice coefficient, sensitivity, and specificity (as shown in Figs. 5 and 6, and 8) further support the model’s capability in segmenting and correctly identifying anatomical regions with precision. In clinical terms, a high Dice score reflects the model’s ability to accurately delineate anatomical boundaries, which is especially important when detecting subtle conditions like ventriculomegaly, abnormal femur length, or cervical shortening. These types of abnormalities often manifest through minor deviations in structure size or shape, which the model must be sensitive enough to detect. The consistently high IoU values indicate that the predicted segmentation maps significantly overlap with the expert-annotated ground truth, a critical factor in ensuring trustworthy analysis in fetal health evaluations. The ability of Fetal-Net to generalize well across different scenarios can be attributed to the diverse and well-curated dataset, which includes over 12,000 ultrasound images collected from two major hospitals and spans multiple anatomical planes and imaging conditions. This broad representation enables the model to maintain high performance even when exposed to varying machine types (e.g., Voluson E6, Aloka, and others) and operator techniques. Additionally, the use of data preprocessing techniques—such as normalization, histogram equalization, scaling, and data augmentation—has played a crucial role in improving model robustness and reducing overfitting. Fetal-Net’s architecture plays a pivotal role in its performance. The use of multi-scale CNN blocks ensures the extraction of fine and coarse spatial details from ultrasound images, while the integration of transformer layers enables the capture of long-range dependencies and contextual information between anatomical regions. This combination allows the model to learn hierarchical representations of anatomical structures, enabling it to model intricate spatial relationships, such as the alignment between the fetal brain and ventricular system or between the thorax and femur. These interactions are critical for understanding fetal development patterns and detecting deviations. Furthermore, the model’s attention mechanisms provide interpretability by generating attention maps, which highlight the regions most relevant to the model’s predictions. This enhances the clinical trust in the system, as practitioners can visually verify which regions the model focused on when suggesting a diagnosis or classification. Such transparency is essential in medical AI applications, where explainability directly affects the model’s acceptance and usability by healthcare professionals. The discussion of the results also reveals the model’s potential to assist in real-time clinical decision-making. The high precision and recall values suggest that the model could reduce diagnostic delays and increase the reliability of prenatal assessments. Moreover, the low false negative rate is particularly valuable for anomaly detection, ensuring that fewer pathological cases are missed during screening. Despite these promising results, certain limitations remain. The model’s performance could be influenced by data imbalance across classes, such as the over-representation of fetal brain images in the dataset. While strategies like data augmentation and class-weighted loss functions were applied to mitigate this, future work could explore advanced rebalancing techniques or incorporate synthetic data generation. Additionally, while the dataset is extensive, expanding it to include first-trimester scans or rare fetal conditions would further improve generalizability and coverage. In terms of future directions, the integration of clinical metadata such as maternal history, gestational age, and previous pregnancy outcomes could enrich the input features and enhance diagnostic precision. Incorporating a continual learning mechanism would also allow Fetal-Net to adapt to new imaging protocols and clinical settings without retraining from scratch. Finally, prospective clinical validation studies are essential to assess Fetal-Net’s performance in live clinical environments, including integration into ultrasound machines and workflow systems for real-time assistance. In conclusion, Fetal-Net represents a robust and comprehensive deep learning framework for maternal-fetal ultrasound interpretation. By effectively combining multi-scale feature extraction and contextual reasoning, it delivers high accuracy in anatomical classification and fetal anomaly detection. With further validation and development, Fetal-Net has the potential to become a valuable tool for obstetricians and sonographers in improving prenatal care and fetal health outcomes.
Conclusion
Following the establishment of significant directions for future prenatal care, this conclusion section presents a brief review of key points, comments, and future directions. A reminder on Fetal-Net’s features may include the production of interpretable attention mas, its new fusion process, and utilization of Multi-Scale CNN and transformers. The contribution of the paper to the improvement of fetal health outcome and interpretation of maternal-fetal ultrasonography is emphasized. The paper introduces Fetal-Net, a key deep learning network for the processing of ultrasound images of foetuses, with focus on the detection of abnormalities and the recognition of anatomical structures. The model is thoroughly tested, and the findings show that the model excels at both tasks. In this concluding piece, we distill the major points, expound on what it is, and offer some recommendations on where to proceed with future research as well as limitations. As regards where classification of distinct anatomical sections is concerned, Fetal-Net is actually excellent. It has good values for accuracy, precision, recall, and F1 score for all kinds of fetal structures. Its prospective success as an aiding tool in prenatal care is made more probable with its ability to generalise for most anatomical planes and to specifically define complicated features. It’s very high precision, accuracy, recall, and F1 score shown during anomaly detection only signify its competency at discerning ordinary and deviating instances. Outputs from the confusion matrix give further evidence that the model is correct in identifying potential threats to the health of the foetus. Further research and development on AI system Fetal-Net are required in spite of its excellent performance. Better model generalisation can be achieved by researching better preprocessing methods that enhance the quality and usability of ultrasound images. For the model to respond quicker and be more adaptive to novel datasets and clinical settings, adopt a continuing learning strategy. To learn more about personalized healthcare environments, investigate the inclusion of more clinical data like patient demographics and history within the model. Improve Fetal-Net’s explainability so that clinicians can better comprehend how the model comes to its findings. Although Fetal-Net is highly effective, it has several limitations. The performance of the model may depend on the dataset size. The model’s generalizability can be enhanced by increasing the dataset size. We conclude that to evaluate the efficacy of Fetal-Net in actual healthcare environments, large-scale clinical validation trials must be performed. Discuss how to implement Fetal-Net into current clinical practice without a problem, making it easy for doctors and nurses to use. Finally, Fetal-Net offers a thrilling new way of examining ultrasound scans of pregnant foetuses, and this has far-reaching potential for prenatal medicine. The model’s performance so far shows its promise in helping medical professionals assess fetal health, but more research and improvements are required.
Ethical considerations
All procedures performed in this study involving human participants were conducted in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments. Ethical clearance for the use of maternal-fetal ultrasound images was obtained from the institutional review boards (IRBs) of both participating hospitals—Hospital Clínic de Barcelona and Hospital Sant Joan de Déu, Spain. The ultrasound images used in this study were fully anonymized prior to analysis to ensure patient privacy and data confidentiality. No identifiable information was stored or processed at any point during the study. The dataset was compiled exclusively from routine prenatal screening sessions conducted between the second and third trimesters. Informed consent was obtained from all participants as part of standard clinical protocol prior to the use of their anonymized data for research purposes. This study was approved under the BCNatal Research Program, which permits retrospective analysis of de-identified medical imaging data for non-commercial academic purposes. All data handling, storage, and experimental procedures complied with the European General Data Protection Regulation (GDPR) guidelines.
Data availability
The entire ultrasound dataset used in the study is publically available on the link: https://zenodo.org/record/3904280.
References
Aji, C. P., Fatoni, M. H. & Sardjono, T. A. Automatic measurement of fetal head circumference from 2-dimensional ultrasound, in 2019 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM), Nov. pp. 1–5. IEEE. (2019).
Ahmad, A. et al. Prediction of fetal brain and heart abnormalities using artificial intelligence algorithms: A review. Am. J. Biomedical Sci. Res. 22 (3), 456–466 (2024).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848 (2017).
Chen, L. et al. DRINet for medical image segmentation. IEEE Trans. Med. Imaging. 37 (11), 2453–2462 (2018).
Cootes, T. F., Edwards, G. J. & Taylor, C. J. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23 (6), 681–685 (2001).
Cho, H. et al. A system-on-chip solution for deep learning-based automatic fetal biometric measurement. Expert Syst. Appl. 237, 121482 (2024).
He, K. et al. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, 2015. [Online]. Available: https://arxiv.org/abs/1502.01852
van den Heuvel, T., de Bruijn, D., de Korte, C. & van Ginneken, B. Automated measurement of fetal head circumference using 2D ultrasound images, PLoS ONE, 13, p. e0200412, (2018).
Jatmiko, W., Habibie, I., Ma’sum, M., Rahmatullah, R. & Satwika, P. Automated telehealth system for fetal growth detection and approximation of ultrasound images. Int. J. Smart Sens. Intell. Syst. 8, 697–719 (2015).
Krishna, T. B. & Kokil, P. Standard fetal ultrasound plane classification based on stacked ensemble of deep learning models. Expert Syst. Appl. 238, 122153 (2024).
Liu, L. et al. Automatic fetal ultrasound image segmentation of first trimester for measuring biometric parameters based on deep learning. Multimedia Tools Appl. 83 (9), 27283–27304 (2024).
Martínez, C. M., Darnell, A., Escofet, C., Mellado, F. & Corona, M. Fetal magnetic resonance imaging. Ultrasound Rev. Obstet. Gynecol. 4 (3), 214–227 (2004).
Nadiyah, P., Rofiqah, N., Firdaus, Q., Sigit, R. & Yuniarti, H. Automatic detection of fetal head using Haar cascade and fit ellipse, in Proc. Seminar Intell. Technol. Its Appl. (ISITIA), Aug. 2019, pp. 320–324. IEEE. (2019) Int.
Rahayu, K. D., Sigit, R., Agata, D., Pambudi, A. & Istiqomah, N. Automatic gestational age estimation by femur length using integral projection from fetal ultrasonography, in Proc. Seminar Appl. Technol. Inf. Commun. (iSemantic), Sep. 2018, pp. 498–502. IEEE. (2018) Int.
Rueda, S. et al. Evaluation and comparison of current fetal ultrasound image segmentation methods for biometric measurements: A grand challenge. IEEE Trans. Med. Imaging. 33 (4), 797–813 (2013).
Skeika, E., Da Luz, M., Fernandes, B., Siqueira, H. & De Andrade, M. Convolutional neural network to detect and measure fetal skull circumference in ultrasound imaging. IEEE Access. 8, 191519–191529 (2020).
Sobhaninia, Z., Emami, A., Karimi, N. & Samavi, S. Localization of fetal head in ultrasound images by multiscale view and deep neural networks, in Proc. Int. Comput. Conf., Comput. Soc. Iran (CSICC), Tehran, Iran, (2020).
Srinivas, A. et al. Bottleneck transformers for visual recognition, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 16519–16529. (2021).
Szegedy, C. et al. Going deeper with convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1–9. (2015).
Salini, Y., Mohanty, S. N., Ramesh, J. V. N., Yang, M. & Chalapathi, M. M. V. Cardiotocography Data Anal. Fetal Health Classif. Using Mach. Learn. Models IEEE Access., 12, (2024).
Specktor-Fadida, B. et al. Deep learning–based segmentation of whole-body fetal MRI and fetal weight estimation: assessing performance, repeatability, and reproducibility. Eur. Radiol. 34 (3), 2072–2083 (2024).
Sriraam, N., Chinta, B., Suresh, S. & Sudharshan, S. Enhanced fetal development assessment via contour detection and CRL estimation, in Proc. IEEE 3rd Int. Conf. Control, Instrum., Energy & Commun. (CIEC), Jan. 2024, pp. 396–400. IEEE. (2024).
Thomas, S. & Harikumar, S. An ensemble deep learning framework for foetal plane identification. Int. J. Inf. Technol. 16 (3), 1377–1386 (2024).
Lee, H., Park, J. & Hwang, J. Y. Channel attention module with multiscale grid average pooling for breast cancer segmentation in an ultrasound image. IEEE Trans. Ultrason. Ferroelectr. Freq. Control. 67 (7), 1344–1353 (Feb. 2020).
Xu, L. et al. DW-Net: A cascaded convolutional neural network for apical four-chamber view segmentation in fetal echocardiography. Comput. Med. Imaging Graph. 80, 101690 (2020).
Yang, T. Y., Huang, Y. H., Lin, Y. Y., Hsiu, P. C. & Chuang, Y. Y. SSR-Net: A compact soft stagewise regression network for age estimation, in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), vol. 5, no. 6, p. 7, July (2018).
Płotka, S. et al. Fetal-Net: Multi-task deep learning framework for fetal ultrasound biometric measurements, in Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part VI, vol. 28, Springer International Publishing, pp. 257–265. (2021).
Zhang, T., Qi, G. J., Xiao, B. & Wang, J. Interleaved group convolutions, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 4373–4382. (2017).
Ziani, S. Enhancing fetal electrocardiogram classification: A hybrid approach incorporating multimodal data fusion and advanced deep learning models. Multimedia Tools Appl. 83 (18), 55011–55051 (2024).
Burgos-Artizzu, X. P. et al. Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Sci. Rep. 10, 10200. https://doi.org/10.1038/s41598-020-67076-5 (2020).
Acknowledgements
The authors present their appreciation to King Saud University for funding this research through Ongoing Research Funding program, (ORF-2025-206), King Saud University, Riyadh, Saudi Arabia.Funding Information: The authors present their appreciation to King Saud University for funding this research through Ongoing Research Funding program, (ORF-2025-206), King Saud University, Riyadh, Saudi Arabia.
Author information
Authors and Affiliations
Contributions
Author Contributions: U.I., Y.A.A., M.A.-R., H.U., M.A.A., and K.M.W. contributed to the conceptualization and overall design of the study. H.U., M.A.A., Z.T., and Y.A.A. handled data curation, formal analysis, and methodology refinement. Y.A.A. and M.A.-R. secured funding and oversaw project administration. U.I., H.U., and K.M.W. provided critical resources and were responsible for software development and model implementation. Y.A.A. and M.A.-R. supervised the research process and provided expert guidance throughout. U.I., H.U., M.A.A., Z.T., and Y.A.A. carried out validation and visualization tasks. The original draft was written collaboratively by U.I., H.U., M.A.A., K.M.W., and Z.T., with all authors contributing to manuscript review and final approval. K.M.W. also coordinated the revised version of the manuscript and now serves as the corresponding author.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Islam, U., Ali, Y.A., Al-Razgan, M. et al. Fetal-Net: enhancing Maternal-Fetal ultrasound interpretation through Multi-Scale convolutional neural networks and Transformers. Sci Rep 15, 25665 (2025). https://doi.org/10.1038/s41598-025-06526-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-06526-4