Introduction

Ultrasound is relatively underrepresented in research in the machine learning applied to medical imaging space when compared to other modalities such as magnetic resonance imaging (MRI) and X-ray computer tomography (CT). By way of illustration, at a leading international conference in medical image analysis in 2023 (Medical Image Computing and Computer-Assisted Interventions, MICCAI 2023)1, only 51 papers were classed under the ultrasound modality topic. On the other hand, there were 185 papers and 126 papers that were classed under the MRI and CT topics, respectively. This observation holds true for earlier years as well. In MICCAI 2022, the equivalent figures are 43 ultrasound, 159 MRI and 82 CT papers2 and in MICCAI 2021, the figures are 32 ultrasound, 128 MRI, and 95 CT papers3. This holds despite the wide variety of tasks performed on ultrasound imaging data, including image classification4,5,6, image captioning7,8,9,10, video captioning11,12, representation learning13,14,15, and segmentation16,17.

In this paper, we report on a study to collate, analyze and score publicly available ultrasound datasets and models. Our goal was to identify open-source resources available for ultrasound imaging research, including machine learning-based analysis and sonography data science, to encourage the scientific community to answer more research questions pertaining to clinical ultrasound.

There are several related reviews on public datasets for other medical imaging modalities and clinical imaging areas. Khan et al.18 reviews publicly available ophthalmological imaging datasets and scores them based on how much information on them is made available. Magudia et al.19 discuss the challenges of assembling large medical datasets and share best practices on cohort selection, data retrieval, storage, and failure handling. Kohli et al.20 highlight the need for improved collection, annotation, and reuse of medical imaging data, as well as the development of best practices and standards. These works19,20 suggest there is a need to standardize how datasets are described and what information to track when they are collected and prepared. In this paper, we describe the methods used in our search for public ultrasound datasets and open-source models and detail what we have found available and any interesting points to note relating to the public resources.

To build AI-powered clinical solutions that generalize well, researchers need access to datasets on which to train their proposed models on. Ideally, they would train on a variety of datasets that have different characteristics, such as the country of origin, the type of scanning device used, and the time-frame during which the data was collected.

AI-powered clinical solutions that could be of benefit to current medical practice can vary. One way, for example, would be through solutions that could allow for assistive tools that speed up the scanning process (e.g., automatic segmentation of anatomies of interest), leading to direct benefits to clinician productivity, allowing medical facilities to serve more people in the same window of time.

Results

Findings of the dataset search

The results of the dataset search suggest that there are a number of different anatomies represented in open-source ultrasound datasets, as shown in Figs. 1 and 2. It is interesting to note from Table 1 that our search implies that the popularity of using ultrasound imaging in clinical practice or clinical research does not translate to the availability of public datasets. An example is the prostate, where only four publicly available datasets were identified. Fetus and fetal related anatomies are among the less prevalent anatomies in public US datasets, as shown in Fig. 2. In Figs. 3 and 4, we summarize where and when publicly available US datasets are being acquired and collected, respectively. These are but two ways to visualize some of the characteristics of the datasets.

Fig. 1: Human anatomy diagrams showing what anatomy groups are represented in the publicly available US datasets.
figure 1

Left shows male anatomy, right female anatomy. The colors orange, green, black, blue, red, violet, pink, yellow, and peach represent parts of the circulatory system, the brain, the thyroid gland, parts of the digestive system, the kidneys, the lungs and the breasts, lymph nodes, the reproductive system, and the musculoskeletal system, respectively. The fetus is not shown in the anatomy diagrams. Plot was made using the pyanatomogram library180. Anatomogram silhouettes © EMBL-EBI Expression Atlas, CC BY 4.0181.

Fig. 2: Pie chart showing the breakdown of different anatomy groups in the publicly available US datasets.
figure 2

The numbers shown are the number of datasets that contain data samples of this anatomy group. Some datasets contain multiple anatomy groups. The plot was made using the python library Matplotlib182.

Fig. 3: The geographical distribution showing where the reported datasets were acquired is shown.
figure 3

Note that datasets that do not mention where the data is acquired are not included. This map, plotted using the python library plotly183, is provided for illustrative purposes only and does not reflect any legal or political stances. Basemap: Natural Earth (Public Domain).

Fig. 4: A chart showing the ranges of acquisition in years of different datasets.
figure 4

Please note this only includes datasets that mention ranges of acquisition. The plot was made using the python library Matplotlib182.

Table 1 Summarized characteristics of the open-sourced ultrasound datasets

In our proposed SonoDQS dataset quality score, there are seven possible tiers or ranks: diamond, platinum, gold, silver, bronze, steel, and unrated. A plurality (40.32% of the datasets are of similar quality, corresponding to bronze (fifth tier out of seven tiers) in the SonoDQS ranking. The classification of the datasets according to the SonoDQS ranks can be found in the pie chart in Fig. 5. Some dataset characteristics are considered to be more important when showing how complete a dataset is Fig. 6 shows these characteristics, and it is pleasing to note that most of these characteristics are shared and made available with their corresponding datasets.

Fig. 5: A pie chart showing the classification of datasets according to the SonoDQS levels as a percentage of the total number of datasets.
figure 5

The plot was made using the python library Matplotlib182.

Fig. 6: A bar chart showing the completeness of the datasets in terms of the 13 most commonly shared characteristics.
figure 6

100% means this characteristic or information is always shared with datasets. The plot was made using the python library Matplotlib182.

Discussion

We identified 72 public ultrasound datasets covering different anatomies and coming from different parts of the world. We observed that anatomies such as fetal content and prostate did not have many public datasets as one might have expected given the wide spread use of ultrasound in these clinical ultrasound areas. We identified 56 open-source models trained on ultrasound data. It is noteworthy that when open-sourcing models, researchers share more comprehensive information about the models than they do when sharing the dataset.

Most open-source models 30/56 actually do not publicly make available their model weights; however, they still met our definition of open source because the code, the dataset, and the training details are available.

We recognize that the boundary thresholds set for SonoDQS ranks were set empirically and could undergo refinement in the future. The ranks, nonetheless, are useful as a guide to help in the selection of the datasets to consider and a way to look at the quality of ultrasound datasets as whole. A researcher is likely to use the dataset characteristics when deciding which exact datasets are useful for their purpose. The fact that there are few open-sourced fetal ultrasound datasets and prostate ultrasound datasets is particularly noteworthy. The published literature in this area is primarily on private data.

The proposed SonoDQS and SonoMQS scores fundamentally are meant to encourage the consistent sharing of comprehensive information on public ultrasound datasets and open-source models in a standardized manner. This consistent sharing of relevant comprehensive information in a standardized manner is important to clinical decision making and ultrasound diagnostics. If, for example, a hospital is interested in using a readily-available model but would want one trained on data from their geographic region or trained on data acquired with equipment similar to what they have in-house. Being open about such details ensures greater transparency which is important in the clinical space but also is reassuring to potential users because they know more about the datasets and the models. Potentially, being more open about these details might help slightly with clinical translation.

We acknowledge the existence of limitations with our work. In terms of datasets, we only focus on publicly available datasets disregarding in our analyses and reporting private datasets that are publicly known to exist but are not available. The number of SonoDQS ranks is somewhat subjective (diamond, platinum, etc.) despite going through an iterative process. They exist to make it easy to quickly compare the SonoDQS scores of a large number of datasets.

This paper catalogs open sourced ultrasound datasets and ranks them based on how open they are and how much information has been shared about the data. Open source models are also reported and ranked in terms of how open-sourced they have been made. For the reporting of both the public datasets and the open-sourced models, the cut-off date was September 2025.

Methods

Identification of publicly available ultrasound imaging datasets

Team members searched for information on datasets and models starting in April 2024 and ending in September 2025 The search strategies are described next. Each team member was tasked with searching and identifying publicly available ultrasound datasets. This was achieved in a number of ways, such as through search engines (most typically Google search21), the website Paperswithcode22, WeChat search23, social media, blogs, Google Scholar24, ChatGPT25 and its ability to conduct searches through Bing26, survey papers27, and Google Dataset28. The search terms that were used to identify and locate open-source ultrasound imaging datasets include “ultrasound imaging”, “public ultrasound datasets”, and “ultrasound”.

One important characteristic is dataset access. A dataset may be available immediately as an open access dataset or available on request which may require the signing of an agreement before the dataset can be downloaded. This is similar to what was done by ref. 18 for opthalmological datasets who also noted whether publicly available opthalmological datasets were open-access or available on request.

There are several dataset characteristics that we attempted to identify. We based these on the checklist used by organizers of MICCAI 2024 Challenges29. This collected information goes into tables, the summary table and the detailed table. The 13 more important dataset characteristics of interest: (1) dataset name, (2) link to dataset, (3) anatomy, (4) scanning mode, (5), multi-modal (text+image), (6) total dataset size, (7) training, validation, and test set sizes, (8) task category, (9) geography of study, (10) year(s) of acquisition, (11) other imaging and metadata, (12) single machine or multiple machines (single center or multiple centers), and (13) open access (OA) or available on request (AR). Please find the interactive table on display on the webpage interface.

There are 20 further characteristics that are relatively less important but are still collected and considered. These characteristics are: (1) device(s) used to acquire data (US machine type and transducer), (2) data acquisition protocols, (3) center or institute collecting or providing data, (4) characteristics of the data acquirers (such as their experience level), (5) other characteristics of the training, validation, and testing datasets (such as the data distribution), (6) what is considered a single data sample, (7) number of operators or sonographers, (8) how annotations are made (manually or automatically) and by how many annotators, (9) the instructions given to the annotators, (10) annotation description, (11) how were annotations merged together if required, (12) dataset application cases (such as education), (13) ethics approval statement, (14) data usage agreement, (15) challenge if dataset is associated with a challenge, (16) imaging modalities and techniques used, (17) context information relevant to the spatial or spatio-temporal data, (18) patients’ (or subjects’) details, such as gender, age, and medical history, (19) Doppler imaging (Y/N), and (20) B-mode imaging (Y/N).

For ease of readability of this paper, a reduced version of the table is included as Table 1. This table has four columns: (1) the name of the dataset, (2) the anatomies involved, (3) the total size of the dataset, and (4) whether the dataset is open access or available on request. The full table webpage has been made uploaded as additional material as HTML files.

Another characteristic of note is the scanning mode. We have specified five possible scanning modes: (1) 2D ultrasound images from freehand scanning potentially containing both standard and non-standard planes for the scanned anatomy, (2) 2D ultrasound images meant solely for diagnostics and measurements typically containing only standard planes for the scanned anatomy, (3) 2D+t content (i.e. ultrasound videos), (4) 3D ultrasound images (there is different physics involved and different probes used compared to 2D imaging), and (5) 3D ultrasound images reconstructed from 2D ultrasound images.

Dataset characteristics were extracted from the source material by the team member that found the dataset during the data search stage. Subsequently, a second team member verified and corrected, if necessary, the characteristics. This cross-referencing was done, as a quality assurance step, to ensure the integrity of the content in the paper, the webpage, and their associated tables.

Identification of open-source ultrasound imaging models

On 7 July 2024, in a lecture titled “Collaborate as a Community: Innovations and Trends in Open-Source AI”, presented at the Machine Learning for Health & Bio track of the Oxford Machine Learning Summer School (Oxford, UK), Vincent Moens described four elements of a trained model that relate to its “openness”: (1) the code, (2) the training data, (3) the model weights, and (4) the training procedure details (unpublished). A fully open model will provide details on all four elements, and a closed model provides no detail. Models may also have different degrees of openness between these two extremes. While a fully open model is desirable, a partially open model may still be very useful for community research and allow for model comparisons.

Any deep learning models trained on ultrasound data were eligible for inclusion. The search was conducted in two stages. First, we performed a literature search to identify studies related to ultrasound data analysis published since 2022. This search focused on top journals and conferences in medical imaging, including IEEE Transactions on Medical Imaging, Medical Image Analysis, and the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). We used keywords describing various types of ultrasound, including ‘ultrasound’, ‘US’, ‘sonography’, and ‘sonographer’. This returned 112 IEEE Transactions on Medical Imaging papers, 170 Medical Image Analysis papers, and 97 MICCAI papers. In the second stage, we conducted a targeted search on GitHub, using similar keywords to identify highly starred repositories published since 2022. We reviewed the top results from the first ten pages after sorting by stars.

For each identified paper or Github repository, we checked whether open-source code and model weights were provided. If the training data is open source in addition to the training details and the code is shared, then even if model weights are not available, we recorded it as an open-source model. We did this because, theoretically, given the code, the training details, and the training data, one could obtain the weights by training the model the same way on the same data. The search for the two sources was done from August 2024 to December 2024.

The search yielded 52 open-source models that met our criteria. In summary, to find the open-source models, we have perused the Github pages of many publications published as part of the main conferences of MICCAI 2024, MICCAI 2023, and MICCAI 2022 as well as the journals Medical Image Analysis (MedIA) and IEEE Transactions in Medical Imaging (TMI).

From July 2025 to September 2025, we later expanded our search strategy to ensure a broader and more systematic coverage of the literature. Specifically, we used Google Scholar to search for ultrasound deep learning models published after 2022 Year using the keywords “ultrasound” + “deep learning”. We reviewed the first 20 pages of results for each query, which corresponds to several hundred candidate works, and selected representative high-quality papers based on (i) publication in top-tier venues, (ii) relevance to ultrasound modeling, and (iii) shared a link to a GitHub repository. This brought the total number of found open-source models to 56.

There are specific characteristics we report for open-source models, including direct links to the open-source code, the task the model was designed for, the corresponding open-source dataset and its accessibility, training details, availability of model weights, competing interest statements, authorship and contribution statements, and software citations. These characteristics then make up the columns of the model table as shown in Table 2. We categorize the models based on 13 different tasks: Image enhancement, generation, pretraining/foundation model, detection, segmentation, estimation, classification, post-processing, report generation, scanning-guide, explainability, reconstruction, and registration. Here, Image Enhancement implies enhancing the low quality ultrasound images to the high-quality images. Post-processing refers to the matching of clinical post-processing techniques typically found in commercial ultrasound scanners. Estimation means prediction or quantification of different diagnostic parameters from ultrasound images such as ejection fraction, biometry, displacement, etc. Scanning-guide indicates automated scanning guidance algorithm designed to help sonographers capture high-quality ultrasound images.

Table 2 Open-sourced ultrasound models and some of their characteristics: the availability of their training dataset, weights, and code, task, and score

Scoring datasets and models

We draw inspiration from the data quality score used by Health Data Research UK (HDRUK)30,31 to define an ultrasound dataset quality score. In this context, datasets are not scored based on the quality of the data samples or how well-annotated they are, but rather on how complete the information shared on the dataset is. We have adapted the HDRUK score, and for the sake of clarity, we refer to our modified HDRUK score as SonoDQS.

It is worth noting that the score components are weighted. Different characteristics that contribute to SonoDQS have different weights associated with them based on their relative importance. For SonoDQS, the weighting of the different characteristics was chosen as follows. The characteristics that we consider more important make up the characteristics that are part of the summary table. These characteristics have been given twice as much weight as those only in the detailed table.

$${\text{Weighted}}\,{\text{Completeness}}\,{\text{Percent}}=\sum ({\text{weights}}\,{\text{of}}\,{\text{completed}}\,{\text{fields}})$$
(1)
$${\text{SonoDQS}}=\frac{{\text{Weighted}}\,{\text{Completeness}}\,{\text{Percent}}}{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{characteristics}}}$$
(2)

SonoDQS quantifies the completeness of data entries by weighting completed fields, summed in the numerator. The denominator is the total number of dataset characteristics. The result provides a percentage that reflects the extent of completeness, adjusted for the relative importance or weight of the different characteristics.

HDRUK scores are binned into four levels: “bronze”, “silver”, “gold”, and “platinum”30. For SonoDQS we also use “platinum”, “gold”, “silver”, and “steel” as rankings. However, additionally we include “unrated”, which denotes datasets that that have significant gaps in data reporting or unverifiable details.

Initially, for SonoDQS, we assigned a level of “platinum” for scores ≥0.6, “gold” for scores <0.6 and ≥0.5, “silver” for scores <0.5 and ≥0.4, and “bronze” for scores below 0.4. Level boundaries were subsequently refined empirically as follows.

Firstly, to emphasize the greater quality of platinum and gold, we adjusted the level thresholds as follows: ≥0.7 for the “platinum” level, <0.7 and ≥0.55 for “gold”, <0.55 and ≥0.25 for “silver”, <0.25 and ≥0.1 for “bronze”. Datasets that score <0.1 are considered “unrated”.

The thresholds were further refined so that the threshold for silver became ≥0.57. Two more levels: “diamond” (to be the highest) and “steel” (to be above “unrated”) were subsequently introduced. The final levels were fixed at “diamond” ≥0.87, “platinum” <0.875 and ≥0.77, “gold” <0.77 and ≥0.67, “silver” <0.67 and ≥0.57, “bronze” <0.5 and ≥0.45, “steel” <0.45 and ≥0.32, and “unrated” <0.32. The SonoDQS level boundary values were set early while dataset collection was still ongoing (when no more than 32 datasets had been found) to define wide, non-overlapping ranges even as we expanded from four to five to seven levels. These values were agreed on by our internal team, ensuring a coverage of a wide range of values, and then fixed for all subsequent analyses, avoiding post hoc fitting upon completion of the search for datasets.

Our motivation and justification for the threshold choices is to reward datasets for being more open about the characteristics of the dataset and how it was collected. Open-sourced datasets can nonetheless be useful for research, irrespective of the richness of the metadata and our naming convention was chosen to recognize the current breadth of metadata available from zero upwards.

A perfect SonoDQS (1.0) score means that possible details that a majority of researchers and practitioners might want to know about a given dataset have been revealed by the publishers of the dataset. The less of those details are available, the lower the score. Those details, or characteristics, were determined in an iterative process. First, we started with the details that MICCAI Challenges32 are expected to include when introducing a dataset for a Challenge. Then, we cull the details that are irrelevant to ultrasound imaging specifically, while adding missing characteristics that are important to know before working on an ultrasound dataset. Then, we ranked the characteristics into two tiers to reflect their importance to a researcher who has yet to start working on a dataset.

The higher-tier characteristics have twice as much of an effect (value = 1 + 1) on the SonoDQS on the score than the lower tier (value = 1). If information pertaining characteristic is not available, a dataset gets a value = 0 for that characteristic. The values are summed together and averaged, giving us the SonoDQS score. Therefore, assuming we are only working with three characteristics where one is high-tier and is filled in, one is low-tier and is filled in, one is low-tier but not filled in, then the SonoDQS score would be ((1 + 1) + 1 + 0)/4.

Like the publicly available datasets, we also rated the open-source models. It is important to emphasize that we do not rate them based on the performance on tasks but rather how clearly the authors of the model have explained what the model does, how the model was trained, and what data it was trained on.

Inspired by RipetaScores33, we developed a modified scoring method tailored to open-source ultrasound models, which we call the Sonographic Model Quality Score (SonoMQS). Unlike the RipetaScore, SonoMQS omits certain criteria less relevant to this context, such as biological materials documentation. Instead, it emphasizes transparency and accessibility in deep learning models.

RipetaScores are not specific to machine learning-based models but rather a mechanism to score scientific research33. In a RipetaScore, there are three main considerations: reproducibility, professionalism, and research. Reproducibility focuses on whether the research being scored has enough information associated with it that would allow it to be reproduced, such as the availability of the data the model was trained on. The reproducibility category considers aspects such as the availability and location of data, the availability of code, software citations, and the documentation of biological materials. The professionalism category assesses the presence of author contribution statements, authorship clarity, competing interest declarations, ethical approval details, and funding transparency. The research category evaluates the clarity of study objectives and the presence of well-defined section headers in the manuscript. Together, these components provide a comprehensive measure of a study’s adherence to good scientific practices.

$${\text{RipetaScore}}=\,{\text{Research}}\,{\text{Check}}\,({\text{pass}}/{\text{fail}})\times [{\text{Professionalism}}\,(0-10)+{\text{Reproducibility}}\,(0-20)]$$
(3)

For SonoMQS, we give a model a point for affirmative (i.e., “yes”) to each of the following twelve questions: (1) training dataset is public, (2) model weights are public, (3) code is shared, (4) training details mentioned, (5) study objective mentioned, (6) funding statement mentioned, (7) ethics approval statement mentioned, (8) competing interests (conflict of interests) statement mentioned, (9) authorship mentioned, (10) authorship contribution statement mentioned, (11) code availability statement mentioned, (12) and software citations included, with questions (1), (2), (3), and (4) being weighed twice as much. The total number of points is then divided by 16 to obtain the normalized score reported in Table 2 as SonoMQS.

It is important to emphasize that SonoMQS does not measure a model’s performance in its given task. It is not a typical evaluation metric. It is a score that reflects the quality of the reporting surrounding a given model. For example, a high SonoMQS score for a model that performs segmentation does not mean that the model is excellent at segmentation, but rather that the developers of the model have been forthcoming and open about details such as how the model was trained or what the model was trained on.