Introduction

Female breast and cervical cancer remain as major contributors to the burden of cancer1,2. The World Health Organization (WHO) reported that approximately 2.86 million new cases (14.8% of all cancer cases) and 1.03 million deaths (10.3% of all cancer deaths) were recorded worldwide in 20203. This disproportionately affects women, especially in low- and middle-income countries (LMICs), which can be largely attributed to more advanced stage diagnoses, limited access to early diagnostics, and suboptimal treatment4,5. Population-based cancer screening in high-income countries might not be as effective in LMICs, due to limited resources for treatment and palliative care6,7. Integrative screening for cancer is a complex procedure that needs to take biological and social determinants, as well as ethical constraints into consideration, and as is already known, early detection of breast and cervical cancers are associated with improved prognosis and survival8,9. Therefore, it is vital to select the most accurate and reliable technologies that are capable of identifying early symptoms.

Medical imaging plays an essential role in tumor detection, especially within progressively digitized cancer care services. For example, mammography and ultrasound, as well as cytology and colposcopy are commonly used in clinical practice10,11,12,13,14. However, fragmented health systems in LMICs may lack infrastructure and perhaps the manpower required to ensure high-quality screening, diagnosis, and treatment. This hinders the universality of traditional detection technologies mentioned above, which require sophisticated training15. Furthermore, there may be substantial inter- and intraoperator variability which affects both machine and human performances. Therefore, the interpretation of medical imaging is vulnerable to human error. Of course, experienced doctors tend to be more accurate although their expertise is not always readily available for marginalized populations, or for those living in remote areas. Resource-based testing and deployment of effective interventions together could reduce cancer morbidity and mortality in LMICs16. In line with this, an ideal detection technology for LMICs should at least have low training needs.

Deep learning (DL), as a subset of artificial intelligence (AI), could be applied to medical imaging and has shown promise in automatic detection17,18. While media headlines tend to overemphasize the polarization of DL model findings19, few have demonstrated inferiority or superiority. However, the Food and Drug Administration (FDA) has approved a select number of DL-based diagnosis tools for clinical practice, even though further critical appraisal and independent quality assessments are pending20,21. To date, there are few medical imaging specialty-specific systematic reviews such as this, which assess the diagnostic performance of DL algorithms, particularly in breast and cervical cancer.

Results

Study selection and characteristics

Our search initially identified 2252 records, of which 2028 were screened after removing 224 duplicates. 1957 were also excluded as they did not fulfil our predetermined inclusion criteria. We assessed 71 full-text articles and a further 36 articles were excluded. 25 of these articles focused on breast cancer, and 10 were on cervical cancer (see Fig. 1). Study characteristics are summarized in Tables 13.

Fig. 1: PRISMA flowchart of study selection.
figure 1

Displayed is the PRISMA (preferred reporting items for systematic reviews and meta-analyses) flow of search methodology and literature selection process.

Table 1 Study design and basic demographics.
Table 2 Methods of model training and validation.
Table 3 Indicators, algorithms and data sources.

Thirty-three studies utilized retrospective data. Only two studies used prospective data. Two studies also used data from open access sources. No studies reported a prespecified sample size calculation. Eight studies excluded low quality images, while 27 studies did not report anything around image quality. 11 studies performed external validation using out-of-sample dataset, while the others performed internal validation using in-sample-dataset. 12 studies compared DL algorithms against human clinicians using the same dataset. Additionally, medical imaging modalities were categorized into cytology (n = 4), colposcopy (n = 4), cervicography (n = 1), microendoscopy (n = 1), mammography (n = 12), ultrasound (n = 11), and MRI (n = 2).

Pooled performance of DL algorithms

Among the 35 studies in this sample, 20 provided sufficient data to create contingency tables for calculating diagnostic performance and were therefore included for synthesis at the meta-analysis stage. Hierarchical SROC curves for these studies (i.e. 55 contingency tables) are provided in Fig. 2a. When averaging across studies, the pooled sensitivity and specificity were 88% (95% CI 85–90), and 84% (95% CI 79–87), respectively, with an AUC of 0.92 (95% CI 0.90–0.94) for all DL algorithms.

Fig. 2: Pooled overall performance of DL algorithms.
figure 2

a Receiver operator characteristic (ROC) curves of all studies included in the meta-analysis (20 studies with 55 tables), and b ROC curves of studies reporting the highest accuracy (20 studies with 20 tables).

Most studies used more than one DL algorithm to report diagnostic performance, therefore we reported the highest accuracy of different DL algorithms for included studies in 20 contingency tables. The pooled sensitivity and specificity were 89% (86–92%), and 85% (79–90%), respectively, with an AUC of 0.93 (0.91–0.95). Please see Fig. 2b for further details.

Subgroup meta-analyses

Four separate meta-analyses were conducted:

  1. I.

    Validation types—15 studies with 40 contingency tables included in the meta-analysis were validated with an in-sample dataset and had a pooled sensitivity of 89% (87–91%), and pooled specificity of 83% (78–86%), with an AUC of 0.93 (0.91–0.95), see Fig. 3a for details. Only 8 studies with 15 contingency tables performed an external validation, for which the pooled sensitivity and specificity were 83% (77–88%), and 85% (73–92%), respectively, with an AUC of 0.90 (0.87–0.92), see Fig. 3b.

    Fig. 3: Pooled performance of DL algorithms using different validation types.
    figure 3

    a Receiver operator characteristic (ROC) curves of studies with internal validations (15 studies with 40 tables), b ROC curves of studies with external validations (8 studies with 15 tables).

  2. II.

    Cancer types—10 studies with 36 contingency tables targeting breast cancer, had a pooled sensitivity of 90% (87–92%) and specificity of 85% (80–89%), with an AUC of 0.94 (0.91–0.96), see Fig. 4a. 10 studies with 19 contingency tables considered cervical cancer with a pooled sensitivity and specificity of 83% (78–88%), and 80 (70–88%), respectively, with an AUC of 0.89 (0.86–0.91), see Fig. 4b for details.

    Fig. 4: Pooled performance of DL algorithms using different cancer types.
    figure 4

    a Receiver operator characteristic (ROC) curves of studies in detecting breast cancer (10 studies with 36 tables), and b ROC curves of studies in detecting cervical cancer (10 studies with 19 tables).

  3. III.

    Imaging modalities—4 mammography studies with 15 contingency tables had a pooled sensitivity of 87% (82–91%), a pooled specificity of 88% (79–93%), and with an AUC of 0.93 (0.91–0.95), see Fig. 5a. There were four ultrasound studies with 17 contingency tables with a pooled sensitivity of 91% (89–93%), pooled specificity of 85% (80–89%), and an AUC of 0.95 (0.93–0.96), see Fig. 5b. There were four cytology studies with six contingency tables which had a pooled sensitivity of 87% (82–90%), pooled specificity of 86% (68–95%), and an AUC of 0.91(0.88–0.93), Fig. 5c. There were four colposcopy studies with 11 contingency tables which had a pooled sensitivity of 78% (69–84%), pooled specificity of 78% (63–87%), and an AUC of 0.84 (0.81–0.87), see Fig. 5d.

    Fig. 5: Pooled performance of DL algorithms using different imaging modalities.
    figure 5

    a Receiver operator characteristic (ROC) curves of studies using mammography (4 studies with 15 tables), b ROC curves of studies using ultrasound (4 studies with 17 tables), c ROC curves of studies using cytology (4 studies with 6 tables), and d presented ROC curves of studies using colposcopy (4 studies with 11 tables).

  4. IV.

    DL algorithms versus human clinicians—of the 20 included studies, 11 studies compared diagnostic performance between DL algorithms and human clinicians using the same dataset, with 29 contingency tables for DL algorithms, and 18 contingency tables for human clinicians. The pooled sensitivity was 87% (84–90%) for DL algorithms, which human clinicians had 88% (81–93%). The pooled specificity was 83% (76–88%) for DL algorithms, and 82% (72–88%) for human clinicians. The AUC was 0.92 (0.89–0.94) for DL algorithms, and 0.92 (0.89–0.94) for human clinicians (Fig. 6a, b).

    Fig. 6: Pooled performance of DL algorithms versus human clinicians and human clinicians using the same sample.
    figure 6

    a Receiver operator characteristic (ROC) curves of studies using DL algorithms (11 studies with 29 tables), and b ROC curves of studies using human clinicians (11 studies with 18 tables).

Heterogeneity analysis

All included studies found that DL algorithms are useful for the detection of breast and cervical cancer using medical imaging when compared with histopathological analysis, as the gold standard; however, extreme heterogeneity was observed. Sensitivity (SE) had an I2 = 97.65%, while specificity (SP) had I2 = 99.90 (p < 0.0001), see Fig. 7.

Fig. 7: Summary estimate of pooled performance using forest plot.
figure 7

Data presented forest plot of studies included in the meta-analysis (20 studies).

A funnel plot was produced to assess publication bias. The p value of 0.41 suggests there is no publication bias although studies were widely dispersed around the regression line. See Supplementary Fig. 3 for further details. In order to identify the source/sources of such extreme heterogeneity we conducted subgroup analysis, and found:

  1. I.

    Validation types—Internal validation (SE, I2 = 97.60%, SP, I2 = 99.19, p < 0.0001), and external validation (SE, I2 = 96.15%, SP, I2 = 99.96, p < 0.0001). See Supplementary Fig. 4.

  2. II.

    Cancer types of DL algorithms included breast cancer (SE, I2 = 95.84%, SP, I2 = 99.86 p < 0.0001) and cervical cancer (SE, I2 = 98.16%, SP, I2 = 99.89, p < 0.0001). Please see Supplementary Fig. 5 for further details.

  3. III.

    Imaging modalities including mammography (SE, I2 = 97.01%, SP, I2 = 99.93, p < 0.0001), and ultrasound (SE, I2 = 86.49%, SP, I2 = 96.06, p < 0.0001), cytology (SE, I2 = 89.97%, SP, I2 = 99.90, p < 0.0001), and colposcopy (SE, I2 = 98.12%, SP, I2 = 99.59, p < 0.0001), see Supplementary Fig. 6.

However, heterogeneity was not aligned to a specific subgroup, nor was it reduced to an acceptable level, with all subgroup I2 values remained high. Therefore, we could infer whether different validation types, cancer types, and imaging modalities were likely to have influenced DL algorithm performances for detecting breast and cervical cancer.

To further investigate this finding, we performed meta-regression analysis with these covariates (see Supplementary Table 1). The results highlighted a statistically significant difference, which is line with sub-group and meta-analytical sensitivity analyses.

Quality assessment

The quality of the included studies was assessed using QUADAS-2 and a summary of findings has been provided with an appropriate diagram in the supplementary materials as Supplementary Fig. 1. A detailed assessment for each item based on the domain of risk of bias and concern of applicability has also been provided as Supplementary Fig. 2. For the patient selection domain of risk of bias, 13 studies were considered high or unclear risk of bias due to unreported inclusion criteria or exclusion criteria, and improper exclusion. For the index test domain, only one studies was considered high or at unclear risk of bias due to having no predefined threshold, whereas the others were considered at low risk of bias.

For the reference standard domain, three studies were considered at high or unclear risk of bias due to reference standard inconsistencies. There was no mention of whether the threshold was determined in advance and whether blinding was implemented. For the flow and timing domain, five studies were considered high or with an unclear risk of bias because the authors had not mentioned whether there was an appropriate time gap or whether it was based on the same gold standard.

In the applicability concern domain, 12 studies were considered to have high or unclear applicability in patient selection. One study also had unclear applicability in the reference standard domain, with no applicability concern in the index test domain.

Discussion

Artificial Intelligence in medical imaging is without question improving however, we must subject emerging knowledge to the same rigorous testing we would for any other diagnostic procedure. Deep learning could reduce the over-reliance of experienced clinicians and could, with relative ease, be extended to rural communities and LMICs. While this relatively inexpensive approach may help to bridge inequality gaps across healthcare systems generally, evidence is increasingly highlighting the value of deep learning in cancer diagnostics and care. Within the field of female cancer diagnosis, one of the representative technologies is computer-assisted cytology image diagnosis such as the FDA-approved PAPNET and AutoPap systems, which dates back to at least the 1970s22. While rapid progress in AI technology is made, they are also becoming an increasingly important element involved in automated image-based cytology analysis systems. These technologies have the potential to reduce the amount of time spent and improve cytologics during the reading process. Here, we attempted to ascertain which is the most accurate and reliable detection technology presently available in the field of breast and cervical cancer diagnostics.

A systematic search for pertinent articles identified three systematic reviews with meta-analyses which investigated DL algorithms in medical imaging. However, these were in diverse domains which make it difficult to compare directly with the present review. For example, Liu et al. 23 found that DL algorithm performance in medical imaging might be equivalent to healthcare professors. However, only breast and dermatological cancers were analyzed with more than three studies, which not only inhibits generalizability but highlights the need for further DL algorithm performance research in the field of medical imaging. In identifying pathologies, Aggarwal et al. 24 found that DL algorithms have high diagnostic performance. However, the authors also found high heterogeneity which was attributed to combining distinct methods and perhaps through unspecified terms. They concluded that we need to be cautious when considering the diagnostic accuracy of DL algorithms and that there is a need to develop (and apply) AI guidelines. This was also apparent in this study and therefore we would reiterate this sentiment.

While the findings from the aforementioned studies are incredibly valuable, at present there is a need to expand upon the emerging knowledge-base for metastatic tumor diagnosis. The only other review in this field was conducted by Zheng et al. 25 who found that DL algorithms are beneficial in radiological imaging with equivalent, or in some instances better performance than healthcare professionals. Although again, there were methodological deficiencies which must be considered before we adopt these technologies into clinical practice. Also, we must strive to identify the best available DL algorithm and then develop it to enhance identification and reduce the number of false positives and false negatives beyond that which is humanly possible. As such, we need to continue to use systematic reviews to identify gaps in research and we should not only consider technology-specific reviews, but also disease-specific systematic reviews. Of course, DL algorithms are in an almost constant state of development but the purpose of this study was to critically appraise potential issues with study methods and reporting standards. By doing so, we hoped to make recommendations and to drive further research in this field so that the most effective technology is adopted into clinical practice sooner rather than later.

This systematic review with meta-analysis suggests that deep learning algorithms can be used for the detection of breast and cervical cancer using medical imaging. Evidence also suggests that while the deep learning algorithms are not yet superior, nor are they inferior in terms of performance when compared to clinicians. Acceptable diagnostic performance with analogous deep learning algorithms was observed in both breast and cervical cancer despite having dissimilar workflows with different imaging modalities. This finding also suggests that these algorithms could be deployed across both breast or cervical imaging, and potentially across all types of cancer which utilize imaging technologies to identify cases early. However, we must also critically consider some of the issues which emerged during our systematic analysis of this evidence base.

Overall, there were very few prospective studies and few clinical trials. In fact, most included studies were retrospective studies which may be the case because of the relative newness of DL algorithms in medical imaging. However, the data sources used were from either pre-existing electronic medical records or online open-access databases, which were not explicitly intended for algorithmic analysis in real clinical settings. Of course, we must first test these technologies using retrospective datasets to see whether they are appropriate and with a view to modifying and enhancing accuracy perhaps for specific populations or for specific types of cancer. We also encourage more prospective DL studies in the future. If possible, we should investigate the potential rules of breast or cervical images through more prospective studies, and identify possible image feature correlations and diagnostic logic for risk predictions. Most studies constructed and trained algorithms using small labeled breast or cervical images, with labels which were rarely quality-checked by a clinical specialist. This design fault is likely to have created ambiguous ground-truth inputs which may have caused unintended adverse model effects. Of course, the knock-on effect is that there is likely to be diagnostic inaccuracies through unidentified biases. This is certainly an issue which should be considered when designing future deep learning-based studies.

It is important to note that no matter how well-constructed an algorithm is, its diagnostic performance depends largely upon the volume of raw data and quality26. Most studies included in this systematic review mentioned a data augmentation method which adopted some form of affine image transformations strategy e.g. translational, rotation or flipping, in order to compensate for data deficiencies. This, one could argue, is symptomatic of the paucity of annotated datasets for model training, and prospective studies for model validation. Though fortunately, there has been a substantial increase in the number of openly available datasets around cervical or breast cancer. However, given the necessity for this research, one would like to see institutions collaborating more frequently to establish cloud sharing platforms which would increase the availability (and breadth) of annotated datasets. Moreover, training DL algorithms requires reliable, high-quality image inputs, which may not be readily available, as some pre-analytical factors such as incorrect specimen preparation and processing, unstandardized image digitalization acquisition, improper device calibration and maintenance could lower image quality. Complete standardization of all procedures and reagents in clinical practice is required to optimally prepare pre-analytical image inputs in order to develop more robust and accurate DL algorithms. Having these would drive developments in this field and would benefit clinical practice, perhaps serving as a cost-effective replacement diagnostic tool or an initial method of risk categorization. Although, this is beyond the scope of this study and would require further research to consider this in detail.

Of the 35 included studies, only 11 studies performed external validation, which means that an assessment of DL model performance was conducted with either an out-of-sample dataset or with an open-access dataset. Indeed, most of the studies included here split a single sample by either randomly and non-randomly assigning individuals’ data from one center into one development dataset or the other internal validations dataset. We found that studies with internal validation were higher than externally validated studies for early detection of cervical and breast cancer. However, this was to be expected because using an internal dataset to validate models is more likely homogenous and may lead to an overestimated diagnostic performance. This finding highlights the need for out-of-sample external validation in all predictive models. A possible method for improving external validation would be to establish an alliance of institutions wherein trained deep learning algorithms are shared and performances tested, externally. This might provide insight into subgroups and variations between various ethnic groups although we would also need to maintain patient anonymity and security, as several researchers have previously noted27,28.

Most of the studies that were retrospective using narrowly defined binary or multi-class tests focusing on the diagnostic performance in the field of DL algorithms rather than clinical practice. This is a direct consequence of poor reporting, and the lack of real-world prospective clinical practice, which has resulted in inadequate data availability and therefore may limit our ability to gauge the applicability of these DL algorithms to clinical settings. Accordingly, there is uncertainty around the estimates of diagnostic performance provided in our meta-analysis and adherence levels should be interpreted with caution.

Recently, several AI-related method guides have been published, with many still under development29,30. We found most of the included studies we analyzed were probably conceived or performed before these guidelines were available. Therefore, it is reasonable to assume that design features, reporting adequacy and transparency of studies used to evaluate the diagnostic performance of DL algorithms will be improved in the future. Even though, our findings suggest that DL is not inferior in terms of performance compared to clinicians for the early detection of breast or cervical cancer, this is based on relatively few studies. Therefore, the uncertainty which exists is, at least in part, due to the in silico context in which clinicians are being evaluated.

We should also acknowledge that most of the current DL studies are publications of positive results. We must be aware that this may be a form of researcher-based reporting bias (rather than publication-based bias), which is likely to skew the dataset and adds complexity to comparison between DL algorithms and clinicians31,32. Differences in reference standard definitions, grader capabilities (i.e. the degrees of expertise), imaging modalities and detection thresholds for classification of early breast or cervical cancer also make direct comparisons between studies and algorithms very difficult. Furthermore, non-trivial applications of DL models in the healthcare setting will need clinicians to optimize clinical workflow integration. However, we found only two of studies which mentioned DL versus clinicians and versus DL combined with clinicians. This hindered our meta-analysis of DL algorithms but highlighted the need for strict and reliable assessment of DL performance in real clinical settings. Indeed, the scientific discourse should change from DL versus clinicians dichotomy to a more realistic DL-clinician combination, which would improve workflows.

35 studies met the eligibility criteria for the systematic review, yet only 20 studies could be used to develop contingency tables. Some DL algorithm studies from computer science journals only reported precision, dice coefficient, F1 score, recall, and competition performance metric. Whereas indicators such as AUC, accuracy, sensitivity, and specificity are more familiar to healthcare professionals25. Bridging the gap between computer sciences research would seem prudent if we are to manage interdepartmental research and the transition to a more digitized healthcare system. Moreover, we found the term “validation” is used causally in DL model studies. Some authors used it for assessing the diagnostic performance of the final algorithm, others defined it as a dataset for model tuning during the development process. This confuses readers and makes it difficult to judge the function of datasets. We combined experts’ opinions33, and proposed to distinguish datasets used in the development and validation of DL algorithms. In keeping with the language used for nomogram development, a dataset for training the model should be named ‘training set’, while datasets used for tuning should be referred to as the ‘tuning set’. Likewise, during the validation phase, the hold-back subset split from the entire dataset should be referred to a ‘internal’ validation, which is the same condition/image types as the training set. While a completely independent dataset for our-of-sample validation should be referred to as ‘external’ validation34.

Most of the issues discussed here could be avoided through more robust designs and high-quality reporting, although several hurdles must be overcome before DL algorithms are used in practice for breast and cervical cancer identification. The black box nature of DL models without clear interpretability of the basis for the clinical situations is a well-recognized challenge. For example, a clinician considering whether breast nodules represent breast cancer based on mammographic images for a series of judgement criteria. Therefore, a clinician developing a clear rationale for a proposed diagnosis maybe the desired state. Whereas, having a DL model which merely states the diagnosis may be viewed with more skepticism. Scientists have actively investigated possible methods for inspecting and explaining algorithmic decisions. An important example is the use of salience or heat maps to provide the location of salient lesion features within the image rather than defining the lesion characteristics themselves35,36. This raises questions around human-technology interactions, and particularly around transparency and patient-practitioner communications which ought to be studied in conjunction with DL modeling in medical imaging.

Another common problem limiting DL algorithms is model generalizability. There may be potential factors in the training data that would affect the performance of DL models in different data distribution settings28. For example, a model only trained in US may not perform well in Asia because a model trained using data from predominantly caucasian patients may not perform well across other ethnicities. One solution to improve generalizability and reduce bias is to conduct large, multicenter studies which can enable the analysis of nationalities, ethnicities, hospital specifics, and population distribution characteristics37. Societal biases can also affect the performance of DL models and of course, bias exists in DL algorithms because a training dataset may not include appropriate proportions of minority groups. For example, a DL algorithm for melanoma diagnosis in dermatological study may lack diversity in terms of skin color and genomic data, but this may also cause an under-representation of minority groups38. To eliminate embedded prejudice, efforts should be made to carry out DL algorithm research which provides a more realistic representation of global populations.

As we have seen, the included studies were mostly retrospective with extensive variation in methods and reporting. More high-quality studies such as prospective studies and clinical trials are needed to enhance the current evidence base. We also focused on DL algorithms for breast and cervical cancer detection using medical imaging. Therefore, we made no attempt to generalize our findings to other types of AI, such as conventional machine learning models. While there were a reasonable number of studies for this meta-analysis, the number of studies of each imaging modality was limited like cytology or colposcopy, Therefore, the results of the subgroup analyses around imaging modality needs to be interpreted with caution. We also selected only studies in which histopathology was used as the reference standard. Consequently, some DL studies that may have shown promise but did not have confirmatory histopathologic results, were excluded. Even though the publication bias was not identified through funnel plot analysis in Supplementary Fig. 3 based on data extracted from 20 studies, the lack of prospective studies and the potential absence of studies with negative results can cause bias. As such, we would encourage deep learning researchers in medical imaging to report studies which do not reject the null hypothesis because this will ensure evidence clusters around true effect estimates.

It remains necessary to promote deep learning in medical imaging studies for breast or cervical cancer detection. However, we suggest improving breast and cervical data quality and establishing unified standards. Developing DL algorithms needs to feed on reliable and high-quality images tagged with appropriate histopathological labels. Likewise, it is important to establish unified standards to improve the quality of the digital image-production, the collection process, imaging reports, and final histopathological diagnosis. Combining DL algorithm results with other biomarkers may prove useful to improve risk discrimination for breast or cervical cancer detection. An example would be a DL model for cervical imaging that combines with additional clinical information i.e. cytology and HPV typing, which could improve overall diagnostic performance39,40. Secondly, we need to improve the error correction ability and DL algorithm compatibility. Prophase developing DL algorithms are more generalizable and less susceptible to bias but may require larger and multicenter datasets which incorporate diverse nationalities and ethnicities, as well as those with different socioeconomic status etc., if we are to implement algorithms into real-world settings.

This also highlights the need for international reporting guidelines for DL algorithms in medical imaging. Existing reporting guidelines such as STARD41 for diagnostic accuracy studies, and TRIPOD42 for conventional prediction models are not available to DL model study. The recent publication of CONSORT-AI43 and SPIRIT-AI44 guidelines are welcomed but we await disease-specific DL guidelines. Furthermore, we would encourage organizations to develop diverse teams, combining computer scientists and clinicians to solve clinical problems using DL algorithms. Even though DL algorithms appear like black boxes with unexplainable decision-making outputs, these technologies need to be discussed for development and require additional clinical information45,46. Finally, medical computer vision algorithms do not exist in a vacuum, we must integrate DL algorithms into routine clinical workflows and across entire healthcare systems to assist doctors and augment decision-making. Therefore, it is crucial that clinicians understand the information each algorithm provides and how this can be integrated into clinical decisions which enhance efficiency without absorbing resources. For any algorithm to be incorporated into existing workflows it has to be robust, and scientifically validated for clinical and personal utility.

We tentatively suggest that DL algorithms could be useful for detecting breast and cervical cancer using medical imaging, with equivalent performance to human clinicians, in terms of sensitivity and specificity. However, this finding is based on poor study designs and reporting which could lead to bias and overestimating algorithmic performances. Standardized guidelines around study methods and reporting are needed to improve the quality of DL model research. This may help to facilitate the transition into clinical practice although further research is required.

Methods

Protocol registration and study design

The study protocol was registered with the PROSPERO International register of systematic reviews, number CRD42021252379. The study was conducted according to the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines47. No ethical approval or informed consent was required for the current systematic review and meta-analysis.

Search strategy and eligibility criteria

In this study, we searched Medline, Embase, IEEE and the Cochrane library until April 2021. No restrictions were applied around regions, languages, or publication types; however, letters, scientific reports, conference abstracts, and narrative reviews were excluded. The full search strategy for each database was developed in collaboration with a group of experienced clinicians and medical researchers. Please see Supplementary Note 1 for further details.

Eligibility assessment was conducted by two independent investigators, who screened titles and abstracts, and selected all relevant citations for full-text review. Disagreements were resolved through discussion with another collaborator. We included studies that reported the diagnostic performance of a DL model/s for the early detection of breast or cervical cancer using medical imaging. Studies reporting any diagnostic outcome, such as accuracy, sensitivity, and specificity etc., could be included. There was no restriction on participant characteristics, type of imaging modality or the intended context for using DL models.

Only histopathology was accepted as the study reference standard. As such, imperfect ground truths, such as expert opinion or consensus, and other clinical testing were rejected. Likewise, medical waveform data or investigations into the performance of image segmentation were excluded because these could not be synthesized with histopathological data. Animals’ studies or non-human samples were also excluded and duplicates were removed. The primary outcomes were various diagnostic performance metrics. Secondary analysis included and assessment of study methodologies and reporting standards.

Data extraction

Two investigators independently extracted study characteristics and diagnostic performance data using predetermined data extraction sheet. Again, uncertainties were resolved by a third investigator. Binary diagnostic accuracy data were extracted directly into contingency tables which included true-positives, false-positives, true-negatives, and false-negatives. These were then used to calculate pooled sensitivity, pooled specificity, and other metrics. If a study provided multiple contingency tables for the same or for different DL algorithms, we assumed that they were independent of each other.

Quality assessment

The risk of bias and applicability concerns of the included studies were assessed by the three investigators using the quality assessment of diagnostic accuracy studies 2 (QUADAS-2) tool48.

Statistical analysis

Hierarchical summary receiver operating characteristic (SROC) curves were used to assess the diagnostic performance of DL algorithms. 95% confidence intervals (CI) and prediction regions were generated around averaged sensitivity, specificity, and AUCs estimates in SROC figures. Further meta-analysis was performed to report the best accuracy in studies with multiple DL algorithms from contingency tables. Heterogeneity was assessed using the I2 statistic. We also conducted the subgroup meta-analyses and regression analyses to explore potential sources of heterogeneity. The random effects model was implemented because of the assumed differences between studies. Publication bias was assessed visually using funnel plots.

Four separate meta-analyses were conducted: (1) according to validation type, DL algorithms were categorized as either internal or external. Internal validation meant that studies were validated using an in-sample-dataset, while external validation studies were validated using an out-of-sample dataset; (2) according to cancer type i.e., breast or cervical cancer; (3) according to imaging modalities, such as mammography, ultrasound, cytology, and colposcopy, etc; (4) according to the pooled performance for DL algorithms versus human clinicians using the same dataset.

Meta-analysis was only performed where there were more than or equal to three original studies. STATA (version 15.1), and SAS (version 9.4) were for data analyses. The threshold for statistical significance was set at p < 0.05, and all tests were two-sides.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.