Introduction

The survival rates for childhood cancers have significantly improved over the past decades.1 However, oncological treatments – especially anthracycline-based chemotherapy – can have serious cardiotoxic side effects that may severely impair the long-term quality of life of survivors.1 Anthracycline-induced cardiotoxicity is dose-dependent and can result in progressive, often irreversible myocardial damage that eventually leads to dilated cardiomyopathy and clinically manifest heart failure.2 Therefore, survivors of childhood cancer are at a significantly higher risk of developing long-term cardiovascular complications compared to the healthy population.3

In this context, cardiotoxicity is defined broadly as any chemotherapy- or radiotherapy-related cardiac alteration that, if left unaddressed, may progress to permanent structural and functional damage. While overt heart failure or severe cardiomyopathy represents a late-stage manifestation, our aim is to prevent reaching that stage by recognizing early, potentially reversible changes in a timely manner, underscoring the urgency of early detection.

Current clinical practice for monitoring cardiac damage in cancer patients primarily relies on echocardiographic assessments and serum biomarkers.4 While these methods are helpful, each on its own may not be sufficiently sensitive to detect early, mild, and still reversible injury.5 For example, left ventricular ejection fraction (LVEF) often declines only after significant myocardial damage has already occurred, by which point the injury may be irreversible.5 It is, therefore, crucial to employ diagnostic tools that are both sensitive and specific, capable of identifying the early signs of cardiotoxicity at a stage when preventive or interventional measures can still be effective.

We performed a systematic review and meta-analysis to evaluate the diagnostic accuracy of various modalities for early detection of chemotherapy-related cardiotoxicity in pediatric oncology patients. We aimed to determine whether any available tool can reliably identify subclinical cardiac injury before the development of overt cardiomyopathy and to compare their relative performance. By synthesizing the evidence, we seek to highlight the most promising strategies and the gaps that need to be addressed to better protect childhood cancer survivors from late cardiac effects.

Methods

Search strategy and selection criteria

We adhered to the Cochrane Handbook recommendations for study methodology, and this meta-analysis was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.6,7 The protocol is registered in the International Prospective Register of Systematic Reviews (CRD42023485629, details are provided in Supplementary Figs. S1 and S2). A systematic search was performed on November 22, 2023, using three databases: MEDLINE (via PubMed), Embase, and Cochrane Central Register of Controlled Trials (CENTRAL). The search key is described in detail in a Supplementary Appendix 1. No restrictions were applied regarding the publication date, language, or other criteria. The references of the eligible studies were searched using an online tool (citationchaser).8

Studies that met the following eligibility criteria were included: the study population comprised children with cancer who were currently receiving or had previously received cardiotoxic chemotherapy, childhood cancer survivors who had cancer before the age of 21 and received cardiotoxic chemotherapy; the prognostic factors considered in these studies included echocardiographic parameters, serum biomarkers, miRNAs, genetic variants, and AI models.

The assessed outcomes were prognostic and diagnostic accuracy measured with AUC, sensitivity, specificity, and correlation. Cohort, case-control, pilot, cross-sectional, and case series studies were included.

Study selection and assessment of risk for bias

The selection process was performed using the reference management software EndNote 20 (Clarivate Analytics, Philadelphia, PA) and Rayyan.9,10 Following the automatic and manual removal of duplicates, two independent investigators selected articles systematically based on the title, abstract, and full-text content, strictly adhering to predefined eligibility criteria. Cohen’s kappa coefficient was computed at each selection stage to measure agreement between assessors. Disagreements were resolved through discussion with a third participant.

Data extraction from each article was conducted manually by two independent researchers using an Excel (Office 365, Microsoft, Redmond, WA) data collection form. Both investigators cross-checked the extracted datasets to ensure quality standards. Two independent investigators applied the Quality Assessment of Prognostic Accuracy Studies (QUAPAS) tool to assess the methodological quality of each study, as outlined in Supplementary Figs. S3 and S4. Any disagreements were resolved by consensus with the involvement of a third independent participant.

Data synthesis and statistical analysis

The following data were extracted from each eligible article: first author, year of publication, digital object identifier (DOI), country of origin, study design, population characteristics (age, age at diagnosis, disease status, chemotherapy type, and cancer type), number of patients, cardiotoxicity definition, prognostic factors, number of patients with cardiotoxicity, number of patients without cardiotoxicity, variables cut-off value, sensitivity, specificity, AUC, correlation, odds ratio (OR) and confidence intervals (CI).

All the random-effect meta-analyses were carried out using the R statistical software (version 4.1.2) and the R script of the online tool described in Freeman et al.11

For the reduction in GLS AUC pool, we applied classical inverse variance meta-analysis with restricted maximum likelihood (REML) tau estimator and Hartung-Knapp adjustment. In the case of AI AUC values, studies were published with more than one AUC value calculated on the same patient population. To account for these correlations, we fitted a multivariable model using the rma.mv() function of the metafor R package. To circumvent the problem caused by the unknown correlations, we supplemented the method with the robust approach implemented in the coef_test() function of the clubSandwich R package.12 Heterogeneity was assessed by calculating the univariate I² measure and its confidence interval and performing the Cochrane Q-test. Even when the pooled estimate was created using the multivariate approach, we calculated the I² values provided by the univariate method. I² values of 25%, 50%, and 75% were considered low, moderate, and high heterogeneity, respectively. The following categories were used to interpret the discriminatory performance of AUC: ≥0.9 = excellent; 0.8–0.9 = good; 0.7–0.8 = fair; 0.6–0.7 = poor; and 0.5–0.6 = fail.13 We visualized the results on forest plots. We calculated pooled results if the outcome was present in at least three studies.

In the case of the NT-proBNP, since the thresholds differed across studies, we fitted the summary receiver operating characteristic curve (SROC) curve using the non-Bayesian version of the approach.14 For clarity, we note that Harbord et al. show that the employed method is mathematically equivalent to the bivariate model of Reitsma et al. (2005) and Chu and Cole.15,16,17 In the case of troponin, all the studies used the same threshold. We separately calculated pooled sensitivity and specificity using the generalized mixed-effect univariate approach.18 Note that using the bivariate approach was not possible due to the small number of available studies. On a common ROC plot, we plotted the resulting NT-proBNP SROC curve, the troponin summary point, the study-level estimates, and their confidence intervals. We found several correlation values between several different variables. Since pooled result calculation was not possible, we only visualized the results on a forest plot.

Publication bias analysis was not feasible due to the small number of involved studies.

Results

The results of our search and selection are detailed in the PRISMA flowchart shown in Supplementary Fig. S1. Through November 2023, the electronic literature search identified 4994 studies from the PubMed, EMBASE, and CENTRAL databases. Following an automatic and manual removal of duplicates, 178 articles were screened based on the title and abstract. After full-text selection, 73 articles were eligible for qualitative synthesis, and 12 were included in the quantitative synthesis. No additional articles were found after screening the reference lists of the included papers. All included articles were available in full text and were published in peer-reviewed journals. The study characteristics are presented in Supplementary Table S1.

First, we evaluated the predictive performance of complex prediction models created with AI in three studies,19,20,21 using the area under the ROC curve (AUC) to assess their ability to distinguish between cases and controls, as shown in Fig. 1.

Fig. 1: Forest plot of prediction performance of AI models (refs. 19,20,21).
figure 1

AUC area under the curve, CI confidence interval, LVEF left ventricular ejection fraction. This figure summarizes the pooled predictive performance of artificial intelligence models across three studies19,20,21 for identifying chemotherapy-induced cardiotoxicity in pediatric patients, showing an overall AUC of 0.80 (95% CI: 0.70–0.90).

These models included clinical data alone, clinical data combined with genetic markers, NT-proBNP levels, different serum protein levels, and electrocardiographic ECG parameters. We identified three eligible studies: cases included cardiotoxic patients defined by LVEF values, while controls comprised treated patients without cardiotoxicity.

The high I2 value (93%) suggests substantial heterogeneity between studies, indicating that the variability in AUCs is due to differences in the prediction models, populations, or other factors rather than random chance.

The published AUC results suggest that clinical models show moderate predictive performance, 0.59 (95% CI: 0.51–0.67) and 0.69 (95% CI: 0.64–074). However, adding extra variables to the clinical model significantly improved the performance, with AUCs reaching up to 0.89 (95% CI: 0.87–0.91).

An overall AUC of 0.8 with a 95% CI of 0.70 to 0.90 underscores the model’s high predictive capability, but this summary was derived from only three studies and showed significant variability.

All three included AI-based studies implemented formal model-fitting and overfitting control procedures. Güntürkün et al.19 used an XGBoost algorithm with genetic algorithm–based hyperparameter optimization and validated their model through five-fold cross-validation. Leerink et al.20 applied an elastic-net regression model with nested cross-validation for hyperparameter tuning and performance assessment on unseen data. Chaix et al.21 employed a random forest classifier with bootstrap-based validation to ensure robustness. Collectively, these approaches demonstrate rigorous model fitting and generalizability testing, supporting the reliability of the pooled AUC estimates.

The results should be interpreted with caution, acknowledging that no definitive reliability can be claimed yet for prediction model tools in isolation, given the limited data and between-study differences.

After analyzing the performance of AI models, we evaluated the predictive capability of the echocardiographic parameter GLS in four studies,22,23,24,25 as illustrated in Fig. 2.

Fig. 2: Forest plot of the prediction performance of GLS.
figure 2

AUC area under the curve, CI confidence interval, FS fractional shortening, LVEF left ventricular ejection fraction. This figure illustrates the prognostic accuracy of GLS in four studies22,23,24,25 for detecting early myocardial dysfunction, indicating moderate overall performance (AUC≈0.72) and high between-study heterogeneity (I² = 96%).

Studies were divided into subgroups based on criteria defining abnormal GLS values. In the follow-up subgroup, GLS measurements were taken over time to monitor cardiotoxicity, while in the reduction subgroup, changes in GLS between two time points were assessed.

The heterogeneity (I²) was substantial, at 96% (95% CI: 91%, 98%), indicating significant variability across the studies. This variability in the definition of cardiotoxicity and abnormal GLS value is a crucial factor in interpreting and comparing these results. The follow-up subgroup demonstrated relatively higher and more consistent AUC values, with 0.85 (95% CI: 0.77–0.94) and 0.91 (95% CI: 0.85–0.96), indicating good predictive ability; however, the reduction subgroup exhibited greater variability. The overall AUC is 0.72 (95% CI: 0.06, 1.00); however, it is essential to emphasize that the extreme confidence interval is partly due to the considerably low AUC (0.42, 95% CI: 0.30–0.54) value reported in the study by Ardelean et al.24

In summary, while GLS has shown reasonable predictive potential in some settings, its results have not been consistently robust. The lack of consistency means GLS alone cannot serve as a reliable stand-alone diagnostic criterion for early cardiotoxicity.

Furthermore, we evaluated the predictive performance of cardiac biomarkers in five studies,26,27,28,29,30 including cardiac troponin T (cTnT) and NT-proBNP, using a hierarchical summary ROC (HSROC) curve, as presented in Fig. 3.

Fig. 3: ROC plot visualizing the diagnostic performance of cTnT and NT-proBNP.
figure 3

cTnT cardiac troponin T, HSROC hierarchical summary receiver operating characteristic curve, NT-proBNP N-terminal pro-B-type natriuretic peptide. This figure displays the summary ROC curve and sensitivity/specificity estimates from five studies26,27,28,29,30 for cardiac biomarkers used to detect cardiotoxicity. The pooled analysis indicates that NT-proBNP has moderate accuracy, while troponin T is more consistent but less sensitive, with performance strongly influenced by measurement timing and thresholds.

The figure shows the results of the diagnostic meta-analysis assessing the performance of NT-proBNP and cTnT in detecting cardiotoxicity. In the case of NT-proBNP, since the thresholds differed across studies, we fitted an SROC curve showing the trade-off between sensitivity and specificity as the threshold varies. Conversely, for troponin, all studies applied a standardized 0.001 threshold; hence, pooled sensitivity and specificity were calculated corresponding to this specific threshold.

The results highlight that the obtained sensitivity and specificity values are threshold-dependent and, therefore, should be interpreted accordingly. These biomarkers may benefit from being integrated into a broader diagnostic strategy that includes additional biomarkers or imaging modalities to improve diagnostic accuracy.

Additionally, we analyzed the performance of some miRNAs and high-sensitivity cardiac troponin T (hs-cTnT), as detailed in Fig. 4. Cheung et al.31 focused on 39 patients, classifying 22 with cardiotoxicity, defined based on GLS. They examined hs-cTnT and miRNA-1, with both biomarkers showing moderate predictive ability (AUC = 0.62; 95% CI: (0.38, 0.86) and (0.40, 0.84), respectively). In contrast, Leger et al.32 studied 33 patients, using troponin levels to identify 7 with cardiotoxicity. They evaluated 12h-miRNA-499, 6h-miRNA-499, and a combination of 6h-miRNA-29b with miRNA-499, finding better predictive performance (AUC = 0.80, 0.82, and 0.90; 95% CI: [0.59, 1.00], [0.62, 1.00], and [0.74, 1.00], respectively). Leger’s study indicated that miRNA-499, particularly in combination with miRNA-29b, was more effective and consistent across different time points.

Fig. 4: Forest plot of prediction performance of miRNA (refs. 31,32).
figure 4

AUC area under the curve, CI confidence interval, GLS global longitudinal strain, hs-cTnT high-sensitivity cardiac troponin T, MD mean difference, N number. This figure compares the predictive ability of selected miRNAs (e.g., miRNA-1, miRNA-29b, miRNA-499) and hs-cTnT for cardiotoxicity prediction across two studies,31,32 illustrating study-specific AUCs ranging from 0.38 to 0.90.

This stark difference between studies indicates that miRNA performance can vary greatly depending on which miRNA is measured, the timing of sampling, and the patient population. The small sample sizes and divergent methodologies made it impossible to pool results into a meta-analysis. Consequently, any conclusions about miRNAs must be tentative.

Data extracted from 61 studies31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91 that were ineligible for quantitative synthesis are summarized in Supplementary Table S2. Among imaging-based methods, echocardiographic techniques, particularly advanced modalities such as GLS, tissue Doppler indices (Tei index, E/e’ ratio, Tricuspid annular plane systolic excursion), and 2D/3D speckle tracking, often revealed subtle systolic or diastolic dysfunction even when LVEF remained within normal limits.

In some cases, cMRI detected myocardial changes or reduced cardiac volumes not captured by echocardiography, suggesting higher sensitivity in certain contexts. Similarly, radionuclide ventriculography (RVG) was reported to identify early functional impairment at lower cumulative anthracycline doses than conventional echocardiography.

Stress-based assessments, such as dobutamine stress echocardiography, yielded mixed findings: some studies reported abnormal systolic and diastolic responses to stress in the majority of patients, while others found no difference from healthy controls.

CPET often revealed impaired exercise capacity despite preserved resting systolic function. CPET-derived measures such as peak VO₂ correlated with reduced left ventricular mass and volume, highlighting its potential to detect latent cardiac dysfunction.

Regarding biochemical monitoring, emerging candidates like growth/differentiation factor 15 (GDF-15) and anti-cardiac autoantibodies may offer additional diagnostic value.

Genetic studies identified variants (SLC28A3, UGT1A6, RARG, SLC22A17, SLC22A7) that significantly modulate anthracycline cardiotoxicity risk. Some authors proposed combined genetic-clinical risk models to identify high-risk individuals more precisely.

Regarding electrocardiographic approaches, traditional ECG generally had limited sensitivity, but specific techniques like signal-averaged ECG and QT dispersion during dobutamine stress demonstrated potential utility in detecting subclinical changes.

Other less common modalities included ^111In-antimycin scintigraphy, which in some cases detected myocardial injury even in the absence of echocardiographic changes, and nomogram-based risk tools incorporating cumulative anthracycline dose, echocardiographic parameters, and biomarker levels, which showed promise in stratifying patients by risk. Some studies also emphasized the value of multimodal diagnostic approaches, combining imaging, biomarkers, and functional testing to improve sensitivity for early cardiotoxicity.

Although many of these additional methods showed promise for detecting subclinical cardiac injury, considerable methodological heterogeneity, small sample sizes, and variable study populations prevented their inclusion in our quantitative meta-analysis.

The results of the risk of bias assessment are detailed in Supplementary Figs. S3 and S4. The overall risk of bias was generally high, underscoring the need for cautious interpretation of the findings. Key contributors include variability in the study populations, differences in how index tests are conducted and interpreted across studies, and disparities in outcome definitions and measurements. Addressing these sources is crucial for improving the accuracy and applicability of the review findings.

Discussion

Translational science plays a crucial role in bridging the gap between clinical research and its practical implementation in everyday healthcare.92,93 This systematic review set out to determine which diagnostic tools can detect early, potentially reversible cardiotoxic changes in pediatric oncology patients before the onset of overt heart failure or cardiomyopathy.

Echocardiography remains the clinical cornerstone for cardiotoxicity monitoring due to its non-invasive nature and wide availability.94 However, conventional echocardiographic measures like ejection fraction or fractional shortening often detect myocardial dysfunction only after a significant injury has occurred, limiting their utility for truly early intervention.5

Even advanced echocardiographic techniques such as speckle-tracking GLS, which can reveal subtle systolic impairment earlier than LVEF, are not yet a standalone solution. We fully agree with Kouwenberg et al. that while GLS has shown reasonable predictive ability in several studies, the results have not been consistently robust enough for GLS to serve as a sole diagnostic criterion for early cardiotoxicity.95 This underscores that even the best current imaging parameters alone cannot guarantee sensitive early detection in all cases.

Likewise, circulating biomarkers of myocardial injury or stress have yielded mixed results as early detectors of chemotherapy-related cardiac damage. Cardiac troponins and B-type natriuretic peptides are frequently studied and are even integrated into some monitoring protocols. For instance, the 2022 European Society of Cardiology (ESC) cardio-oncology guidelines recommend serial echocardiography with troponin/B-type natriuretic peptide (BNP) measurements for high-risk patients receiving cardiotoxic therapy.4 In our review, these biomarkers on their own were not sufficiently accurate for early cardiotoxicity identification; using a high-sensitivity troponin T cutoff (~0.01 ng/mL) identified only a subset of children who later developed cardiotoxicity, often detecting injury at a more advanced stage. Similarly, natriuretic peptides sometimes correlate with dysfunction, but no reliable threshold exists to ensure both sensitivity and specificity for early detection. We agree that these inconsistencies echo the conclusions of other experts: biomarker performance varies widely, primarily due to inconsistent timing of measurements, variable assay cutoffs, and inter-assay differences.96 Without standardization of assay timing and thresholds, the clinical utility of cardiac biomarkers in predicting chemotherapy-induced cardiotoxicity remains limited.96

Looking toward emerging technologies, AI-based prediction models offer a promising new avenue but remain in preliminary stages. In our analysis, some AI-driven models that integrated a broad array of clinical data achieved notably high predictive accuracy. This suggests that by analyzing complex patterns across multiple variables, these models may identify patients at risk of developing cardiotoxicity.97 Zhou et al. demonstrated that no single clinical or echocardiographic parameter could efficiently predict cardiotoxicity in a large pediatric oncology cohort.98 However, a machine-learning model combining dozens of features improved early risk stratification.98 Importantly, we interpret these AI results with caution. Many of the AI models showing high performance were derived from relatively small or single-center datasets, and their apparent accuracy could diminish when tested in broader populations. Our opinion, supported by Nechita et al., is that further validation on large, prospective pediatric cohorts is essential before AI-driven tools can be adopted clinically.99 These AI algorithms are exploratory – they illustrate what might be possible if we fully leverage big data, but they are not ready to replace conventional monitoring.

Similarly, novel circulating biomarkers such as miRNAs have garnered interest as early indicators of cardiac injury. Early studies have identified specific miRNAs – for example, miR-29b and miR-499 – that rise during anthracycline treatment and correlate with later cardiac dysfunction.32 Some of these candidates have shown impressive diagnostic potential in initial trials. However, the evidence for miRNAs as cardiotoxicity biomarkers is currently too limited to draw firm conclusions. In addition, variability in miRNA assay techniques and normalization methods poses challenges for standardization across studies. Thus, like AI, miRNAs should be viewed as promising research avenues rather than established clinical tools at this time.

At this stage, we must clearly state that neither AI-based models nor miRNA assays have sufficient evidence to be considered reliable standalone solutions for early cardiotoxicity detection. They remain adjuncts under investigation, not replacements for vigilant clinical monitoring.

Our findings align with a consensus that a multimodal approach is likely required to catch cardiotoxicity at a reversible stage.4,96 Relying on any single test may miss early changes; by integrating multiple diagnostics, clinicians can potentially improve overall sensitivity and capture a more complete picture of cardiac health. This concept is supported by other research. Cronin et al., in a state-of-the-art review of cancer therapy-related cardiac dysfunction, likewise advocated for multimodal monitoring, combining various imaging modalities to improve early detection rates.100

A recurring theme in our analysis was the significant heterogeneity of evidence across studies. The need for standardization in this field cannot be overstated. Our review revealed that studies use widely differing definitions of cardiotoxicity – ranging from acute changes during chemotherapy to late-onset cardiomyopathy – which makes it difficult to compare results or establish universal screening guidelines. Harmonizing the definition of early cardiotoxicity would greatly aid both research and clinical practice. Establishing uniform definitions and protocols would enable more direct comparisons between modalities and help to identify which children truly benefit from early cardioprotective interventions.

In summary, our findings reinforce that a multimodal strategy is essential for detecting cardiotoxicity while still reversible. Combining multiple diagnostic tools can improve sensitivity and specificity. At the same time, the marked heterogeneity across studies—particularly in how cardiotoxicity is defined—underscores the urgent need for standardization. Harmonizing definitions and protocols would enhance both research quality and clinical decision-making, ultimately helping to identify which children genuinely benefit from early cardioprotective interventions.

Strengths and limitations

Our study strengths are that we evaluated a wide range of diagnostic modalities, specifically in the context of pediatric oncology. By synthesizing data across these domains, we provide an up-to-date overview that can inform clinical surveillance programs and future research directions. We believe these insights contribute meaningfully to the ongoing discussion of how best to protect children’s hearts during cancer therapy, and they highlight critical areas where guidelines could be improved.

This review also has important limitations. Heterogeneity among the included studies was a significant challenge. Studies differed in how they defined cardiotoxicity, the timing of outcome assessment, and their patient populations. This methodological diversity limited our ability to make direct comparisons and to perform quantitative meta-analyses across all modalities. In particular, the lack of a uniform definition for early or subclinical cardiotoxicity made it challenging to determine which diagnostic tool was the most efficient, since each study’s endpoint was slightly different. Another limitation is the generally small sample size of many pediatric studies. Especially for novel approaches like AI and miRNA, most available studies had relatively few patients and cardiotoxic events, raising concerns that the reported accuracies may not generalize to larger populations. Moreover, most studies used a healthy control group rather than an at-risk cancer control group, which can lead to an overestimation of a tool’s diagnostic performance. These limitations suggest caution in interpreting any single finding from this review in isolation.

Implications and future directions

The absence of a clearly superior early diagnostic tool means that, for now, pediatric oncologists and cardiologists must continue using a combination of methods – along with careful clinical judgment – to monitor for cardiotoxicity. There is a pressing need to establish unified definitions and protocols, a consensus on what constitutes an early cardiotoxic event, and a way to consistently measure it. This would allow future studies and clinical trials to directly compare results and identify effective tools. Future research should focus on prospective validation of the most promising modalities identified, and the tools that have shown retrospective accuracy need to be tested in real-world prospective cohorts to see if they can reliably flag high-risk patients in advance.

The future likely lies in integrated approaches – perhaps a risk algorithm that incorporates patient risk factors, serial imaging findings, and biomarker trends to generate a composite risk score for cardiotoxicity. Such an approach could be aided by AI algorithms but should be grounded in well-validated clinical evidence. The aim would be to tailor monitoring and intervention to each patient’s risk profile, catching signs of cardiac injury as early as possible without overwhelming patients with unnecessary tests.

Conclusion

In this systematic review, we found that no single method reliably detects early chemotherapy-induced cardiotoxicity in children. Echocardiography often identifies damage late, and biomarkers like troponin and BNP show inconsistent results. Emerging tools such as AI models and miRNAs are promising but lack validation and standardization. A multimodal strategy—combining imaging, biomarkers, and clinical risk factors—remains the most effective approach. Standardized definitions and protocols are urgently needed to improve comparability across studies and guide clinical practice.