Introduction

There is increasing interest in interventions targeting the aging process1,2. The “geroscience hypothesis” posits that a shared pathophysiology of aging shapes most chronic diseases and interventions targeting aging will confer larger health benefits than those targeting any individual disease3,4. Research into such anti-aging interventions has grown substantially, including trials repurposing commonly used drugs such as metformin5.

Because of the large sample sizes and long durations of trials required to demonstrate anti-aging effects, most evidence to date has come from preclinical experiments in non-human animals6. Aging is a universal pathological process in eukaryotes7,8 with conservation of aging pathways across organisms9,10; interventions targeting aging may be more successfully translated than interventions for specific diseases which often rely on artificial disease models11,12.

Given the possible substantial health benefits of slowing aging, the quality of preclinical studies in this area may be especially important. However, alongside the challenges translating results from one species to another, model organism studies have a long history of shortcomings and design flaws13,14,15.

Here, we systematically analyzed studies from DrugAge, a curated database of preclinical experiments investigating the effects of interventions on aging and lifespan in non-human animals16. We aimed to evaluate the quality of reporting and methodological rigor of this literature, assess the distribution of observed effect sizes, and probe for the presence of diverse biases. We also investigated how these features changed over time.

Results

Of 667 included studies, 617 included only experiments with one species and one start time; 29 summarized experiments with two species and three summarized experiments with three species. Eighteen studies had experiments with two start times, for a total of 720 experiments. Of these, 364 involved an organism that reproduces sexually; of these, 130 used only males (35.7%); 47 used only females (12.9%), 172 used both (47.3%), and 15 did not report the sex(es) used (4.1%). The median sample size across experiments was 200 animals (IQR: 105–341).

All studies were published in peer-reviewed journals (667, 100.0%) and most stated control of temperature (607, 91.0%). Randomization was mentioned in 133 studies (19.9%). Blinding to intervention was mentioned in 27 studies (4.0%), blinded assessment of outcomes in 20 (3.0%), and sample size calculations in 40 (6.0%). Following animal welfare regulations was mentioned in 93 studies (13.9%). Conflict of interest statements were included in 347 studies (52.0%).

The median CAMARADES score across studies was 3 (IQR: 2–3) and varied significantly across species (p < 0.0001). 61 studies (9.1%) had CAMARADES counts > 4. Except for peer-review publication that was ubiquitous and blinding that was rare, all CAMARADES components varied significantly across species (p < 0.0001, Table 1). Caenorhabditis and Drosophila studies almost always stated control of temperature but rarely reported randomization or sample size calculations. Studies of mice and rats stated control of temperature less commonly but did better on all other fronts.

Table 1 Mention of CAMARADES components across studies of different species in the database, along with median CAMARADES count (sum of the 8 included CAMARADES elements, minimum 0 and maximum 8)

Of the 667 studies, 153 reported whether the organisms included were from an inbred (genetically homogenous) line or an outbred/hybrid line; of these, 73 used inbred lines. None of the CAMARADES components differed significantly between inbred and non-inbred studies.

The earliest included study was published in 1948, the latest in 2024. Over time, there was a significant increase in reporting conflicts of interest, compliance with animal welfare regulations, control of temperature and sample size calculations (linear regression p < 0.0001 for each). There was no significant increase over time in reporting of randomization (p = 0.60) or blinding with regard to intervention (p = 0.07) or outcomes (p = 0.011) (Fig. 1). Studies had higher CAMARADES counts over time (p < 0.0001).

Fig. 1: Percentage of studies with specific CAMARADES features over time (scatterplot, with local polynomial regression fit curves).
figure 1

Reporting of potential conflicts of interest, animal welfare regulations, control of temperature and sample size calculations increased significantly over time (p < 0.0001 for each).

Across 720 experiments, the median percentage of average lifespan that interventions were started at was 5.7% (IQR: 5.3-13.8%) (Fig. 2), with significant variation across species (Table 2). Mammal experiments started at a relatively later point in lifespan than non-mammal experiments (26.2% vs 5.6%, p < 0.0001). Most experiments started “early” (before 20% of average lifespan) (n = 592, 82.2%). Few experiments started at 50% of average lifespan or later (58, 8.1%).

Fig. 2: Distribution of when an intervention was started in the lifespan of an organism, expressed as a percentage of the median lifespan for that species, across 720 experiments.
figure 2

Blue represents 153 mammal experiments (median = 26.2%), and pink represents 567 non-mammal experiments (median = 5.6%).

Table 2 Timepoint in lifespan of organisms that each experiment was started at expressed as a percentage of the median lifespan for that organism. Kruskal-Wallis test for comparison between species p < 0.0001

Of the 720 included experiments, most SMDs were positive (638, 88.6%), indicating a favorable effect of the intervention on lifespan. The median SMD was 0.43 (IQR: 0.24-0.70); the random effects meta-analysis estimate was 0.57 (95% CI: 0.48-0.66, p < 0.0001), with significant heterogeneity (I2 = 95%, 94–96%, p < 0.0001). As a fraction of average species lifespan, the median percentage increase in lifespan was 13.3% (IQR: 6.8–23.1%); the meta-analysis estimate was 14.3% (12.9–15.7, p < 0.0001). Table 3 summarizes these results.

Table 3 Median and IQR of SMD for lifespan increase, as well as the median percentage increase in lifespan, for each species in the database, across 720 experiments from 667 studies

Comparing experiments in studies with specific CAMARADES components, reporting of randomization was associated with a smaller SMD (0.38 in those reporting vs 0.45, Kruskal-Wallis p = 0.0074). Other CAMARADES components were not associated with significant differences: peer-reviewed publication (p = 1.0), control of temperature (p = 0.094), blinded intervention (p = 0.35), blinded assessment of outcome (p = 0.84), sample size calculations reported (p = 0.17), compliance with animal welfare regulations (p = 0.48), conflict of interest statement (p = 0.041). However, in multivariate linear regression adjusting also for species, publication year, sample size, there was no statistically significant association between any CAMARADES component and SMD. There was no significant difference between the 592 early start experiments and the 128 late start experiments (median SMD 0.43 vs 0.40, Kruskal-Wallis p = 0.60). Median SMD did not vary significantly with publication year (p = 0.11) (see Supplementary Fig. 1 for bubble plot). Studies with mammalian species had lower median SMDs than non-mammal studies (0.39 vs 0.44, p = 0.040).

There were 35 compounds that were tested in at least one mammal and at least one non-mammal experiment, allowing for comparisons within the same drug (see Supplementary Table 1). Of these, 21 showed a significant increase in lifespan (p < 0.005) for non-mammals. Of these, only 7 also showed a significant increase in mammal lifespan (curcumin, spermidine, epithalamin, D-glucosamine, estradiol, SKQ and taurine); additionally, two in contrast showed a significant decrease in mammal lifespan (quercetin and butylated hydroxytoluene).

Ten compounds showed a significant increase in mammal lifespan (p < 0.005) and most showed some increase in non-mammals as well. However, the amount of mammalian evidence for these compounds other than rapamycin was limited (total sample sizes: 2761 for rapamycin, 293 for curcumin, 360 for spermidine, 160 for melatonin, 171 for epithalamin, 44 for berberine, 146 for D-glucosamine, 370 for estradiol, 50 for SKQ, and 122 for taurine).

The absolute percent error between non-mammal and mammal effects was 83% (IQR: 49–164%). Across compounds, the median percentage increase in mammal lifespan was significantly smaller than the percentage increase in non-mammal lifespan (7.4 vs 17.5%, paired Wilcoxon p = 0.0006). There was no significant correlation between non-mammal and mammal SMDs or percentage increases (r = 0.26, p = 0.10 and r = 0.25, p = 0.10 respectively).

Of the 720 experiments, 638 (88.6%) were associated with an increase in lifespan and 82 (11.4%) were associated with a decrease in lifespan. Of the 638 experiments associated with increasing lifespan, 495 showed p < 0.05 (77.6%); of the 82 experiments associated with decreasing lifespan, 51 showed p < 0.05 (62.2%); Supplementary Fig. 2 shows the p value distributions.

Across the 720 experiments there was evidence of significant funnel plot asymmetry (Egger’s Z = 11.3, p < 0.0001); see Fig. 3 for the contour enhanced funnel plot. The expected number of significant findings was 499 out of 720, while the observed number was 546, indicating significantly more significant results than expected (test of excess significance χ2 = 14.0, p < 0.0001); similarly, the proportion of statistical significance test resulted in a test statistic of Z = 3.74 (p < 0.0001)17,18.

Fig. 3
figure 3

Contour-enhanced funnel plot of SMD from the 720 experiments.

Discussion

Despite growing excitement about the possibility of anti-aging interventions impacting human healthspan and lifespan, most studies of these interventions have been conducted in non-human animals. In this review of 720 experiments from 667 such studies, we found widely varying effect sizes and reporting of factors potentially associated with study quality across species and compounds. Important design features such as randomization, blinding of intervention, blinded assessment of outcome, compliance with animal welfare regulations, and sample size calculations were infrequently reported, despite evidence that the absence of such features can bias experimental results19,20,21,22,23. Only slightly more than half included conflict of interest statements, although all studies were published in peer-reviewed journals, and over 90% reported control of temperature.

Although reporting quality improved somewhat over time, this was mainly due to increases in reporting of compliance with animal welfare regulations or conflict of interest statements; crucial design features such as randomization and blinding did not increase substantially over time. Generally, most studies did not meet standard reporting guidelines for pre-clinical experiments22.

Preclinical studies on various diseases have also shown infrequent reporting of randomization and blinding. A review of 271 preclinical studies across different diseases found 13% of studies reported randomization and 14% blinding24. In another similar review of 290 studies, 32% reported randomization and 11% blinding25. Overall, our results are comparable to these, although the reporting of both randomization and blinding seems to be even less frequent in the DrugAge database.

In addition to the overall low rate of reporting factors associated with study quality, we found significant differences across species: the four most represented species in the database (nematodes, fruit flies, mice and rats) varied widely in reporting of randomization and blinding, as well as in the average effect size found. Over half of mammal studies in the database reported randomization, while less than 10% of non-mammal studies did. The better reporting quality of mammal studies does not alleviate concerns, since even for mammal studies reporting was often suboptimal and most studies in the database were from non-mammals. Additionally, the average effect found in non-mammal studies was significantly larger than that found in mammal studies.

For 35 compounds with both mammal and non-mammal experiments, only eight showed a significant lifespan increase in both non-mammals and mammals; the number of experiments and sample sizes for these results were limited. These results are exploratory, and the numbers are small, but they raise hesitation about the direct translation of these results to more complex organisms such as humans.

Furthermore, previous work has suggested that some interventions may have different effects if started late in an organism’s lifespan rather than early26,27, and there is significant interest in discovering interventions that slow aging in older adults28. In our assessment, we found that most preclinical experiments started the anti-aging intervention early in the organism’s lifespan, often prior to sexual maturity, when key senescence mechanisms may lack relevance29. Although we did not find a significant difference in the effect of interventions between early and late start experiments, the sparsity of late start results makes this comparison uncertain. Our study clearly highlights the paucity of late start experiments in the literature, a deficit of evidence that needs to be remedied. We emphasize the need for whole-lifespan aging experiments with a greater diversity of start times, including more starting in middle or late life, as these better reflect the intended translational application of anti-aging interventions and likely design of future clinical trials investigating proposed interventions.

Overall, the analyzed studies have numerous significant reported results and many studies suggest sizeable effect sizes. However, the lack of methodological rigor (at least based on reported information) and the strong suggestion of bias (larger effects in smaller studies and an excess of significant results) prompt skepticism about this overall favorable picture and prospects for translation to humans. Our work has several limitations. The DrugAge database may not include some compounds that have never shown any promising results. Moreover, we did not extract quantitative data from all of the 3423 experiments in the database, but rather extracted a random experiment from each species represented in each study, as well as the earliest and latest start time experiments, obtaining quantitative data from only 21% of the experiments in the database. Although experiments were selected randomly, this method may still have led to a biased estimate of quantitative effects in some cases. Nevertheless, our selection process resulted in a dataset with largely independent observations, while the full database may have a lot of highly correlated data and may have over-represented specific experiments that were rather similar.

It is also possible that some studies that did not mention randomization still carried out randomization, and the same may apply to other design features. Nevertheless, the large variation in reporting of randomization and blinding is concerning. Also, we found the reporting of randomization did not differ significantly between studies that used genetically homogenous populations of organisms and those using more heterogenous populations. Furthermore, while we observed improvements over time in compliance with animal welfare, control of temperature and sample size calculations, it is unclear whether this represents genuine improvements in experimental practices, or simply better realization that these are features that should be reported in their publications.

Preclinical experiments investigating anti-aging interventions do not regularly follow reporting guidelines and infrequently report important design features such as randomization and blinding. There are significant differences in the average lifespan effect, as well as study quality, across different species commonly used in preclinical experiments. Non-mammal results do not seem to reliably predict mammal results, raising further concern for translation. Despite the interest in interventions able to slow aging when initiated late in human lifespan, most preclinical experiments started interventions early in organism lifespans. Our work highlights multiple concrete areas for improvement of preclinical anti-aging research, areas that may be critical for successful translation into human trial results.

Methods

We downloaded the fifth build of the DrugAge database on May 1st, 2025, which contained 3423 different lifespan experiments from a total of 680 unique studies. From these, we excluded 12 studies focusing on replicative aging in the yeast Saccharomyces cerevisiae and one duplicate study, yielding 667 unique studies.

For the 32 studies containing experiments with more than one species, a single experiment was randomly selected for each organism. If a study contained experiments where the same compound was started at different points in the same organism’s lifespan, we included one experiment for the earliest and one experiment for the latest start time. After selecting experiments in this way, our final dataset contained 720 experiments. See Fig. 4 for a flowchart of data extraction and Data Supplement 1 for all included studies and experiments. Overall, the 720 experiments represented 568 different species-drug pairs.

Fig. 4
figure 4

Flowchart of data extraction.

For each study, we extracted eight relevant quality checklist items from CAMARADES (Collaborative Approach to Meta-Analysis and Review of Animal Data in Experimental Studies) (Macleod 2004): (1) whether the study was peer reviewed, and reporting of (2) control of temperature, (3) random allocation to treatment/control; (4) blinded intervention; (5) blinded assessment of outcomes; (6) sample size calculations; (7) adherence to animal welfare regulations; and (8) potential conflicts of interest.

We also extracted the median or mean lifespans of experimental and control groups and whenever possible their confidence intervals (CIs) or standard errors (SEs). When these were not reported, we estimated them from included Kaplan-Meier figures and corresponding log-rank test p values.

The difference between experimental and control groups in lifespan was calculated, as well as the standardized mean difference (SMD) and its SE. Whenever possible, the SEs reported for the experimental and control lifespans were used to calculate the SMD and its SE; when these were not reported, the log-rank p value was used to estimate the SE30. We also calculated the relative increase in lifespan by dividing the difference in experimental and control lifespan by the control lifespan for each experiment, expressing the result as a percentage. For calculations of start times, we estimated the average lifespan of each included species as the median control lifespan in days across all experiments for that species in the database.

Random-effects meta-analysis was performed using the Sidik-Jonkman estimator31. Heterogeneity was estimated with the I2 statistic and Q test32,33. Meta-analysis calculations used the meta package in R34. Contour-enhanced funnel plots and Egger’s test were used to detect small study effects35, considering all 720 results together in the same funnel plot. Funnel plot asymmetry indicates smaller studies showing larger effects than larger studies, possibly indicating selection biases. We also applied the test of excess significance and proportion of statistical significance test, tests that may suggest selective reporting biases17,18. These various analytical tools allow estimation of publication selection bias, where selection processes favoring statistically significant results lead to biased selection of which outcomes to report, which statistical tests to use, and even which studies to publish or not17.

Results were considered statistically significant for p < 0.005 and possibly suggesting significance for p values between 0.05 and 0.00536. The <0.005 threshold has been proposed as a way to diminish the risk of false-positive results that is high in many fields36. The 4.1.0 version of the R programming language was used for all calculations37.