Introduction

The search for pharmacological interventions that extend the healthy lifespan has increased markedly in recent years, spurred by the discovery of a wide range of compounds, such as rapamycin and acarbose, that lengthen life of model organisms1,2,3. Whether these life-extending agents act broadly by reducing mortality hazard throughout the lifespan or only affect mortality during part of the life course remains unclear, in part due to the limitations of statistical tests usually used in aging research. The log-rank test4 is the most commonly used statistical tool to determine whether an intervention, be it pharmacologic, genetic, or nutritional, is life-extending. However, its use as the primary and often only tool for this purpose is questionable for several reasons. First is its requirement for proportional hazards (PH) between compared groups, implying that treatment effects on mortality remain constant over time5,6. This assumption does not align with the evidence that many interventions exert varying impacts at different life stages7. For example, in an earlier analysis of data from the Interventions Testing Program (ITP), we found that many interventions do not adhere to the PH assumption, thus challenging the applicability of the log-rank test in these contexts7. When the PH assumption is not met, there are many analytic tools that can be used6,7,8,9. Our previous approach for these interventions used the Gehan test, which is more robust to the constant PH consistency requirement and more sensitive to effects during early adulthood6. Despite its strengths, the Gehan test has its own drawback: a diminished sensitivity to effects manifesting at later life stages, when mortality and morbidity rates are highest10.

To assess the effects of interventions on the final phase of the aging process, methods like the Wang-Allison and the Gao-Allison test have been developed to determine if treatments extend the maximum lifespan11,12. However, these tests do not evaluate whether an intervention specifically reduces age-specific mortality in the last phase of life when frailty, cognitive impairment, chronic disease, and other burdens of senescence peak. Although the Gompertz model has been used for evaluating age-specific or time-varying effects, it is limited by its strict parametric assumptions about the shape of the hazard function13. The limitations of these approaches underscore the need for a more flexible tool for evaluating longevity interventions, especially one that accommodates potentially variable impacts of treatments across an organism’s lifespan. Such methods should pinpoint when, for how long, and to what extent an intervention significantly alters the mortality risk. This capability is particularly crucial for identifying interventions that mitigate mortality toward the end of life when the exponential increase in the burden of senescence is greatest. Numerous statistical methods for assessing the time-varying efficacy of drugs, including chemotherapeutics, have been developed and published6,14,15,16,17,18,19,20,21. However, these approaches have not seen widespread adoption in clinical trials, nor have they been applied in longevity studies22. A key barrier to their use is the lack of accessible implementations, coupled with the need for substantial user expertise to effectively tune these models.

In response to these challenges, we introduce a nonparametric method termed the Temporal Efficacy Profiler (TEP), which estimates the time-varying hazard ratio and visualizes age-specific effects on mortality risk. TEP can identify when, for how long, and to what extent an intervention significantly influences mortality risk, thereby overcoming the major limitations of traditional methods like the log-rank test. In our approach, we employ the Rebora method, implemented in the ‘bshazard’ R package, to calculate the age-specific mortality risk for the treatment (\({\lambda }_{{Treatment}}\left(t\right)\)) and control (\({\lambda }_{{Control}}\left(t\right)\)) arms separately23. Rebora et al. utilized B-splines within generalized linear Poisson models, incorporating a robust model selection process that automates tuning and eliminates the need for users to manually select smoothing parameters. TEP is defined as:

$${TEP}\left(t\right)=\frac{{\lambda }_{{Treatment}}\left(t\right)}{{\lambda }_{{Control}}\left(t\right)}$$
(1)

We considered two approaches to estimate confidence intervals (CIs) of the hazard ratios: the asymptotic method and the bootstrap method. Although bootstrap methods are broadly applicable, they are computationally intensive. On the other hand, simulation studies demonstrated that the asymptotic method is generally more conservative in terms of the coverage probability, especially at age extremes (“Methods” and Fig. S1). For the asymptotic method, we derived pointwise analytical CIs for the hazard ratio as the sum of two asymptotically normal estimates based on the variance of the difference the log hazards for each group as shown below,

$$V\left(\log \left({TEP}\left(t\right)\right)\right)={V}_{R}\left(\log \left({\lambda }_{{Treatment}}\left(t\right)\right)\right)+{V}_{R}\left(\log \left({\lambda }_{{Control}}\right)\left(t\right)\right)$$
(2)

where VR represents the Rebora estimator variance23,24. In addition, we used pointwise bootstrap CIs to describe the time-varying hazard profile25,26. The bootstrap method estimates CIs that are more sensitive to differences at age extremes, but it also has slightly lower coverage probability under the null hypothesis (Methods and Fig. S1). For conciseness, we only present the results from the asymptotic method, and only when corroborated by the bootstrap method. All significant findings, along with their corresponding mortality hazard ratios, are provided in the supplementary data files (Supplementary data files 1, 2, and 3).

In addition, we developed a color-coded visualization system to better communicate the statistical results. The code is publicly available, and we provide a user-friendly R script with instructions to facilitate its use by any investigator. (Github link: https://github.com/liu-dada/Temporal-Efficacy-Profiler).

To assess the utility of this approach, we utilized publicly available data from the ITP up to 2022, comprising 42 compounds evaluated in over 27,000 genetically heterogeneous mice at 3 geographically distinct sites27. These agents were tested alone or in combination in 132 trials, examining the effects of sex, dosage, and age of treatment initiation. Ten of these agents have been identified by log-rank testing to significantly extend lifespan in at least one sex28. This is the largest publicly available compendium of mouse survival data from tests of compounds with lifespan-extension potential, an exemplary resource for testing the efficacy of our analytic tool.

In this work, we introduce a new analytical tool, the TEP, and apply it to the ITP database to challenge the conventional proportional hazards assumption. The TEP reveals age-specific treatment effects that the commonly used log-rank test fails to detect, uncovering both beneficial and harmful interventions with far greater sensitivity. By delineating when effects occur, it distinguishes drugs that only reduce mortality in early to mid-adulthood from those that still or only act later in life, when aging-related mortality risk is greatest. These insights can improve drug development and enable more targeted interventions, whether aimed at late-life mortality or spanning a broader portion of the life course.

Results

Development and validation of the TEP to determine the timing and impact of life-extending candidates

Figure 1 illustrates how the TEP identifies age-specific effects of an intervention on the mortality hazard, using the ITP test of green tea extract (GTE) in females as an example29. Details of the analysis are described in Online Methods. It should be noted that GTE had no significant effect on survival by log-rank testing29. Figure 1A shows the Kaplan-Meier survival plots for treatment and control groups. These plots indicate that the proportional hazard assumption is likely violated due to the crossing survival curves, which was confirmed by the z-test7.

Fig. 1: Graphical representation of the TEP. Survival data are from the test of Green Tea Extract in females.29.
figure 1

A Kaplan-Meier survival curves of the GTE-treated female mice (Red) and control female mice (Black); B Age-specific mortality hazards of GTE-treated and control mice groups; C Mortality hazard ratio between GTE-treated and control mice groups and 95% confidence intervals shown as dashed lines; D Life course heat map visualization of the age-specific effects of GTE on the mortality hazard ratio. Vertical dashed lines mark the boundaries of significant effects on the mortality hazard ratio based on the ages when the 95% confidence intervals in panel C cross 0. Source data are provided as a Source data file.

Figure 1B is a graphical representation of the mortality hazards of the control and GTE-treated groups throughout the period of testing, using the Rebora method30,31. The mortality hazard of the GTE-treated group is reduced relative to that of the control group before the median lifespan, but shortly thereafter crosses over, exceeding that of the control group.

Figure 1C shows the application of the TEP to the GTE data. The log ratios of the mortality hazards of GTE-treated and control groups shown in Fig. 1B are calculated based on the mortality hazard estimated by Rebora method23. Negative log hazard ratios indicate beneficial effects of GTE (lower mortality hazard), while positive values suggest detrimental effects. The 95% CIs for the mortality hazard ratio were estimated by asymptotic and bootstrap methods, with the asymptotic CI shown as dashed lines. Significant beneficial effects are marked by upper 95% CIs remaining below zero (marked in green), whereas significant adverse effects are indicated when lower 95% CIs exceed zero (marked in red). The duration (age range) of significance is bounded by the ages when the 95% confidence limit crosses 0, as illustrated. This analysis reveals that GTE reduced mortality hazards during midlife but increased mortality hazards toward the end of life.

Figure. 1D integrates the features of Fig. 1C into an annotated horizontal heatmap to assist in cross-compound comparisons. The heatmap ranges from birth to the death of the last subject in either control or treated group, starting blank and transitioning to color with the onset of treatment. Gray indicates no significant effect, green marks periods of significant mortality reduction, and red denotes significant increases. The color intensity correlates with the effect size (log HR), allowing for a direct comparison of intervention impacts across different timelines as illustrated in Fig. 2. In this example, TEP complements and adds value to the log-rank test, pinpointing the specific ages and durations over which GTE significantly alters age-specific mortality hazards.

Since we added features into the established ‘bshazard’ function, we conducted two simulation scenarios assessing the TEP performance. Under null and alternative hypotheses, simulation results show the method’s robustness in estimating accurately coverage probabilities for confidence intervals across life spans (Online “Methods” and Fig. S1). The first scenario confirmed the accuracy of the asymptotic confidence intervals and the conservative nature over an entire lifespan, while the second scenario demonstrated TEP’s ability to correctly indicate treatment effects, outperforming the log-rank test in detecting non-proportional hazards (Online “Methods” and Fig. S1).

Greater sensitivity and precision in identifying mortality-modifying interventions

Figure 2 presents heatmaps of interventions identified by the TEP that significantly reduced or increased the age-specific mortality hazard during treatment. Comprehensive heatmaps generated by both the asymptotic and bootstrap methods are provided in Supplementary Data File 3. The hazard ratio plots, which underlie these heatmaps and were calculated via time-varying hazard ratio analysis, are displayed in Supplementary Data File 1 and 2 for males and females, respectively.

Fig. 2: Life course heat maps of interventions that significantly modified mortality hazard.
figure 2

Only interventions confirmed by both bootstrap and asymptotic methods are displayed, with the asymptotic method results used as the representative data. Interventions are ranked by the age when beneficial effects ceased in males (from earliest to latest). The remaining interventions are ranked by cessation age of beneficial effects in females, followed by detrimental effects in females (ranked by cessation age of effect), and the remaining ranked from longest to shortest duration of detrimental effects in males. Each row represents an individual trial of one intervention in a single cohort. Each intervention involved one compound or a combination of two, with dosage and starting age of treatment listed. The color-coded bands denote the temporal significance of drug effects: white indicates the period before treatment onset, gray marks periods with no significant effects, green indicates periods of significant beneficial effects, and red denotes intervals of significant detrimental effects. The solid black triangle indicates the median lifespan of the control group for each trial, and the open triangle marks the age of 90% mortality of the control group. Empty cells indicate no significant effects detected by both methods, while cells crossed indicate there is no trial tested. Note that the reader can rearrange trials in any order using an Excel spreadsheet (Supplementary Data File 3). Source data are provided as a Source data file.

Interventions in Fig 2 are ranked based on the age at which their beneficial effects ceased in males, from earliest to latest. For interventions that were identified differently by the two methods, further investigation may be required. Readers can reorder these data as they see fit using the spreadsheet in Supplementary Data File 3. In this discussion, we focus on the interventions identified to be significant by both methods.

Twenty-eight compounds, consisting either of a single agent or a combination of two agents, at one or more doses, initiated at varying ages, significantly modified the mortality hazard in one or both sexes at one or more periods during the treatment period. This analysis identified 11 new compounds that significantly reduced mortality in at least one sex during treatment but were overlooked by the log-rank test: namely, candesartan cilexetil (CC), caffeic acid phenethyl ester (CAPE), 17-dimethylaminoethylamino-17-demethoxygeldanamycin hydrochloride (DMAG), enalapril, GTE, L-leucine, metformin, oxaloacetic acid (OAA), PB125, syringaresinol (Syr), and ursodeoxycholic acid (UDCA). The new analysis also identified 14 compounds that were detrimental (i.e., increased mortality) in one or both sexes at one or more periods of treatment. The duration of significant benefit or detriment varied markedly from weeks (e.g., H2-(2-Hydroxyphenyl) benzoxazole (HBX)) to almost the entire treatment period (e.g., rapamycin + acarbose). Most compounds only reduced mortality or only increased mortality. Two exceptions were CC and GTE in females. Effect sizes, indicated by the color intensity, varied markedly during the periods of benefit and detriment. Acarbose had its greatest benefit at the initiation of treatment, waning progressively thereafter. Effect sizes of other compounds, such as butanediol and captopril in males, and many of the different rapamycin trials in females, peaked during the middle of treatment. A few interventions showed a steady increase in effect with continued treatment (e.g., glycine in males and leucine in females).

Only a fraction of interventions reduced mortality at later ages

A strength of the TEP is its ability to estimate when during the life course and for how long an agent exerts its effect on survival. In males, 16 interventions reduced mortality hazards at some period during the life course (Fig. 2). Of these, 9 compounds only reduced mortality risk in early and mid-adulthood (i.e., before the median lifespan): Syr, (R/S)−1,3-butanediol (BD), CC, captopril, enalapril, UDCA, metformin, DMAG, and nordihydroguaiaretic acid (NDGA) at 800 ppm. The two higher doses of NDGA had a slightly longer period of benefit, but only for a short period beyond the median lifespan. By contrast, in females, of the 11 agents that reduced mortality risk at some stage of life, only GTE reduced mortality during early- to mid-adulthood.

In males, five compounds tested in 11 trials demonstrated reduced mortality after attainment of median lifespan, although these effects vanished before mice attained the 90% mortality benchmark: 17α-estradiol, aspirin at 21 ppm, Protandim, high doses of NDGA, and 3 of 4 late-onset (20 mo) rapamycin treatments. Notably, only 5 of the 17 compounds that reduced mortality in males did so at ages beyond the 90% mortality threshold: canagliflozin, acarbose, 17α-estradiol, glycine, rapamycin, and cocktails of either acarbose or metformin with rapamycin. In females, in contrast to males, 10 of 11 beneficial interventions reduced mortality mainly at ages after attainment of median lifespan. Five compounds reduced mortality after 90% mortality, including most trials involving rapamycin, acarbose, BD, L-leucine, and captopril.

Some compounds have adverse effects on mortality hazards

One goal of the ITP has been to ensure against possible deleterious side effects of potential life-extending interventions, especially those already being marketed. Until now, the ITP has only identified two life shortening interventions using the log-rank test32. Here the TEP revealed 15 trials with 14 compounds that increased the mortality hazard at one or more periods of treatment: 2 in males (HBX and INT-767) and 12 in females (CC, metformin, DMAG, canagliflozin, 17α-estradiol, GTE, minocycline, geranylgeranyl acetone (GGA), fish oil, nicotinamide riboside (NR), UDCA, and MIF098) (Fig. S2).

Sex differences in the effect of pharmacological interventions

Marked sex differences in the responses to life-extending compounds are one of the key outcomes of the ITP28. The TEP unveiled even more sex differences. It identified 5 additional compounds that only benefited males: Syr, enalapril, metformin, DMAG, and UDCA, and 5 compounds that only reduced mortality in females: OAA, CAPE, PB125, Leu, and GTE. Notably, 6 interventions, UDCA, CC, metformin, DMAG, canagliflozin, and 17α-estradiol, exhibited beneficial effects in males but detrimental effects in females (Fig. S2). More compounds adversely affected survival in females (12) than in males (2). Moreover, most compounds with negative effects exerted their effect on females almost from the beginning of treatment. The detrimental effects waned during the 2nd year of life but sometimes reappeared in the final stage of life.

Discussion

The TEP presented here promises to be broadly useful and impactful for aging research. Rather than repeatedly testing different quantiles (i.e., median, 90th percentile, etc.), we have introduced a descriptive approach that reveals the age-specific effects of interventions on the mortality hazard. This opens the door to more granular insights about the actions of an intervention. Such insights can lead to more targeted application of interventions and a better understanding of the underlying mechanisms. The analysis does this by providing estimates of when and for how long during the life course an intervention reduces (or, in the case of detrimental effects, increases) age-specific mortality. It also provides an estimate of the effect size of an intervention and how the strength of its effect changes over the course of treatment. None of this information is attainable by the log-rank test, the current standard for evaluating longevity interventions.

The TEP can distinguish interventions that specifically reduce mortality during senescence from those that only affect survival during midlife or earlier. This is important in the search for therapeutic interventions that benefit individuals of advanced age when the burdens of senescence are greatest. The TEP is also sensitive to adverse effects, which is critical for pre-clinical models that aim to be translatable. Furthermore, the method is sensitive to sex differences in timing, duration, and efficacy of interventions, providing further impetus to probe the mechanisms underlying the growing number of sexually dimorphic traits in aging. Here, we discuss some of the ways the new information provided by this analytic tool that can assist drug discovery and the search for the underlying mechanisms that drive aging. Additional applications will likely emerge as its adoption spreads within the geroscience community.

A major discovery using this tool is that most interventions exhibited age-related changes in drug efficacy across the detectable treatment period (Fig. 2). This observation is not readily apparent by visual inspection of most Kaplan–Meier plots and is not obtainable from the log-rank tests. Very few interventions significantly reduced (or increased) mortality through the entire course of treatment. Most were only effective for less than half of the treatment duration. This calls for explanation, and the answers are likely to lead to better interventions and greater insight into the mechanisms of aging. The age-specific decrease, increase, or loss of efficacy of an intervention may reflect age-related changes in pharmacokinetics or pharmacodynamics, leading to suboptimal dosage. This finding opens the door to developing age-specific doses to sustain efficacy for longer periods and raises awareness of the importance of understanding the role of aging in pharmacokinetics. It is plausible that the aging processes or causes of mortality change with age, and the intervention loses efficacy because it no longer targets the underlying pathways. Whatever the reason, this tool has uncovered a critical variable that needs to be considered in interventional geroscience.

Another important outcome of the application of the TEP to longevity data is the finding that only a subset of the interventions in the ITP database affected age-specific mortality rates in the last half of the lifespan, and even fewer affected mortality rates after the age when 90% of the control cohort had died11. Diet restriction has long been considered an example of an intervention that retards aging processes broadly, because it extends the age of 90% mortality, distinguishing it from many interventions that only extend the median lifespan33,34. Many studies, including the ITP, use the Wang-Allison test as a discriminator for interventions that do or do not extend the maximum lifespan based on the 90% mortality measure. However, this test does not distinguish whether an increase in the age of 90% mortality reflects the effects of reduced mortality accumulated during earlier ages from the effects of the age-specific mortality reduction at or near the age of 90% mortality. This distinction is of particular importance to a major goal of geroscience: namely, to identify compounds and discover the underlying mechanisms that extend the maximum lifespan by reducing age-specific mortality during the later stages of life when the burden of senescence is greatest. The TEP provides such a measure by indicating whether the intervention specifically reduces mortality rates in the final stage of life. Only a subset of the interventions reported by the ITP as lifespan extending, using log-rank analysis, reduced mortality hazard after the median lifespan, and even fewer did so at later ages.

Nevertheless, compounds that only reduce mortality during the first half of adult life should not be discounted. Reducing mortality at any stage of life can be impactful, especially when considering translatability to humans. For example, the male mortality disadvantage, compared to females, is greatest in the first half of adult life in both humans and UM-HET3 mice31. It is noteworthy that most of the compounds that are only effective in males are also only effective during the first half of the lifespan. Castration of UM-HET3 males before puberty eliminates this mortality disadvantage30. If any of the compounds that only eliminate the male mortality disadvantage during this period without interfering with male reproductive function, the societal impact if clinically translatable would be great28,35.

Not only is this method more sensitive to agents that reduce age-specific mortality, but it is also more sensitive to those that increase mortality. The ITP has never identified adverse effects using the log-rank test until recently32,36. This new tool revealed 15 trials involving 14 compounds that increased mortality hazards in at least one gender. There was a marked sex difference. Only 2 trials showed detrimental effects in males compared to 13 trials in females. Some compounds, including canagliflozin and high doses of 17α-estradiol, markedly reduced mortality in males but were harmful in females. This finding has been confirmed in a recent ITP trial, where canagliflozin significantly prolonged lifespan in males, but shortened lifespan in females32,36. These findings underscore the need for sex-specific testing of life-extending candidates.

The TEP can detect reversals of the benefit of compounds across the life course. GTE in females reduced mortality before the median lifespan but increased mortality at later ages—another discriminator not possible using the log-rank test. There is precedence for this reversal. In humans, individuals reporting the lowest intake of dietary protein had reduced mortality from cardiovascular disease and cancer before 65 years of age, but this relationship reversed after 6537. Mice with reduced branch chain amino acid intake had extended life when the diet began in early adulthood, but their lifespan was unaffected when the diet was initiated at a later age38. Age-related changes in pharmacokinetics and pharmacodynamics may play a role here. For example, blood levels of canagliflozin, whose beneficial effects in males diminish with age, are 2-3-fold higher in older males39.

Another strength of the TEP is its heightened sensitivity to potential life-extending candidates. It identified over twice as many as the log-rank test. This is due in part to its ability to identify age-specific effects on the mortality hazard unimpeded by the requirement of the log-rank test for consistent proportional hazard across the duration of treatment. The newly identified compounds generally have smaller effect sizes and shorter durations of positive effect compared to those identified by the log-rank test. However, given their geroprotective potential and the fact that most trials have only used one dose, they deserve further study. It is important to emphasize that this statistical tool should not be used as a final arbiter of any candidate for mortality reduction and lifespan extension (or adverse effect), but rather should be considered a screening tool for identifying potential candidates that deserve follow-up—for example, with different doses. Type 1 errors (i.e., false positives) during initial screens are more acceptable and preferable to false negatives.

A key strength of TEP lies in its use of the Rebora et al. bshazard method23, which enables data-driven estimation of time-varying hazard without requiring manual tuning. The hazard function is modeled using B-splines, with smoothness determined by second-order differences treated as random effects. The variance of these effects, estimated directly from the data via Extended Quasi-Likelihood, serves as the smoothing parameter, allowing the method to automatically adapt to data complexity. This flexible framework, introduced by Eilers and Marx40 and extended by Lee et al.41, allows TEP to detect both subtle and pronounced age-specific effects, including those missed by traditional approaches such as the log-rank test. Its ability to accommodate non-proportional hazards is central to uncovering temporal patterns of intervention efficacy and risk across the lifespan.

It is important to acknowledge the limitations of this method. Although we employed both asymptotic and bootstrap methods to identify significant effects, and we only present findings that were consistently identified by both methods, effects that emerge at age extremes (after 90% mortality) may require further validation due to the relatively small sample size during that period. For example, the detrimental effects observed in DMAG and metformin treatments in females were only evident during a brief window after 90% mortality. On the other hand, we emphasize that these detrimental effects warrant close attention, especially since many of the compounds studied are readily available over the counter, raising potential safety concerns. Compounds showing detrimental effects, even if detected by only one method, deserve further investigation. Another limitation is that the TEP may require larger sample size than traditional log-rank test. While there is no specific sample size requirement for TEP analysis, we recommend using a sample size that meets the requirements of the log-rank test to ensure more reliable interpretation of the results.

The method currently does not explicitly consider uncertainty in the Time axis, so the ages at which the treatment effect becomes nonzero are presented as point-estimates without confidence intervals. However, this limitation did not preclude consistent findings between similar treatments across several cohorts, such as the early effects of ACE inhibitors (Enalapril and Captopril) or early effects of different doses of NDGA. Statistically testing whether two different treatments have the same effect relative to control is more complex (testing whether the ratio of two hazard ratios is 1) and may require comparisons across cohorts.

While this methodology facilitates the estimation of time-varying treatment effects in comparison to a control, future enhancements could include explicit testing and quantification of differences between active treatments in terms of both timing and the extent of changes in mortality hazard ratios. It would be particularly insightful to assess different dosages of a single compound to pinpoint the optimal dosage for specific age intervals.

In conclusion, this new analytic tool will lead to a better understanding of the impact of interventions on survival, especially in the field of aging research. Testing interventions on survival across the life course is not only time-consuming but also expensive. From such studies, we should derive not just a p value but also gain insight into the ages when interventions are effective or deleterious. This method can provide a more comprehensive evaluation of lifespan interventions, thereby enhancing our understanding of the mechanisms of aging and age-related risk factors.

Methods

Data availability, mouse model, and husbandry

The datasets employed in this study are sourced from the Mouse Phenome Database (MPD; phenome.jax.org), encompassing all data from the Interventions Testing Program (ITP) spanning from 2004 to 2022. This dataset incorporates 13 distinct cohorts, integrating data across three research facilities to ensure the robustness and reproducibility of the findings. The ITP employed the UM-HET3 mouse line, a genetically heterogeneous model, chosen for its relevance to the genetically diverse human population. UM-HET3 mice are bred according to a specific crossbreeding protocol: BALB/cByJ females are mated with C57BL/6J males to produce F1 hybrid females, which are then bred with F1 hybrid males derived from mating C3H/HeJ females with DBA/2J males. This breeding strategy is designed to maximize genetic diversity within the model, thereby approximating the genetic variability inherent in human populations and increasing the translational value of the research findings. The mice designated for longevity assays were maintained under controlled environmental conditions, with a constant ambient temperature of 25 °C and a regulated photoperiod of 12 h light/12 h darkness. Nutritional needs were met with ad libitum access to the Purina 5LG6 diet, alongside specific drugged food formulations as per experimental requirements. Housing protocols were optimized for social enrichment and welfare, accommodating up to three males or four females per standard laboratory enclosure, in accordance with established ethical guidelines. Rigorous daily health assessments were conducted by trained staff to monitor the well-being of the subjects, promptly identify morbidity signs, and implement early intervention strategies as necessary. This proactive health management approach minimized unnecessary suffering and ensured the reliability of longevity data. The specifics of drug administration, including dosage, frequency, and duration, as well as the rationale behind the selection of intervention agents, are detailed in the original published reports, providing a comprehensive overview of the therapeutic strategies explored in this body of research.

Description of the temporal efficacy profiler

TEP adapted the Rebora method (implemented in “bshazard” package in R23, v1.1) to generate a nonparametric smoothed estimate of the baseline hazard rate for both treatment and control groups separately. We included site as an adjustment covariate within the models. We considered male and female as different groups, since most pro-longevity interventions exhibit significant sex differences, with more than half demonstrating efficacy exclusively in males28. In our analysis, mortality events occurring prior to the initiation of treatment were excluded to ensure that the hazard ratio estimates accurately reflect the treatment’s effect on survival. This exclusion criterion is crucial for eliminating bias arising from pre-treatment mortality, thus enhancing the validity of our findings.

The confidence intervals for the treatment hazard ratio were estimated using asymptotic and bootstrap methods. First, we used 1000 bootstrapped replications to estimate the confidence intervals25,26, this method is similar to that reported previously in the analysis of the age-specific effects of sex31. Second, we employed the asymptotic method to derive pointwise analytical confidence intervals (CIs) for the hazard ratio. This was achieved by summing two asymptotically normal estimates based on the variance of the difference in log hazards between the groups, which was estimated by the Rebora method23.

Data visualization

The visualization method uses a color-coded band to depict treatment effects on hazard ratios, with the pre-treatment phase shown as a blank band. Upon treatment initiation, a gray color indicates no detectable effect, while significant effects are represented by changes in color intensity: beneficial effects cause the band to turn green, with the intensity reflecting the magnitude of negative log hazard ratios, and detrimental effects are shown in red, with intensity corresponding to positive log hazard ratios. The transition points where significant effects begin or end are marked by dashed lines. Additionally, key lifespan metrics for the control group, such as median and maximum lifespan (when 90% have died), are highlighted to facilitate interpretation. All computational analyses were conducted in R (version 4.3, Vienna, Austria). Additional R packages used for data processing, visualization, and survival analysis include: survminer (v0.5.0), ggeasy (v0.1.4), plyr (v1.8.9), dplyr (v1.1.4), survival (v3.8.3), ggplot2 (v3.5.2), tidyverse (v2.0.0), tidyr (v1.3.0), and readxl (v1.4.5).

Simulation

We conducted two simulation scenarios to validate the performance of the method under the null and alternative hypotheses. The simulations demonstrate the model’s accuracy under known conditions with the goal of demonstrating that the coverage of the confidence intervals was accurate and sensitive to variation. The first scenario used simulated datasets of similar sample size as the ITP case study with 300 controls and 150 treated under the null hypothesis (The specific sample size for each group is provided in Supplementary Data File 4). We used a same Gompertz distribution for both the treatment and control groups (Fig. S1A). The specific Gompertz density was f(ta,b) = beatexp(−b/a(eat − 1)) where a = log(300)/1200 and b = 0.001/a. These values were chosen to reflect a similar hazard as female mice with a median survival of 741 days and a censoring rate of 10%. We conducted 500 simulations (Fig. S1C) and estimated the TEP(t) hazard over the lifespan with the bootstrap (500 resamples each) and asymptotic 95% confidence intervals, and computed the coverage probabilities of each. The results are shown in Figs. S1E and S1G. The bootstrap confidence intervals exhibit close to nominal coverage until the later part of the lifespan, where the coverage rate goes below 90% (Fig. S1E). The asymptotic confidence intervals have >95% coverage and are conservative over the full lifespan (Fig. S1G). This indicates good accuracy of the TEP method under the null hypothesis for the asymptotic confidence intervals, and the lower computational burden makes this an attractive option.

In the second scenario, we examined the alternative hypothesis where the control group (n = 300) maintained the same Gompertz parameters as previously described, while the treatment group (n = 150) was characterized by parameters a = log(50)/1200, b = 0.0003/a. This resulted in a time-varying treatment effect, with an early hazard ratio greater than 1 (harm) and a later hazard ratio less than 1 (benefit), with the transition occurring around the median lifespan (Figs. S1B and S1D). To evaluate the performance of the TEP and the log-rank test, we analyzed 100 simulated datasets by comparing the proportion of estimated hazard ratio 95% CIs that did not include 1 for TEP, against the proportion of log-rank tests rejecting the null hypothesis at alpha = 0.05. The CIs estimated using both bootstrap and asymptotic methods did not contain HR = 1 and correctly indicated the direction of the treatment effect in the early part of the curve in over 90% of simulations (Figs. S1F and S1H). However, in the later part of the curve, bootstrap CIs did not contain HR = 1 and correctly indicated the treatment effect direction in approximately 60% to 90% of simulations (Fig. S1F), whereas asymptotic CIs did so in about 60% of cases (Fig. S1H). Around the median lifespan, where the hazards crossed, TEP appropriately covered HR = 1 at a near-nominal rate. As expected, the log-rank test exhibited reduced power (21%) to reject the null hypothesis in this scenario of nonproportional, crossing hazards.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.