Introduction

The widespread adoption of electronic health records (EHR) offers a rich, longitudinal resource for developing machine learning (ML) models to improve healthcare delivery and patient outcomes1. Clinical ML models are increasingly trained on datasets that span many years2. This holds promise for accurately predicting complex patient trajectories and long-term outcomes, but it necessitates discerning relevant data given current practices and care standards.

A fundamental principle in data science is that the performance of a model is influenced not only by the volume of data but crucially by its relevance. Particularly in non-stationary real-world environments, more data does not necessarily result in better performance3. Selecting the most relevant data, however, often requires careful cohort scoping, feature reduction, and knowledge of how data and practices evolve over time in specific domains. This is further complicated by temporal variability, i.e., fluctuations of data quality and relevance for current-day clinical predictions, which is a critical concern for machine learning in highly dynamic environments4,5.

Clinical pathways in oncology evolve rapidly, influenced by factors such as emerging therapies, integration of new data modalities, and disease classification updates (e.g., in the AJCC Cancer Staging System)6,7,8. This leads to i) natural drift in features, because therapies previously considered standard of care become obsolete; and ii) drift in clinical outcomes, e.g., because new therapies might reduce the number of certain adverse events (AE) like hospitalizations due to anemia, while giving rise to autoimmune-related AEs4,5,9,10,11. These shifts in features, patient outcomes, and their joint distribution are often amplified at scale by EHR systems. For example, the introduction of new diagnostic tests and therapies frequently necessitates updated data representations, coding practices (e.g., due to billing and insurance policies), or data storage systems12,13. This is exemplified by the switch of ICD-9 to ICD-10 as coding standard in 2015, leading to a series of model (re)validation studies14,15,16. Finally, the onset of the COVID-19 pandemic, which led to care disruption and delayed cancer incidence, is a primary example for why temporal data drifts follow not only gradual, incremental, and periodic/seasonal patterns, but can be sudden12,17.

Together, these temporal distribution shifts, often summarized under ‘dataset shift’18, arise as a critical concern for the deployment of clinical ML models whose robustness depends on the settings and data distribution on which they are originally trained4. This raises the question about which validation strategies ensure that models are thoroughly vetted for future applicability and temporal consistency, ensuring their safety and robustness.

In the literature, it is increasingly recognized that training a model once before deployment is not sufficient to ensure model robustness, requiring local validation and retraining strategies19,20,21. Existing frameworks, therefore, frequently focus on drift detection post deployment21,22. Other strategies integrate domain generalization and data adaptation strategies to enhance temporal robustness23. There is a rising number of tools that harness visualization techniques to understand the temporal evolution of features24. However, there is a lack of comprehensive, user-friendly, frameworks that look at the dynamic progression of features and labels over time, analyze data quantity-recency trade-offs, and seamlessly integrate feature reduction and data valuation algorithms for prospective validation—elements that are most impactful when combined synergistically.

This study introduces a model-agnostic diagnostic framework designed for temporal and local validation. Our framework encompasses four domains, each shedding light on different facets of temporal model performance. We systematically implement each element of this framework by applying it to a cohort of cancer patients under systemic antineoplastic therapy. Throughout this study, the framework enables identifying various temporally related performance issues. The insights gained from each analysis provide concrete steps to enhance the stability and applicability of ML models in complex and evolving real-world environments like clinical oncology.

Methods

Study design and population

This retrospective study identifies patients from a comprehensive cancer center at Stanford Health Care Alliance (SHA) using EHR data from January 2010 to December 2022. SHA is an integrated health system, which includes an academic hospital (Stanford Health Care [SHC]), a community hospital (ValleyCare Hospital [ValleyCare]), and a community practice network (University Healthcare Alliance [UHA]). The study was approved by the Stanford University Institutional Review Board. This study involved secondary analysis of existing data, and no human subjects were recruited; therefore, the research was deemed exempt from requiring informed consent (Protocol #47644).

The unit of analysis of this cohort study is individual patients. We construct a dataset using dense, i.e., highly detailed, and high-dimensional, electronic health records (EHR, Epic System Corp) from a comprehensive cancer center. Eligible patients were diagnosed with a solid cancer disease, and received systemic antineoplastic therapy, i.e., immuno-, targeted, endocrine and/or cytotoxic chemotherapy, between January 1, 2010, to June 30, 2022. We excluded patients who were below the age of 18 on the first day of therapy, were diagnosed with a cancer disease externally, or had a hematologic or unknown cancer disease.

Time stamps and features

Each patient record had a timestamp (index date) corresponding to the first day of systemic therapy. This date was determined using medication codes and validated using the local cancer registry (gold standard). The feature set was constructed using demographic information, smoking history, medications, vitals, laboratory results, diagnosis and procedure codes, and systemic treatment regimen information. To standardize feature extraction, we used EHR data solely from the 180 days preceding the initiation of systemic therapy (index date) and used the most recent value recorded for each feature. This provides an efficient, straightforward approach with minimal assumptions, while allowing balanced patient representation since patients with longer and more complex medical history are more accurately represented in EHR systems. We included the 100 most common diagnosis and medication codes, as well as the 200 most prevalent procedure codes from each year. All categorical variables in the feature matrix were one-hot encoded, indicating whether a specific demographic information, tumor type, procedure, diagnosis code, medication type etc. was applied or administered in the past 180 days. This approach produced a feature matrix where each row represents a single patient. If multiple records for a feature existed, the last value was carried forward (LVCF) or, if a value was missing fully for 180 days, imputed using the sample mean from the training set. We further implemented a k-nearest neighbor-based (KNN) imputation approach for k = 5, 15, 100, 1000, which was fit on the training data and applied to both the training and the test data.

Labels and censoring

The labels for this study are binary. Positive labels (y = 1) were assigned if two criteria were met. First, an acute care event (ACU), meaning emergency department visit or hospitalization, occurred between day 1 and day 180 following the index date. We disregarded events on day 0 (index date) to avoid contamination from in-patient therapy initiations. Second, the ACU event was associated with at least one symptom/diagnosis as defined by CMS’ standardized list of OP-35 criteria for ACU in cancer patients25. To ensure this window was set correctly, the index date was compared against the therapy start date in the local tumor registry (gold standard) and observations were removed if the start date in the EHR deviated more than 30 days from the start date in the registry. To address censoring and identify patients treated elsewhere (i.e., presenting only for second opinion), patients were only included if they (i) had at least five encounters in the two years preceding the index date (no left-censoring), and (ii) had at least 6 encounters under therapy, i.e., in the 180 days following the index date.

Model training and evaluation

For each experiment, the patient cohort was split into one or multiple training and test sets. Hyperparameters were optimized using nested, 10-fold cross-validation within the training set. The hyperparameters from the model with the best performance were then used to refit the model on the entire training set. The model’s performance was evaluated on two separate independent test sets; one emanated from the same time period (internal validation) as the training data (90:10 split), and the other originated from subsequent years, constituting a prospective independent validation set. To guarantee balanced representation of each year in the sliding window as well as retrospective incremental learning experiments, i.e., set-ups in which models are trained on moving blocks of years (see below), we sampled a fixed number (n = 1000) of training samples from each year. For all remaining experiments, the data within the training and test sets were utilized in their entirety, without any sampling, encompassing the full scope of data for the selected time period. To ensure universality of the framework, model performance was evaluated using the Area Under the Receiver Operating Characteristic curve (AUROC), with 95%-bootstrapped confidence intervals (CI).

Model selection and diagnostic framework

Our framework consists of four steps (Fig. 1), each addressing a distinct issue related to temporal/local validation. The first experiment performs a single split on the data to divide the longitudinal dataset into a ‘retrospective’ training and ‘prospective’ validation cohort. We trained three machine learning models suitable for predicting acute care events in settings with a sparse feature matrix: logistic Least Absolute Shrinkage and Selection Operator (LASSO)26, Extreme Gradient Boosting (XGBoost)27, and Random Forest (RF)28. Based on this experiment, we select the model with the highest AUROC and short run time for use in subsequent experiments.

Fig. 1: Framework for prospective and local validation.
figure 1

The figure shows the framework for prospective and local validation and success metrics for different stages. The model-agnostic framework (a) depicts a sequence of experiments that researchers and practitioners can follow to validate a prediction model prospectively before clinical deployment. Each part addresses a separate analysis, including 1. assessment of model performance upon splitting time-stamped data into retrospective training and prospective validation cohorts; 2. temporal evolution of patient outcomes and patient characteristics, ranked by feature importance; 3. longevity of model performance and assessment of trade-offs between the quantity of training data and its recency; and 4. estimated impact of feature reduction and algorithmic data selection strategies. The presence of red flags (b) highlights potential risks/failure modes. Each red flag warrants further investigation and potential application-specific mitigation.

The second experiment (Fig. 1, Evolution of Features and Labels over Time), examines trends in the incidence of systemic therapy initiations and acute care events (labels), stratified by event type in each month. The third experiment (Fig. 1, Sliding Window, Model Longevity, and Progression Analysis) interrogates performance longevity/robustness as well data quantity-recency trade-offs. This is achieved through two sub-experiments. First, the model is trained on a moving three-year time window, and its performance is then analyzed as the temporal gap (between the last time stamp in the train set and the first time stamp in the test set) widens. Second, we perform a progression analysis that emulates the real-world implementation of the model: the model is implemented from the outset, and its performance is shown both without retraining and with annual retraining using the cumulative data available up to each year.

The final experiment (Fig. 1, Temporally Validated Feature Selection and Data Valuation) estimates the impact of strategies to select features and training samples most valuable for creating a temporally robust model.

Shapley values and feature reduction

To inspect the evolution of features most important to the model’s predictions over time, we calculated Shapley values using ‘TreeExplainer’ from the Python package ‘SHAP’29. For feature reduction in the final part of the framework, we used scikit-learn to calculate the importance of each feature (to identify the features most important to the model’s performance). Models were trained on data from 2010 to 2020 and tested on data from 2021, and permutations were repeated 50 times. We then removed features with negative permutation importance due to their putative negative impact on performance. Feature removal was halted once the features had an importance value of 0. We finally used a second held-out test set from 2022 to evaluate performance on a future test set.

Data valuation and elimination

For the data valuation and reduction, we used a Random Forest classifier and data valuation algorithms implemented in the Python package ‘OpenDataVal’30. First, the following algorithms were compared using recursive datum elimination (RDE): KNNShapley 31, Data-Oob32, Data Banzhaf33, and Leave-One-Out (control). We employed proportional sampling from each class (with and without ACU event) when removing datapoints, as changes in the class (im)balance commonly affect model performance34. We subsequently used the best-performing algorithms from the point-removal experiments to calculate the data value averages for each year and removed the lowest-ranking training years. The residual data set was used to retrain the model and tested on the second held-out independent prospective test set from 2022 (for illustration, see Fig. 1, Temporally Validated Feature Selection and Data Validation). All analyses were performed using the computational software Python (version 3.9.18) and R (version 4.3.1).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

We identify a total of 24,077 cancer patients who are eligible for the study. Figure S1 shows the cohort scoping strategy and exclusion criteria for identifying cancer patients undergoing anti-neoplastic therapy. The feature set includes a total of 1050 variables including demographics, medications, vitals, laboratory results, diagnoses and procedures, and systemic treatment regimen information. Among all patients, 5227 (21.71%) had an acute care event within 180 days after the index date, with 4097 (17.02%) hospital admissions and 1958 (8.13%) emergency department (ED) visits. Patients are on average 61.12 (SD = 14.49) years of age, predominantly female (55.4%), white (57.4%), and non-Hispanic (87.28%). The majority of patients received intravenous systemic therapy (57.48%). Table 1 shows the labels and baseline patient characteristics for the retrospective (2010–2018) and prospective (2019–2022) cohorts resulting from a single, temporal data split.

Table 1 Baseline Characteristics of Retrospective and Prospective Cohorts

Prospective validation upon single data split

To assess the temporal stability of an ML model on retrospective data (Fig. 2), we perform a single data split and partition the dataset into a ‘retrospective’ training and ‘prospective’ validation cohort. We hypothesized that performing a single temporal data split is effective in assessing the capability of historical data from earlier periods in sustaining model performance on future time intervals. The results indicate the performance metrics of the classification model when trained on historical data up to 2018 and hypothetically implemented at the point of care in 2019. Patients who start antineoplastic therapy between 2010 and 2018 are grouped into the retrospective cohort, whereas patients starting therapy after December 2018 are assigned to the prospective validation cohort (Fig. 1). We subsequently select three classification models based on their capacity for outcome prediction from a sparse feature matrix: logistic Least Absolute Shrinkage and Selection Operator (LASSO), eXtreme Gradient Boosting (XGBoost), and Random Forest (RF). All models were then trained in four configurations (Fig. 2a–d) and tested on held-out test sets from either the same time period or a future validation cohort. The results from training and validation on the retrospective cohort (Fig. 2a) show an Area Under the Receiver Operating Characteristic (AUROC) of 0.81 [0.79, 0.82] for the LASSO, 0.80 [0.79, 0.82] for the XGBoost and 0.80 [0.79, 0.81] for the RF. When testing these models on the prospective cohort (Fig. 2b) performance decreases slightly, with an AUROC of 0.76 [0.76, 0.77] for the LASSO, 0.78 [0.78, 0.79] for the XGBoost and 0.78 [0.78, 0.79] for the RF. This performance is also slightly lower than when training and testing the models on the prospective cohort (Fig. 2c, LASSO: 0.81 [0.79, 0.82]; XGBoost: 0.81 [0.79, 0.83]; and RF: 0.80 [0.78, 0.82]) as well as data from the entire time span between 2010–2022 (Fig. 2d, LASSO: 0.78 [0.77, 0.79]; XGBoost: 0.79 [0.78, 0.80]; and RF: 0.79 [0.78, 0.80]). Since repeating all subsequent experiments for each classifier is impractical, we proceeded with the RF as our preferred model because of its high performance in the prospective validation experiment and shorter training time (Fig. 2b).

Fig. 2: Prospective validation under single temporal data split.
figure 2

The violin plots highlight the performance of three machine learning models on held-out test sets in terms of Area Under the Receiver Operating Characteristic curve (AUROC). Error bars show 95%-bootstrapped confidence intervals. Panel a shows model performance on retrospective cohort, i.e., when models are trained and tested on data from the same time period (2010–2018). The second panel (b) highlights performance when testing on future data (2019–2022). The last two panels (c and d) show performance when training and testing on most recent data (2019–2022) or when training the models on the full time span (2010–2022). The gray arrows indicate from which periods the train and held-out test set emanate.

Evolution of features and labels over time

Since a decrease in performance can be due to temporal changes in features, labels, or their joint distribution, we first analyzed the evolution of features and labels over time. We hypothesized that if there was, for example, a sudden drift post-2015 (switch from ICD9 to ICD10), this could motivate including mostly data from the period post-2015 in the training set. To assess shifts in clinical outcomes, we first calculated the ratio of acute care events (labels) over the incidence of therapy initiations in each month (Fig. 3a). The results show that this ratio fluctuates slightly but remains overall stable. Most acute care events take place in the first 50 days after initiating chemotherapy (Fig. 3b). Pain, anemia, fever, and nausea are proportionally the most important diagnoses underlying acute care utilization (Fig. 3c) during 2010–2018 and more recent years (2019–2022). Except for anemia, there is no quantitative difference larger than 5 percentage points in the proportional frequency of diagnoses associated with ACU between the two time periods.

Fig. 3: Evolution of labels and diagnoses associated with ACU over time.
figure 3

The smoothed line plot in panel a show the ratio of acute care events (ACU) over therapy initiations between January 2010 and June 2022. The smoothed line plot in panel b shows the time to event from therapy initiation, with the mean-normalized number of events on each day after therapy initiation (index date) on the y-axis. The incidence of emergency department (ED) visits and admissions is highest in the first 50 days following therapy initiations. The bar plot (panel c) shows the share (frequency in percent) of diagnosis groups associated with acute care utilization, plotted separately for the retrospective cohort from 2010–2018 (blue) and the prospective validation cohort from 2019–2022 (red).

To further investigate the projection of features over time, we built on a previously reported approach24 to create a heatmap that shows the evolution of features between 2010 and 2022 (Fig. 4). Such heatmaps provide an intuitive and effective visualization to screen features for their overall variability and longevity, as well as to eliminate obsolete features—those that were previously influential but may be irrelevant for a current model due to discontinued use. We reasoned that understanding feature evolution is important, as data gathering is expensive and unused features might harm model performance5. Since displaying all features is impractical and the screening of more important features should be prioritized, we further refine this approach by combining it with Shapley value-based feature reduction. Shapley values are a model-agnostic approach to measure the importance of each feature for prediction based on their average marginal contribution35. To illustrate this approach, we select the top 15 features using the Shapley values from training the RF on a retrospective cohort (2010–2014). The heatmap displays large variations for almost all top features. For example, the variable that records the intention in which a therapy is administered (‘Therapy Goal Curative’) is frequently used from 2010–2015 and 2018–2022, but barely used during 2015–2018. Similarly, the distribution of the results from the lab test for serum albumin shifts to higher means after 2019. Among the top 15 features, antineoplastic therapy- and laboratory results-related features are most frequent.

Fig. 4: Evolution of features.
figure 4

The heatmap shows the evolution of the top 15 features (ranked by Shapley values), standardized, and organized into monthly segments. Feature importance is determined using the training set and evolution is monitored beyond this period (white vertical line). For illustration, the white rectangle depicts the row-normalized average albumin serum result of patients initiating systemic therapy in April 2015. Changes in color saturation depict variations in the standardized means (continuous variables) and frequencies (binary variables) of a feature and thus highlight shifts in practices, usage of procedure/diagnosis/laboratory units or distributions of the patient population. Features that transition to dark purple hue indicate their diminishing usage and potential discontinuation (categorical variables) or that the mean has decreased (continuous variables).

Sliding window, model longevity and progression analysis

As a third step in our framework, we interrogated the performance of the prediction model on different training and test periods to assess model longevity/robustness as well as potential trade-offs between data quantity and data recency. This is achieved through two sub-experiments (Fig. 5). First, the model is trained on a moving three-year time window and its performance is then analyzed as the temporal gap between train and test set widens (Fig. 5a). Second, we perform a progression analysis that emulates the real-world implementation of the model: the model is implemented from the outset and its performance is shown both without retraining and when retrained annually using the cumulative data available up to but excluding the test set year (Fig. 5b). The results from the sliding window experiment highlight that depending on which years (fixed training sample size, n = 3000) the model is trained, performance on the future test sets varies considerably, ranging between AUROCs of 0.64 and 0.76. Rows with overall dark purple hue highlight training years that lead to comparatively poor performance. The heatmap further reveals that, except for training on 2013–2015, all models perform as good or better on test data from 2022 than the two years before (2020–2021).

Fig. 5: Sliding window, real-world progression analysis, and retrospective analysis.
figure 5

The heat maps show different permutations of training and test sets to assess model robustness or degradation over time. Performance is evaluated using Area Under the Receiver Operating Characteristic curve (AUROC). The color-coding shows the model performance for each training-validation pair. For the sliding window experiment (a), the model is trained on a moving three-year data span (vertical axis) and then evaluated as the temporal gap between training and validation years widens (horizontal axis, longevity analysis). The progression analysis (b) simulates a scenario where the model is implemented from the outset, with annual retraining using all available data up to that point, and prediction on a single future year. This approach mimics the real-world process of continuously updating the model as new data emerges each year. Specifically, the triangle heatmaps in a and b offer three readings: first, the values on the diagonal display model performance on the year directly following the training period; second, moving along the horizontal axis highlights changes in model performance as the temporal gap between training and test years widens; third, moving along the vertical axis depicts how model performance is influenced when training the model on increasingly more recent years, but testing on the same data from a single future year. The final heatmap (c), shows the incremental learning in the reverse direction, i.e., when adding a fixed set of 1000 training samples from each previous year.

The results from the real-world progression analysis show that adding training years at the beginning leads to higher performance on most test years (vertical reading). At later stages, model performance does not necessarily increase when adding more recent data. For example, model performance on test data from 2022 is comparable for the models trained on the full data from 2010–2019 versus the model trained on full data from 2010–2021. Finally, the experiment of reverse incremental learning (Fig. 5c), i.e., when adding a fixed number of data points (n = 1000) from each previous year retrospectively, shows that the model benefits from integrating more recent data. When adding data from 2013 (vertical reading), performance drops considerably for almost all models.

Temporally validated feature selection and data valuation

Finally, we hypothesized that if features and training points were affected by a shift or particularly noisy, those could be addressed using feature reduction and data valuation algorithms. Particularly, since adding 2013 as a training year results in a considerable performance drop, we hypothesized that the average data value of this year as well as those during the COVID-19 pandemic would be much lower than those of feature years. In this final experiment, we develop a ‘denoising’ strategy that would allow us to selectively narrow down an extensive dataset like ours to temporally robust feature and training sets, while ensuring robust performance on future data. This is achieved in two steps (Fig. 6).

Fig. 6: Temporal data valuation, feature reduction, and data pruning.
figure 6

The heatmap in a displays the normalized datum values averaged for each month after calculation by KNNShapley and Data-Oob on each training datum from 2010 and 2020 and tested on data from 2021. Bins with green hues depict months with, on average, the most valuable training data for predicting on the prospective cohort. Red hue indicates that a month is composed of at least some training data that is less valuable when training a model. b shows the ranked data values group by each year, based on the data valuation algorithms KNN-Shapley and Data-Oob for training examples between 2010 and 2020. Data values are converted to rank space for comparability. Error bars denote 95%-bootstrapped confidence intervals. c shows performance evolution under recursive feature elimination based on permutation importance in terms of Area Under the Receiver Operating Characteristic curve (AUROC) with 95% bootstrapped confidence intervals (shading). The dashed line displays the point at which all features with negative signs are removed. This process is repeated three times.

First, we perform data valuation with validation on a prospective test set using four data valuation algorithms (Supplementary Fig. S2), including KNNShapley 31, Data-Oob32, Data Banzhaf33, and Leave-One-Out (control). Data valuation algorithms enable placing an importance value for model performance, evaluated on a held-out test set, on each datapoint in the training set30. To select the top-performing data valuation algorithm, we first perform recursive data pruning for each algorithm separately on data from the training period (2010–2020). For all data valuation algorithms, we use the full feature set. From the comparison of all four data valuation algorithms, we identify the KNNShapley and Data-Oob as the top-performing algorithms (Supplementary Fig. S2), which are then used for data valuation on a held-out prospective test set (2021). The heatmap in Fig. 6a displays the normalized data values averaged for each month resulting from KNNShapley and Data-Oob algorithms. The color coding shows strong consistency between both methods. Months with lower data value averages are mainly observed for earlier time periods in the training data. Fig. 6b shows the yearly data value averages determined by the algorithms KNNShapley and Data-Oob for training examples between 2010 and 2020 and the prospective validation set from 2021. The data value averages show a sharp decline for years before 2014. Based on this result, we train our final model (Fig. 6d) on training data from 2014–2020.

Second, we use permutation importance to determine feature importance among our set of 1050 features. Specifically, permutation importance shuffles a single column (or feature) to measure how important that feature is for predicting a given outcome28. When used on a model trained on retrospective data (2010–2020) and tested on future, held-out data (here 2021), permutation importance is a model-agnostic approach to identify features that truly maintain model performance on future data. In the presence of correlation, the approach can further be combined with hierarchical clustering based on the Spearman rank-order correlations. In our study, we train a Random Forest on data from 2010–2020 and used the year 2021 as a first held-out test set. We subsequently perform stepwise (n = 10) recursive feature elimination (Fig. 6c). We then remove all features with negative signs, i.e., without importance for future test data. We repeat this experiment three times, each time removing all features with negative signs. This results in a subset of 405 features. Fig. 6c shows the impact on performance when recursively removing features ranked by permutation performance and at random.

Finally, the model is retrained on the reduced feature set with data from 2014–2020 and tested on the second held-out test set from 2022 that is independent of all previous data valuation and feature selection (for approach, see Fig. 1). Supplementary Fig. S3 shows comparable performance of all three models (LASSO, XGBoost, and RF) in three configurations: when trained i) on full data, ii) upon data pruning, i.e., after removing the years with the lowest data value averages (2010–2013), and iii) upon data pruning and permutation importance-based feature reduction. Supplementary Fig. S4 shows no clear performance differences between models using a k-nearest neighbor (KNN)-based imputation when trained and tested on held-out data from 2010–2018, as well as tested prospectively on data from 2019–2022.

Discussion

Our study advances a diagnostic framework aimed at identifying and adjusting for temporal shifts in clinical ML models, crucial for ensuring their robust application over extended periods. This approach, developed through analyzing over a decade of data in oncology, emphasizes the difficult balance between the advantage of leveraging historical data and the disadvantage of including data of diminishing relevance due to changes in medical practice and population characteristics over time. By focusing on the temporal dynamics of clinical data, our framework promotes the development of models that are robust and reflective of current clinical practices, thereby enhancing their predictive accuracy and utility in real-world settings. This is accomplished by delineating a framework comprising essential domains, readily applicable yet frequently overlooked or omitted in local validation processes32. This methodology not only enables inspection of temporal shifts in data distributions but also provides an approach for the validation and implementation of clinical predictive models, ensuring they remain effective and relevant in rapidly evolving medical fields—especially when deployed at the point of care.

The work in this study emphasizes the importance of data timeliness and relevance and advocates for a data-centric approach with rigorous prospective local validation, followed by strategic feature and data selection for model training36,37,38. This is critical to improve data quality, reduce the need for complex architectures, and train ML models with the data that reflect contemporary clinical protocols and patient demographics38,39. This approach allows researchers to apply a models of their choice within the framework. The results from the feature evolution and incremental learning experiments show that relevance of data is as important as its volume. Simply increasing the volume of data, i.e., including more data from the past, does not necessarily enhance performance. This challenges some traditional beliefs that accumulating larger datasets guarantees improved accuracy40,41,42.

The swift progress in fields like oncology, highlighted by breakthroughs in immunotherapies, illustrates how an entire treatment landscape can experience multiple paradigm shifts within a single decade. Thus, relying on extensive historical data spanning 10 to 15 years can result in a model that is narrowly tuned to non-representative, old care episodes, thereby degrading its real-time applicability and performance. In highly dynamic clinical settings, the need for heightened vigilance in identifying data that remain relevant under current practices and care standards becomes paramount39. This complexity is exacerbated as individuals frequently move between various healthcare systems and providers.

Developing a nuanced understanding of how outcomes, features, and health practices evolve over time is critical for selecting the most relevant data for ML models. This is an increasingly important challenge that has been widely acknowledged across fields and is addressed in many dimensions by this framework3,43,44. Models trained on historical data often experience performance deterioriation45, a phenomenon that underscores the necessity for periodic retraining or updating20,21,45. As shown in the sliding window experiment, the suitability of certain training years over others can be a decisive factor for the model’s success when deployed at the point of care39. This is relevant beyond baseline model training as the question of how and when models should be retrained, once deployed, has received much attention in recent years20,21.

The potential for data to become outdated implies a trade-off between the recency and the volume of data, prompting researchers to prioritize more recent datasets. However, some treatment regimens and patient distributions may remain stable for extended periods, rendering data from a broader timespan applicable for model training. Thus, identifying the date beyond which data becomes irrelevant for training ML models is a key challenge46. Understanding patterns of drift is also relevant for data processing steps such as imputation. If drift is strong but gradual, using a KNN approach with small k might be advantageous as the imputation is likely to rely primarily on (more recent) neighbors that are less affected by drift. In contrast, when there is moderate gradual drift combined with severe, temporary drift, we might want to increase k. Using a larger number of neighbors from more distant years will then be used for imputation and can offset more recent (and potentially sudden) trends. Imputation approaches are particularly important if missingness affects many variables. In this study, more than half of the features were curated using procedure and diagnosis codes, which are prone to high levels of missingness due to shifts in medical practices and coding standards. This highlights the need for an effective imputation strategy in medical applications.

Detecting pivotal junctures poses an intricate task, especially in extensive data sets and without historic knowledge of the data. By methodically evaluating model performance and data relevance, we delineate a strategic cut-off point for the inclusion of training data in such highly dynamic environments. While one might instinctively consider excluding data from 2015 and before due to the known transition from ICD-9 to ICD-10 coding standards, our findings reveal a notable decline in data values and in performance in the incremental learning experiment in 2013. Overall, training the model on older data still yields reasonable performance on future test years. This may be due to the presence of many stable features (with low variation) as illustrated in the heatmap, moderate baseline drift, and/or model robustness. Moreover, sudden external shocks such as the COVID-19 pandemic may introduce drift in more recent data. This may result in additional variability that offsets the relative benefits of adding more recent data to the training process. This research shows that older data can still yield strong model performance, reducing the need for frequent retraining.

Moreover, the utilization of incremental learning techniques also enables us to understand the actual impact of sudden events, such as the COVID-19 pandemic, on model performance. The average data value for training data in 2020 exhibited a moderate decrease compared to the peak values observed in the preceding years of 2017–2019. Our framework, thus, offers the advantage to contrast data values and performance during periods of drift with those outside such periods (e.g., just before or after the pandemic). The known care disruption and delayed (cancer) incidence under the COVID-19 pandemic can strongly affect model performance in many areas30. Detecting performance issues and quantifying the value of data during such periods provides more clarity on how to address such years, i.e., whether to remove them as a potential safeguard against model overfitting on unique, non-recurring distributions39.

In addition to external shocks, novel therapies can suddenly lead to changes in the distribution of clinical outcomes, adverse events, and patient demographics. This corroborates the need for local validation strategies that also monitor more recent data and explains why simply excluding old data might not always be enough. Our framework supports the need for adaptive model training and offers the potential to design more nuanced model (re)training schemes for clinical machine learning as well as other disciplines.

Our application included several outcomes (ED visits and hospital admissions), tumor types (and thus diseases), disease stages, and data types (e.g., ranging from demographic data to treatment-specific information), and an extended time span for sudden and gradual temporal drift. This complex and multifaceted application was intentional to allow for easy adaptation to less complex scenarios, promoting generalizability.

Our study further advances the management of acute care utilization (ACU) in cancer patients under systemic therapy and demonstrates that ML models effectively discriminate between patients with ACU within 180 days of initiating systemic therapy and those without. Identifying individuals at high risk for acute care use is increasingly recognized in the literature as clinically important for reducing mortality, enhancing the quality of care, and lowering cost47. In this study, we illustrate the methods by which algorithms can undergo local and prospective validation, a crucial step in facilitating their transition to clinical application and bridging the gap between the development and implementation of models48.

The study should be interpreted in the context of its limitations. Our cohort was a population from Northern California and, despite the inclusion of diverse practice sites, may not be generalizable to other populations. While our framework is designed to be a generalizable local validation strategy, its broader utility across various medical domains, data environments, and care settings is yet to be demonstrated. Oncology represents a noisy, highly variable data environment, which is well-suited for studying time-related performance aspects of clinical machine learning models. Extending the framework to other domains, data types (e.g., imaging or genome sequencing data), and care settings with seasonal influences (e.g., infectious diseases) may require additional adjustments to account for different clinical contexts and domain-specific data sources40,42,46.

This framework provides a stepwise approach for prospective validation and leverages data valuation methods to assess temporal drift. Our framework combines a suite of visualization approaches with temporally validated feature reduction and data selection strategies. This offers a strategy to select the most pertinent training data from datasets spanning over a decade30,46. Understanding when relevant drifts occur and which data to select for model training is a challenge that is critical beyond the field of oncology. For example, models to predict cardiovascular outcomes for patients with hypertension might be affected by drift that is caused by the change in hypertension diagnosis criteria (>130/80 mmHg) in 201749. Such changes in diagnostic criteria are likely to result in a shift in the baseline characteristics of patients diagnosed with hypertension and can thus affect model performance. By showcasing the framework’s utility across diverse domains, we will gain further evidence of its broader applicability and generalizability.