Introduction

Chronic obstructive pulmonary disease (COPD) is a chronic respiratory disease that serves as a paradigm for chronic diseases. With a high global prevalence (12.16%)1,2 it represents a significant health burden3. During the course of the disease patients may experience deterioration of their baseline clinical status (exacerbation), occasionally requiring hospitalization to control such worsening. In COPD, the baseline disease severity and hospitalizations due to COPD exacerbations (eCOPD) are the two factors that most significantly impact on direct costs4. Furthermore, hospitalizations also have a profound clinical impact on patients, leading to a loss of pulmonary function5, deterioration of health-related quality of life in the short and long term6,7, heightened risk of cardiovascular events8, increased mortality9, and greater probability of short-term readmission10. This cycle of hospitalization and readmission not only escalates costs but also perpetuates adverse outcomes for patients.

The current situation has spurred various interventions aimed at modifying the sequence of negative events associated with eCOPD. These interventions involve predicting the risk of severe eCOPD (hospitalization)11 or readmissions10,12, as well as early detection of eCOPD13, particularly in patients with a higher likelihood of admission14. This notion of anticipation and the tools that support it represent a new care paradigm for chronic diseases, particularly COPD, which would necessitate a redesign of the chronic disease care model.

In this new scenario, telemonitoring and Machine Learning are expected to play crucial roles in changing clinical practice. Although telemonitoring has been a controversial tool in managing eCOPD13, it may be necessary to establish an adequate patient profile to achieve success14. Additionally, telemonitoring can provide a continuous and well-structured stream of quality data that can be leveraged with Machine Learning techniques. These techniques can learn and uncover relationships and patterns that are not visible to traditional methods currently in use.

At the hospital of Galdakao, we developed a telemonitoring program (telEPOC)14,15,16 aimed to monitor COPD patients that have been frequently admitted for COPD exacerbation to the hospital. The main goal of this program is to reduce the number of admissions to the hospital, and its results so far have been very satisfactory (the program has been spread to other hospitals of the Basque Government Health Department). It has also been shown that this program improves several aspects of health care of patients with respect to those in the control group14.

In this study, our objective is to develop an early warning system based on Machine Learning (telEPOCML) that can predict when a patient in the telEPOC program is likely to experience a red alarm (i.e., the highest level of alarm, see14 for more details). In other words, we aim to anticipate eCOPD in our cohort of telemonitored patients.

Methods

Telemonitoring data set

The telEPOC dataset is composed of the daily questionnaires submitted by patients on a daily basis. More specifically, the dataset is composed of the following variables:

  • SpO2: Measured by a pulse oximeter.

  • Heart Rate: Measured by a pulse oximeter.

  • Breathing Rate: Manually measured respiratory rate.

  • Number of Steps (previous day): Measured with a pedometer.

  • Temperature.

  • Do you have fatigue? (Yes/No).

  • Do you have more fatigue than usual? (Yes/No).

  • Do you have a cough? (Yes/No).

  • Do you have more coughing than usual? (Yes/No).

  • Do you have sputum? (Yes/No).

  • Is your sputum amount increased, the same, or decreased compared to usual? (Increase/Same/Decrease).

  • What is the color of your sputum? (White / Greenish yellow / With blood).

  • How are you feeling in general? (Better/Equal/Worse).

  • Are you feeling better, equal, or worse than usual? (Better/Equal/Worse).

Based on these reports and a set of rules, an alarm system with three levels is established14. The highest level, which indicates a severe exacerbation, is labeled as a red alarm.

The dataset is subdivided into three subsets as a result of the different software platforms used to collect the data. Table 1 shows the characteristics of each data subset. Due to major changes in the data acquisition process and alarm definitions, the subset 1 was left out of this project.

Table 1 The TelEPOC dataset is divided into three subsets.

In total, 166 COPD patients from the TelEPOC program were included in this study (74.7% men, 15% current smokers). Table 2 presents baseline characteristics obtained from the medical records available in the hospital for these patients. These characteristics are not part of the TelEPOC dataset and were not used to train the machine learning models. Instead, they are included to provide additional context regarding the demographic, clinical, and physiological profiles of the study population, enhancing the interpretability of our results.

Table 2 Baseline characteristics of the study participants.

Data cleaning and harmonization

The data cleaning and harmonization steps consisted of fixing wrongly labeled alarms, harmonizing categorical variable values that changed over time, removing unreliable time periods when that happened during back end migrations, removing duplicate submissions and removing unrealistic values. Table 3 shows the range of allowed values for the different variables.

Table 3 Range of allowed variables for each variable.

Sometimes, patients make mistakes while submitting their daily data. In such cases, they send a second submission to correct their errors. During the data cleaning process, we only retain one submission per day. We prioritize the values submitted in the most recent submission. Additionally, variables and categorical values have been modified throughout the program. Therefore, we performed necessary operations to standardize them without compromising the information.

In addition, we excluded samples where the patient was already in a red alarm state. However, we noticed that the model achieved significantly better scores when we retained these samples in the dataset. This was likely due to the fact that patients in a red alarm state are much more likely to experience another red alarm the following day than those who are not. Nevertheless, we believe that such predictions are of limited value in daily practice. This is because doctors are already aware of the patients’ condition, and consecutive days with a red alarm can be considered as part of the same exacerbation period. Therefore, we concluded that it would be more appropriate and realistic to exclude these samples from the model’s training and evaluation.

After these pre-processing steps, we combined it to create a single dataset. The final dataset consisted of 149 patients and 159,719 samples. It covers a time frame from 09-29-2014 to 05-19-2021.

Model input and target variables

The target variable that we chose to predict is whether a patient will experience a red alarm within the next three days. As part of the data preprocessing, we computed this target variable for each patient submission.

The model utilizes the patient’s submitted data from the previous days to make this prediction. We experimented with different numbers of previous days, which we referred to as the input window size. After testing, we selected an input window size of 11 days (current day plus a historical window of 10 days) for our final model. This window size yielded the best results among the ones we tested. Additionally, it corresponds to our clinical observations and the findings reported in other studies17.

Besides, our model will use the medium term statistics for each patient at the time of each submission. Specifically, in order to allow the model to find both patterns in the raw data as well as deviation from individual baseline values, we add new variables with mean, median and standard deviation from the last 4 months prior to each submission for temperature, heart rate, number of steps and breathing rate variables.

Experimental setup

We split the dataset into three subsets along the temporal axis. The first subset, which contains 75% of the samples, is the train set used to train the model. The second subset is the validation set, comprising 15% of the samples, used for model selection. The third subset is the holdout test set, also consisting of 15% of the samples, used to evaluate the selected model’s performance on unseen data.

Splitting the data along the temporal axis, rather than splitting by patients, ensures that the model is trained on data preceding the validation and test sets, mimicking real-world scenarios. It also enables us to assess the model’s ability to generalize to new data by testing it on the holdout test set, which covers a time period the model has not previously encountered, thus reducing the risk of information leakage from the train set to the test set.

Alternatively, we could divide the dataset based on the patient axis, resulting in each split consisting of a group of patients and their complete temporal sequence. This approach offers the advantage of preventing the model from learning patient-specific patterns, which can enhance its ability to generalize to new patients. However, this setup could potentially leak patterns specific to a particular moment in time to the test set. Furthermore, it is less representative of the real-world use case than the temporal split.

Additionally, a prospective dataset containing 14,253 samples was collected. This extra dataset was used to conduct a real-world validation of the final model, assessing its performance beyond the original train–validation–test split.

Models

We tried multiple machine learning models to find the most suitable one for the red alarm prediction task, specifically gradient tree boosting (CatBoost), feedforward neural networks, and convolutional neural networks. The selection of these three approaches was motivated by their complementary strengths. Gradient tree boosting excels with tabular data and provides feature importance rankings, making it particularly valuable for clinical applications where model interpretability is crucial. Feedforward neural networks can capture complex non-linear relationships and are adept at processing high-dimensional healthcare data. Convolutional Neural Networks, particularly 1D CNNs, are effective for time series data, automatically extracting relevant features from temporal structures. All three methods have proven track records in healthcare applications, can handle longitudinal data effectively, and have well-established implementations, making them suitable choices for this prediction problem. By employing these diverse algorithms, we aimed to explore different approaches to modeling the complex patterns in COPD exacerbation data, ultimately selecting the best-performing model for our task.

Training and model selection

We conducted a random search of hyperparameters for all models and chose the one that performed the best on the validation set. We evaluated the performance of all models based on both the Area under the ROC curve (AUROC) and the Area under the Precision-Recall curve (AUPRC). We selected the AUPRC as our primary metric for model selection because it better reflects model performance in our heavily imbalanced dataset—one in which non-exacerbations greatly outnumber exacerbations—and it directly captures the trade-off between recall and precision for the positive (exacerbation) class.

While AUROC is a useful measure of overall discrimination, it does not reflect the false positive rate’s impact on model applicability in real-world scenarios. A model with high AUROC can still generate an impractical number of false positives in an imbalanced dataset, which can overwhelm clinical teams, causing alarm fatigue and undermining the system’s utility. In contrast, AUPRC focuses on the balance between identifying true positives (recall) and avoiding false positives (precision), which is essential for a model to be actionable in practice.

Results

Table 4 shows the performance of each model when using an input window size of 11 days and selecting the hyperparameters that provided the highest AUPRC in the validation set. As can be seen, the catboost model outperforms the neural network approaches. We also Ran experiments with other gradient tree boosting approaches, such as LightGBM and XGBoost, which provided almost identical but slightly worse results than catboost.

Table 4 AUROC and AUPRC in the test set provided by the best hyperparameter configuration that we found for each model type.

We conducted experiments with varying input window sizes to observe how scores change as we alter the number of recent days that the model can use to make predictions. Table 5 displays the catboost model scores when utilizing input windows of only the current day, the current day + 10 days, and the current day + 20 days. We observed that this parameter appeared to be more critical for the AUPRC metric and we chose to employ the current day + 10 days version for the remainder of the study, since it provided the best result for both AUROC and AUPRC.

Table 5 AUROC and AUPRC in the test set for the catboost model when using different input window sizes.

The detailed results for the catboost model are shown in Table 6, where “AUPRC random” contains the area under the Precision-Recall curve obtained by random predictions (the AUROC for random predictions is always 0.5).Figures  1, 2 and 3 depict the AUROC and AUPRC curves for the train, validation, and test sets, respectively.

Table 6 AUROC and AUPRC scores for the best CatBoost model for each data subset.
Fig. 1
figure 1

AUROC and AUPRC curves for the best CatBoost model on the train set.

Fig. 2
figure 2

AUROC and AUPRC curves for the best CatBoost model on the validation set.

Fig. 3
figure 3

AUROC and AUPRC curves for the best CatBoost model on the test set.

Model interpretability

To achieve a more interpretable model that would allow us to examine the importance of each variable, we trained a model on a simplified dataset without temporal windows. Specifically, we excluded the 10-day history and the 4-month summary, and focused solely on a single submission to predict the probability of a red alarm within the next three days. The experiment was conducted using the same data splits as the previous experiment. Table 7 presents the scores obtained from this experiment. As we can see, the results are significantly worse than those of the previous experiment. This demonstrates the importance of the 10-day history and the 4-month summary, which provide valuable information for the red alarm prediction task. Figures 3, 4, 5 and 6 depict the AUROC and AUPRC curves for this experiment.

Table 7 AUROC and AUPRC scores for the catboost model for each simplified data subset.
Fig. 4
figure 4

AUROC and AUPRC curves for the CatBoost model trained on the simplified train set.

Fig. 5
figure 5

AUROC and AUPRC curves for the CatBoost model on the simplified validation set.

Fig. 6
figure 6

AUROC and AUPRC curves for the CatBoost model on the simplified test set.

To ensure that our model had learned meaningful relationships and to identify the most informative variables for the task at hand, we conducted a SHAP analysis on the simplified dataset. SHAP is a game-theoretic approach that attributes the contribution of each feature to the output. We can see the results of this analysis in Fig. 7, which shows how breath rate, heart rate and SpO2 are the three most informative variables to predict red alarms with the CatBoost model trained on the simplified dataset.

Fig. 7
figure 7

SHAP scores for the CatBoost model on the simplified test set. Each point represents an instance’s SHAP value for a feature. The x-axis shows the SHAP value (impact on prediction), while the color gradient (blue to red) represents the feature’s actual value (low to high).

Prospective analysis and applicability

After training our best CatBoost model, an additional dataset was collected in order to perform a prospective analysis. Specifically, we used data submitted from 2021-05-20 until 2022-01-30, with a total number of 14,253 samples.

The scores obtained in this subset were similar to the previous ones. Table 8 shows the scores achieved and Fig. 8 their corresponding curves.

Table 8 AUROC and AUPRC scores for the best catboost model on the prospective set.
Fig. 8
figure 8

AUROC and AUPRC curves for the best CatBoost model on the prospective set.

One potential application of this system in the real world is to perform a thorough assessment on patients who, while not experiencing a red alarm, have the highest risk scores according to our model. To evaluate this approach using our prospective dataset, we identified 63 days during the prospective period when at least 30 patients had complete information for model evaluation. Specifically, these were days when no patient currently had a red alarm, the target label could be computed for the next three days, and the patients had no missing data in the previous 10 days. We then examined how many red alarms could have been anticipated by selecting the “n” patients with the highest score predicted by our model on these 63 days. To compare the effectiveness of this approach, we also examined the number of red alarms that would have been anticipated by randomly selecting “n” patients instead. The results of this experiment are presented in Table 9. We observed that if we had selected the five patients with the highest predicted scores according to our CatBoost model, we would have identified 107 patients who later experienced a red alarm. In contrast, if we had randomly selected five patients each day, we would have identified only 23.5 ± 1.30 patients who later experienced a red alarm. These findings suggest that our model can effectively identify patients at higher risk of experiencing a red alarm, and that this approach could be a valuable tool for healthcare providers seeking to prioritize patient care.

Table 9 Number of red alarms anticipated when selecting the top “n” patients with higher score versus “n” random patients.

This study was conducted in accordance with the Declaration of Helsinki and relevant guidelines and regulations. Ethical approval was obtained from the Comité de Ética de la Investigación de Euskadi (Euskadi Research Ethics Committee), with project ID PI2019038. Informed consent was obtained from all participants prior to their inclusion in the study.

Discussion

There is no definitive, universally accepted definition of eCOPD (GOLD23). The definition is inherently subjective, primarily due to patient input data but also influenced by healthcare providers, resulting in a lack of objective data to support the diagnosis of eCOPD. This is why some experts refer to the diagnosis of eCOPD as an exclusion diagnosis. In our telEPOC program, we were able to reduce the hospitalization rate for eCOPD by 40% compared to a hospital control (14). However, despite this improvement, some patients still required hospitalization. Consequently, our next step was to attempt to anticipate the deterioration of the patients’ condition. In this paper, we aimed to identify the clinical deterioration of symptoms (a “red alarm”), which we used as a surrogate marker for exacerbation. It is important to note that a “red alarm” does not necessarily indicate an exacerbation; rather, it serves as an indicator of potential deterioration and prompts healthcare providers to take appropriate actions within the telEPOC program.

We will center the discussion on articles focused on predicting COPD patient deterioration using solely telemedicine data.

In18, the authors used a telemonitoring system to collect pulse rate, oxygen saturation, and breathing rate as well as symptoms such as chest tightness, breathlessness and sputum for 110 COPD patients monitored for 1 year. They identified stable and prodromal states and built a model to classify patients’ status into one of the two states. Their exacerbation events are defined as changes in medication, and therefore they pursue a different target that the one presented in this work. They achieve an AUROC of 0.682.

In19, the authors use 68,139 self reports from 2374 patients to define COPD exacerbation events based on the reported symptoms, and then they build a model to anticipate events that would happen within the next three days. They didn’t have access to measurements such as SpO2 or temperature. They achieved an AUROC of 0.727.

In20, the authors argue that using classic algorithms to establish eCOPD events leads to too many false-positive alarms, rendering them impractical for daily use. Instead, they focus on forecasting two targets within the next 24 h: hospitalization and initiation of oral steroids. To achieve this, the authors utilized 363 days of telemonitoring data from 135 patients. Their approach resulted in an AUROC of 0.74 for hospitalization prediction, corresponding to a 40% false-positive rate for an 80% recall rate. For anticipating the initiation of steroids, their model resulted in an AUROC of 0.765. The authors demonstrate how their Machine Learning approach produces better results for these tasks than the standard algorithms used to define exacerbations.

As an example of what can be achieved by including clinical markers, in21, the authors developed a model to classify patients as being in a mild state or a severe condition. Their input data included 24 features, such as Neutrophil count and Eosinophil count from blood test analysis. They experimented with several Machine Learning models and, similar to our study, found that the Gradient Tree Boosting approach performed the best, with an AUROC of 0.963. Interestingly, they also discovered that a naive ensemble of their models improved the best model’s score, resulting in an AUROC of 0.968. Additionally, among their 24 variables, the authors reported that the second most informative feature for their ensemble model was the weight of the patients, which is a variable that could potentially be added to our model as an input.

After reviewing these studies, we discovered that our dataset contains a larger number of data points compared to the others. This indicates that the quality and size of our dataset played a significant role in achieving the best AUROC score (0.91) among the presented articles. However, as pointed out in20, a high false-positive rate is the main factor that limits the practicality of these models in daily use. The precision or positive predictive value is the metric that should be reported to assess the number of false alarms generated in comparison to the true alarms. Therefore, we believe that this type of work should report the AUPRC, which reflects the model’s precision and recall performance. It is worth noting that a random model will always achieve an AUROC score of 0.5, and the score remains the same regardless of the positive to negative ratio. However, a random model’s AUPRC score is equal to the prevalence of positive labels. Consequently, the lower the prevalence, the harder it is to achieve a high AUPRC score, making it difficult to use this metric to compare models trained on datasets with different ratios of positive events. None of the related works we reviewed report this metric. In our case, we achieved an AUPRC score of 0.53, allowing us to identify a threshold to operate in a regime with 50% recall and 50% precision.

One limitation of our study is its reliance on data from a specific cohort of patients in the telEPOC program, which may limit the generalizability of our findings to other populations or healthcare settings. Additionally, the use of the red alarm as a surrogate marker for exacerbation, while practical, may not fully capture the complexity of COPD exacerbation events. These factors should be considered when interpreting our results and designing future studies.