Abstract
Chronic Obstructive Pulmonary Disease (COPD) exacerbations pose significant challenges to healthcare systems due to their unpredictable nature and severe impact on patients. Current COPD prediction models often lack real-time capabilities and fail to leverage multi-source data for accurate forecasts. This research proposes a Cyber-Physical System-enabled framework that integrates both primary (clinical) and secondary (online) data sources to predict COPD exacerbations in real-time. The framework employs advanced machine learning techniques, specifically Random Forest and Artificial Neural Networks, for feature selection and prediction accuracy. Statistical validation through ANOVA ensures the harmonization of diverse data sources, enhancing the robustness of the prediction models. Experimental results demonstrate the framework’s effectiveness, with key metrics such as accuracy, precision, recall, F1-score, and AUC showcasing its potential for early COPD detection. The proposed system offers proactive healthcare solutions by delivering timely alerts, forecasting exacerbations, and supporting clinical decision-making, ultimately improving patient outcomes and reducing healthcare costs.
Similar content being viewed by others
Introduction
Chronic Obstructive Pulmonary Disease (COPD) is a long-term lung condition that gradually limits airflow, making it difficult for patients to breathe and often leading to repeated hospital visits and increased mortality1. One of the major challenges in managing COPD is the absence of real-time monitoring tools that can predict sudden health deteriorations, which are often preventable with early intervention. Current systems typically rely on delayed or static data, limiting their usefulness in providing timely care. Cyber-Physical Systems (CPS) have emerged as a transformative approach in healthcare by integrating physical devices with intelligent computing and communication capabilities2. These systems enable real-time data collection, processing, and decision-making, allowing for more responsive and personalized medical care. This research introduces a CPS-based framework designed to improve the management of COPD. The framework collects real-time health data from hospital systems and with relevant online sources. Using machine learning models such as Random Forest and Artificial Neural Networks (ANN), it analyzes this combined data to predict health risks and provide early warnings. This approach aims to shift COPD care from reactive treatment to proactive management.
The proposed framework is structured into two main components: the Physical Layer and the Cyber Layer. The Physical Layer includes medical sensors, monitoring equipment, and data transmission technologies that gather patient health information continuously. This layer ensures timely and accurate data delivery from various environments, including hospitals and online sources. The Cyber Layer processes the collected data, applying techniques such as statistical validation (ANOVA), feature selection, and machine learning. This layer runs prediction models to assess the likelihood of COPD flare-ups, hospital admissions. Random Forest and ANN models are trained to analyze key health indicators, identify patterns, and generate alerts if abnormal conditions are detected. What makes this CPS framework valuable is its ability to operate in real time and adapt to changing patient conditions. By combining the hospital data with validated online health sources, the system improves the accuracy of predictions. This framework has potential applications in smart hospital infrastructures, home-based patient monitoring, and emergency preparedness planning, especially for chronic respiratory care.
Forecasting and early prediction in health are significant needs. It shifts the concept from one of reactive treatment to preventive care, where timeliness of intervention improves outcomes for patients and reduces healthcare costs for the management of resources3. As technologies continue to evolve, forecasting and early prediction will only increase their vital role to take us toward smarter healthcare where proactive care of everyone is considered normal. Forecasting and early prediction find great value in the healthcare industry for better outcomes and resource optimization4. Technologies, such as machine learning and artificial intelligence, make early prediction models possible that can help screen for potential health risks, decades in advance, before symptoms may become critical enough to call for healthcare provider intervention. Forecasting in healthcare is especially important because it can change the one-dimensional care given from a reactive model to a proactive model5. Healthcare systems react to patients when problems have already occurred, hence, late treatment and increasing costs of drugs. Forecasting allows for earlier diagnosis and treatment by acquiring knowledge from the examination of historical and real-time data on the patients that can enable very effective interventions. For instance, the possibility of developing COPD in a patient can be predicted before the disease has reached severe stages, thereby enabling the inauguration of lifestyle changes, prevention, and timely medical treatment to reduce the need for hospitalization and improve the quality of life of such a patient6. Early prediction plays a major role in managing resources within healthcare efficiently. This would significantly improve health outcomes in individual patients by developing personalized care plans. These models also empower patients to take responsibility for their health, given the identification of early signs of diseases or complications in health, enabling them to make informed choices toward better long-term wellness7. Early prediction is also critical in reducing healthcare costs. Healthcare providers can give treatments that are much less invasive and cheaper compared to the treatment of diseases when they have reached their last stages8. Predictive analytics-driven preventive care reduces the overall load on healthcare systems through reductions in hospital admissions, emergency room visits, and the necessity for more extensive medical procedures9.
Problem statement
The problem involves the integration of machine learning models into a Cyber-Physical System (CPS) for enhanced Chronic Obstructive Pulmonary Disease (COPD) management within a cloud computing environment. We are given two distinct datasets, \(D_1\) and \(D_2\), each containing various attributes related to COPD, such as Age, PackHistory, FEV1, and smoking status.
The primary objective is to determine if these datasets can be merged into a unified dataset, \(D_{\text {merged}}\), for subsequent analysis. The challenge is to ensure that this merged dataset is statistically valid for machine learning applications. To address this, we need to perform the following tasks:
-
1.
Standardization of Datasets: Each dataset must be standardized to ensure comparability. Given a feature \(X_i\) in dataset \(D_k\), the standardization process is defined as:
$$X_{i}^{(k)\text {norm}} = \frac{X_{i}^{(k)} - \mu _{X_i}^{(k)}}{\sigma _{X_i}^{(k)}}$$where \(\mu _{X_i}^{(k)}\) and \(\sigma _{X_i}^{(k)}\) are the mean and standard deviation of feature \(X_i\) in dataset \(D_k\), respectively. This transformation ensures that each feature has a mean of 0 and a standard deviation of 1.
-
2.
Merging the Datasets: After standardization, the datasets \(D_1\) and \(D_2\) are combined into a single dataset \(D_{\text {merged}}\). The merging operation is expressed as:
$$D_{\text {merged}} = D_1 \cup D_2$$The datasets must be aligned correctly, ensuring that corresponding features are combined appropriately.
-
3.
Statistical Validation Using ANOVA: To validate the compatibility of the merged dataset, we perform a one-way Analysis of Variance (ANOVA) test on each feature to determine if there is a significant difference between the means of the feature in the two original datasets. The null hypothesis \(H_0\) for each feature \(X_i\) is:
$$H_0: \mu _{X_i}^{(1)} = \mu _{X_i}^{(2)}$$where \(\mu _{X_i}^{(1)}\) and \(\mu _{X_i}^{(2)}\) are the means of feature \(X_i\) in datasets \(D_1\) and \(D_2\), respectively. The ANOVA test statistic \(F\) is calculated as:
$$F = \frac{\text {Between-group variability}}{\text {Within-group variability}}$$where
$$\text {Between-group variability} = \frac{n_1 (\bar{X}_{i}^{(1)} - \bar{X}_{i})^2 + n_2 (\bar{X}_{i}^{(2)} - \bar{X}_{i})^2}{k - 1}$$and
$$\text {Within-group variability} = \frac{\sum _{j=1}^{n_1} (X_{ij}^{(1)} - \bar{X}_{i}^{(1)})^2 + \sum _{j=1}^{n_2} (X_{ij}^{(2)} - \bar{X}_{i}^{(2)})^2}{n_1 + n_2 - k}$$with \(\bar{X}_{i}^{(k)}\) being the mean of feature \(X_i\) in dataset \(D_k\), \(\bar{X}_{i}\) the overall mean, \(n_k\) the number of observations in dataset \(D_k\), and \(k = 2\) as the number of groups.
-
4.
Decision on Merging: The computed \(F\) value is compared with the critical value from the \(F\)-distribution table. If the \(F\) value is less than the critical value for all features, the null hypothesis \(H_0\) is not rejected, suggesting that the datasets can be merged without significant differences. Otherwise, further analysis is required to address any significant discrepancies.
-
5.
Application in Cyber-Physical Systems: Once the datasets are validated and merged, they are utilized within a CPS framework for enhanced COPD management. The integrated dataset supports the development and deployment of machine learning models that are capable of providing actionable insights and automated alerts.
Literature review
Recent advancements in integrating Cyber-Physical Systems (CPS) with Machine Learning (ML) have shown significant promise in the healthcare domain, particularly for chronic disease monitoring and prediction. For instance, Zhang et al.10 explored CPS resilience in smart grids, demonstrating how distributed algorithms can sustain performance even under cyber-attacks. Although this study focused on energy systems, the concept of resilient CPS design has strong parallels in healthcare, where data integrity and uninterrupted system function are equally critical. Similar introduced an energy-efficient CPS transceiver capable of body-coupled communication, showcasing the potential for low-power, real-time health monitoring systems. This aligns with our proposed framework, which emphasizes real-time COPD prediction and ICU alerting in resource-constrained environments. In predictive modeling, presented a hybrid data-processing model for electronic medical records (EMRs) using text-to-SQL parsing, enabling deeper insights into patient history11. This framework advances this direction by not only integrating structured hospital data but also incorporating verified online health sources to improve the robustness of prediction models. Moreover, the work by Lin et al. on emulating brain metabolic activities using optical phantoms illustrates the growing use of biologically-inspired CPS simulations for personalized medicine12. While our work does not simulate physiological states, it addresses real-time prediction based on multi-source physiological data, bridging a similar gap in chronic disease management. Further, the implementation of explainable models in medical AI is emphasized by Zhang et al.13 who propose differentially private mechanisms in patient data processing to protect sensitive information while maintaining prediction accuracy. This framework adopts a similar approach, ensuring HIPAA and GDPR compliance through anonymization and secure data handling protocols. Comparative studies done by Liang et al. on magnetic microrobots also highlight the trend toward embedding intelligence into physical health devices14. While our model is software-centric, its deployment can integrate with such robotic systems for COPD monitoring in future applications. Huang et al.15 explored large language model-based recommendations in medical systems for improved decision-making under uncertain conditions. This model complements the work by offering data-driven insights in critical care settings, specifically targeting COPD exacerbation forecasting.
Comparison with Our Approach While the above works lay the groundwork for CPS and ML integration in various domains, our proposed framework stands out in the following ways: \(\bullet\) It uniquely combines hospital and authenticated online health data sources. \(\bullet\) It implements Random Forest for feature robustness and ANN for learning nonlinear relationships in COPD-specific parameters. \(\bullet\) It ensures real-time performance with explainable outcomes, bridging both clinical relevance and computational efficiency.
Proposed framework
The proposed system is based on COPD exacerbation prediction along with the integration of data from hospitals, which is primary, and from authenticated websites that serve secondary data. The architecture of the proposed COPD prediction system integrates both physical and cyber components, giving a seamless framework supporting the real-time processing of data, its analysis, and ultimately its prediction. The architecture has the data collection layer, where primary data obtained in the hospitals through sensors, patient records, and medical devices, and secondary data is obtained from authenticated healthcare websites. It forms the foundation of the system, as it creates a comprehensive dataset.
Figure 1 shows the Proposed Framework for COPD Management in Cyber-Physical Systems Using Machine Learning. The critical work of cleaning, merging, and preparing the data collected for further analysis is done by the Data machine learning algorithms. Further, the integration of primary and secondary data is done within a single dataset, ensuring consistency of variables-for instance, demographic data on patients, medical history, and measures relating to COPD. The data pre-processing step involves normalization, imputation of missed values, and standardization of parameters of normalization since all these are crucial steps in modeling and analyzing the data. The proposed framework relies upon a few key elements:
Physical layer
The Physical layer contains the following components:
Sensors and data acquisition
These are the physical devices, used to acquire real-time health information relative to the patient, including wearable health monitors, smart inhalers, and environmental sensors that track measures such as respiratory rate, oxygen saturation, and air quality, among others. Through these sensors, constant input will be provided to have an overall view of the health status of a particular patient.
Communication technologies
It is supported by several communication technologies to ensure that data is effectively transmitted and parts of the system are correctly integrated:
Wireless communication
Data Processing Unit receives signals from sensors along with the interaction among the system components. Such technologies include Wi-Fi, Bluetooth, and Zigbee.
ANOVA test
After data collection from different sources, in the merging process, consistency in variable alignment must be assured, such as patient characteristics and levels of COPD severity from both primary and secondary data sources. To validate the statistical significance in the merged data, application of the ANOVA test will be applied. It can be used to ensure whether there is any statistical difference that may exist between groups, such as age groups or levels of COPD severity. It is essential to conduct this ANOVA test, which will ensure that the dataset is coherent and statistically valid. Therefore, one can be confident that data may be used for developing predictive algorithms. This step is important because it gives reliability to the data and supports the correct model predictions in successive phases. COPD Exacerbation Prediction Algorithm: This algorithm will predict any COPD exacerbation for a patient based on his/her data. It uses deep machine learning models trained on both historical and real-time data, including age, smoking history, lung function tests, and various physical performance metrics. These train the algorithm to identify patterns and risk factors from these features that would point towards an imminent exacerbation. Prediction is timely because the system is dynamic and hence can make forecasts any time new patient data arrives. It builds critical thresholds necessary for the expectation to be significant enough to fire an alert. This means the algorithm is ever-improving due to training on constantly updated data, allowing it to learn new trends and maintain high predictive accuracy. Its main function would be to provide timely warnings to healthcare providers that would enable them to take appropriate measures for enhancing the outcomes of their patients. This proactive approach helps in reducing hospitalizations and thus has better management of COPD.
Algorithm for merging COPD datasets and ANOVA test
This algorithm outlines the process of merging two different COPD datasets and using the ANOVA test to determine if they can be combined for further analysis.
Statistical validation through ANOVA
To ensure the compatibility of data collected from hospital records (primary source) and verified online websites (secondary source), Analysis of Variance (ANOVA) was employed. This step was crucial to determine whether the distributions of corresponding features from both datasets differed significantly before integrating them into the final model. The goal of applying ANOVA was to examine if there were statistically significant differences in the mean values of specific health-related attributes between the two sources. A one-way ANOVA test was conducted independently on each selected feature. This helped confirm whether the merging of datasets would introduce bias due to inconsistent data patterns. Table 1 shows statistical analysis results for various features. Yes, p-values and F-statistics were calculated for each feature analyzed. A p-value less than 0.05 was considered significant, suggesting that the feature’s distribution differed between the datasets.
Sample Results:
Data merge
Data merging forms the backbone of creating a viable prediction and early intervention framework. Data merging from different sources is vital to further improve the precision, context, and level of completeness in the prediction models. Merging data allows comprehensive analysis, improves the performance of the model in machine learning, and facilitates both real-time and predictive health care.
Cyber layer
The following tasks are performed in the cyber layer to generate insights and make an informed decision.
Machine learning algorithms
This is the main part of the framework. The machine learning algorithm analyzes and preprocessed data. The models are developed to identify early warning signs of increasing severity of disease or other health-related issues from historical and real-time data.
Data processing unit
This unit is responsible for the basic aggregation and preprocessing of data coming from sensors. The system ensures data is cleaned and normalized, meaning presented in a defined format, to allow further analysis. This may involve edge resources that are utilized to process data locally before transmission back to the central system.
Data cleaning
Data cleaning forms a broad part of preparing machine learning datasets in healthcare. Healthcare data is collected from different devices or manual entries and, therefore, is error-prone, inconsistent, and redundant. Some common issues that need to be sorted out before analysis are the wrong entries, duplicate records, and mislabelling.
Handling missing values
Handling missing values is of key importance, especially when dealing with clinical data and healthcare data in real-time, such as COPD. Missing values occur frequently due to incomplete patient records, faults or sensor malfunctions or errors while collecting data during the integration of hospital and external sources in the CPS framework. Missing values can severely affect machine learning algorithms. If not handled properly, they may feed into biased results, wrong predictions, or degradation of the model’s performance.
Feature selection
Feature scaling is the most important step, which allows features coming from the same dataset to be uniformly scaled. Some features of healthcare data usually have different units and magnitude; examples include age and glucose for patients. If not scaled, features with higher numerical values dominate the learning algorithm and often lead to biased outcomes. To avoid this, one may apply normalization or standardization. Normalization performs rescaling the data in a range of 0 to 1, while standardization makes changes to have the data mean at 0 and standard deviation at 1. These methods are considered fair among all features of the machine learning model, for better learning of meaningful patterns in data and making accurate predictions such as disease classification and patient risk assessment.
Post-hoc testing
To further validate features with significant p-values, Tukey’s HSD and Bonferroni correction methods were applied. These post-hoc tests minimized the risk of false positives during multiple comparisons. Only the features that passed these tests were considered harmonized and retained for model development.
Impact on model construction
Features showing statistical inconsistency or failing post-hoc validation were excluded from the dataset to avoid potential noise and misclassification. This process improved the dataset’s quality and ensured that only statistically aligned features were used, directly contributing to the model’s accuracy and generalizability.
Data balancing
One of the major problems in health care datasets is data imbalance, where some of the outcomes or classes-for example, healthy versus diseased patients-could be presented in extremely big numbers. Machine learning models could thus suffer from biased predictions toward one majority class due to data imbalance, which usually causes poor predictive performance for the minority class. Various techniques have been used to address the problem of data imbalance. Resampling techniques include oversampling of the minority class and the majority class. By balancing the dataset, machine learning algorithms would be able to generalize better and make more accurate predictions for all classes. The model would give appropriate representation to the minority cases, such as rare diseases, when making any decision.
Model training
Model development and training in machine learning are performed to predict exacerbations of COPD. Several algorithms may be applied to this, such as decision trees, Random Forests, Support Vector Machines, and Artificial Neural Networks. Data is divided into a training set and a test set for training the model to identify patterns and validate its performance. The cross-validation techniques are used to learn that the model generalizes well on new data, while hyperparameter tuning is performed to bring out the optimal performance of the models. This step is important in constructing robust predictive models for the forecasting of COPD exacerbations.
Algorithmic solution for COPD prediction
Given the patient attributes \(X\) and the goal of predicting COPD exacerbation \(Y\), the following algorithm outlines the steps to train a machine learning model and make predictions.
Model evaluation
This involves assessing the performance of the various trained models using different metrics: Accuracy, Precision, Recall, and F1-score in predicting COPD exacerbations. A confusion matrix can be used to get a more specific look at the errors of the predictions, including false positives and false negatives.The COPD Exacerbation Prediction Algorithm uses pre-processed data to predict possible COPD exacerbations. The algorithm applies the trained model for analysis of the patient data and delivers the corresponding prediction of the probability of an exacerbation. It decides whether to raise an alert by using suitably set thresholds for the prediction made. This module is critical in giving proper warnings in advance and hence allows for proactive management of COPD to let there be early interventions and, therefore, best care for the patients.
Algorithm for alert generation in COPD positive cases
This algorithm describes the process of generating an alert when a patient is predicted to have a COPD exacerbation.
Results
The CPS integrated with predictive modeling for the exacerbation of COPD provided a seamless framework through which real-time data collection, processing, and analysis of patient data was performed. Continuously, data can be acquired in the CPS framework from both primary data, such as those from hospital records, or secondary data obtained from authenticated healthcare websites. These data are processed by various machine learning models to predict the risk of exacerbation. This system uses the data obtained from preprocessing and the merging of proposed COPD Data Merging and ANOVA Test for feeding into machine learning models. These have been evaluated further in establishing the efficiency in predicting exacerbation in COPD. Integrating CPS with various machine learning models culminated in a highly efficient and reliable system for the prediction of COPD exacerbation. The Random Forest Classifier turned out to be the improved model due to its superior accuracy, which allowed for handling more complicated data relationships, whereas the Artificial Neural Network had the potential to improve the overall system efficiency. These results indicate that the CPS-enabled system can provide an early warning for COPD exacerbation, thus enabling timely proactive interventions by healthcare providers in managing such conditions of the patients.
Random Forest Classifier produces improved results, with accuracy as high as 97.5%. This model showed great performance, as it could support high-level nonlinear relationships that may occur in the dataset. For instance, the interaction of a patient’s age, history of smoking, and lung function metrics can be carried out using FEV1 and FVC. RFC achieved great recall, meaning that it rightly identified most of the patients who were at risk of exacerbation. A high value of recall is indispensable for early intervention in COPD, where overlooking an exacerbation too early may be dangerous. While SVM had efficient interpretability, it did not outperform RFC. Though it achieved a reasonable accuracy, it performed poorly regarding non-linear patterns. Non-medical datasets often have all sorts of factors at work during the progression of any disease. Also, the linear SVM had lower recall as compared to RFC, which could lead to more false negatives, whereby high-risk patients might get identified late enough. After training through more than 50 epochs, the ANN also realized a high accuracy of 97.5%, nearly like the RFC. However, ANN had to be tuned longer and required much more computational power to create such performance. The ability of ANN to capture complex interactions between variables, including age, smoking patterns, and lung function, is an assurance that it could get better with subsequent modifications. In analyses by COPD Severity Levels, results indicated that middle-aged patients, aged between 43 and 58 years, were noted to have a higher prevalence of COPD, especially in its moderate to very severe stages. These trends were then depicted using bar plots and helped the models to be informed of the demographic trends in the progression of COPD. Precisely, the system was able to make predictions based on age-related risk factors. The system further analyzed smoking behavior, it was determined that heavy smoking history greatly increased the risk of severe COPD. The correlation between smoking history and the decline in lung function-for example, FEV1-further supported the ability of the models to provide accurate predictions. smokers of a high pack history typically presented more severe stages of COPD, which the RFC and ANN models capitalized on rather effectively. The investigation into lung function metrics and comorbidity alone showed some deepening insights, such as the association of FVC with diabetes. Patients with lower FVC and the presence of diabetes were more likely to exacerbate fact again highlighting the inclusion of comorbidity in the prediction models. This has given further insight, allowing the system to take into consideration not only respiratory factors but also more general health conditions.
The proposed CPS framework demonstrates the potential to enhance the predictive capabilities and early interventions against COPD, providing actionable insights to healthcare providers and a complete strategy for managing the disease. Testing of the proposed Cyber-Physical System framework for prediction and early interventions of COPD with machine learning models integrated along with statistical validation through ANOVA testing was done. This CPS framework focused on the improvement in accuracy of predicting COPD exacerbations while adding improved monitoring of the disease through real-time data leveraged from hospitals and other sources. It was characterized by evaluating multiple parameters, fine-tuning prediction algorithms, and checking the results with statistical measures. Data Integration and Parameter Selection were the integral parts of the CPS framework. The system used the patient data, which included Age, PackHistory, MWT1Best, FEV1, FVC, CAT, and comorbidity as diabetes and hypertension. These features were validated using ANOVA test to identify a significant contributor to the risk of exacerbation in COPD. Factors such as Age, PackHistory and FEV1 were found to have significant contributions in predicting disease progressions. This framework would favour these parameters leading to improved predicting efficacy. COPD Exacerbation Prediction Algorithm showed strong predictive accuracy based on the selection of supervised machine learning models. On successive rounds of training, its accuracy keeps rising, this is indicative that the model is broadly improving learning. The application of feature selection in using ANOVA tests also helped to streamline the algorithm by limiting it to only those features that are statistically significant. It served not only to refine the precision of the model but also to optimize computational efficiency. For Correlation Analysis, a heatmap depicted the relationship between major parameters that showed such variables like PackHistory, FEV1, and FVC were highly correlated with COPD severity. ANOVA test further verified those relationships for correlated dependencies of variables and ensured that only the most relevant features were used in a prediction model. This kind of analysis of the CPS framework allowed it to focus on high-impact factors and made predictions much more robust and resource-efficient. This analysis of MWT1Best showed a trend toward being less active with the advanced age of patients. The ANOVA test confirmed both the age and the results of the walking test as predictors of exacerbations of COPD. This further empowered the system in showing individualized predictions for those patients who are affected with other health conditions that better improve the management of diseases in general. The Gender-Based and CAT Score Analysis demonstrated that there was quite a presentation of diverse severities of COPD between genders. CAT scores were higher for male patients. The ANOVA test proved that indeed the severity of diseases depended on gender, meaning that gender should be a part of the predictive model. This CPS framework allowed to present of more personalized health interventions according to gender, thus improving predictions and delivering treatment plans appropriately.
The training procedure went on to enhance accuracy and loss metrics in the end. Features analysis with ANOVA largely cancelled out overfitting since only statistically significant features were included. Thus, the model was efficient both on the training data and promising for deployment in real-world COPD management systems. The complexity of the model has been decreased with better prediction accuracy because the statistical significance features have been selected during the feature selection. The system also supports personalization, like patient-specific aspects and the framework supports the adaptations based on that. Besides real-time monitoring data, the CPS framework has been able to alert the healthcare providers in time to improve the outcome of the patient. This was achieved by improvements in the usage of critical parameters, leading to enhanced resource utilization such that the system is fast and effective for large healthcare systems.
Model justification
In this research, the selection of machine learning models is based on their proven efficiency in handling healthcare data characteristics such as missing values, noise, non-linear dependencies, and high dimensionality.
Random forest
Random Forest (RF) is an ensemble learning algorithm known for its robustness and high performance in classification tasks, especially in the presence of noisy and incomplete datasets. It constructs multiple decision trees and aggregates their results to reduce overfitting, thus ensuring consistent predictions. RF is particularly effective in identifying feature importance, which aids in interpreting the influence of various physiological and environmental parameters on COPD progression.
Artificial neural network (ANN)
Artificial Neural Networks are designed to model complex and nonlinear relationships between inputs and outputs, making them well-suited for medical applications where disease progression does not follow simple linear patterns. In this research, ANN is employed to uncover hidden relationships between physiological signals and the likelihood of COPD exacerbation.
Baseline model comparisons
To assess the performance of the proposed models, baseline classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), and K-Nearest Neighbor (KNN) are also implemented. These models serve as reference points to highlight the effectiveness of RF and ANN in capturing complex health data relationships. Table 2 shows model Performance Comparison of different models.
Model architecture and configuration
This research applies two machine learning models−Artificial Neural Network (ANN) and Random Forest (RF)−to predict COPD conditions effectively. These models were chosen based on their ability to handle healthcare data that may contain missing values, noise, and complex patterns.
Artificial neural network (ANN) setup
The ANN was used to detect non-linear relationships among patient features. The structure of the network was designed to ensure a balance between learning ability and computational cost. \(\bullet\) Layers: The network has three hidden layers between the input and output. \(\bullet\) Activation Functions: The ReLU (Rectified Linear Unit) function is used in hidden layers to handle non-linearity. The output layer uses the Sigmoid function because this is a binary classification task (i.e., predicting whether a COPD event will occur). \(\bullet\) Loss Function: Binary crossentropy was selected, which is commonly used for binary outcomes. \(\bullet\) Optimizer: Adam optimizer was used to update weights efficiently during training. \(\bullet\) Hyperparameter Tuning: A grid search method was adopted to find the best combination of parameters like learning rate, number of epochs, batch size, and neurons per layer. \(\bullet\) Validation: To avoid overfitting and ensure model stability, k-fold cross-validation (with k = 10) was used.
Random forest configuration
Random Forest was selected due to its high accuracy, ease of interpretability, and ability to rank feature importance. \(\bullet\) Number of Trees: 200 decision trees were used in the forest. \(\bullet\) Depth of Trees: The maximum depth was set to 20 to control overfitting. \(\bullet\) Split Criterion: Gini index was applied to decide how nodes are split. \(\bullet\) Minimum Samples: At least 4 samples were needed to split a node, and 2 were required at a leaf node. \(\bullet\) Feature Importance: The model calculates how important each feature is for making predictions, which helped in understanding which patient data fields influenced results the most. \(\bullet\) Tuning: A random search technique was employed to test multiple parameter combinations quickly and select the best performing ones. Both models were tested using performance indicators like accuracy, precision, recall, F1-score, and area under the curve (AUC) to judge how well they perform in real-world scenarios. This helped in choosing the more effective model for COPD prediction.
Model evaluation and fairness analysis
-
1.
ROC Curve and AUC Performance To evaluate model discrimination capability, ROC (Receiver Operating Characteristic) curves were generated for all implemented models, including Random Forest (RF), Artificial Neural Network (ANN), Support Vector Machine (SVM), Logistic Regression, and K-Nearest Neighbors (KNN). The Area Under the Curve (AUC) was used to summarize each model’s ability to distinguish between positive and negative COPD cases. Table 3 shows AUC Scores for Various Models
Table 3 AUC scores for various models. Higher AUC values for RF and ANN confirm their stronger diagnostic performance in classifying COPD exacerbation risk compared to traditional models.
-
2.
Confusion Matrix Analysis Confusion matrices were used to measure individual model outcomes in terms of correctly and incorrectly predicted cases. The Random Forest model showed a higher count of true positives and true negatives, confirming its robust predictive capability. ANN followed closely with comparable results. SVM, Logistic Regression, and KNN demonstrated more misclassifications, particularly in identifying positive COPD cases.
-
3.
Statistical Significance Testing To assess whether performance differences across models were meaningful, paired t-tests were conducted using k-fold cross-validation scores (k = 10). Table 4 shows comparison of models using t-statistics and p-values.The results are summarized below.
Table 4 Comparison of models using t-statistics and p-values. The p-values (< 0.05) indicate that the Random Forest model performs significantly better than the others in terms of overall accuracy and generalizability.
-
3.
Class Balancing Strategies
To overcome class imbalance issues, especially due to the lower occurrence of exacerbation-positive cases, the following techniques were applied: \(\bullet\) SMOTE (Synthetic Minority Over-sampling Technique): Used to synthetically generate new instances of the minority class during training. \(\bullet\) Class Weight Adjustment: Implemented in algorithms such as SVM and Logistic Regression to penalize misclassification of underrepresented classes.
Figure 2 shows A heatmap generated from the dataset gives an overall provision for several features related to COPD analysis. It can be used to plot the correlation matrix to find how different variables like AGE, PackHistory, FEV1, FVC, and CAT are related to one another. For instance, it shows the relation of MWT1 values with other measures like FEV1 and FVC, which are important to measure the severity of COPD and the status of patients.
Figure 3 shows the bar plot showing the distribution of COPD severity, considered as COPD = 2, by different age categories. Further segmenting the data into defined age ranges helps in the better display of the prevalence of severity of COPD for each of the age groups. This plot is of high value for identifying the most relevant patterns in COPD severity.
Figure 4 shows the counts of COPD for different ranges of age. The distribution of counts of COPD for different ranges of age is segmented into different levels depending on the severity. By melting the dataset and plotting the counts for the different categories of COPD severity against different age groups, it shows how the prevalence varies with age and severity. The insights from this plot will help elaborate on focused strategies for the management of COPD for different age groups.
Figure 5 shows the subplot visualization displays the relationship existing between Pack History and smoking status in the dataset. Plotting PackHistory across different categories of smoking gives this analysis of the correlation of smoking habits with the extent of tobacco use among the patients. Each subplot, for a different smoking status, allows for comparison by showing how pack history varies between smokers versus non-smokers.
Figure 6 shows the dependence of MWT1Best-the best measurement of the 6-minute walk test-on age in categories. Categorizing values of MWT1Best by age groups and computing the mean over each class yields data on how physical performance is measured using the walk test changes across age groups. Converting the age ranges into string format enables one to present and interpret the results more clearly.
Figure 7 shows the bar chart of the relation between the range of FVCs and the incidence of diabetes among COPD patients. In melting the data set with counts of diabetes across different ranges of FVC, this plot shows the distribution of FVC levels among diabetic and nondiabetic subjects. It is of great value for any insight that it would give on how diabetes may influence lung function, as measured by FVC, for COPD patients.
Figure 8 shows the boxplot of Cat scores with a visual summary of the distribution and variability in symptoms of COPD as reported by patients. This visualization will be important in portraying COPD patients’ overall symptom burden, reflected by their CAT scores. This study reinforces the pattern and variation in symptom severity that establish a basis for more targeted and personalized intervention strategies within a smart health framework.
Figure 9 shows the distribution plot of CAT scores is a detailed representation of the frequency distribution of symptom severity in COPD patients. This visualization will relay how the CAT scores fall within a variety of ranges, hence communicating the prevalence of different levels of symptom severity. The shape and spread of this distribution give an overview of the overall burden of symptoms due to COPD in the patient population. This information will contribute to the development and refinement of predictive models.
Figure 10 shows a bar plot showing the distribution of the continuous variable CAT, which is the COPD Assessment Test score, across different ranges segmented by gender. By melting the dataset into a format appropriate for directly displaying counts for CAT scores for each gender, this analysis will describe differences between male and female patients in symptom severity as measured by CAT scores. This is a valuable visualization that explains the difference in the burden of symptoms concerning COPD between the genders.
Figure 11 shows the box plot of the HAD scores will give a complete view of the distribution and variability of anxiety and depression levels among COPD patients. This visualization will emphasize the median quartiles, and potential outliers of HAD scores by providing insight into the general emotional condition of the patients.
Figure 12 shows the distribution plot of HAD scores shows the frequency distribution of the symptoms of anxiety and depression among COPD patients. This plot describes how the HAD scores are spread over different ranges, thus giving a clear view of the prevalence and intensity of emotional distress in the population of patients. The shape of the distribution curve shows whether the symptoms are concentrated in narrow ranges or widely distributed across all levels.
Figure 13 shows the boxplot of HAD scores is an intensive view showing the distribution in the level of anxiety and depression among COPD patients. It shows the median, quartiles, and possible outliers of HAD scores, showing the range in which patients fall in emotional distress. This visualization depicts the variation in the central tendency of anxiety and depression and identifies any extreme values or outliers that may exist.
Figure 14 shows the bar plot, represents the distribution of the HAD score in variation, by gender. This analysis will melt the dataset and plot counts of HAD scores for each gender, showing the distribution of anxiety and depression across the various score ranges for both male and female COPD patients. This visualization depicts the evolution of emotional distress throughout the disease course for both male and female subjects, reflecting any gender-specific patterns in emotional distress-important information.
Figure 15 shows the bar plot depicting the distribution of the severity level of COPD in female gender patients, further segmented by gender, categorized under the different stages of the severity of COPD, namely, mild, moderate, and severe. The plot compares counts of COPD in various levels of severity in females using the custom ”Paired” colour palette. It shows the distribution of different stages of COPD, such as mild, moderate, and severe, within gender to extract information about the manifestations of the disease in female patients.
Figure 16 represents the performance produced during the training of a machine learning model used in COPD prediction research, focusing on accuracy and loss metrics for several epochs. The left graph represents the model’s accuracy as a function of time; the plot shows an increase in how well the model learns to predict COPD severity as it goes through the epochs. This increase in accuracy across the epochs demonstrates that the model is improving its predictions effectively.
Feature selection process
To enhance the predictive accuracy of the COPD forecasting framework, a structured feature selection process was implemented. This step is essential to eliminate irrelevant or redundant variables, reduce overfitting, and improve computational efficiency during model training and inference. Techniques Applied Three widely accepted techniques were employed to extract the most significant predictors from the dataset: \(\bullet\) Mutual Information (MI): MI was used to quantify the dependency between each feature and the target class. Features with higher MI scores were considered more informative, as they contribute more to understanding the target variable. \(\bullet\) Recursive Feature Elimination (RFE): Using Random Forest as the base model, RFE was conducted to systematically remove less influential features while retaining the most relevant ones based on feature importance scores. \(\bullet\) LASSO Regression: LASSO introduces L1 regularization to penalize less important coefficients, promoting a sparse solution. This method proved particularly useful in shrinking the feature space. Based on LASSO, 12 out of 25 features were retained for final model development. Rationale Behind Selection The use of these complementary techniques provided a robust mechanism to validate feature significance from multiple perspectives−statistical correlation, model-based importance, and regularization-based selection. This combination reduced dimensionality without compromising model performance. Ranked Feature Importance Table 5 highlights the top features, ranked by their average importance across MI and RFE methods:
Future scope
The potential future scopes of this research include inculcating advanced AI techniques to improve real-time decision-making, enhancing the security aspect of data, and scaling the system for global health applications. These directions will not only foster better management of COPD but also transform the treatment of chronic diseases in the context of smart healthcare and CPS. Some possible areas of growth and influence might be:
Integration of advanced machine learning and AI techniques
This study can be further extended by embedding deep learning models and reinforcement learning techniques to make this COPD prediction framework even more accurate and performative. Advanced techniques such as NLP on medical records or computer vision on imaging data could be integrated for advancing the model to a new level of predictive power.
Real-time decision support systems
Integration of predictive models into real time decision support systems for healthcare providers can be thought. Given the flow of constant data from wearable devices and hospital records, the system would enable health care professionals to make quick decisions that might warrant interventions before the patient’s condition worsens.
Enhanced data security and privacy protocols
There are also huge amounts of sensitive health data collected by CPS and machine learning systems, and there will be an intense need for data privacy and security. In the future, it may be all about implementing blockchain technology or federated learning models that allow secure collection, sharing, and analysis of patient data while maintaining strict privacy standards. These will comprise a significant portion of efforts towards acquiring trust among patients and institutions toward large-scale deployment.
Conclusion
This research presents a real-time COPD prediction framework that leverages Cyber-Physical Systems (CPS) to support timely medical decision-making. By combining data from both clinical (primary) and verified websites (secondary) sources, the system utilizes Random Forest and Artificial Neural Network (ANN) models to forecast potential exacerbations with improved accuracy. The use of ANOVA for validating and harmonizing diverse datasets enhances the framework’s reliability and robustness before model training. The developed CPS framework demonstrates practical potential for integration into hospital systems, wearable devices, or remote health-monitoring platforms. Its ability to continuously process health data and generate early alerts enables faster clinical responses, thereby reducing hospitalization rates and improving patient outcomes. A notable contribution of this study is the successful fusion of multi-source data with statistical validation to support real-time, predictive healthcare. In the future, this framework can be adapted for other chronic illnesses and integrated with cloud infrastructure for broader deployment across smart healthcare environments.
Data availability
Data will be available on an author request to the corresponding author.
References
Elshewey, A. M. et al. Optimizing hcv disease prediction in egypt: The hyoptgb framework. Diagnostics 13(22), 3439 (2023).
Alzakari, S. A., Alhussan, A. A., Qenawy, A.-S.T. & Elshewey, A. M. Early detection of potato disease using an enhanced convolutional neural network-long short-term memory deep learning model. Potato Res. https://doi.org/10.1007/s11540-024-09760-x (2024).
Khater, H. M. et al. Empowering healthcare with cyber-physical system-a systematic literature review. IEEE Access https://doi.org/10.1109/ACCESS.2024.3407376 (2024).
Kumar, S. & Bhowmik, B. Emergence, evolution, and applications of cyber-physical systems in smart society. in 2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), pp. 1–8, IEEE, (2024).
Alkhammash, E. H. et al. Application of machine learning to predict covid-19 spread via an optimized bpso model. Biomimetics 8(6), 457 (2023).
Prasad, M. G., Choudhury, T., Kotecha, K., Jain, D. & Davanageri, A. M. A novel framework of smart security system based on machine learning techniques. in Intelligent and Fuzzy Systems: Intelligent Industrial Informatics and Efficient Networks Proceedings of the INFUS 2024 Conference, Volume 3, p. 133, Springer Nature.
Roy, A. & Satija, U. A novel multi-head self-organized operational neural network architecture for chronic obstructive pulmonary disease detection using lung sounds. IEEE/ACM Trans. Audio Speech Language Process. https://doi.org/10.1109/TASLP.2024.3393743 (2024).
Narducci, M. & Aya, J. M. Design of a wearable device for respiratory rate monitoring and detection of anomalous values. IEEE Lat. Am. Trans. 21(3), 513–518 (2023).
Rahim, M., Lalouani, W., Toubal, E. & Emokpae, L. A digital twin-based platform for medical cyber-physical systems. IEEE Access https://doi.org/10.1109/ACCESS.2024.3502077 (2024).
Zhang, H. et al. Event-trigger-based resilient distributed energy management against FDI and dos attack of cyber-physical system of smart grid. IEEE Trans. Syst. Man Cybern. Syst. 54(5), 3220–3230 (2024).
Liang, X. et al. Magnetic microrobots fabricated by photopolymerization and assembly. Cyborg Bionic Syst. 4, 0060 (2023).
Liang, X. et al. Field-controlled microrobots fabricated by photopolymerization. Cyborg Bionic Syst. 4, 0009 (2023).
Zhang, M., Wei, E., Berry, R. & Huang, J. Age-dependent differential privacy. IEEE Trans. Inf. Theory 70(2), 1300–1319 (2023).
Li, D. & Xing, W. A comparative study on sustainable development of online education platforms at home and abroad since the twenty-first century based on big data analysis. Educ. Inf. Technol. https://doi.org/10.1007/s10639-025-13400-3 (2025).
Huang, F. et al. Large language model simulator for cold-start recommendation. in Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, pp. 261–270, (2025).
Author information
Authors and Affiliations
Contributions
Navneet Kumar Rajpoot, Prabh Deep Singh, and Bhaskar Pant have contributed to this research work. Navneet Kumar Rajpoot completed the data collection, writing the manuscript, Prabh Deep Singh & Bhaskar Pant contributed significantly to reviewing the manuscript. All authors reviewed, revised, and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Rajpoot, N.K., Singh, P.D. & Pant, B. A novel framework for COPD management in cyber-physical systems using machine learning. Sci Rep 15, 36517 (2025). https://doi.org/10.1038/s41598-025-08932-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-08932-0





















