Introduction

Asthma attacks are the cause of more than 25 deaths per week on average in the UK1. If a patient contacts a health professional promptly following a decline in symptoms, short courses of systemic steroids can be prescribed on top of preventative therapy to relieve exacerbations and reduce the need for (transfer to or continuation in) emergency care2,3. Most asthma attacks occur in those who would be classed as having mild-to-moderate asthma4, due to the volume of such patients: only 5–10% of those with asthma are typically classed as having severe asthma5. Many patients in danger of life-threatening deterioration are unsure of how to handle emergency situations6, as highlighted by the finding in the 2014 UK National Review of Asthma Deaths which reported that only 55% of people who died from asthma had called for or received any medical assistance after their attack began7.

Consultations with GPs and asthma nurses in primary care present a key opportunity for the joint evaluation of asthma attack risk, and outline plans for action in the event of symptom deterioration8, as recommended by national asthma guidelines9,10. A personalised health promotion tool may be able to promote risk-reducing lifestyle choices, instigate revisions to asthma action plans, improve patient engagement with self-management protocols, and reduce patient anxiety11,12,13. Personalised guidance has been demonstrated to have a greater impact on disease prevention than generic documents8.

Machine learning algorithms can identify complex patterns in data from patient’s histories and symptoms that are associated with increased risk of severe outcomes. They often require high volumes of data to make such inferences without guidance from a clinical expert, which makes their application to the wealth of information recorded in routinely collected Electronic Health Records (EHRs) a promising pathway for research14,15. However, even in large national datasets, the infrequency of asthma attacks relative to the prevalence of asthma means that it can be challenging to identify potentially causal relationships that may lead to increased probability of an asthma attack. Furthermore, the imbalance between positive (attack) and negative (no attack) samples in the data poses a practical challenging setting for developing predictive models16,17,18. Therefore, many prediction models report either low (such as below 50%) sensitivity (predicted risk in people who had asthma attacks) or positive predictive value (incidence of asthma attacks in people with high predicted risk)19.

In this paper, we utilised the wealth of data recorded in EHRs and tested multiple methodologies for overcoming this data imbalance problem, to develop, validate and test a model for predicting whether asthma attacks would occur within the next year from a primary care-based asthma-related consultation.

Methods

Data

The Asthma Learning Healthcare System (ALHS) study recruited over half a million patients from 75 general practices in Scotland, with primary care records linked to national accident and emergency (A&E), hospital, and mortality datasets20. The original study period was between January 2009 and March 2017. The initial data processing report is provided in Supplementary Material A. The linked analysis dataset flow diagram is presented in Fig. 1.

Fig. 1
figure 1

Linked Analysis Dataset Flow Diagram.

Analysis population

The analysis population for this study was adults (aged 18 and over), diagnosed with asthma in primary care, and treated with inhaled corticosteroids (ICS)21. The exclusion criteria were having missing age or sex in the primary care registration record.

Eachsample in the final analysis dataset was a day on which a primary care consultation related to asthma or a respiratory infection occurred, without an oral corticosteroid (OCS) prescription or secondary care asthma encounter. Therefore, we additionally excluded individuals who had no such event during their follow-up (as defined in Supplementary Material A).

Finally, those with a diagnosis of Chronic Obstructive Pulmonary Disease (COPD) were identified, and the time between first asthma diagnosis and first COPD diagnosis was estimated. A diagnosis of COPD prior to a diagnosis of asthma excluded patients from primary analyses, however they were retained for a sensitivity analysis (model testing only, no data included in model training). Similarly, for those with a diagnosis of COPD following their asthma diagnosis, the time (and any samples) after their COPD diagnosis was excluded from model training, but was retained for sensitivity analysis (model testing only).

Outcome ascertainment

The model’s outcome was asthma attacks occurring within one year from the index date of each samples. The joint American Thoracic Society (ATS) and European Respiratory Society (ERS) Task Force definition of a severe exacerbation22 was used to define an asthma attack: a prescription of OCS, an asthma-related A&E visit, or an asthma-related hospital admission (ICD-10 codes J45 and J46). In addition, deaths with asthma as the primary cause were considered indicative of an asthma attack. The identification of asthma-related A&E presentations, inpatient admissions, and deaths is described in Supplementary Material A.

Prescriptions for OCS were considered indicative of an asthma attack if all of the following conditions were also met: 1) they were prescribed to someone with a diagnosis of asthma or receiving asthma treatment, 2) they were prescribed on the same day as an asthma-related consultation, 3) the prescribed strength was greater than or equal to 5 mg per dose, and 4) the total prescribed dose was between 50 and 350 mg.

Prediction model features

Supplementary Material B describes the full set of risk factors that were included in the analysis, and notes regarding the feature extraction method and missing data handling.

Training data enrichment

As described in the introduction, the low incidence of asthma attacks in the general asthma population results in complexity in model development which can often result in poor model sensitivity. Herein, we have tested the utility of the training data enrichment method known as SMOTEing23,24, with three difference parameter sets as described in Supplementary Material C.

Analysis plan

In this analysis, a random partition approach was used to split the population for model training and testing, as shown in Fig. 2. A random 90% partition of the ALHS dataset population was used for hyper-parameter optimisation, model selection and initial performance reporting (henceforth the derivation subset, n = 523,611 samples), and the remaining 10% was held-out for assessing model generalisation (hold-out testing subset, n = 63,331 samples). During the model selection process, the derivation dataset was randomly partitioned another 100 times, again with 90% of the data used for training the final selected algorithm (approximately 471,000 samples), hyper-parameters, and enrichment method, and 10% for internal validation (approximately 52,000 samples).

Fig. 2
figure 2

Diagram of Dataset Partitioning for Model Training, Model Selection, Internal Validation Performance Reporting, and Testing Partition Performance Reporting.

Four algorithms were tested, namely Naïve Bayes Classifier, Logistic Regression, Random Forests, and Extreme Gradient Boosting; these included variations based on training data enrichment approaches, algorithm hyper-parameters, and classification thresholds. The full model selection process is detailed in Supplementary Material C.

For each iteration, model, and enrichment method, the Area Under the Receiver-Operator Curve (AUC) was calculated. The confusion matrix was also recorded, based on the primary classification threshold that optimised the Matthews Correlation Coefficient (MCC; identified using golden-section search optimisation25) in predictions made on the training data partition. The MCC was used as the primary performance measure as it utilises the balance ratios of all four categories of the confusion matrix (i.e., true positives, true negatives, false positives, and false negatives). The other performance measures reported were: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and balanced accuracy. Across iterations, summary statistics were calculated for each performance measure to provide some estimate of the average performance, and the certainty around that estimate.

As a sensitivity analysis of the classification threshold selection, a further three approaches were explored: 1) the default (probability greater than 0.5) classification threshold (aka ‘Fixed’), 2) the classification threshold closest to the outcome prevalence in the training data partition, to 3 decimal places (aka ‘Prevalence’), and 3) the mean of the ‘Variable’ and ‘Prevalence’ values (aka ‘Balanced’).

The model performance was summarised over the 100 iterations of the split-sample process. Calibration was assessed using the slope and intercept of a logistic regression model between the predicted risk and the observed outcome. Model coefficients were calculated across the models trained from the 100 iterations of derivation data partitioning.

Finally, the model was then retrained on the full derivation dataset and tested on the as-yet unseen holdout partition. The performance was also reported with data stratified by various risk factors. These were: (i) history of other comorbid chronic pulmonary disease, (ii) BTS Step (to evaluate the decision to assign the level ‘0’ to periods of non-adherence), (iii) missingness of peak flow and blood eosinophil measurements (to evaluate their added value), (iv) smoking status (to evaluate the utility of assigning the level ‘never’ having smoked to those with missing smoking status), (v) recent respiratory infections, oral steroid prescriptions and prior known asthma attacks (to establish the utility of the model in predicting those not known to be prone to attacks), and (vi) if or when in the future the patient was diagnosed with COPD. In the latter case, this information obviously could not be known at the time of prediction, however we explored it to examine the potential impact of over-lapping diagnosis, which might be considered in individuals at high risk of COPD.

Reporting

Deviations from the protocol paper, published in BMJ Open in 201926, are listed in Supplementary Material D.

This work was written in line with guidance from RiGoR (Reporting Guidelines to address common sources of bias in Risk model development, by Kerr et al.27), TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis, by Collins et al.28), and RECORD (Reporting of studies Conducted using Observational Routinely-collected health Data, by Benchimol et al.29). Checklists are presented in Supplementary Material E.

Results

Analysis population

There were 22,063 patients included in the study, with 723,762 samples, as illustrated in Fig. 3. There were 19,125 individuals in the training data partition, with a total of 584,288 samples (eligible consultations) spanning 115,282.5 person-years (median 7.0 years per person, interquartile range (IQR) 5.1 to 7.4, and range <0.1 to 7.5 years). 19,283 samples were excluded as they occurred after a diagnosis of COPD had been made.

Fig. 3
figure 3

Asthma attack risk prediction model analysis population flow diagram.

There were 2125 individuals in the testing data partition, with a total of 65,985 samples spanning 12,919.0 person-years (median 7.0 years per person, interquartile range 5.2 to 7.4, and range 0.1 to 7.6 years).

Finally, there were 871 individuals in the post-COPD diagnosis testing data partition (sensitivity analyses). 813 of these were individuals who had no samples prior to their subsequent COPD diagnosis, and samples post COPD diagnosis from 58 individuals in the testing data partition were also included. There were a total of 54,206 samples spanning 5030.0 person-years (median 7.0 years per person, interquartile range 4.3 to 7.4, and range 0.3 to 7.6 years). The demographics of each data partition at baseline are presented in Table 1. Supplementary Material F shows the demographics (and all other characteristics) by sample in each partition.

Table 1 Demographics of the ALHS analysis population.

There was a median of 34 days between subsequent consultations in the training data partition (IQR 18 to 60 days), and the incidence of asthma attacks in the year after consultation was 8.0% (n = 46921/584288).

Model testing

The results of the model selection process are reported in Supplementary Material G. The final selected model used the logistic regression algorithm, with no training data enrichment, and the ‘balanced’ threshold for classification.

In Table 2, the summary statistics of a selection of model performance measures across the 100 iterations of derivation data partitioning are presented.

Table 2 Summary statistics of model performance measures from 100 internal validation data partitions, and the hold-out data partition.

The final threshold used for classification in the holdout partition was 0.199: the median across the 100 derivation partition iterations (range 0.133–0.234, interquartile range = 0.174 – 0.216). As shown in the final column of Table 2, the performance in the holdout partition, following a full retrain of the model on the entire derivation dataset, was consistent with the range observed in the derivation dataset. This internal validation demonstrated the stability of the model performance within these data to perturbations of the sample set and confirmed that the crossover in samples between the model selection and performance reporting subsets did not bias the results.

Around 1 in 3 (PPV = 36.2%) predicted high-risk patients had an attack within one year of consultation, compared to approximately 1 in 16 in the predicted low-risk group (6.7%). The Receiver-Operator Curve (ROC) is presented in Fig. 4. The confusion matrix in the holdout partition is presented in Table 3. The model had good calibration overall, with logistic regression between the predicted risk and observed outcome yielding a calibration intercept of 0.004 and slope of 1.02.

Fig. 4
figure 4

Receiver-Operator Curve.

Table 3 Confusion matrix for model performance in holdout partition.

Model coefficients

In the final model, history of asthma attacks recorded in primary care were strongly associated with risk of attacks in the next year: having an attack between 1 and 2 years before the index date resulted in 1.959 times higher odds of a future attack than no record of an attack in the last two years, and more recent historical attacks increased the odds further (Supplementary Material H). A recent history of respiratory infections was also associated with higher odds of attacks, with an infection 1 to 2 years prior to index date associated with an odds ratio of 1.494 compared to none in the past 2 years (and increasing odds with more recency). Results were highly consistent between the models trained in iterations of internal testing and the final model tested on the holdout partition.

The odds ratios for several comorbidities in the time window of ‘longer than 5 years ago’ were significantly lower than the reference categories of ‘never recorded’, which may be indicative of an artefact. As a sensitivity analysis, these features were recoded to ‘longer than 5 years ago or never recorded’. For anxiety/depression there was a trend that more recent records were associated with higher risk, however the results were not consistent in other comorbidities (data not shown). Further investigation may be warranted.

Discrimination and calibration in population subgroups

Sensitivity was markedly higher in those with a history of comorbid chronic pulmonary disease (excluding asthma and COPD) than those without: 85.9% versus 29.4% (Table 4). Sensitivity was very high in those with multiple respiratory infections in either the current or previous calendar year (80.8%), those with multiple OCS prescriptions in either the current or previous calendar year (100%), and those with an asthma attack recorded in last 2 years (70.3%). However, the specificity for each of these subgroups was also lower: 52.2, 6.1, and 61.4%, respectively. The calibration in most of the population subgroups was good (plots presented in Supplementary Material I), except for those with recent oral steroid treatment or another concurrent chronic pulmonary disease.

Table 4 Subgroup Analyses.

The model had lower sensitivity and PPV in those diagnosed with COPD more than 5 years from their observation date (22.8%), compared to those with no COPD recorded, or those with COPD recorded within 5 years of the observation date (30.4 and 36.6%, respectively). Those with COPD diagnosed more than five years after the observation date had poor calibration (Supplementary Material I).

Discussion

Summary of results

Our selected model used the logistic regression algorithm with no training data enrichment, and had an AUC of 0.75. The specificity and negative predictive value were high (95% and 93%), highlighting the model’s strength at accurately identifying a large proportion of patients at low risk. Around 1 in 3 predicted high-risk patients had an attack within one year of consultation (PPV = 36.2%), compared to approximately 1 in 16 in the predicted low-risk group, and there was good calibration (slope of 1.02). The model’s sensitivity was poor (30.1%) however, so most asthma attacks were not predicted in advance. We demonstrated the effect of various classification thresholds to dictate the balance between minimising false negatives (missed opportunities for intervention) and false positives (inefficient resource allocation).

Results in context

Comparing model performance across studies is not straightforward due to differences in outcome definitions, populations, and even simply performance reporting, as highlighted in our recent systematic review19. Several studies in this review did not report both the sensitivity and PPV of their models, or provide the confusion matrix such that they could be calculated. One study which did report their model thoroughly, and achieved strong performance, was the model developed by Inselman et al. in the USA30. Their model predicted asthma attacks in the six months following discontinuation of biologic therapy and achieved 81% sensitivity and 84% PPV. However, that study focused on a highly selective population: biologics are an often highly effective and safe alternative to oral steroid treatment, but their high cost means that they are typically reserved for those with an extensive history of previous attacks and high ongoing risk31,32. Our study was also able to achieve high sensitivity (100%) and PPV (49%) in those with oral corticosteroid prescriptions in the last year, however the specificity was very poor (6%). Furthermore, we posit that while identifying high-risk patients without a history of attacks has proved to be a much more challenging task, these patients may benefit most from health education interventions.

The great potential of EHRs to conduct large-scale, resource efficient, observational research has been well discussed33,34. However, the question remains whether EHRs, as they are currently generated, are a viable source of data for predictions about individual patients, particularly when it comes to making healthcare decisions on their basis. Firstly, the recording (in coded data) of key features such as diagnosis of comorbidities maybe be poor, particularly in practices where the coding is often conducted by a non-clinical member of staff35. It is likely that there is a wealth of useful data already being captured in primary care, but stored in free-text clinical notes36, which are rarely available for research due to re-identification risk. Many patients with asthma may have a scarcity of coded historical data: in our dataset 87% of samples were preceded by at most one asthma-related consultation in the previous year. In only 2% of samples participants were able to make use of a peak flow recording in the previous two weeks. In the future, linkage between primary care records and smart devices or other patient-sourced data sources may facilitate the leveraging of more regular data for patients who are willing and able to share.

Furthermore, coded primary care diagnoses are an evolving phenomenon, with hypotheses of suspected asthma being examined through formal spirometry tests and observation of outcomes following treatment37,38. Asthma diagnosis is a particularly difficult task, due to the heterogeneity of clinical presentation39, and the range of conditions with similar symptoms, such as COPD. As such, many previous studies have specifically excluded patients with COPD19. In this study, we conducted sensitivity analyses on those with COPD diagnoses recorded at a later date (although the model was not trained on samples from people with a COPD diagnosis at that time) in order to determine whether this was associated with model performance. We found that there was no substantial difference in performance for those with COPD diagnosis recorded within five years of the observation index date, compared to those without COPD ever recorded, however the model had lower sensitivity in those with COPD diagnosis recorded more than five years from their observation date (23% versus 37%). The validity, generalisability, and interpretation of this finding is therefore unclear.

In our analysis, the logistic regression model outperformed the more complex, non-linear algorithms such as the random forest. This may indicate that there were no substantial interactions between model features, and also has the result that the model is much easier to interpret. Further work is warranted to determine whether the model can be improved by removing any of the features, or through the addition of expert-determined interaction terms.

Strengths and limitations

This study was able to leverage the wealth of longitudinal data from a large, uncontrolled, representative population, covering the whole geography of Scotland20. We also used an expert-driven guideline-based operational definition of our outcome22, which ensures that it is aligned well with current clinical practice as well as other research studies. We tested four algorithms (Naïve Bayes Classifier, Logistic Regression, Random Forests, and Extreme Gradient Boosting), and a selection of hyperparameters, resulting in twelve different models. We also took care to avoid leakage in the model development: the use of data in model training that would not be available in the deployment setting, which overestimates the model’s predictive performance40. Specifically, there was no crossover of patients between the training and testing partitions, feature scaling in both the training and testing partitions was based on the values observed in the training data only, and linked data were only used for outcomes and not for the features for prediction.

The main limitation to this study is that the model has very poor sensitivity in those without any history of asthma attacks in the past two years (4%). However, the PPV was 29%, which still demonstrates that the model is able to detect some individuals who may be otherwise considered low risk. Future research should consider explicitly looking to train a model in those with no recent history of attacks, for whom the incidence rate will be very low, but the potential for impact especially large.

Overall, the population definition may have been a hinderance to the model’s predictive ability. This study employed limited exclusion criteria, meaning that this model would be in principle applicable to those diagnosed with asthma across the spectrum of severity. However, the diversity of the population may have presented too large a challenge for the models to learn from. For example, if certain risk factors were only pertinent to specific asthma phenotypes (e.g. seasonality for those with allergic asthma)41, the model might not be able to detect this pattern due to limitations in sample size, or model parameters such as the depth of the trees or the number of trees.

Including multiple samples per individual, throughout their years of registration at their general practice, allows us to capture some of the inter-person variability in risk, it also means that some population subgroups will be over-represented in the data. For example, if smokers have more consultations that non-smokers, they will contribute more weight to the model than if there was a single sample per person. Furthermore, the performance measures in this analysis are relational to the number of samples, and as such the performance is skewed by that of those with a larger number of encounters. As shown in Table 4, only 13% of samples were for those with multiple encounters in the previous year, but these individuals also had a higher attack rate (30% vs 4%). The sensitivity and PPV were higher in this sub-cohort (61% vs 18%, and 42% vs 30%, respectively).

Conclusions

Our model achieved good calibration between predicted risk of an asthma attack in the year following primary care encounter for asthma or respiratory infection and observed rates of attacks. Our implemented binary classification rule flagged approximately 1 in 10 patients as being high risk, and in this group approximately 1 in 3 had asthma attacks, compared to approximately 1 in 16 in the low-risk group. We demonstrated the effect of various model specifications on the balance between minimising false negatives (missed opportunities for intervention) and false positives (inefficient resource allocation), which allow the model to be tailored according to the desired clinical utility. Building on this analysis, there is a need for user-centred research to explore optimal ways of presenting this information to clinicians so it can be incorporated into routine care. There is also the need to identify additional data sources (e.g., pollen, pollution and weather data) that could potentially be incorporated into future iterations of our risk prediction algorithm.

Key messages

What is already known on this topic:

  • Clinical risk prediction models are increasingly proposed as a solution to improve efficiency and equality of healthcare. The high prevalence and heterogeneity of asthma, combined with the low incidence of serious outcomes, often results in suboptimal self-management, and missed opportunities for clinical intervention. Integration of the model into a primary care clinical decision support tool, developed on routinely collected data, may be the most appropriate route, however risk prediction in this setting remains a challenge.

What this study adds:

  • Our model was able to identify low-risk patients, with a specificity 95% and a negative predictive value of 93%. The high-risk patient group was harder to identify, and the model only achieved 30% sensitivity and 36% positive predictive value; however, the model had good calibration (slope of 1.02). The Area under the Curve was 0.75. The model can be adjusted to fit best clinical need, such as adjusting the classification threshold relative to estimated misclassification costs of a given intervention, focussing on calibration or filtering to a less broad asthma population, depending on the desired clinical utility.

How this study might affect research, practice or policy:

  • This study examines the discrimination and calibration of a developed prediction model of asthma attacks on a 1-year horizon, on a diverse population in a real-world setting. By examining a range of methodological approaches, we highlight a range of possible modifications to the model design to assist with specific clinical tasks, such as health education, and efficient resource utilisation.