Introduction

Health-related quality of life (HRQoL) is a multi-dimensional assessment of how disease affects a patient’s health status and perceived well-being1. Commonly used self-report questionnaires can predict patient health outcomes. Evidence from the past two decades suggests that HRQoL is associated with overall morbidity and mortality1. Reduced HRQOL has shown a significant association with coronary heart disease and stroke2, cancer incidence3, long-term outcomes and all-cause mortality among cancer patients4,5, amputation, hospitalization, and mortality in individuals with diabetes6,7,8, disease-burden and survival among cystic fibrosis patients9, as well as mortality among community-dwelling populations and older adults10. HRQoL has also demonstrated significant associations with service utilization and healthcare costs11,12. Multiple QoL screening tools have been developed, though methodological and implementation barriers impede integration into community-based practice workflows.

The Quality of Life General 10-question Survey (QGEN10)13 and Quality of Life Disease Impact Scale (QDIS)14 surveys form a nationally validated quality of life measure. These measurements have demonstrated good psychometric properties and predictive value among chronically ill populations but are not commonly used in primary care (PC). Scoring formulas are not publicly available. Therefore, assessment requires prospective data collection and raw data transfer to the intellectual owner of the survey for scoring. Scored data is returned to the collection site as a series of 15 + norm-based and transformed variables representing aspects of patient mental and physical health. These profiles are not easily translatable as guides to clinical decision-making. Risk scores are returned as stand-alone spreadsheets and are not integrated into clinic systems, which impedes providers’ on-the-ground use. To facilitate meaningful use of QoL screening, clinicians require solutions that are easy to collect, interpret, and integrate into workflows.

There is movement towards integrating multiple data sources, such as eHR and social determinants of health, into QoL risk prediction15,16. Race, ethnicity, age, smoking status, BMI, household size, and income are significantly associated with HRQoL, though current screeners seldom leverage such demographic or other social determinant data17,18,19,20. Since practice sites routinely collect and house this information in their electronic health record (eHR), incorporating demographic factors into risk prediction algorithms may improve the accuracy of HRQoL risk estimates without additional prospective data collection.

Collaboration with PC providers to date has demonstrated a central need for a single risk score that is easy to access and interpret. Supervised machine learning may be a means to generate such a score. Preliminary evidence using machine learning (ML) has shown promise in predicting patients at risk for diminished QoL in some medical specialties21,22,23, but to our knowledge, ML methodologies have not yet been applied to PC populations. Leveraging this methodology to identify and combine the most predictive elements of previously validated measures can minimize patient burden by lowering the requisite number of patient-reported items needed to generate an accurate risk estimate. Simultaneously, one can incorporate potentially predictive patient factors extracted from the eHR to improve accuracy. By reducing the number of items and offering a set of straightforward, transparent decision support guidelines informed by ML, clinicians can build predictive data into routine clinical workflows at the point of service.

Objectives

  1. 1.

    Develop and evaluate a novel ML algorithm to predict QoL risk level (high/low) among PC patients based on a unique set of demographic factors and previously validated patient self-report quality of life screening questions.

  2. 2.

    Combine the most predictive elements of existing measures with demographic and SDOH data retrieved from patient eHR to generate a single, straightforward determination of mental or physical risk while reducing the required number of prospectively collected questions.

Methods

Overview and setting

Initial algorithms were developed using a convenience sample of PC patients from a single pilot site in Western Colorado. This medium-sized, community-based practice serves a diverse population of approximately 2500 pediatric and adult patients each year. Project work was deemed human subjects research exempt, and the Colorado Multiple Institutional Review Board (COMIRB) waived the need to obtain informed consent (protocol #20-2833). All methods were performed under the relevant guidelines and regulations.

Materials and data collection

QGEN10 and QDIS screeners result from 30 + years of development and refinement stemming from The Short Form 36 (SF36), created by RAND Corporation in 1992 as a 36-item survey to measure patient quality of life24. Over 3 decades, the SF36 was continuously validated and modified by the John Ware Research Group25 from its original format. Survey improvements have shortened the number of Likert-type questions required to generate an accurate risk estimate. Adding a chronic condition checklist has improved the validity of risk estimates26. A plethora of literature is available that details the development and evolution of SF36. For a comprehensive review, we recommend the 2000 article by John Ware titled SF-36 Health Survey Update25.

The SF-36 has become one of the most commonly used generic measures of health27,28. The resulting QGEN10 survey contains ten questions that measure eight health-related quality of life domains. Domain scores for physical functioning, physical role, bodily pain, and general health are combined to produce a single composite score for physical health. Likewise, domain scores for vitality, social functioning, emotional role, and mental health are combined to produce a single mental health composite score. Composite scores are norm-based using population data collected over many years. QGEN10 composite-scored, patient-level risk estimates for physical and mental health were used as ground truth to train new ML algorithms in this study.

The Quality of Life Disease Impact Scale (QDIS) is a 15-question disease-specific survey in which patients identify their chronic conditions and rate each condition’s impact on their QoL. Recent evidence has suggested that a count of chronic conditions combined with their relative impact increases the survey’s ability to discriminate among clinically defined risk groups29. Patient responses are aggregated and weighted to generate a set of standardized scores that can be directly compared, though exact weighting calculations are not publicly available.

The QDIS allows investigators to choose target conditions for inclusion. However, based on previous John Ware Research Group studies, certain disease groups have shown more or less discriminatory capability26. For the current study, investigators chose to include 15 of the most common chronic conditions affecting US populations as published by the Center for Medicare Services (CMS)30, including Hypertension (HTN), coronary artery disease (CAD), congestive heart failure (CHF), myocardial infarction (MI), headaches (HO), cardiovascular accident (CVA), respiratory disease (RES), ulcerative colitis (UC), osteoarthritis upper extremity (OAU), osteoarthritis lower extremity (OAL), osteoarthritis other (OST), sciatica (SCA), diabetes mellitus (DM), major depressive disorder (MDD), and cancer (CA).

All questions from the QGEN10 and QDIS were loaded onto an electronic tablet and provided to patients upon check-in for their clinic visit between February 2021 and January 2023. See Table 1 for QGEN10 question and response options, which are made available with permission from the intellectual owner. For each chronic condition in our checklist, patients were asked to respond yes/no to the following question: “Have you ever been told by a health care professional that you have (X condition)?”. Respondents were asked to rank each condition’s impact on their quality of life over the last 4 weeks. Response options for the impact question ran on a Likert scale from “not at all” to “extremely” impactful. For most participants, the entire survey took less than 2 min to complete.

Table 1 QGEN10 Abbreviated item stem and response category content.

Participant responses for QGEN10 and QDIS questions were batch-routed to the survey scoring hub to calculate each patient’s mental and physical composite scores. Higher scores represent better quality of life, and lower scores represent poorer quality of life. Composite-scored data was returned to the project team and used as a known outcome to train novel machine learning algorithms.

The following demographic variables were retroactively extracted from the site eHR for each respondent: date of birth, race, ethnicity, sex, BMI, income, education level, housing status, insurance status, insurance name, evidence of tobacco use, # of distinct providers seen, # of office visits, and # of Emergency Department (ED) visits in the 12 months before QGEN/QDIS survey date. A random number generator created a unique patient identifier for each respondent to protect patient identity. All analyses for this study were conducted using R Studio version 4.1.231.

Exploratory data analysis

Descriptive statistics and estimates of missingness were generated for each demographic variable, plus the raw QGEN and QDIS questions. Due to the nature of eHR-based data entry, several demographic variables demonstrated numerous, small response categories, necessitating collapse into meaningful categories before modeling. Race and ethnicity variables were collapsed using federal classification standards published by the Department of the Interior32.

Classification algorithms

Norm-based mental and physical composite scores serving as the ground truth were collapsed into a binary (high/low) risk structure as follows: composite scores below the population mean minus 1 standard deviation were classified as high-risk, and the remainder of observations were classified as low-risk. This approach placed 15–20% of the sample into the high-risk group. At the advice of our clinical partners, this is a realistic percentage of the patient panel on which they could reasonably intervene. Separate models were trained to predict mental and physical risk.

Tree-based supervised machine learning techniques were used to develop prediction models by training on a dataset where the outcome is known and labeled with a pre-defined class. The model is then applied to unseen data where the outcome of unlabelled examples can be predicted based on the relationship between the data items determined in the training model33,34.

Multiple approaches were applied to the pilot data to develop and test risk classification algorithms. A review of the strengths and limitations of common supervised ML approaches was conducted and considered before finalizing the analysis plan. Based on the context and objectives of our research, our team decided that the key beneficial components of the following algorithms best suited our needs: Decision tree, support vector machine, random forest, boosting algorithms, and neural network. Ease of interpretation, small sample size, and low-dimensional data particularly informed our choice. Computation expense and speed were not deemed pressing issues for our small dataset. The exclusion of k-nearest neighbors technique was based on the limitation that it provides no information on which attributes are most effective at classification, which is critical for our objectives. See Table 2 for highlighted strengths and limitations of common supervised ML approaches33,34,35,36,37,38,39,40.

Table 2 Review of supervised machine learning techniques.

A single Classification and Regression Tree (CART) was fit to generate a decision diagram for mental and physical risk. Conducting this analysis first allows us to visualize a set of decision logics or ‘classification rules’33 used to systematically predict whether a patient is at high or low risk for decreased QoL based on the presence or absence of certain features. This simple but powerful cornerstone approach allows us to identify relevant characteristics and decision thresholds for predicting QoL. However, a single CART model may suffer from overfitting and/or instability35,36. To mitigate these issues, we then applied support vector machines, random forest, gradient boosting, and neural networks to our data to refine classification estimates while addressing model stability, overfitting, and bias. Performance metrics for each approach, along with relative influence for top predictors are presented. Variable importance measures can provide valuable insight into the magnitude of the effect of any single predictor41 on QoL risk level. We used the customary threshold or decision boundary 0.5 for all tree-based modeling to classify predicted probabilities. All algorithms used a 70/30 split to train and test classification. Estimates of accuracy, sensitivity, specificity, and balanced accuracy are presented to evaluate performance.

Results

Exploratory data analysis

Univariate analysis

432 patients completed the QGEN and QDIS surveys. The practice site did not track the exact number of patients asked to complete the survey. However, it is estimated that 85–90% of patients who were asked chose to participate. 2 duplicate observations were removed, along with 3 observations from patients under 18 years of age, leaving 427 observations included for exploratory analysis. The scoring hub did not return composite scores for 52 observations. Data from these patients is included in the overall variable estimates but could not be stratified by risk level in the exploratory analysis.

Two SDoH variables displayed inconsistent presence in the eHR and were excluded from further analysis: patient housing status and patient income (both with greater than 70% missing across the entire sample). The following demographic variables were collapsed into more meaningful response categories before analysis: patient race, ethnicity, weight, tobacco use, and education. The patient’s date of birth and survey completion date were used to calculate the patient’s age at the time of the survey.

The majority of patients in our sample identified themselves as White (73%, n = 310), Non-Hispanic/Latino (63%, n = 267) males (59%, n = 250). The mean age for our sample was 50.5 years. On average, patients had 6.4 office visits and 0.29 emergency room visits in the 12 months before filling out the survey. On average, patients saw 1.5 providers during that same period. See Table 3 for complete details on the distribution of each demographic variable.

Table 3 Distribution of patient demographic variables stratified by composite score risk level.

QGEN10 answers ranged from 0 to 5 with varying missingness. The mental composite score averaged 50.7 (sd = 7.5, range = 26.8–61.9). 17% (n = 63) met the at-risk criteria (≤ 43.2) by our calculation. The Physical composite score averaged 50.2 (sd = 7.8, range = 25.7–61.4). 17% (n = 64) met at-risk criteria (≤ 42.4). 31% (20 patients) at risk based on the physical composite score were also at risk based on the mental score.

The average count of chronic conditions was 1.6 (range 0–8). The most commonly endorsed chronic condition in our sample was major depressive disorder (n = 102, 24%). The highest QDIS impact appeared among patients who endorsed major depressive disorder (mean impact ranking = 2.0, sd = 1.2). See Tables 4 and 5 for complete details on the distribution of each QGEN and QDIS question.

Table 4 Distribution of QGEN-10 questions stratified by composite score risk level.
Table 5 Distribution of QGEN-10 questions stratified by composite score risk level.

Bivariate analysis

Several of the demographic factors showed a significant relationship with either the physical or mental composite scores or both. See Table 3 for complete details.

Each of the ten QGEN questions was significantly associated with mental and physical composite scores (p < 0.001). 3 of the chronic conditions in our QDIS checklist demonstrated a significant relationship with mental risk level, and 7 were significantly associated with physical risk level. A raw count of chronic conditions was significantly associated with both mental (p = 0.014) and physical composite scores (p < 0.001). See Tables 4 and 5 for complete details.

Impact rankings of major depressive disorder and ulcerative colitis demonstrated a significant relationship with the mental composite score (p = 0.001 and 0.008, respectively). Impact rankings from the following conditions were significantly associated with physical composite scores: lower limb osteoarthritis (p = 0.003), sciatica (p = 0.004), major depressive disorder (p = 0.016), and cancer (p = 0.04). Total impact score was significantly associated with the physical composite outcome (p < 0.001).

Several of the individual QGEN questions demonstrated strong positive relationships with each other and with either or both composite scores. No strong correlations were detected between the eHR-based variables and the individual QGEN questions or composite scored outcomes. Figure 1 displays strong correlations (greater than or equal to the absolute value of 0.6) amongst the numeric predictors and outcome variables.

Fig. 1
figure 1

Correlation matrix all numeric predictors and outcomes.

Classification algorithms

All 5 tree-based approaches fit to classify mental risk demonstrated similar results, with QGEN Q8 and Q9 as predictive frontrunners. All algorithms other than the xgboost and neural network identified Q8 as the feature with the highest importance. Question #8 asks how happy or satisfied one has been with their life, with response options ranging from extremely happy to very unhappy. QGEN question #9 asks how easy or hard emotional health makes one’s work activities with response options ranging from very easy to unable to do. Patient responses that they were more often unhappy/dissatisfied or very unhappy/dissatisfied most of the time were associated with reduced mental QoL, as were patient responses that it was ‘very hard’ or they were ‘unable’ to conduct their daily activities because of how they feel emotionally. A reduced 2-item model was fit for each algorithm using only questions 8 and 9 as predictors. Performance on the holdout partition demonstrated balanced accuracy between 79 and 98%, depending on the specific approach. The highest balanced accuracy achieved was 0.98 from the random forest and the Bernoulli boosting algorithms. All 5 approaches fared significantly better than the no-information rate (0.83). Tree-based approaches to classify physical risk demonstrated more variability in the top predictive features overall, though Q2, Q3, and Q4 were consistently ranked as the most important. Question #2 asks how easy or hard it has been to do one’s regular physical activity (walk, climb), with response options ranging from very easy to unable to do. Question #3 asks how much pain limited one’s everyday activities or QOL, with response options ranging from no pain to extremely limited. Q4 asks patients how easy or hard it has been to do regular work or home activities. Patient responses that their physical health made it ‘very hard’ or they were ‘unable’ to do their physical activities or work/home activities were associated with reduced physical QoL, as were patient responses indicating that they were ‘limited a lot’ or ‘extremely limited’ by their pain. A 2-item decision model using questions 2 and 3 was applied to the test sample, resulting in estimates of balanced accuracy between 76 and 94% for individual algorithms. The highest balanced accuracy achieved was 0.94 from the neural net and the Adaboost algorithms. All algorithms performed better than the no-information rate (0.77). The top 5 predictor variables and individual algorithm performance metrics can be found in Table 6.

Table 6 Variable importance & performance metrics by classifier.

The algorithms predicting mental composite scores generally had higher balanced accuracy. More variability in top predictors coupled with lower performance for the fitted model could suggest that physical risk for diminished quality of life is more nuanced than mental risk, perhaps necessitating more complex model parameters.

Discussion

Health-related quality of life is a dynamic concept uniquely impacting each individual. QoL dimensions are among the most often identified by patients when asked about desired health outcomes.42 Patients with chronic medical conditions’ self-assessment of their function and well-being can predict their use of health resources and resulting morbidity and mortality. Thus, HRQoL assessment and intervention within the primary care setting, where most individuals receive their care, would make great sense.

To that end, an easily-initiated and highly predictive measure could greatly benefit PC providers. Preliminary results from the current study suggest that a brief, 4-question measure may predict physical and mental quality of life nearly as accurately as the full 40-question QGEN-10 and QDIS surveys combined. If this predictive validity is confirmed in larger samples, it could lead to significant restructuring and simplifying of QoL measurement, resulting in increased adoption by primary care providers and integration into everyday workflows.

The causal functions that drive the predictive capability of the most influential questions in our survey warrant further investigation. We cannot state definitively what makes QGEN questions 2,3,4,8, and 9 more predictive than others. However, recent evidence suggests that a single question may be enough to capture some QoL domains adequately43.

Next steps

We plan to generate a rapid-cycle learning community with additional practices to validate further and scale up these predictive algorithms to generalize our methodology. We have enlisted 4 additional primary care sites to collect QGEN and QDIS data. Patients from these sites are expected to be more representative of the general population. Future studies using a larger, more diverse sample may enhance the generalizability of the results while validating the algorithm in different settings and patient populations. Demographics and SDoH factors were not highly predictive in our pilot study. Data from additional, diverse patients may improve their predictive capacity.

Ultimately, we aim to automate a pipeline that uses demographic and eHR data to identify potentially at-risk patients who can be electronically outreached to collect the most predictive QGEN questions described above. This reduces the up-front need for human effort to collect data prospectively. Patient responses from the QGEN will be fed to providers through the eHR in a custom manner to befit their existing workflows. This makes HRQoL risk data available to clinicians in a way that allows them to make an informed recommendation for patient disposition. This next phase of the work is necessary but not sufficient. Concurrently with our psychometric efforts needs to be the definition and implementation of clinical workflows that identify the requisite care needs of patients in the different risk categories.

Limitations

Algorithms generated for this pilot study are based on a small, convenience sample, which can result in selection and/or non-response bias. Due to missingness and lack of variation in our pilot sample, several patient demographic variables needed to be collapsed into fewer categories before modeling. This reduces the information available to the algorithms and may introduce bias. Additionally, overfitting is a common concern for decision trees. While ensemble methods such as gradient boosting aim to address such issues, our models likely remain overfit.

Furthermore, while our response rate was relatively high and the sample relatively diverse, the results must be interpreted cautiously as they may not represent all PC patients. For example, our sample was mainly healthy compared to the national norms, and thus, our predictive findings may not represent PC patients with multi-morbidities.

Preliminary results from this study are promising, however, prospective validation in real-world clinical settings is needed to ensure that the algorithm performs well in practice and can be effectively integrated into clinical workflows. This is necessary to further delineate the robustness and applicability of our preliminary findings.

Future work needs to focus on the integration of these tools into existing electronic health record systems and clinical workflows to maximize their downstream impact on patient care. There are many barriers and challenges to integrating risk prediction tools into the eHR, and we recommend a more formal investigation of these in the future.

Conclusions

We are fast approaching the use of artificial intelligence as a clinical tool in health care. Supervised machine learning can be used to restructure QoL measurement to reduce the burden of prospective data collection, which can be built into a multi-data source risk prediction. Prediction alone is not enough to impact care delivery and outcomes; such predictive data must translate into usable workflows if such efforts are to translate into regular use to guide and enhance clinical practice and impact patient care delivery and outcomes.