Introduction

AI is changing the way we solve problems and healthcare is not exempt. Saenz et al. (2023), discuss the recent FDA approvals for biomedical systems with at least some degree of autonomy for use in clinical and non-clinical settings which highlights the popularity of integrating modern AI techniques to different aspects of patient care. For example, the automatic generation of annotations used by radiologists in diagnostic image interpretation, and the recent triumph of LumineticsCore for timely delivery of insulin in point of care diagnostics1. Even over-the-counter biosensing technologies boast of increased capabilities as a result of some form of AI integration2. Regardless, the adoption of AI into healthcare is not without its challenges; there is a complicated landscape of changing governmental policies to address the question of liability, ethics, economics and other important considerations which makes the task of providing clear guidelines to physicians and healthcare technology practitioners especially difficult1,3,4. Recently, ambulatory and wearable sensing technology for general wellness monitoring is increasingly popular and their adoption for early disease detection is currently a prevailing idea in early stage medical start-up based on the volume of research publications and conference presentations that extends the usefulness of the smartwatch into a medical gadget5,6,7,8,9,10. With advances in Large Language Models (LLMs) and the underlying mechanism for Chain of Thought (COT) reasoning which allows for easy integration of biophysics models as instruments models in wearable technology, there is an acceleration in the development of digital health applications especially with the availability of models that focus on consistency checking and resiliency evaluation; for all that we are now able to do, surprisingly, there is no corresponding increase in the studies that assess the patient’s experience of care outside of the general customer satisfaction survey11,12,13,14,15,16. Typically, the ultimate goal of the customer satisfaction survey is profit maximization for the service provider whereas an assessment of care aims to cater to an optimization of the psychological factors within the context of the healthcare service. One might argue that a comprehensive study of the patient’s experience of care should assess how much of the human qualities of empathy, understanding, gentle touch, etc. is lost in the AI integrated healthcare. To this end, there is currently limited data on the perception of the general population on the integration of AI into healthcare. A recent study found that majority of Americans wanted to be notified if AI was involved in their care with females, older adults, non-Hispanic Whites, and more educated people expressing desire to be notified17. In another study on the acceptability of AI, it was determined that there was decisive positive correlation with perceived utility, positive attitude and perceived trustworthiness and negative correlation with poor computer literacy, and negative attitudes towards computers18,19. Since patients are ultimately the beneficiaries of AI integration in healthcare either directly or indirectly, it is important to ensure that AI is well received. In addition, patient acceptability and public trust are important to ensure patient engagement and wide dissemination and successful integration of AI into healthcare. In this study, we attempted a foundational work to develop an AI affinity score using a mathematical model and prevailing Machine Learning (ML) techniques to develop models that leverage a patient’s general information and biological data in the determination of their experience of care strictly on the basis of the degree of AI integration. In this regard, the AI affinity score is designed to predict the degree of AI integration that will maximize patients’ experience of care and was evaluated on data generated from a survey of participants from North America, Asia, and Africa regarding their perceptions and acceptability of AI integration into healthcare.

Distribution of attitudes towards AI integrated Healthcare

We conducted this study using the survey method with the aim of assessing how the degree of AI integration in digital health systems and general healthcare services impact the patient’s experience of care. The research was approved by the Institutional Review Board (IRB) for CSU San Bernardino with number: IRB-FY2025-63. Informed consent was obtained from all subjects and all methods were performed in accordance with the relevant guidelines and regulations of the IRB. The data collected revealed that most of the respondents are familiar with AI, with 97% acknowledging awareness of the technology, and nearly 60% having used AI-powered tools. This widespread familiarity with AI provided an opportunity to examine variations in perceptions of AI-integrated healthcare. These variations are explored across demographic groups using Kernel Density Estimation (KDE) and box plots. While overall perceptions remain neutral, with most affinity scores centered around 0.5, significant differences are observed across gender, age, education level, and regional factors.

Fig. 1
figure 1

Normalized KDE plot for affinity scores across demographic groups.

Figure 1 shows that male and female respondents have similar perceptions of AI-integrated healthcare, as indicated by overlapping affinity scores. However, respondents identifying as “Other” demonstrate consistently lower scores, suggesting less favorable views. Aligning with the KDE analysis, the “Other” category also has a noticeably lower median and narrower inter-quartile range (Fig. 2). Regarding age, while medians are similar, younger respondents exhibit slightly greater variability in their affinity scores, which may indicate more diverse perceptions among younger participants. Education level shows the most significant differences between subgroups, with respondents possessing advanced and moderate education displaying tightly clustered, higher scores, indicative of more favorable and consistent views. Conversely, those with lower education levels exhibit broader and lower scores, as confirmed by the lower median. Regional differences also play a significant role in shaping perceptions. Respondents from Asia show the highest and most consistent affinity scores, as evidenced by a narrow distribution and higher median in both the KDE and box plots. This suggests a more positive and unified perception of AI-integrated healthcare in this region. In contrast, participants from North America and other regions exhibit more variability in their responses, reflecting a diversity of opinions on the topic.

Fig. 2
figure 2

Box plot for affinity scores across demographic groups.

Noticeable differences emerge when comparing perceptions of digital technology and AI integration across age and regional groups. While respondents generally express positive views toward digital technology, their attitudes toward AI integration are notably more cautious. This is reflected in a higher proportion of respondents adopting a negative or neutral stance on AI integration (Fig. 4) compared to digital technology applications (Fig. 3).

Fig. 3
figure 3

Digital approval level across age and region groups.

Fig. 4
figure 4

AI support level across age and region groups.

Older demographics tend to favor digital technology more, whereas younger respondents show a slightly greater openness towards AI. Regionally, Asians demonstrate stronger support for AI despite harboring more skepticism towards digital technology. On the surface, it appears that perceptions of AI integration in healthcare generally lean towards neutrality but exhibit variation most significantly on level of education, and regional considerations that play pivotal roles in shaping attitudes on AI generally, especially it’s integration into healthcare systems. Therefore, it is important to account for these demographic nuances when addressing public perceptions and fostering trust in AI integrations.

Predicting a patient’s experience of care in an AI integrated healthcare system

To develop a computational model to reason over general preferences and attitudes around the integration of AI into healthcare, we introduce the AI affinity coefficient\((\alpha ) \in R \xrightarrow (-1, 1)\) as a measure of the deviation of an answer to a survey question, \(Q_{i} \in Q\), from neutrality, which is realized as \(\alpha = 0\). When a response is in favor of AI, \(\alpha \rightarrow 1\). The actual realized value of \(\alpha\) depends on the strength of the sentiment expressed. The reverse is true for a response not in favor on AI; \(\alpha \rightarrow -1\). For each study participant, we calculate an AI affinity score such that for the kth respondent the following holds:

$$\begin{aligned} \textit{AI affinity score}(A_{k}) = \prod _{i=1}^{n}\alpha _{i}^{k}W(Q_{i}) \end{aligned}$$
(1)

We choose \(W(Q_{i}) \xrightarrow (0, 1)\) such that

$$\begin{aligned} \sum _{i=1}^{n} W(Q_{i}) = 1 \end{aligned}$$
(2)
  • \(\alpha _{i}^{k} \text { is the AI affinity coefficient of the \textit{kth} respondent response to the \textit{i}th question}\)

  • \(W(Q_{i}) \text { is the weight assigned to the \textit{ith} question}\)

  • \(n \text { is the total number of related questions in the survey}\)

In this study, the weights of the questions in the survey were selected based on expert opinion on their perceived importance or influence (implicit or otherwise) on AI affinity. Subsequently, we present a deep learning model that predicts AI Affinity Scores towards the determination of the degree of AI integration into care that will impact a patient’s experience of care.

Supervised learning for predicting AI affinity scores

The dataset we used for the model prediction included 24 predictors derived from 320 patient survey responses. The age of these patients cohort ranged from 18 to over 46 years, with categories across 18–25, 26–35, 36–45, and 46+. These predictors covered demographics such as Gender, Education, Region, Occupation, familiarity, and attitudes toward AI and robotics in healthcare. First, we perform Principal Component analysis (PCA) to determine the top five relevant features in the prediction to handle high-dimensional data and minimize redundancy. Specifically, PCA revealed that the most influential predictors included patients’ attitudes toward AI integration, their concerns about the use of AI and robot assistants in healthcare and the service industry, digital health usage behaviors and familiarity, and their level of trust in AI tools. Subsequently, we partitioned the dataset using a 60/20/20 train-test-validation, i.e., 60% of the data is used to fit the model, 20% is used to evaluate the trained model and 20% is used to validate the trained model.

Model training

The models were trained to support continuous and categorical prediction of AI Affinity Scores. These include a deep learning regression model, a classification model using the same deep learning architecture, a baseline linear regression model, and a Random Forest classifier for interpretability. We implemented a feedforward neural network (FNN) for the regression model with three layers. First, an input layer that accepts the top 10 PCA-transformed components as inputs. Secondly, there are two hidden layers with 64 and 32 neurons, each using ReLU activation. The output layer contains a single neuron with a linear activation function for affinity score prediction. We use dropout layers with a 20 per cent rate after each hidden layer and an Adam optimizer with a learning rate of 0.001. We minimize the average squared difference between observed and predicted affinity scores using the Mean Squared Error (MSE) loss function. The training procedure with the 60/20/20 train-test-val split was performed over 2000 epochs with a batch size of 32. Since we have a small dataset size ( 300 samples), we have included an EarlyStopping callback with a patience of 50 epochs to prevent overfitting. With this modification, training stopped automatically once the model’s performance on the validation set plateaued to avoid unnecessary training up to the preset 2000 epochs. The ML model’s outcome after training is the Affinity Score, a metric that captures participants’ degree of receptiveness to the use of AI and robots as assistants in healthcare. The classification model shares the same architecture but uses a softmax-activated output layer with three units for the affinity categories(Low = 0, Medium = 1, High = 2). Since the distribution of affinity labels is imbalanced, with the “Medium” class dominating, we incorporated class weights during training. Affinity scores were binned into three ordinal classes using cutoffs based on quantiles. Class weights were computed from the inverse class frequencies and used to balance the loss contribution across categories. The third model we trained was a linear regression model using the same PCA features to compare performance. This basic model provides a baseline for continuous prediction that is interpretable and easy to implement. Additionally, we trained a Random Forest classifier using the binned affinity classes. This model is useful in evaluating robustness across categorical prediction tasks and enables further comparison to the neural classification model. All our deep learning models were built and trained using the TensorFlow library, with the Keras API used to define the neural network architectures. Classical models (Random Forest and Linear Regression) were implemented using scikit-learn.

Model evaluation

For the deep learning model with regression, the evaluation on the test set yielded a low MSE of 0.0020. The \(\text {R}^{2}\) score was 0.9339, indicating strong agreement between predicted and observed values. Figure 5 shows that the sorted squared errors (blue line) have an increasing trend, which may make it harder for the model to obtain correct predictions for a subset of training samples. The red line, which represents the Mean Squared Error (MSE), serves as a reference to evaluate the model’s overall performance. Samples above this line indicate areas where the model may struggle more with prediction. To further evaluate model performance, we applied the paired t-test to determine whether the differences between the actual and predicted affinity scores are statistically significant. The test produced a t-statistic of 0.5043 and a p-value of 0.6158. Since the p-value is well above the commonly used threshold of \(\alpha = 0.05\), we fail to reject the null hypothesis. This indicates no significant difference between the predicted and actual scores, suggesting that the model’s predictions are strongly aligned with the ground truth. The Mean Absolute Error (MAE), which quantifies the average magnitude of prediction error, was 0.0356. This means that on average, the predicted scores deviate from the true values by just 0.0356 units, which is small. We also observed that the deep learning model tends to regress toward the mean, producing predictions close to the average and with reduced variance, which is common among models trained on small datasets. For comparison, we trained a linear regression model. Surprisingly, the linear model outperformed the deep learning model, achieving an \(\text {R}^{2}\) of 0.91, a lower MSE of 0.0016, and an MAE of 0.0327. The second model which is the DL with the classifier achieved a test accuracy of 90% , but the confusion matrix revealed that most predictions fall into the “Medium” category. This is likely due to class imbalance in the data. To reduce this bias, we applied class weighting during training, which improved sensitivity for the “Low” and “High” classes. However, class imbalance remains a limitation in the overall classification performance. The third model, the linear regression baseline performed well, achieving a low Mean Squared Error (MSE) of 0.0017, a Mean Absolute Error (MAE) of 0.0337, and a high \(\text {R}^{2}\) score of 0.9388. This suggests that the model was able to closely approximate the continuous Affinity Scores. These results demonstrate that for small datasets, linear models can provide fair performance with minimal overfitting. The last model (random forest classifier) achieved a test accuracy of 82% with F1 scores for the Medium and High categories. It attained a precision of 0.89 and recall of 0.80 for the High category as well as a precision of 0.79 and recall of 0.98 for the Medium category. However, it struggled with the Low category, yielding a much lower recall of 0.31, which indicates frequent misclassifications. The confusion matrix confirms that while the model correctly identified most Medium scores, many Low scores were misclassified as a result of the underlying distribution of the data.

Fig. 5
figure 5

Training error surface.

The confusion matrix in Fig. 6 helps evaluate the performance of the deep learning model (with regression) in classifying AI Affinity Scores. For visualization, the model’s continuous predictions and corresponding ground truth values were discretized post-hoc into three ordinal categories: “0” for Low, “1” for Medium, and “2” for High. This binning was applied only after model training and did not affect the regression model itself. The diagonal elements represent correct predictions and show strong performance for the Medium category, with 43 correct predictions. Additionally, 8 correct predictions were made for the Low category. The High category performed the worst, with only 3 correct predictions, indicating that the model struggles to accurately classify high-affinity scores. Most misclassifications appear in the off-diagonal elements. Specifically, Medium scores were often misclassified as both Low and High, and High scores were misclassified as Medium in five instances. This pattern suggests that the model has difficulty distinguishing between Medium and High scores, possibly due to overlapping feature distributions.

Fig. 6
figure 6

Confusion matrix.

Comparison between the predicted and observed affinity scores

The plot in Fig. 8 shows the relationship between the observed (true) AI Affinity Scores and the model’s predicted AI Affinity Scores. The red dashed line represents the perfect scenario where the predicted values match the observed values entirely. A key observation from Fig. 8, is that most points cluster around the red dashed line indicating that the model’s predictions are reasonable when compared to the actual values.

Paired T-test for synthetic data evaluation

The total sample size is 320, we generated 80 additional samples synthetically by resampling the original dataset with replacement through bootstrapping. Each sample in this synthetic dataset retained the same feature distribution as the original data. The distribution of 80 synthetic samples of AI Affinity Scores were treated as the observed scores for evaluation. We added random Gaussian noise \(\sim N(\mu =0, \sigma =0.05)\) to the observed distribution of AI Affinity Scores to simulate the deep learning model predictions. These noisy scores represent the predicted scores from the synthetic dataset. We performed a paired t-test to compare the observed and predicted affinity scores specifically for the 80 synthetic rows and obtained the T-statistic of 0.496 and a P-value of 0.621. This p-value indicates no significant difference between the observed and predicted scores showing stable model predictions. These results from the synthetic data evaluation show a strong alignment between the model’s predictions and the observed values and are a testament to the robustness of the trained model as seen in Fig. 7. Overall, the model’s predictions align with the observed values with high accuracy (low MAE and strong alignment in the scatter plot). There is a strong statistical agreement between the high p-value and t-test results. This shows that the model effectively captures the relationship between features and the target variable.

Fig. 7
figure 7

Distribution of predicted versus observed AI affinity scores from the deep learning model and the linear regression baseline model (N = 320).

Fig. 8
figure 8

Predicted versus observed affinity scores.

Impact of gender, age group & level of education on AI integrated healthcare

It is important to note that the context admits the constraint that the survey participant selections are within a digital health space, i.e., the subsequent analysis reflects the preferences and attitudes of the referenced demographics in a universe where digital health is realized. For example, we find that older populations are more inclined towards AI integration when an acceptance of digital health is present whereas the opposite might be true in a more relaxed universe where digital health is optional. The group statistics for the distribution of AI Affinity Scores over gender is presented in Table 1 below. With less than 10 respondents identifying as neither male nor female, we don’t have enough data to include it in the analysis.

Table 1 Group statistics for AI affinity scores over gender.

Table 2 below shows that there is no statistically significant difference between AI Affinity Scores based on gender implying that Gender has no impact on the preferred degree of AI integration into Healthcare.

Table 2 1-way ANOVA involving groups of AI affinity scores over gender.
Table 3 Group statistics for AI affinity scores over age group.
Table 4 1-way ANOVA involving groups of AI affinity scores over age group.

Similarly, Table 3 shows the group statistics for the distribution of AI Affinity Scores while Table 4 below shows that there is no statistically significant difference between AI affinity scores based on Age Group implying that Age has no impact on the preferred degree of AI integration into Healthcare.

On the distribution of AI affinity scores over Education Level we are particularly interested in the impact of the degree of academic exposure to the theory and application of AI. To clarify this end, we will only consider participants with at least some college level exposure to AI, either directly through instruction or indirectly via informal interactions within the academic community.

Table 5 Group statistics for AI affinity scores over level of education.
Table 6 1-way ANOVA involving groups of AI affinity scores over level of education.

Table 5 shows the group statistics for the distribution of AI Affinity Scores over level of education while Table 6 below shows that there is a statistically significant difference between AI affinity scores based on Level of education at \(\alpha = 0.10\) with a p-value = 0.09771 implying that Level of Education has an impact on the preferred degree of AI integration into Healthcare. When comparing the group of people with Advanced Degrees directly against people with Some College Degree (ignoring the group with only a college degree) at \(\alpha =0.05\), we find more evidence against the Null hypothesis for ANOVA with p-value = 0.04629.

Functional groups over AI affinity scores

For practical considerations and effective adoption, it is useful to categorize AI Affinity Scores, i.e., associate them with a domain specific label tailored to engender decisions and minimize uncertainty over a space of healthcare protocols that optimize a patient’s experience of care. To this end, an arbitrary number of categories of degree of AI integration may be assigned and AI Affinity Scores can be distributed over the chosen categories using threshold functions. However, it may at least be marginally better to assign labels based on groups that arise implicitly from a Bayesian nonparametric statistical analysis of the data under the assumption that a fully AI integrated healthcare is the Expectation and that deviations to other degree(s) of AI integration happen with the concentration parameter \(\alpha\).

First, we fold the distribution of AI Affinity Scores over a single group, and refer to the resulting distribution as \(\phi\), such that it is a Gaussian mixture model over k clusters.

$$\begin{aligned} GMM(\phi ) = \sum _{j=1}^{k} \pi _{j}P(\phi ; \mu _{j},\sigma _{j}) \end{aligned}$$
(3)
  • \(\pi _{j} \text { is the mixture co-efficient}\)

  • \(\mu _{j}, \sigma _{j} \text { are the model parameters of the \textit{jth} Gaussian distribution}\)

Then we state the Dirichlet prior as having the simple form:

$$\begin{aligned} p(\phi ) = GMM(\phi ) \end{aligned}$$
(4)
$$\begin{aligned}&\text { such that the following holds:} \\&\qquad \frac{1}{\alpha } \sim \Gamma (1,1) \\&\qquad \{\pi _{1}, ..., \pi _{k}\}|\alpha \sim Dir(\frac{\alpha }{k})\\&\qquad \{\mu _{1}, ..., \mu _{k}\}\sim N(0,1)\\&\qquad \{\frac{1}{\sigma _{1}}, ..., \frac{1}{\sigma _{k}}\}\sim \Gamma (1,1) \end{aligned}$$

With Tensorflow_Probability we derive both the number of inferred clusters and the cluster element distribution as shown in the Table 7 below:

Table 7 Inferred clusters from Dirichlet process mixture model (DPMM).

Healthcare administrators can use this information to design intervention protocols and care packages for a functional AI integrated healthcare system with 5 degrees (levels) of AI integration. The categories over which the AI affinity scores distribute can also be used to guide marketing strategies tailored to different populations. Those with high scores are likely early adopters of AI interventions, while those with lower scores may require targeted strategies to encourage adoption. Also, in-line with upholding string ethics, it is crucial to avoid marginalizing patients who prefer traditional care. Under this lens, AI affinity scores can preserve human-centered care for those with lower AI affinity20,21. The asymmetry in cluster size shown in Table 7 allows for proper resource planning and allocation for targeted interventions over the set of functional categories based on their popularity.

Discussion and conclusion

The AI Affinity Score allows healthcare providers to personalize care delivery based on an individual’s preference for AI integration, optimizing their experience. Tailoring AI integration to patient preferences can enhance engagement and satisfaction18,22,23. Patients with higher AI Affinity Scores may be more receptive to AI-driven interventions, while those with lower scores may prefer human-centered approaches. Although AI-based therapy can be effective, its success depends on acceptability, trust, and attitude toward AI24,25. Research shows that satisfaction with care is linked to engagement, adherence to therapy, and improved outcomes2627. Embedding the AI Affinity Score in electronic medical records at intake can help determine the appropriate level of AI integration, potentially leading to better outcomes. Our data show differences in AI Affinity Scores across level of education which can inform the allocation of AI-based technologies to areas with higher affinity, freeing human resources for regions with lower affinity scores and reducing healthcare inequities. More granular data is needed to assess variations within countries including rural-urban differences, regional variations and intra-city variations in AI Affinity Scores. This can help policymakers and healthcare administrators reduce disparities. In general the data suggests that individuals with lower education tend to have less favorable attitude towards AI-integrated healthcare24. This aligns with our findings and can inform resource planning and allocation. However, additional data is necessary. Tailored messaging can leverage affinity scores to encourage adoption and engagement. Several studies have highlighted the importance of developing new models of patient segmentation and customizing communication strategies and health care delivery to meet the needs of different patient groups. The AI Affinity Scores can be integrated into new models that incorporate social determinants of health, neighborhood characteristics, and consumer data, to address the needs of different population by stratifying people based on their preferences for AI-integrated healthcare28,29,30. The AI Affinity Score can also monitor trends in AI acceptance over time, offering valuable insights into shifting attitudes and informing continuous improvements in AI applications to ensure they remain patient centered. A systematic review showed that patients generally accepted AI-integration in healthcare when effectiveness is demonstrated, providers remain involved, and the integration maximized the individual strengths of human providers and AI. There are several limitations associated with a survey-based study as noted by other researchers including non-response bias, selection bias (with younger and better educated participants represented more in the studies reviewed), and a digital divide between older and younger individuals22. It is important to acknowledge these limitations in our model as well especially that the data collection process introduced a selection bias, as participants are more likely to be familiar with digital technologies since the survey was conducted online and respondents are more likely to have a positive attitude toward technology. This bias is mitigated by constraining our study with the assumption that digital health is not optional as stated earlier. Additionally, the model relies on limited demographic variables, whereas attitudes toward AI integration are influenced by many factors. Incorporating additional data could improve the model’s accuracy, but the current model’s simplicity allows for lower variance and more stable predictions across several datasets. Adding more variables may increase complexity but also introduce more variability, potentially reducing generality and predictive stability31,32,33. Ultimately, we believe that the AI Affinity Score offers a practical tool for tailoring AI integration to individual patient preferences, enhancing engagement, optimizing healthcare delivery and improving health outcomes.