Introduction

Age-related macular degeneration (AMD) is the commonest cause of blindness among elderly people in the developed world. There are an estimated 67 million patients with AMD in Europe, expected to rise to 77 million by 2050 [1]. The neovascular, or wet, form of the disease requires monitoring and through treat-and-extend protocols with some treatments, patients can at times achieve a 24-week treatment interval [2, 3]. This with the aging population is causing an ever-increasing workload which is placing a strain on health care systems. Ophthalmology is the largest outpatient department of the NHS and treatment for AMD is a large proportion of this group of patients [4].

The development of modern artificial intelligence (AI) techniques such as deep learning, reinforcement learning, and large language models have presented new opportunities in clinical domains. Retinal imaging has been used in many projects exploring the predictive power of the technology and such analysis techniques are exploring the possible detection of perturbation of vision, alongside predicting the progression of conditions [5]. Similarly, automated analysis of retinal images is being used to predict risk of many other disorders, such as cardiac disease and neurological dysfunction [6]. In some instances, these techniques are more accurate than human raters of images [7]. This opens up the possibility of an automated decision-making process which can happen in real time and not require the involvement of a highly trained expert which are in limited supply.

However, it is not immediately clear whether delegating the decision-making process that directly impacts one’s clinical care to an AI system is acceptable to patients. Patient centred care was a key in the report by Robert Francis, commissioned in response to events at Stafford Hospital, with an emphasis on ensuring that the patients’ voices are heard in clinical decision making [8]. As more AI based tools are incorporated into clinical pathways, a better understanding of patient perspectives on acceptability of using said tools is crucial.

One way of determining patient preferences is to use a technique called conjoint analysis. The approach is based on the idea of a trade-off. The full concept method of analysis generates a number of profiles or scenarios with the attributes of interest for the service package or product represented. If one makes the assumptions that each factor acts independently, then one can, by careful choice of combinations use only a fraction of the possible scenarios and still be able to do a full analysis. It allows the relative importance of different attributes to be assessed and shows what features individuals are prepared to trade to obtain what they think is most important. Utility (which is a measure of desirability) scores are generated for each of the attributes and can be used to find the relative importance of each attribute. This method has been used previously to assess ophthalmic services [9,10,11,12,13].

Methods

The survey was conducted using a popular survey website surveymonkey.com. This site was chosen due to its accessibility for those with visual impairments, including compatibility with screen readers. The survey was modelled on a previous conjoint survey conducted which had a strong response. Wording in the survey provided an introduction to AI to those who may be unfamiliar to this concept (Supplementary Information).

The survey was promoted through the Macular Society’s monthly e-newsletter with 82,000 subscribers, and the dedicated webpage for surveys and focus groups. The survey was live for a month, between May and June 2024.

We used the conjoint package version 1.41 developed for R by Bak and Bartlmomwicz [14] and available from The Comprehensive R Archive Network (CRAN).

We looked at four factors: -

  • The initial reader and this had two levels; human or artificial intelligence.

  • The error rate and this had three levels of 5%, 10% and 20%.

  • Speed of result and this had three levels; within one day, within two days and within four days.

  • Whether there was a second reader or checker and this had three levels; none, human and AI.

The scenarios presented for ranking described a range of orders of a first reading by AI, with either no second reading or, second reading by either a human or an AI system. The options were presented to ensure there was not an assumption of a second reader based on presentation of ‘first reading’. i.e. If the first reader was AI, the second was not automatically human or AI as a result. We chose the difference between the levels for error rate and speed of reporting to be a constant ratio rather than a constant amount.

This generates 54 possible scenarios (2 × 3 × 3 × 3). We used a partial fractional design that reduced the number of scenarios needing to be ordered to 13.

These scenarios were presented in a random order on Survey Monkey and participants were asked to rank them in order from most to least preferable. The results were stored on an Excel spreadsheet and then imported into R for analysis.

Participants were also asked about their age, gender, level of schooling (primary school, secondary school up to 16, A-levels, University and Postgraduate), whether they were undergoing intraocular injections and whether they were aware of AI (on a three point scale from have heard about it but do not understand it, aware and have some understanding, and very aware and understand it).

The utility scores for each factor level were calculated for each participant. The overall mean utility scores and their importance was calculated. Then for each factor level, a linear regression model was constructed with the part utility score for the factor level being the dependent variable and the participant age, gender, level of schooling, knowledge of AI and experience of intraocular injections entered as the independent variables. The aim of these analyses was to determine if patient specific utility values could be predicted. They were also given the opportunity to provide free text comments.

Results

Here we report that 374 participants responded to the survey, 193 did not fully complete the survey and were excluded. In total, 181 fully completed including the ranking task. Of the 181 responders, 43% had wet AMD, 35% had dry AMD, and 7% had no macular disease.

Eighty-eight participants provided free-text comments. Seven of the comments related to finding the ranking task too difficult. The commonest comments were positive in favour of AI and saw benefits in consistency, speed of response, accuracy (though a number highlighted that this would need to be rigorously assessed and the need for further research) and freeing up time for professionals to spend more time caring. Perhaps the single most frequent comment was about AI assisting humans with humans making the final decision. There was a consistent minority view that humans should make decisions about humans.

In this study, 181 individuals completed the task and the results are summarised in Table 1 and in Fig. 1A. The two most prominently valued factors were accuracy of reporting and having a second reader or checker. Overall, the least important factor was whether the initial reporting was done by a human or by AI. Similarly, when considering a second checker (see Fig. 1B), respondents did not perceive a difference in the utility value between having the second reader (checker) being human or AI.

Fig. 1: Conjoint analysis output for factors and error checking.
figure 1

A Figure showing the relative importance of the four factors. B The utilities values associated with having a second reader or checker showing that having the results checked would be welcomed but no significant difference between whether it was being done by either a human or by AI.

Table 1 Mean utility score and importance of each factor and factor level.

The part utility scores were saved and used as the dependent variables in linear regression models with age, gender, having treatment, knowledge of AI and level of education as the independent variables. The aim was to see whether any of these predicted participants’ values. There was a total of 11 factor levels (two for whether the initial reader was human or AI, three for spend of turnaround, three for the error levels chosen and three for checking).

No assumptions were made about the factor levels and, as can be seen for checking (Fig. 1B) there is no simple relationship between the three levels. Each factor level was entered into a linear regression as the dependent variable and age, gender, whether they had had intravitreal injections, level of education and knowledge about AI as the independent variables. Seven of the eleven has no significant associations and three did (Tables 24).

Table 2 Linear regression model predicting the utility of an error rate of 1 in 5 (top) and an error rate of 1 in 20 (bottom).
Table 3 Linear regression model predicting the utility of an error rate of AI checker.
Table 4 Linear regression model predicting the utility for no checker.

AI would appear to be very acceptable but with a non-significant trend in favour of a human making the decisions (mean utility score for AI, and human respectively: –0.012, 0.012). Those who claimed to know most about AI trusted it least and women were less trusting than men. Table 2 showed a negative regression coefficient for using AI as a checker and awareness and a corresponding positive coefficient between knowledge and no checker at all (Table 4).

Discussion

The result of this conjoint analysis shows that participants were most concerned about the accuracy of test interpretation as well as having the results checked. The free comments were in line with this finding with one of the more common comments suggesting that the best combination was a human making the final decision but supported by AI.

There were several comments suggesting that AI should not be used at all. However, this was not the finding from the ranking analysis. It shows that whether it was human, or AI was only of minor importance compared to the error rate, and whether there was a “checker” or second reader. When considering the response to having a second reader, or checker, it was notable that participants valued having decisions being checked but noted that there was almost no difference as to whether the checker was human or AI.

There was a trend favouring rapid reporting.

There was no association between the views of the participants and their age, gender, level of education of whether they were undergoing treatment. There was a weak negative association between the use of AI as a checker and awareness about AI. This suggests that increasing awareness about AI may not necessarily improve acceptance.

There was a negative association between risk and education level, with higher education having a higher risk tolerance. There was a negative correlation between education level and error rate. Higher educational level is associated with both improved health outcomes, and a higher expectation of patient centric standard of care [15, 16].

Interestingly, a similar association has been noted in economics where an association between risk tolerance and education was also found [17]. Acceptable error rates in clinical application of AI needs careful consideration. In recent studies about AI in clinical settings, it has been shown that patients are less trusting of AI than they would be of a human rater, with higher standards in relation to error when considering AI [18]. This is interesting to consider and certainly needs to be incorporated into the design of clinical pathways, as well as communication around AI assisted decision making. The error rates chosen for this study represent reasonable thresholds discussed in the literature for both clinical domains and clinical applications for AI. For example, in radiology (6.8% acceptable error rate for AI, 11.3% for humans), Large Language Model based AI scribes (1–3%) [19], and human error in radiology diagnosis with postmortem confirmation of cases (20%) [20]. There is great variability across fields and techniques, and a balance of reasonable error thresholds were chosen to reflect this variance.

The study has a number of limitations. Firstly, the participants were self-selecting, and many found the ranking task difficult and was only available online, therefore excluding those without access to the internet. Secondly, only 48% of those who visited the site, completed the ranking task. This may have been due to the length or complexity of the survey especially for many of those participating who had visual loss. For those who had difficulty there were contact details for help, however, no participants used this. Despite the relatively low response rate, the advantage of online surveys however is the relatively large number of participants. Larger, more in depth studies will be needed to aid in the design of clinical pathways and oversight of AI systems in ophthalmology, the present sample represents a small percentage of the clinical population and as such has a limited generalisability.

Finally, the participants may not be representative of the general macular disease population, as 39% of the participants had a college or university level education and 72% were aware of AI and what it is.

Conclusion

Patients with macular disease find AI to be acceptable in the assessment of retinal images. Patients prioritise accuracy, speed, and the presence of a checking mechanism when considering the adoption of AI in macular disease treatment pathways. These results should inform the development of patient focused policies for assured AI adoption.

Summary

What was known before

  • AI is considered to present key opportunities in retinal medicine

  • Detection and diagnosis of conditions can be made more efficient with the use of AI tools

  • Radiology and oncology have explored patient views of AI decision making, but macular treatment decisions had not been explored

What this study adds

  • This study explores what patients view as acceptable for AI use in macular clinics

  • This study adds the patient perspective of AI being acceptable in macular treatment decisions

  • This study demonstrates a priority for accuracy from the patient perspective, with a tolerance for error in line with other clinical fields