Introduction

Artificial intelligence (AI) can be classified into ‘assistive’ or ‘autonomous’1. With assistive AI, the physician is prompted with AI-generated insights after which he/she decides whether and how to use the prompt(s). With autonomous AI, the AI agent suggests a course of action which is the default action plan unless vetoed by the physician (‘conditional automation’). When patient care is at stake, acceptance of an AI algorithm for treatment decision-making in prospective, real-world clinical settings may deviate from the estimated level of acceptance based on retrospective or simulated evaluation2. To date, prospective trials of autonomous AI are often limited to image-based diagnosis, radiation therapy planning, behavioral nudges, or a chatbot for post-procedural follow-up2,3,4,5,6,7,8,9,10,11,12. Previously, AI models for deciding treatment strategies have been constructed, but they are rarely applied prospectively in clinical settings13,14,15. We hypothesize autonomous AI may be useful in pharmaceutical intervention when the physician believes that there is otherwise no reliable means to determine whether to initiate medication, that the consequence of inaction can be dangerous, and that the intervention may be harmful or costly.

There are many unanswered questions regarding use of autonomous AI to prescribe drugs. For instance, how do we embed an autonomous AI algorithm into the patient care pathway? Does the autonomous AI agent make decisions on which drug, which dose, and/or which schedule? Will physicians and patients agree to participate in a trial wherein an autonomous AI agent makes drug prescription decisions? In addition, we anticipate automatic ingestion of multi-modal co-variate data directly into the model would be necessary for acceptance and successful implementation of autonomous AI. Whether this is broadly feasible is challenged by the complex organization of hospital information systems16.

In this work, to test the feasibility of using autonomous AI to prescribe a drug we choose a setting after a haematopoietic cell transplant (HCT) where the aim is to prevent severe (grade 3−4) acute graft-versus-host disease (acute GvHD). Accurately predicting risk of developing severe acute GvHD is challenging with no widely-accepted model. Survival of patients with severe acute GvHD could be < 50% within one year17. The question is how to identify intermediate- and high-risk patients and intervene.

We conduct a proof-of-concept, phase-2 study of using ‘daGOAT’, an algorithm we have developed18, to monitor dynamic changes in 141 common clinical co-variates and prescribe a drug when the model identifies a participant to be at risk of developing severe acute GvHD. Our data show that physicians agree to use daGOAT in 85% of eligible patients and, in turn, 88% of these potential participants agree to use the model. Compliance with AI prescription is 98% initially, with few deviations from the AI-prescribed dose and/or schedule within one month. To sum up, many physicians and patients are receptive to conditional autonomous AI prescription.

Results

Model update

Retrospectively we used data of 723 human leukocyte antigen (HLA)-mismatched transplants from the NICHE-GOAT cohort at the Institute of Hematology, Chinese Academy of Medical Sciences (IHCAMS) to optimize the daGOAT model19.

After optimization, daGOAT (Fig. 1a) includes 14 peri-transplant features (recipient age, sex, body mass index, primary disease, donor type [HLA-haploidentical versus HLA-mismatched unrelated donor [MMUD]], sex match, ABO match, HLA match, graft source, pretransplant conditioning regimen, peri-transplant use of anti-thymocyte globulin [ATG], baseline regimen for acute GvHD prophylaxis, CD34-positive cell dose, and total nucleated cell dose) and 141 dynamic co-variates (complete blood count [CBC; 36 co-variates], blood biochemistry [44 co-variates], electrolytes [7 co-variates], blood cell flow cytometry [36 co-variates], and cytokines [18 co-variates]; Supplementary Data 1). In retrospective cross-validation experiments, AI-designated ‘high-risk’ had a subdistribution hazard ratio (sHR) = 14.3 (95% confidence interval [CI], 4.5–45.1) for severe acute GvHD compared with ‘low-risk’ in the validation set whilst for ‘intermediate-risk’ sHR = 7.3 (95% CI, 3.0–17.6) (Fig. 1b).

Fig. 1: Model update using retrospective data.
Fig. 1: Model update using retrospective data.
Full size image

a The daGOAT model evaluates 14 peri-transplant features and 141 dynamic co-variates with three hyperparameters including \({\tau }_{{\mbox{retrospective}}}=3\) days (how far back the model looks at past values of dynamic co-variates when calculating risk score for severe acute graft-versus-host disease [acute GvHD]), \({T}_{{\mbox{active}}}^{{\mbox{start}}}=+ 17\) days posttransplant (the day the model starts reporting severe acute GvHD risk), and \({T}_{{\mbox{active}}}^{{\mbox{end}}}=+ 23\) days posttransplant (the last day the model reports severe acute GvHD risk). For instance, if ‘today’ were day +22 posttransplant, daGOAT would integrate peri-transplant features and dynamic co-variates from day +19 to day +22 to compute the risk of developing severe acute GvHD. b Performance of daGOAT-computed risk-stratification on the validation set (1 December 2020–31 December 2021; n = 203) whilst the model was fitted using only data from the training set (1 April 2012–30 November 2020; n = 519). Purple, high-risk (n = 12); pink, intermediate-risk (n = 73); blue, low-risk (n = 118). Pairwise comparisons are performed using the two-sided Fine-Gray test. p-values are not adjusted for multiple-hypothesis testing. c Data densities of dynamic co-variates in the retrospective cohort. d Information contributed by peri-transplant features and separate categories and various combinations of categories of dynamic clinical co-variates. Source data for (bd) are available online.

Mutual information between co-variates and risk-stratification

Despite the high number of dynamic co-variates included in daGOAT, low data density was normal for most of the co-variates in the retrospective dataset (Fig. 1c; Supplementary Data 1). When only one solitary category of dynamic co-variates was chosen to be included in the model (that is, rendering the model into a highly truncated version), choosing CBC, blood cell flow cytometry, or blood biochemistry data led to higher mutual information between the risk-stratification computed by the truncated model and the risk-stratification computed by the complete model (Fig. 1d).

Conditional autonomous AI-driven prescription

We deployed daGOAT as a conditional autonomous AI agent in the hospital intranet and connected to the hospital information system (Fig. 2a). From day +17 to day +23, each day at 1720 h the model would autonomously extract participant data from the hospital information system and classify each participant as low-, intermediate-, or high-risk to develop severe acute GvHD (Fig. 2b). If a participant were classified as high-risk, his/her risk-stratification would stay at high-risk regardless of subsequent events. If a participant were classified as intermediate-risk, his/her risk level would stay at intermediate-risk unless it were subsequently revised to high-risk. Any adjustment of risk-strata would be communicated to the clinical study coordinator and attending physician at 1721 h using a dashboard on the desktop computer daGOAT information system and alert messages via mobile-phone short message service (SMS; Fig. 3).

Fig. 2: Deploying daGOAT as an autonomous artificial intelligence agent in the posttransplant care pathway.
Fig. 2: Deploying daGOAT as an autonomous artificial intelligence agent in the posttransplant care pathway.
Full size image

a Interactions among the participant, clinician, researcher, hospital information system, and daGOAT. b Work flow of the prospective trial. Acute GvHD, acute graft-versus-host disease; HCT, haematopoietic cell transplant; HLA, human leukocyte antigen.

Fig. 3: Mock examples of daGOAT-generated alerts.
Fig. 3: Mock examples of daGOAT-generated alerts.
Full size image

a Alert via mobile-phone short message service. Translation: “Short message service (July 17, Wednesday, 1721 h): [daGOAT] Today the daGOAT model is monitoring three participants, of which one (patient name: Fa Hai) is high-risk, one (Xu Xian) is intermediate-risk, and one (Bai Suzhen) is low-risk.” b Alert via a dashboard on the desktop computer. “返回患者中心”, back to the patient list; “姓名”, name; “性别”, sex; “就诊年龄”, age at transplant; “诊断”, primary disease; “修改信息”, edit patient information; “病程概览”, overview of the posttransplant clinical course; “中风险”, intermediate-risk; “高风险”, high-risk; “术后第__天”, day +__ posttransplant. Displayed patient names are made-up and do not correspond to real names of trial participants.

A model-designated high-risk participant was to be given oral ruxolitinib, 5 mg twice daily, until ≥day +60 posttransplant. A model-designated intermediate-risk participant was to be prescribed oral 2.5 mg ruxolitinib twice daily until ≥day +60 posttransplant; if his/her risk-stratification were later up-staged to high-risk the ruxolitinib dose would be increased to 5 mg twice daily. Physicians were directed to attempt discontinuing ruxolitinib by day +100 posttransplant. The protocol stipulated the physician would regain full control of all intervention decisions when severe acute GvHD occurred (Fig. 2b).

Participant enrollment

During the enrollment period 152 patients receiving HLA-mismatched transplants of granulocyte-colony stimulating factor (G-CSF)-mobilized blood-derived haematopoietic cells were eligible for participation. The physicians excluded 23 (15%) patients. The most common reason given was ‘slow haematopoietic recovery’ (n = 10), followed by ‘not in histologic complete remission (CR) or having detectable residual disease’ (n = 6 [acute leukaemia, n = 4; myelodysplastic syndromes, n = 2]), ‘unable to swallow medicine’ (n = 5), ‘human immunodeficiency virus (HIV) infection’ (n = 1), and ‘no given reason’ (n = 1). Of the remaining 129 eligible patients, 15 (12%) declined to participate in the trial. In summary, 114 (75%) of the 152 eligible patients participated in the trial whereas 38 did not participate. All the participants received myelo-ablative conditioning and baseline acute GvHD prophylaxis using ATG 2.5 mg/(kgd) during days –4 to –1.

Because the predominant majority (96% [110/114]) of the transplants included in the trial were HLA-haploidentical whilst a small minority (4% [4/114]) were HLA-mismatched unrelated donor (MMUD) transplants, we focused on the 110 HLA-haploidentical transplants – hereafter referred to as the ‘AI focus group’ – in our analyses (Fig. 4a; Table 1).

Fig. 4: Execution of the prospective trial.
Fig. 4: Execution of the prospective trial.
Full size image

a Patient screening and participant enrollment. AI, artificial intelligence; HLA, human leukocyte antigen; MMUD, HLA-mismatched unrelated donor. b Densities of autonomously-extracted dynamic co-variate data in the AI focus group (n = 110). c daGOAT-computed risk-stratification in the AI focus group. Participant numbering does not correspond to the participants’ chronological order of receiving transplants. d Compliance with AI prescriptions in the AI focus group. Acute GvHD, acute graft-versus-host disease. Source data for (b) are available online.

Table 1 Baseline characteristics of patients

Dynamic co-variate data in the participants

Densities of autonomously-extracted dynamic co-variate data between days +14 and +23 posttransplant in the AI focus group (Fig. 4b) were largely comparable to data densities during the same posttransplant time interval in the retrospective data used to train the model (Fig. 1c). In the prospective trial, zeros in laboratory test results were occasionally recorded as ‘-’ rather than ‘0’ in the hospital information system and as a consequence daGOAT failed to extract all the zeros. daGOAT did not halt or give error messages because of these missing values; rather, it continued updating risk-stratification based on the data that were successfully extracted.

Risk-stratification of the participants

Median time of granulocyte and platelet recovery was +12 and +14 days posttransplant, respectively (Supplementary Fig. 2). Cumulative incidences of granulocyte and platelet recovery were 100% (110/110) and 98% (108/110) by day +100. No participant developed severe acute GvHD before (including) day +17. According to daGOAT, 53 (48%), 39 (35%), and 18 (16%) participants in the AI focus group were at low, intermediate, and high risk, respectively, of developing severe acute GvHD (Fig. 4c). In 56 (98%) of the 57 intermediate-to-high-risk participants, daily daGOAT-computed risk scores were steady or escalated progressively from day +17 to day +23. In one participant (#109), risk score fluctuated between ‘high’ and ‘low’ between days +17 and +23; by protocol his/her risk-stratification was designated by daGOAT as ‘high-risk’.

Compliance with autonomous AI drug prescriptions

None of the low-risk participants in the AI focus group took ruxolitinib except for two persons (Participants #52 and #53) who started ruxolitinib after they developed severe acute GvHD (Fig. 4d). Fifty-six (98% of 57) intermediate- to high-risk participants immediately started ruxolitinib when prescribed by daGOAT (Fig. 4d). In one intermediate-risk participant (#55), the physician started giving ruxolitinib one day after daGOAT started prescribing. In seven additional participants the physicians deviated from the AI-prescribed dose or schedule within one month after AI prescription started: In two intermediate-risk participants (#56 and #90), physicians increased their dose to ≥10 mg/d after grade-2 acute GvHD was diagnosed. In three high-risk participants, physicians prescribed ≤5 mg/d because of concerns about positive measurable residual disease (MRD)-status pretransplant (Participants #109 and #110) or pancytopenia that started pretransplant (Participant #103). In one intermediate-risk participant (#87), dose was decreased to 2.5 mg/d on day +31 because of Stenotrophomonas bacteraemia. In another intermediate-risk participant (#92), ruxolitinib was discontinued early because his/her MRD-status became positive on day +33.

In the AI focus group there were 52 participants with acute leukaemia and negative MRD-status pretransplant. Two of them became MRD-positive before day +100, including one low-risk and one intermediate-risk participants (#47 and #92, respectively). In contrast, in the co-variate-matched controls, three of the 121 acute leukaemia patients who were MRD-negative pretransplant became MRD-positive before day +100. The probability of ‘worsening’ or conversion from MRD-negative to MRD-positive before day +100 was comparable between the AI focus group and co-variate-matched controls (4% [2/52] versus 2% [3/121]; p = 0.64; two-sided Fisher’s exact test).

Outcomes in the participants

The pre-specified primary clinical endpoint was cumulative incidence of severe acute GvHD. Six participants (5.5% of 110) in the AI focus group developed severe acute GvHD, compared with 16% (41/252) in the co-variate-matched controls (Fig. 5a). Cumulative incidences of ≥stage 2 lower-gastrointestinal (GI), ≥stage 2 hepatic, and ≥stage 3 cutaneous acute GvHD were 5.5% (6/110), 3% (3/110), and 4% (4/110), respectively, in the AI focus group (Fig. 5b–d). Results of sensitivity analyses and a sub-group analysis of HLA-haploidentical transplants with HLA-mismatch ≥3/10 are displayed in Supplementary Fig. 3.

Fig. 5: Clinical outcomes.
Fig. 5: Clinical outcomes.
Full size image

a Severe (grade 3−4) acute GvHD. Blue, AI focus group (n = 110); red, co-variate-matched controls (n = 252). Pairwise comparison is performed using the two-sided Fine-Gray test. p-value is not adjusted for multiple-hypothesis testing. Acute GvHD, acute graft-versus-host disease; AI, artificial intelligence. b ≥stage 2 lower-gastrointestinal acute GvHD. c ≥stage 2 hepatic acute GvHD. d ≥stage 3 cutaneous acute GvHD. e Comparison of daGOAT-computed daily severe acute GvHD risk scores of the participants who developed severe acute GvHD in the AI focus group and the daily risk scores of the other participants who did not develop severe acute GvHD. Displayed daily risk scores are outputs of the ‘component model’ trained using severe acute GvHD as the regression target (Methods). The box-and-whisker plots indicate the distributions of daily risk scores of the 104 participants who did not develop severe acute GvHD (disaggregated into three risk-strata: high-risk, n = 16 [purple]; intermediate risk, n = 37 [pink]; low-risk, n = 51 [blue]), whilst the solid circles indicate the individual daily scores of the six participants who later developed severe acute GvHD (disaggregated into three risk-strata: high-risk, n = 2 [purple]; intermediate risk, n = 2 [pink]; low-risk, n = 2 [blue]). In the box-and-whisker plots, the box indicates the 25th-, 50th-, and 75th-percentile values whilst whiskers, the ranges of values. Because only a small number of participants (n = 6) in the AI focus group developed severe acute GvHD, no statistical test is performed. Source data for (ae) (including the individual data points underlying the box-and-whisker plots in (e)) are available online.

Between days +24 and +100 posttransplant, 66% (73/110) and 65% (72/110) of the participants in the AI focus group had haematologic and non-haematologic abnormalities, respectively, and 37% (41/110) had infection (Table 2; Supplementary Fig. 4). 18% (10/57) and 18% (10/57) of the daGOAT-designated intermediate-to-high-risk participants in the AI focus group had severe neutropenia and severe thrombocytopenia, respectively, between days +24 and +100, compared with 34% (18/53) and 45% (24/53) in the low-risk participants. The observed lower frequency of cytopenia in the intermediate-to-high-risk participants compared with the low-risk participants could be attributed to higher mean concentrations of neutrophils and platelets in blood during days +17 to +23 in the intermediate-to-high-risk participants compared with the low-risk participants (Supplementary Fig. 1), which – we speculate – might have offset the anticipated cytopenia side-effects of ruxolitinib. No immune flare (for example, cytokine storm) was observed after ruxolitinib-discontinuation in any person. Frequencies of most abnormalities and infection types were comparable between the AI focus group and co-variate-matched controls (Table 2). However, cumulative incidences of severe thrombocytopenia (31% versus 43%; p = 0.04), high aspartate transaminase level (7% versus 21%; p = 0.001), and haemorrhagic cystitis (18% versus 31%; p = 0.01) were lower in the AI focus group compared with the co-variate-matched controls. These observed differences could be attributed to faster platelet recovery in the AI focus group compared with the co-variate-matched controls (Supplementary Fig. 2), chance, incomplete co-variate-matching, and/or other explanations.

Table 2 Haematologic and non-haematologic abnormalities and infections between + 24 and + 100 days posttransplant

In the AI focus group, one-year cumulative incidence of chronic GvHD was 36% (95% CI, 19–53%), 47% (95% CI, 29–63%), and 23% (95% CI, 7–45%) in the daGOAT-designated low-, intermediate-, and high-risk participants, respectively. One-year cumulative incidence of relapse was 8% (95% CI, 3–16%) in the AI focus group overall. One-year survival was 90% (95% CI, 84–96%; Supplementary Fig. 5).

Post-hoc analyses suggest the six participants with severe acute GvHD in the AI focus group could not be readily distinguished from the other 104 participants in the AI focus group based on their daGOAT-computed risk scores between days +17 and +23 (Fig. 5e). All the six incidences of severe acute GvHD were steroid-refractory, and their second-line treatment included ruxolitinib (n = 6), basiliximab (n = 4), mesenchymal stromal cells (n = 3), infliximab (n = 1), and vedolizumab (n = 1) (Fig. 6). One daGOAT-designated high-risk participant (#110) died on day +49 posttransplant, and one intermediate-risk participant (#63) died on day +190 posttransplant. The other four cases of severe acute GvHD achieved sustained CR of acute GvHD on days +53 (Participant #107), +57 (#62), +76 (#53), and +106 (#52) posttransplant, respectively (Fig. 6).

Fig. 6: Clinical course of severe acute GvHD in the AI focus group.
Fig. 6: Clinical course of severe acute GvHD in the AI focus group.
Full size image

The clinical course of the six participants who were in the AI focus group and developed severe acute GvHD is displayed in detail. Participant numbering is identical to that in Fig. 4. The first day of steroid-therapy is ‘day 1’ in this graph, whilst the previous day (that is, the day immediately before day 1) is ‘day –1’. Staging of acute GvHD in individual organs is indicated by numerals (0–4). Check marks () indicate drug use. Acute GvHD, acute graft-versus-host disease; AI, artificial intelligence.

During the enrollment period four participants receiving MMUD transplants also utilized daGOAT-driven drug prescription (Fig. 4a). daGOAT prescribed ruxolitinib for two of them; both immediately complied with AI prescriptions and did not develop severe acute GvHD. However, one low-risk participant developed severe acute GvHD (he/she reached sustained CR at day +116 posttransplant). Were we to include the four MMUD transplants in our analysis, cumulative incidence of severe acute GvHD was 6.1% (7/114; 95% CI, 2.7–11.6%) in all the participants who utilized AI prescription, compared with 16% (42/264) in co-variate-matched controls.

Post-hoc analyses of the non-participating patients

We also reviewed the 38 eligible patients who did not participate in the trial (Fig. 4a). All 38 received HLA-haploidentical transplants. Baseline characteristics of these non-participating patients were overall comparable to the AI focus group (Table 1). However, exploratory analysis suggests that eligible patients whose blood granulocyte concentration did not recover by day +12 posttransplant were less likely to participate compared with those with earlier granulocyte recovery (66% participating [52/79] versus 85% participating [62/73]; p = 0.008; odds ratio [OR] for participation = 0.34). Two of the 38 patients died before granulocyte recovery. Among the 38 non-participating patients there were five (13%) who developed severe acute GvHD; three of them were excluded by the physicians from trial enrollment (‘unable to swallow medicine’, n = 2; ‘acute leukaemia not in CR’, n = 1) whereas the other two patients refused to participate.

Clinicians’ attitudes towards AI-driven prescription

Sixteen physicians and 46 nurses participated in trial conduct. They were surveyed regarding their attitudes towards conditional autonomous AI-driven prescription (Table 3). Fifteen (94%) physicians stated that the main benefit of daGOAT-driven, targeted pre-emptive drug intervention was the decrease of severe acute GvHD incidence. Five (31%) physicians mentioned that because of autonomous-AI monitoring of severe acute GvHD risk they could focus on other clinical problems. Five (31%) physicians expressed that the main drawback they encountered in the daGOAT trial was the need to closely monitor the participants who took ruxolitinib. Thirty percent (3/10) of the senior physicians (with ≥10 years of transplantation experience) said they still did not believe an AI model can decide when to prescribe medication to pre-empt severe acute GvHD. Forty-four (96%) nurses said that the main drawback was the workload required of them to explain to the participants risks and benefits of using daGOAT (n = 28), train the participants on how to take low-dose ruxolitinib (n = 32), or monitor the participants’ compliance with AI prescriptions (n = 38).

Table 3 Clinicians’ attitudes towards conditional autonomous AI-driven prescription

Discussion

Our proof-of-concept study interrogated two critical issues in medical AI.

First, how do we derive the maximal amount of value from the considerable data collected in a patient? New research indicates that even CBC, one of the most commonly-done laboratory tests in clinical medicine, has valuable information hitherto untapped20. We reason that common laboratory test data in transplant recipients are plausibly under-utilized also.

Second, what is the maximally plausible level of AI autonomy? Even ‘conditional automation’, the lowest level of AI autonomy, is rarely tested in drug prescription. Using the keyword phrase ‘autonomous artificial intelligence’ to query the PubMed database we identified 40 English-language medicine-related articles published before 24 February, 2025, of which ten were prospective clinical trials6,8,9,10,11,21,22,23,24,25. One study used an autonomous AI conversational agent for follow-up assessment of post-cataract surgery, whilst the other nine studied the role of autonomous AI in image-based diagnosis of diabetic retinopathy or colorectal polyps. However, not all studies of autonomous AI use the phrase ‘autonomous artificial intelligence’. A systematic review12 found that studies investigating the use of AI in cancer care following diagnosis are often about radiation therapy planning or behavioral nudges2,3,4,5,7. Currently, little is known about the feasibility of autonomous AI in pharmaceutical intervention. This contrasts with the vision that future clinicians ‘will increasingly interact with task-specific and domain-specific AI systems across a continuum of automation’1.

In this study we evaluate a conditional autonomous AI agent monitoring high-dimensional laboratory data and prescribing a drug to prevent severe acute GvHD in the setting of HLA-haploidentical transplants. Our success in trial conduct suggests the possibility of establishing trusting and working relationships among patients, clinicians, researchers, and an autonomous AI agent, provided that the algorithm is transparent, that its deployment does not cause inconvenience or disruption in the patient care pathways, that the status quo (that is, without using AI) is not ideal, and that people are convinced dismissing AI prescriptions can have detrimental consequences in some patients. Our study provides a paradigm for applying conditional autonomous AI in pharmaceutical intervention under current regulatory and ethical frameworks.

Some previously-reported predictive models for severe acute GvHD rely on peri-transplant features. These models have an area under the receiver-operator characteristic curve (AUROC) score of about 0.626,27. Prediction of severe acute GvHD onset remains challenging despite availability of acute GvHD biomarkers that can be used to predict prognosis of acute GvHD after its onset or prognosis of steroid-refractory acute GvHD after the first-line treatment has failed28,29,30,31. The daGOAT model uses a different approach. We hypothesize subtle, concerted change patterns of high-dimensional dynamic co-variates before severe acute GvHD onset can be detected by an AI model trained to be a specialist for early warning18,32,33. A major challenge we encountered was the few available data for training the model. Large-cohort datasets of dynamic clinical co-variates are rare except in the context of intensive care14,34,35,36. Nevertheless, using computational techniques to address the ‘large p, small n’ problem and consistent with what we have reported elsewhere18, we show it is feasible to fit a very high dimensional (‘large p’) decision-making model using data from a limited-sized (‘small n’) cohort of ≈700 patients. Our AI model is tolerant of missing data, does not require a uniform data collection protocol to support its operation, and has an easy-to-use interface for physicians. Moreover, our training algorithm, the bulk of training data, and the final updated model are all open-source.

Median frequency of severe acute GvHD in intermediate-to-large-cohort ( >170 cases) studies of HLA-haploidentical transplants without using pre-emptive ruxolitinib is ≈10% in modern years. There is variance around the median value, with reported frequency as low as 5% and as high as 20% (Supplementary Table 1)37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55. At our center, cumulative incidence of severe acute GvHD in haploidentical transplants was 16% during 2017–2021. With the use of daGOAT, frequency of severe acute GvHD was 5.5% in the AI focus group in this study. However, it has not been definitely proven that decrease of severe acute GvHD incidence can be effected by the proposed strategy of utilizing daGOAT to pre-emptively prescribe ruxolitinib in model-designated intermediate-to-high-risk patients. The lower frequency of severe acute GvHD in the AI focus group might reflect enhanced monitoring and extra care of participating patients or possible selection bias. Because every participant who was designated intermediate-to-high-risk in the AI focus group took ruxolitinib, our study design could not disentangle the added benefit of the ‘information content’ of the model from the added benefit of the ‘behavioral consequence’ of the model. Nonetheless, a notable strength of our proposed approach is its data-driven risk classification. For physicians who champion routine, blanket administration of pre-emptive ruxolitinib in all patients receiving haploidentical transplants, utility of the daGOAT model lies in that patients identified as low-risk would not receive the drug unnecessarily. In contrast, for physicians who are hesitant to use pre-emptive ruxolitinib routinely, the model’s utility lies in that only patients identified as at risk would receive the drug, at a dose-rate dependent on model-computed risk level. As of the time of writing this article, ruxolitinib has been approved for second-line treatment of steroid-refractory acute GvHD, but not for preventing acute GvHD.

Because cumulative incidence of severe acute GvHD in the high-risk participants was still >10% in the AI focus group, we speculate a higher ruxolitinib dose might be needed in the high-risk participants. Alternatively, it is also plausible ruxolitinib might not work in some high-risk participants regardless of dose and other drugs need to be explored. Post-hoc analyses suggest that the 141 dynamic co-variates included in daGOAT could not easily distinguish participants not benefiting from pre-emptive ruxolitinib intervention from those who benefit. When data on more co-variates and more participants become available, systematic analysis may be able to identify patient sub-groups on which the AI model underperforms56. Risk scores of the three risk-strata for severe acute GvHD appeared to converge after +21 days posttransplant in the AI focus group, plausibly because of ruxolitinib medication, data sparsity, and/or the need to include additional latent and/or extraneous co-variates. We suggest increasing and/or broadening data after day +21 might improve model performance. For example, including plasma levels of biomarkers such as glucagon-like peptide-2 (GLP-2) in the list of monitored dynamic co-variates might improve the model’s performance57. On the other hand, it is also plausible the model could be further optimized by narrowing the monitoring time window or reducing the number of co-variates included in the model.

The current study has limitations. First, our study was not designed to critically evaluate efficacy of ruxolitinib at the dose and schedule prescribed by daGOAT to prevent severe acute GvHD. Also, it was not designed to see if daGOAT outperforms physicians in deciding when and how to intervene to prevent severe acute GvHD. Second, we have not addressed the crucial issues of liability (that is, determining who should be held accountable when errors occur) and the institution of a closed-loop mechanism whereby the AI agent adjusts itself in response to physician-reported anomalies58. Third, median age of the participants in the AI focus group was 42 years (close to the median age of transplant recipients in China59). Whether the investigated strategy is applicable to patients >60 years old is undetermined. Fourth, biology of acute GvHD in haploidentical transplants could have differences across different transplant protocols and/or ethnic groups, and it might be necessary to refit or fine-tune the model for each separate situation.

In conclusion, our proof-of-concept study suggests physicians and patients can be receptive to conditional autonomous AI drug prescription. Generalizability of our conclusion requires testing of other drugs and/or in other clinical settings.

Methods

Clinical definitions

An HLA-mismatched donor is defined as a donor who is ≤9/10 HLA-matched at HLA-A, -B, -C, -DR and -DQ. An HLA-haploidentical donor is defined as a donor who is a first- to third-degree relative ≥5/10 HLA-matched other than an HLA-identical sibling. Time of attaining granulocyte recovery posttransplant is defined as the first day of ≥3 consecutive days of a granulocyte concentration of ≥500/μL in blood. Time of attaining platelet recovery posttransplant is defined as the first day of ≥3 consecutive days of a platelet concentration of ≥20,000/μL in blood. Severe acute GvHD is defined as acute GvHD with a peak severity of grade 3−4 according to the MAGIC criteria, that is, ≥stage 2 lower-gastrointestinal (GI) acute GvHD, ≥stage 2 hepatic acute GvHD, or stage 4 cutaneous acute GvHD60.

Model update

The daGOAT model18 updates the risk score \(\varphi \left(t\right)\) for severe acute GvHD according to

$$\varphi \left(t\right)=\rho \left({{{\bf{z}}}}\right)+\sum\limits_{\tau=t-{\tau }_{{{\rm{retrospective}}}}}^{\tau=t}\sum\limits_{k}{I}_{k\tau }{\theta }_{k}\left({x}_{k}\left(\tau \right),\tau \right),$$
(1)

where \({{{\bf{z}}}}\) is a vector of peri-transplant features; \({x}_{k}\left(\tau \right)\) is the \(k\)-th dynamic co-variate’s value at time \(\tau\); \({I}_{k\tau }=0\) if the value of \({x}_{k}\left(\tau \right)\) is unavailable, and \({I}_{k\tau }=1\) otherwise; \({\theta }_{k}\left(x,\tau \right)\) is a ‘smooth’ function that describes the relationship between the \(k\)-th dynamic co-variate and severe acute GvHD risk (technical details of the smoothing function have been described elsewhere18); \(\rho \left({{{\bf{z}}}}\right)\) and \({\theta }_{k}\left({x}_{k}\left(\tau \right),\tau \right)\) are contributions of \({{{\bf{z}}}}\) and \({x}_{k}\left(\tau \right)\), respectively, to the risk score \(\varphi \left(t\right)\); and \({\tau }_{{{\rm{retrospective}}}}\) (unit: days) denotes how far back the model looks at past values of dynamic co-variates when computing the risk score \(\varphi \left(t\right)\). Furthermore, two additional hyperparameters are introduced: \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}\) (the day the model starts computing the risk score for severe acute GvHD [unit: days posttransplant]) and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}\) (the last day the model computes the risk score for severe acute GvHD [unit: days posttransplant]).

The original daGOAT model included 14 peri-transplant features and 194 dynamic co-variates. Values of the hyperparameters were originally set to be: \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}=+ 1\), \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}=+ 100\), and \({\tau }_{{{\rm{retrospective}}}}=14\).

While planning for the prospective trial, we realized the original daGOAT model had several weaknesses that would inconvenience its prospective application:

First, data collection for some of the dynamic co-variates (for example, frequency of defecation) was not easy to automate.

Second, the original model’s ability to detect severe acute GvHD peaked at day +23 posttransplant whereas accuracy was considerably worse before and after18. As a consequence, we reasoned that after transplantation it might be beneficial to silence daGOAT initially and only allow the model to start giving alerts for severe acute GvHD risk after a certain amount of elapsed time.

Third, for at least some patients it would be too late to wait until day +23 to attempt to pre-empt severe acute GvHD.

Fourth, daGOAT originally classified patients into two risk-strata. However, having three risk-strata might be better than having two, because this would permit a more granular approach to pharmaceutical intervention that gives lower-intensity intervention to intermediate-risk participants and higher-intensity intervention to high-risk participants.

To remedy these weaknesses, we took a multi-pronged approach we describe below. We used retrospective data from the NICHE-GOAT cohort at the IHCAMS18 to assist model optimization. Five hundred and nineteen patients receiving HLA-mismatched HCT at the IHCAMS during 1 April 2012–30 November 2020 were used as the training set whilst 204 patients receiving HLA-mismatched HCT during 1 December 2020–31 December 2021, the validation set.

First, we eliminated some dynamic co-variates that were deemed impractical or unnecessary in prospective applications. For instance, whilst there were some bone marrow cell cytometry data points in the retrospective dataset, bone marrow cell cytometry is rarely done during the first month posttransplant, and even when it is occasionally done results are rarely available on the same day of sample collection. In addition, we removed co-variates whose evaluation could not be easily automated; examples included symptomatic complaints (for example, nausea), frequency of defecation, volume of defecation, urine output, etc. We also deleted co-variates that are no longer routinely reported by the clinical laboratory at the IHCAMS; examples included high fluorescence lymphocyte cells count, neutrophil forward scatter mean intensity, and monocyte fluorescent light mean intensity.

Although evidence was scanty for association between acute GvHD and many of the clinical co-variates (for example, plasma electrolytes), we did not perform additional ‘variable selection’ beyond the trimmings described above. We decided to let each co-variate speak for itself based on data. We expect co-variates not informative for predicting severe acute GvHD would have small weight values at the conclusion of model-fitting.

After being streamlined, daGOAT includes 14 peri-transplant features (same as the list in the original model) and 141 dynamic co-variates. The 141 dynamic co-variates are: CBC (36 co-variates), blood biochemistry (44 co-variates), electrolytes (7 co-variates), blood cell flow cytometry (36 co-variates), and cytokines (18 co-variates) (Supplementary Data 1).

Value of \({\tau }_{{{\rm{retrospective}}}}\) was chosen so that the following two criteria were met in cross-validation experiments: (1) the AUROC score between \(\varphi \left(t\right)\) (a numeric variable; computed using severe acute GvHD as the regression goal in Eq. 1) and severe acute GvHD onset peaked earlier between days +17 and +23 and (2) mean AUROC score between days +17 and +20 was near-maximized. These two criteria were designed to encourage the AUROC score to rise faster without sacrificing its absolute peak value. The two criteria were simultaneously met when \({\tau }_{{{\rm{retrospective}}}}=3\), with the AUROC score peaking at day +17 whilst mean AUROC score between days +17 and +20 was 0.67, close to the maximum mean value 0.68 (Supplementary Fig. 6).

We then developed a two-component model architecture for daGOAT to further improve its performance (Supplementary Fig. 7). We fitted the first ‘component model’ using severe acute GvHD as the regression target in Eq. 1 (as usual). For any given day between \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}\) and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}\), we classify this component model’s output on that day as ‘high’ if it is larger than the 97th-percentile value in the training set for that day, ‘low’ if it is smaller than the 70th-percentile value, and ‘intermediate’ otherwise. Then, we fitted a second ‘component model’ using grade 2–4 acute GvHD as the regression target (that is, interpreting \(\varphi \left(t\right)\) in Eq. 1 as the risk score for grade 2–4 acute GvHD rather than severe acute GvHD) and likewise classify its daily outputs into ‘high’, ‘intermediate’, and ‘low’. We reasoned that ‘true’ higher risk should be higher risk according to both the component models. With the two-component model architecture, a patient’s risk level on a given day would be classified as ≥intermediate-risk if outputs from both the component models were ≥intermediate on that day; if, in addition, output from ≥1 component model were high, then the patient’s risk level on that day would be classified as high. Thus, from \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}\) to \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}\), each day the combined model classifies each person’s daily risk level as low, intermediate, or high. Finally, we designate each patient’s final risk-stratification to be his/her highest daily risk level between \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}\) and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}\).

With \({\tau }_{{{\rm{retrospective}}}}=3\) (its optimal value), values of \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}\) and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}\) were chosen so that both concordance (between the maximum risk level between \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}\) and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}\) [a categorical variable: low-, intermediate-, or high-risk; computed using the two-component model architecture for daGOAT] and severe acute GvHD onset) and recall scores were near-maximized in the validation set. This happened when \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}=+ 17\) and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}=+ 23\), with concordance score = 0.77 and recall score = 0.83 (Supplementary Fig. 6).

In summary, based on retrospective cross-validation experiments, concordance, recall, and AUROC scores and the AUROC score’s rising speed were simultaneously near-optimized with \({\tau }_{{{\rm{retrospective}}}}=3\), \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}=+ 17\), and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}=+ 23\). In the validation set, with daGOAT with the two-component model architecture, the concordance score for predicting severe acute GvHD was 0.77. In contrast, using the original daGOAT model, one had to wait until day +23 for the concordance score to peak at 0.7818.

Cumulative incidence of severe acute GvHD in daGOAT-designated low-, intermediate-, and high-risk patients was 5%, 32%, and 50%, respectively, in the validation set (Fig. 1b). Eighty-three percent of the patients who later developed severe acute GvHD were classified as intermediate- or high-risk whilst 67% of the patients who did not develop severe acute GvHD were classified as low-risk. Cumulative incidence of ≥stage 2 lower-GI acute GvHD was 4%, 25%, and 50% in the low-, intermediate- and high-risk patients, respectively; ≥stage 2 hepatic acute GvHD, 2%, 16%, and 0%; and ≥stage 3 cutaneous acute GvHD, 2%, 7%, and 8%.

Finally, fixing \({\tau }_{{{\rm{retrospective}}}}=3\), \({T}_{{{\rm{active}}}}^{{{\rm{start}}}}=+ 17\), and \({T}_{{{\rm{active}}}}^{{{\rm{end}}}}+23\) (that is, at their respective optimal values) and leveraging data from the entire retrospective cohort (that is, 723 transplants), we refitted the values of \(\rho \left({{{\bf{z}}}}\right)\) and \({\theta }_{k}\left(x,\tau \right)\). The model was last updated on 29 September 2022.

Correlation of co-variates with risk-stratification

For many dynamic co-variates, their correlation with daGOAT-computed risk-stratification became apparent when we aggregated data across patients. Of the 141 dynamic co-variates included in the updated daGOAT model, 121 (86%) had ≥30 data points between days +1 and +23 in each of the three risk-strata in the retrospective cohort of 723 transplants. The mean temporal profiles of these 121 dynamic co-variates are displayed in Supplementary Fig. 1. Some dynamic co-variates had ‘crescendo’ patterns, with the lowest mean values in the low-risk stratum and highest mean values in the high-risk stratum; examples included lymphocyte count (LYMPH#), neutrophil count (NEUT#), platelet count (PLT), T cell count, T cell percentage in lymphocytes, interleukin (IL)-2R, IL-5, IL-6, IL-8, IL-10, bilirubin (BIL), aspartate transferase (AST), and alanine transaminase (ALT). On the other hand, some dynamic co-variates had ‘decrescendo’ patterns; examples included albumin (ALB), AST-to-ALT ratio, and natural killer (CD3CD56+/CD16+) cell percentage in lymphocytes.

Mutual information between co-variates and risk-stratification

Entropy \(H\left(z\right)\) of a random variable \(z\) and mutual information \(I\left({z}_{1};{z}_{2}\right)\) between two random variables \({z}_{1}\) and \({z}_{2}\) were computed using logarithmic base 2 (unit: bits) as in common practice61.

Let \(Y\) denote risk-stratification based on the full-version daGOAT model (that is, including all the 14 peri-transplant features and 141 dynamic co-variates), \({Y}_{0}\) denote risk-stratification based on only the 14 peri-transplant features, and \({Y}_{\Omega }\) denote the risk-stratification were we to add a subset of dynamic co-variates, \(\Omega\), to the 14 peri-transplant features.

A priori (that is, without any prior information on the peri-transplant features or dynamic co-variates), the entropy or uncertainty of risk-stratification was \(H\left(Y\right)\) bits/person. Out of this total uncertainty \(I\left({Y}_{0}{;Y}\right)\) bits/person could be resolved by the 14 peri-transplant features. Adding a subset of dynamic co-variates, \(\Omega\), to the 14 peri-transplant features would add \(I\left({Y}_{\Omega }{;Y}\right)-I\left({Y}_{0}{;Y}\right)\) bits/person.

Among the 719 (99% of 723) patients who received HLA-mismatched transplants in the retrospective cohort and did not develop severe acute GvHD before day +17, using only the 14 peri-transplant features resolved 0.02 bits/person uncertainty in risk-stratification (that is, \(I\left({Y}_{0}{;Y}\right)=\) 0.02). Adding CBC-, flow cytometry-, biochemistry-, electrolyte-, or cytokine-related dynamic co-variates (that is, adding only one of the five categories of dynamic co-variates) would add 0.18, 0.14, 0.09, 0.03, or 0.02 bits/person incremental information whereas including data of all the dynamic co-variates added 0.95 bits/person total incremental information (Fig. 1d).

Note that contributions to ‘realized’ risk-stratification should not be interpreted as contributions to ‘true’ risk-stratification, which is unknown to us. For example, when cytokines were added at the last (that is, the last row in Fig. 1d), the cumulative mutual information with the final risk-stratification suddenly ‘jumped’; we interpret this ‘jump’ merely reflected that once cytokines – the last group of co-variates to be included in the model – were also included, at last the computed risk-stratification became fully identical to the final daGOAT-computed risk-stratification.

Required sample size for the prospective trial

Cohort size of the prospective trial was designed so that there would be sufficient power for comparing cumulative incidences of severe acute GvHD between the prospective-trial cohort and co-variate-matched controls.

We assumed there would be two intensity levels of add-on pharmaceutical intervention for enhancement of acute GvHD prophylaxis: higher- and lower-level doses. We further assumed that, with ‘add-on prophylactic medication at the higher-level dose’, cumulative incidence of severe acute GvHD in high-risk participants would become comparable to that in intermediate-risk patients receiving only baseline prophylaxis, whilst, with ‘add-on prophylactic medication at the lower-level dose’, cumulative incidence of severe acute GvHD in intermediate-risk participants would become comparable to that in low-risk patients receiving only baseline prophylaxis. Thus, we estimated that, overall, cumulative incidence of severe acute GvHD would fall from 18% (the cumulative incidence in the recipients of HLA-mismatched transplants at the IHCAMS during 1 December 2020–31 December 2021) to 8% in the prospective-trial cohort. Assuming the prospective-trial cohort would be compared with co-variate-matched control cases selected at the target ratio of 3:1 from electronic medical records, to attain a 0.05 significance level and a 0.8 power at a presumed 5% dropout rate, we estimated that at least 102 participants would need to be enrolled in the prospective trial.

Dosing of intensified immune suppression

For participants identified by daGOAT to be at risk of severe acute GvHD, the IHCAMS Ethics Committee approved use of ruxolitinib, a selective JAK1/JAK2 inhibitor effective in treating steroid-refractory acute GvHD62,63,64, for severe acute GvHD prophylaxis in the context of the daGOAT trial. It was approved that ‘add-on prophylactic medication at the higher-level dose’ would be ‘oral 5 mg ruxolitinib twice daily until at least day +60’ and ‘add-on prophylactic medication at the lower-level dose’ would be ‘oral 2.5 mg ruxolitinib twice daily until at least day +60’.

Because 20 mg/d ruxolitinib is the standard dose for treating steroid-refractory acute GvHD, it was reasoned a lower dose might be sufficient for prophylaxis62. Because 5 mg/d has been tested in HLA-matched allogeneic transplants for acute GvHD prophylaxis, it was reasoned a higher dose might be needed for at least some patients receiving HLA-haploidentical transplants65.

Because daGOAT was trained on clinical co-variate data obtained before the start of therapy for acute GvHD, the study protocol did not stipulate to de-escalate ruxolitinib dose even if a participant’s risk score improved after he/she started taking ruxolitinib. However, the trial protocol stipulated that in participants with neutrophil concentration <100/μL in blood the ruxolitinib dose could be reduced or discontinued until the physician judged it safe to restart.

Participant enrollment

Eligibility criteria included the following: (1) age >16; (2) first transplant and HLA-mismatched; (3) able to swallow medicine; (4) consent from the physician and patient. Sex was not a considered factor in the study design. From 17 January 2023 to 30 June 2024 (‘enrollment period’) two transplant wards at the IHCAMS recruited participants. Physicians and potential participants were explicitly told that an AI model would decide whether, when, and at what dose to prescribe ruxolitinib.

Dynamic co-variate data collection in the participants

The daGOAT model requested from the hospital information system only data of the co-variates required by the model and only from the participants who enrolled in the trial. Because daGOAT permits missing data18, physicians were told to order laboratory tests in accordance with their own standard practice. If a physician decided to order blood cell flow cytometry analysis, we required sample to be collected before 0700 h so that results would be available before 1620 h the same day. There was no other stipulation on data collection.

Outcome measures

Acceptance rate of trial participation and compliance with AI prescriptions were quantified. Primary clinical endpoint pre-specified in the trial protocol was the cumulative incidence of severe acute GvHD at day +100 posttransplant.

Participant follow-up

The last follow-up date was 11 March 2025. The median duration of follow-up was +327 days posttransplant.

Clinician survey

On 27 November 2024 we conducted an anonymous survey of physicians and nurses participating in the study to interrogate how they perceived daGOAT.

Statistics and software

We used greedy nearest neighbor matching with caliper constraint and the target matching ratio of 3:1 to select co-variate-matched controls from the 428 HLA-mismatched transplants that were done at the IHCAMS during 1 January 2019–31 December 2021 and did not use ruxolitinib before acute GvHD onset. Co-variates used for case-control matching were: recipient age, sex, primary disease, disease status (CR versus not in CR) and MRD-status pretransplant, donor type (HLA-haploidentical versus MMUD), HLA match, graft source (blood versus bone marrow), infused cell dose, conditioning regimen (myelo-ablative versus reduced-intensity), and baseline regimen for acute GvHD prophylaxis (‘ATG + cyclosporine A’ versus ‘ATG + tacrolimus’).

R code for executing the daGOAT model is available at GitHub19. The following R functions were used in statistical analyses: two-sided Wilcoxon test for comparing median values, ‘wilcox.test’; two-sided Fisher’s exact test for comparing proportions between two groups, ‘fisher.test’; two-sided log-rank test for comparing survival curves, ‘ggsurvplot’; computing hazard ratio when comparing survival curves or calculating concordance between risk score and acute GvHD onset, ‘coxph’; two-sided Fine-Gray test (treating mortality as a competing risk; single-variable model, that is, without controlling for additional co-variates) for comparing two cumulative incidence functions (of acute GvHD, chronic GvHD, or relapse), ‘crr’; and nearest neighbor matching for identifying the control cohort, ‘matchit’. Statistical significance is defined as p < 0.05. All the reported p-values are two-sided and not adjusted for multiple-hypothesis testing. In the plots of cumulative incidence functions of acute GvHD, event time is the first day of steroid-therapy.

Ethics approval and trial registration

This study was approved by the Ethics Committee (IIT2022034-EC-1) of the IHCAMS and registered in ClinicalTrials.gov (NCT05600855). All participants gave written informed consent (Supplementary Notes) consistent with precepts of the Declaration of Helsinki.

Role of the funding sources

The funders of this study had no role in data collection, analysis, interpretation, writing of the report, or the decision to submit for publication.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.