Fig. 6: Bridging preclinical datasets to clinical prediction via transfer learning.

a, Pie chart showing the proportion of different cancer types within the CDS_DB dataset, with breast cancer (58.2%) and leukaemia (32.0%) as the most prevalent types. b, t-SNE63 plot depicting the pretreatment space of breast cancer subtypes from the CDS_DB and breast cancer cell lines from the L1000 datasets, coloured by the data source and cancer subtypes. c–e, PCC comparison of various models under unseen-patient (c), unseen-drug (d) and unseen-cancer (e) evaluation scenarios. For the unseen-patient setting, results are reported under three settings: pan cancer, breast cancer and leukaemia. For each model, two training strategies are compared: training from scratch and pretraining on the L1000 dataset. Performance gains achieved by the XPert model through pretraining are highlighted in red. Box plots show the distribution of PCC values obtained from 5-fold cross-validation, with the centre line indicating the median, the box representing the IQR (25th to 75th percentile) and whiskers extending to 1.5× IQR. All individual points are shown in coloured squares. f, Violin plot showing the distribution of xdeg for the CDK1 and BUB1B genes comparing ground truth and XPert/XPert (pretrain) predictions. The width of each violin represents the kernel density estimate, and the central white dot indicates the median. g, Volcano plot showing differential attention genes identified by XPert between the non-response (NON) and response (RES) groups. The red points represent genes with significantly increased attention in the non-response group, suggesting potential drug-resistance-related genes, whereas blue points highlight those with decreased attention.