Fig. 5: In silico benchmark on unseen epitopes. | Nature Machine Intelligence

Fig. 5: In silico benchmark on unseen epitopes.

From: Conditional generation of real antigen-specific T cell receptor sequences

Fig. 5

a, Schematic of the test set curation. Bubble plot of n = 14 test pMHCs coloured by allele, with the area corresponding to the number of unique cognate CDR3β sequences. The pMHC with the most reference TCRs was reserved for a deep in silico simulation and the remaining 13 were used for a sparse evaluation set. b, Repertoire-level features for reference CDR3β sequences and those generated by ER-TRANSFORMER (ER-TRANS), a modified ER-TRANSFORMER (ER-TRANS+), GRATCR and TCRT5 are shown as smoothed density curves. c, Benchmark metrics showing the aggregate performance of all models on the benchmark dataset (n = 13). soNNia-derived metrics are aggregated across pMHCs and 1,000 simulations to account for the stochasticity of generations. Error bars show mean ± s.d. d, Modified true-positive counts. Box and whisker plots showing the median and quartile values for benchmark pMHCs (n = 13) on exact matches, sequence recovery ≥ 90% and GIANA reference clustered. Whiskers extend to 1.5× the interquartile range. e, Network diagrams for model generations. Clusters with a known binder (red) and GIANA-clustered translations (blue) are highlighted. The number of highlighted clusters (c), the number of clustered translations (t) and the number of known binders sampled (r) are reported per pMHC. GRATCR is omitted due to zero reference-clustered sequences. f, Network diagrams for RVRAYTYSK (EBV). g, In silico simulation of RVRAYTYSK design challenge. Heat maps highlighting the rank of exact reference matches, sequence recovery ≥ 90% and GIANA reference-clustered sequences for each model are shown. For each metric, a summary bar plot counting the number of successes is shown, coloured by range. Panel a created with BioRender.com.

Back to article page