Extended Data Fig. 6: TCRT5 Metrics @1000.
From: Conditional generation of real antigen-specific T cell receptor sequences

(a) Repertoire-level features of reference (validation target sequences) and generated CDR3βs. (b) Sequence logo plot generated from TCRT5 for the canonical GILGFVFTL (Influenza A), KLGGALQAK (CMV), and YLQPRTFLL (SARS-CoV2) from 1000 generations instead of 100. (b) TCRT5@1000 with beam search preferentially samples sequences at the right tail of OLGA generation probabilities. (c) Bar plots for individual pMHCs are overlaid on one another. (d) K-mer spectrum shift plot showing the Jensen-Shannon divergence between generated and reference sequences for TCRT5@1000. Error bars mark the mean and 1-standard deviation across validation pMHCs (n = 20). Mean soNNia values are shown per simulated run, with 1000 generations per pMHC per run over 100 simulations. (e) Heat map of Jaccard Index scores showing the generated sequence co-occurrence across different pMHC pairs at 1000 generations per pMHC. (f) Sankey diagram of TCRT5@1000 generations showing the validity as measured by nonzero generation probability, known binding status, and training set membership.