Fig. 4: Qualitative assessment of TCRT5.
From: Conditional generation of real antigen-specific T cell receptor sequences

a, Repertoire-level features of reference (validation target sequences) and generated CDR3β sequences. TCRT5 captures the tails of the CDR3β length distribution but preferentially samples sequences at the right tail of OLGA generation probabilities. b, Sequence logo plots showing the decrease in sequence diversity position across the generated and reference CDR3β sequences for three canonical pMHCs (GILGFVFTL (influenza A), KLGGALQAK (CMV) and YLQPRTFLL (SARS-CoV2)). c, Generated sequences experience a decrease in Shannon entropy for nearly all positions compared with the reference sequences across all pMHCs. Bar plots for individual pMHCs are overlaid on one another. d, Plot of k-mer spectrum shift, showing the JS divergence between the generated and reference sequences. Mean JS divergences for soNNia generations for 100 sequences sampled per pMHC across 100 simulations are shown for reference. Error bars mark the mean and one standard deviation across validation pMHCs (n = 20). e, Heat map of Jaccard index scores, showing the generated sequence co-occurrence across different pMHC pairs. f, TCRT5 repeats sequences across pMHCs in line with biological probabilities and is robust to training set abundance. Scatter plot visualizing the occurrence across pMHCs with OLGA pgen, polyspecificity and training set frequency. g, TCRT5 generates experimentally validated antigen-specific CDR3β sequences unseen during training. Sankey diagram showing the validity (non-zero OLGA pgen), known antigen specificity status and training set membership of the generated sequences across the validation pMHCs.