Fig. 1: Characteristics of the deep learning model and the training and evaluation datasets for prediction of HLA-I epitopes.

a, Datasets used for training and evaluation were curated by combining data from several previous studies as well as a recent download of the IEDB. Eluted ligand data were used as positives and randomly sampled decoys from Swiss-Prot26 served as negatives. For evaluation, data from an immunopeptidomic study involving 24 monoallelic cell lines were used27. To evaluate immunogenicity, five studies28,29,30 that measure the immunogenicity of influenza epitopes identified via mass spectrometry were used. b–e, Peptide length distribution of the HLA-I binders (b,c) and pie chart of the proportion of epitopes per HLA-I allele (d,e) in the presentation training (b,d) and evaluation (c,e) datasets. All alleles present in the dataset with a frequency <1% are denoted as ‘other’. f, The binding module takes as input the amino acid sequences of the major histocompatibility complex and peptide in the form: [cls] mhc [sep] pep [eos], where [cls], [sep] and [eos] are special tokens that separate the two sequences. This new sequence is fed to the Evolutionary Scale Modeling-2 (ESM-2) Transformer protein language model, and the vector representation for the [cls] token is used to represent the complex. The ligand elution module combines the binding vector with a long short-term memory (LSTM) recurrent neural network encoding of the peptide that includes its left and right flanks in the parent protein of origin. The model can be used when trained with or without flanking residues. These combined features are then concatenated and used to compute a ligand presentation score. The model is first trained on the ligand presentation task. Then, the model is trained with five different random seeds and their scores are averaged to create an ensemble score. pHLA: peptide-human leukocyte antigen complex; TCR: T cell receptor. Panel f created with BioRender.com.