Supplementary Figure 2: Peptide sequence encoding and detailed individual neural network architectures.
From: Predicting HLA class II antigen presentation through integrated deep learning

(a) An example (GCSADQACN) of how variable length amino acid sequences (8-26AA) to be one-hot encoded for machine learning purposes. A peptide is represented by a 21x26 matrix. Each row represents 21 possible amino acids, and each column represents the true amino acid at that position (1 = true). Any positions not encoding for an amino acid due to short length of a peptide are encoded as an all-zero vector which will be ignored by the neural network masking layer. (b) The model architecture of peptide sequence cleavage scores for HLA-DR presentation. The algorithm takes in a pair of query gene and short peptide sequence, and look up human proteome sequence database to determine the upstream and downstream six amino acid sequences (flanking sequences). A two-layer conventional neural network takes in these 12 amino acid sequences and output a 0-1 cleavage score indicating likelihood of HLA-DR presentation by knowing flanking sequences only. (c) The deep RNN model for predicting HLA-DR peptide presentation based on peptide sequences only. The deep RNN model consists of one masking layer, one RNN layer and two conventional dense layers. The deep RNN model takes in hot-hot encoded peptide sequences and output presentation scores indicating likelihood of HLA-DR presentation by knowing peptide sequences only. This model was trained on naturally presented MCL HLA-DR peptide ligands. (d) The deep RNN model for predicting HLA-DR peptide in vitro binding affinities based on IEDB binding data. A pair of query HLA-DR and peptide is encoded as a single sequence consisting of HLA-DRB1 pseudosequence, a spacer (-), and query peptide sequence. A deep RNN model takes in one-hot encodes sequences and outputs estimated in vitro binding affinities (1 - log50k(nM)). This model was trained on the IEDB quantitative HLA-DR to peptide binding data, identical to the data used by NetMHCIIpan3.1. (e) Selecting training and validation data for HLA-DR presentation prediction models. ~35k naturally presented HLA-DR peptide ligands and ~105k length-matched random human peptides are randomly assigned into training (85%) and validation (15%) sets. Peptides in the validation set identical or of substring to any peptides in training set were moved to the training set to avoid overfitting. The training and cross validation repeated 10 times to determine regularization parameters and estimates predictive powers of various models. The final performance of MARIA was determined with independent test sets.