Introduction

Acute myeloid leukemia (AML) is an aggressive hematological cancer arising from the accumulation of hematopoietic stem/progenitor cells in the bone marrow. Standard therapy of AML includes induction chemotherapy followed by intensive consolidation chemotherapy or high-dose therapy in combination with autologous or allogeneic hematopoietic stem cell transplantation, however AML is often incurable with chemotherapy1. Owing to its poor prognosis and high recurrence rate after chemotherapy, it is essential to develop alternative therapies.

The success of immunotherapy in cancer gave rise to renewed interest in developing immune-based therapy strategies in AML, and the current immunotherapy in AML aims to strengthen the ability of CD4+/CD8+ T cells to kill tumor cells via their successful recognition of major histocompatibility complex (MHC)-associated peptides (MAPs), also known as tumor antigens2,3. Recognizing of MAPs presented on the surface of cancer cells determines the efficacy of T cell-based cancer immunotherapy, and the ideal tumor antigens are required to be crucial for tumor initiation or propagation, present on all malignant cells and absent from all normal cells4.

Mutated tumor-specific antigens (TSAs) have attracted substantial attention in quests for vaccines against solid tumors. However, emerging evidence revealed the low mutational burden in AML, and certain mutated TSAs have been validated in AMLs5,6,7. Moreover, mutated TSAs are mostly patient-specific, thus lack of suitable target antigens impede the clinical application of immunotherapy in AML. Identifying TSAs shared across patients in AML is more attractive so that a greater number of patients may benefit from the more efficient and rapid treatment.

Large efforts have been made to explore canonical tumor antigens for cancer immunotherapy, which are encoded within the open reading frames (ORFs) of protein-coding genes, including mutated MAPs8. Recently, non-canonical MAPs (ncMAPs) were highlighted to elicit anti-tumor immune responses, which are derived from the aberrant translation of putative small ORFs in non-protein-coding regions (ncORFs), such as lncRNAs, pseudogenes, UTRs of coding genes9,10,11. This non-canonical category of tumor antigens has recently gained attention in various types of cancer7,12. It has been reported that ~2% genome region translated to canonical proteins, 75% of the human genome also could be transcribed as well as 36% lncRNAs, potentially representing a rich source of previously unexplored MAPs13,14. Moreover, many ncORFs were also actively translated and participated in diverse cellular functions across various cancer types9,14,15. If such abnormal translation events further derive tumor-specific and immunogenic ncMAPs, these MAPs could be both tumor-specific and common among patients, thereby generating an efficient and safe anti-tumor immune response10,11. Therefore, systematical identification of ncMAPs could greatly expand the cancer immunopeptidome.

In this study, we constructed the immunopeptidome landscape in AML patients, both cMAPs and ncMAPs were identified by querying canonical human proteome as well as our customized non-canonical reference library. The characteristics of ncMAPs and cMAPs were compared from multiple aspects to explore the potential of ncMAPs as tumor antigens. We also investigated the shared mechanism of MAPs among AML patients and MHC I alleles, which would help identify neoantigens shared across patients in AML. In particular, we integrated multi-omics data to accurately identified neoantigens, including immunopeptidome in normal tissues, and found that ncMAPs were critical components of tumor antigens in AML. The association between neoantigen-related genes and tumor microenvironment was investigated, and a computational model was further developed to discover their contribution to clinical prognosis.

Results

ncMAPs are indispensable tumor neoantigens

Recent studies have highlighted the importance of ncMAPs7,16, which represent a rich source of tumor neoantigens. A plethora of non-canonical translation products of transcripts have been discovered by Ribo-seq, including lncRNAs, pseudogenes and UTRs of coding genes, and out-of-frame alternative ORFs in annotated protein-coding genes. We collected and integrated the non-canonical ORFs supported by Ribo-seq from RPFdb17, nuORFdb11 and TransLnc14, which were used as a customized ncORF reference library to expand immunotherapy targets in AML. To comprehensively identify the MAPs in cancer, we integrated the human proteome obtained from UniProt with our customized ncORF reference library. We next performed MHC-I immunopeptidome MS/MS spectra searching to identify MAPs in 19 AML patients derived from all possible genomic regions, including non-coding regions (Fig. 1A). We used three methods (MixMHCpred, MHCflurry and NetMHCpan) to predict the binding between peptides and MHC-I. The peptides predicted to bind MHC-I by at least two methods were considered as MAPs (Fig. 1A). As a result, we identified 6,426 cMAPs and 1,838 ncMAPs in AML, respectively (Fig. S1A,B). A similar number distribution was discovered for identified cMAPs and ncMAPs from all samples. Moreover, we found that 49.1% of MAPs can be detected in published mono-allelic MHC-I MS data (Fig. S1C), indicating the high reliability of both cMAPs and ncMAPs11.

Fig. 1
figure 1

Comparison of different characteristics between ncMAPs and cMAPs. (A) The workflow for identification of cMAPs and ncMAPs. (B) Difference of main characteristics between ncMAPs and cMAPs which was calculated as the median of cMAPs minus the median of ncMAPs. Red points indicate difference in favor of ncMAPs, blue points indicate in favor of cMAPs, and grey indicates no difference. (C,D) Boxplot showing the difference of PEP (posterior error probability) and T-cell contact hydrophobicity between ncMAPs and cMAPs. The PEP is a statistical measure that estimates the probability of a peptide being wrongly identified in proteomics analysis with MaxQuant. The Wilcoxon test was utilized for statistical analysis. (E) Pearson correlations between observed retention times and predicted retention time. (F) The length distribution of source ORFs. (G) The percentage distribution of the number of MAPs identified across different numbers of samples. The x-axis represents how many samples MAPs will be detected in, while the y-axis displays the percentage of MAPs appearing in n samples (n = 1, 2,…, 18).

Next, the characteristics of ncMAPs were compared with those derived from protein-coding regions (Fig. 1B). We found that ncMAPs exhibit the comparable characteristics as cMAPs, a little better in the predicted MHC binding affinity, %rank in the top 2% for the corresponding HLA-A and C alleles, and hydrophobicity (Fig. 1B–F). Previous studies have discovered that high degree of T-cell contact hydrophobicity and hydrophobic fraction are associated with T-cell response18,19. We found that ncMAPs exhibited higher hydrophobicity scores compared to cMAPs in T-cell contact residues, suggesting stronger immunogenicity of ncMAPs (Fig. 1D and Fig. S1D). The immunogenicity has been predicted using DeepImmuno20, BigMHC-IM21 and TLImm22. We discovered that the distribution of these features was similar between cMAPs and ncMAPs (Fig. S1E–G).

We also compared the distribution of MS/MS observed MAP retention times (RT) with the distribution of RT predicted based on the DeepLC algorithm23. Significantly positive correlations between experimental and predicted RTs were observed for both ncMAPs (R = 0.96, p < 2.2e-16) and cMAPs (R = 0.97, p < 2.2e-16, Fig. 1E). In addition, we investigated the length of MAPs and found that the 9mer peptides were prevalent (Fig. S1H), which was consistent with the observations in previous reports24. ncMAPs tended to be derived from shorter proteins than cMAPs (Fig. 1F). Notably, we found that approximate 69.24% of cMAPs and 66.27% of ncMAPs were shared among at least two AML patients (Fig. 1G). Together, these results suggested that ncMAPs are not only a better source of neoantigen-targets in personalized immunotherapy treatment efforts, but also a source of shared neoantigens across AML patients.

Higher generation efficiency of source genes deriving ncMAPs

It has been observed the association between the expression levels of source genes and the specificity and sensitivity of neoantigen detection6,7,25,26, we hypothesized that the characteristics of source genes might be related with the detection of ncMAPs. We first annotated the ncMAP to genomic regions and identified the source genes. As a result, 1,299 customized ncORFs were mainly from lncRNAs (39.95%), 3′ overlap dORF (19.86%) and 5′ uORF (15.78%) regions (Fig. 2A and Fig. S2A). Compared to the distribution of references, we found that ncMAPs were likely to be derived from 3′ overlap dORF and lncRNAs (Fig. 2B and Fig. S2C). It is notable that 78.37% of source genes only produced one ncMAP, whereas more than half of source genes could produce multiple cMAPs (Fig. S2B).

Fig. 2
figure 2

Comparison of source proteins encoding ncMAPs and cMAPs. (A) Diverse non-canonical ORFs encoding ncMAPs. (B) Compared with reference library, the tendency of peptides MS-identified. (C,D) The expression levels of source genes encoding MAPs. (EG) The expressions in TCGA samples, the length of source ORFs and the ratio of the length covered with non-coding derived ncMAPs, with protein-coding derived ncMAPs, with protein-coding derived cMAPs, or with protein-coding derived MAPs.

Previous studies have been reported that the expression levels of source genes encoding cMAPs and ncMAPs were significantly different. We thus divided source genes into four groups (Fig. 2C). Indeed, the higher expressions were observed for source genes, in particular the expressions of genes generating both cMAPs and ncMAPs were the highest, followed by protein-coding genes derived cMAPs, protein-coding genes derived ncMAPs and ncORFs derived ncMAPs (Fig. 2C–E and Fig. S2D). For instance, PRPF8, as an essential RNA binding protein in pre-mRNA splicing, generates the highest number of MAPs, including 25 cMAPs and one ncMAP (Fig. S2E). Moreover, the expression levels of source genes were positively correlated with the number of MAPs encoded (Pearson correlation coefficient R = 0.32, P < 2.2e-16).

We next analyzed the length of source genes and found that the ncORFs deriving ncMAPs were the shortest, whereas protein-coding genes generated both kinds of MAPs were the longest (Fig. 2F). We also calculated the generating efficiency of source genes, which were defined as the ratio of the number of amino acids detected in the immunopeptidome to the number of amino acids in the source protein. Surprisingly, the efficiency of generating MAPs exhibited the opposite trend and the ratios were much higher for ncORFs than others (Fig. 2G). These results suggested that although both low expression levels and shorter lengths of source ncORFs that generate ncMAPs, the generation efficiency of ncMAPs was much higher than those of cMAPs.

Prevalence of ncMAPs and cMAPs in AML

The HLA genes are the most polymorphic across the human population, which can complicate the design and development of worldwide covering T-cell epitope-based vaccines. Because of antigens with specific HLA-I restrictions, such vaccines may be only effective across particular populations. For the vaccine to apply to a larger population, the development of MHC-mediated peptide vaccine needs to consider the population coverage of alleles and allele binding motifs27. To evaluate the coverages of HLA alleles in the AML population, we obtained the global frequencies of sample-specific HLA alleles from IEDB28. We observed that the cumulative coverages were high, reaching 90.3%, 76.1% and 94.0% for HLA-A (n = 11), HLA-B (n = 21) and HLA-C (n = 13) alleles (Fig. 3A), respectively. Moreover, MAPs presented by the same allele tend to be prevalent among patients. For example, more than 1,500 MAPs were presented by HLA-A*02:01, and >75% of these MAPs occurred in at least two samples (Fig. 3A). In addition, approximate 60% MAPs presented by HLA-A*01:01 were observed in at least two AML patients. These results indicated the prevalence of shared MAPs across AML patients.

Fig. 3
figure 3

The distribution of MAPs across patients in AML. (A) The top barplot shows the proportion of MAPs presented by different alleles in the samples. The left barplot shows the proportion of MAPs shared by different samples. The middle heatmap shows the percentage of allele-specific presented MAPs to total MAPs in the patient. The right barplot shows the number of MAPs by HLA alleles. (B) The pairwise similarity between samples. (C) The boxplot shows the distribution of Simpson coefficient with shared 0, 1, 2 or > 2 alleles.

Next, we calculated the Simpson index to further evaluate the shared degree of MAPs between patients. We found that the similarity scores were highly correlated, in particular for patients with high number of shared alleles (Fig. 3B). For instance, patients AML13 and AML17 shared 5 HLA-I alleles, which exhibited the highest similarity among all pairs of patients. Moreover, with the increasing number of shared alleles, the MAPs identified were likely to overlap with each other in AML patients (Fig. 3C). Together, these results suggested that the MAPs were prevalent across AML patients, making them potential therapy targets in clinical.

Global identification of shared motifs across alleles

It is well known that the biochemical properties of the peptides are important determinants of peptide binding to HLA I molecules. We thus calculated the pairwise sequence similarity between both alleles using HLA ligands as a proxy to measure HLA similarity. The average pairwise correlation values for HLA-A, B, and C have reached 0.51, 0.48 and 0.80, respectively. We found that HLA alleles belonging to the same supertypes tend to cluster together as expected, such as HLA-B27, HLA-B44, HLA-C03 and HLA-C07 (Fig. 4A, left panel). That is, supertypes as the dominant groups exhibited higher similarity, which was consistent with previous studies29. Supertypes have been found to be associated with clinical characteristics. For example, B44 supertype was found to effectively predict the progression-free survival (PFS) and OS of immune checkpoint blockade therapy in non-small cell lung cancer and melanoma30. Meanwhile, the correlation has attained 0.98 between HLA-C*02:02 and HLA-C*12:03, indicating that they could share motifs. Moreover, there was similar situations among HLA-C*01:02, HLA-C*06:02 and HLA-A02:01.

Fig. 4
figure 4

Alleles Similarity and identified motifs. (A) Pairwise correlations between alleles, based on a vector of frequencies of the 20 amino acids at every peptide position(left), and the Simpson coefficient (right). (B) 2D-visualization of sub-motifs identified across alleles (left), colored according to HLA locus and scaled in size according to the number of underlying peptides making up the sub-cluster. (C) Number of alleles sharing a sub-motif colored according to HLA locus.

On the other hand, the enriched peptide repertoire against the specific alleles is helpful to understand the functions of MAPs in shaping the cancer immunopeptidome. We thus evaluated the similarities between alleles based on the degree of shared binding MAPs. We obtained the similar clustering results (Fig. 4A, right panel), indicating that the sharing degree of MAPs could reflect the similarity of HLA alleles to some extent. In particular, a cluster composed of HLA-A and HLA-C alleles was observed, suggesting the potential sharing mechanism of MAPs across HLA-I supertypes.

To understand the preference for HLA presentation, we have compared the spectrum of distinct and shared sub-motifs across HLA-I alleles, revealing the diversity and complexity of endogenous HLA ligands in AML patients. Firstly, we explored the motifs of ncMAPs and cMAPs across different HLA-I alleles. The similar motif mode was identified between ncMAPs and cMAPs (Figure S3), but there were differences between the position probability matrix of amino acids (Figure S4D). For example, tryptophan and methionine had significant change at different HLA-binding position. Through the motif analysis, we also discovered some sub-motifs were shared among multiple HLA alleles. Next, the distance was calculated between pairwise peptides of a given length against the same allele. We exploited NMDS to reduce peptide distance matrices and DBScan to reveal the sub-motifs of peptides. In total, 254 sub-motifs were identified, which contained at least 20 peptides across alleles. To examine whether the sub-motifs detected were associated with multiple haplotypes, these sub-motifs were projected onto a 2-dimensional space to identify distinct motifs that were specific and shared among alleles (Fig. 4B,C). Of these, 188 sub-motifs were shared across alleles, and 66 were allele-specific. We identified the two largest allele-shared motifs (xLxxxxxxL and xLxxxxxxV), consisting of 23 and 18 alleles, respectively. Furthermore, we also discovered that the sub-motif (xExxxxxxY) with polar primary anchors was shared across HLA-A*26:01 and HLA-B44 supertypes (Fig. S4A). These results were consistent with previous reports that the second and C-terminal positions of epitopes are the most critical residues for HLA I binding that fit into the anchor pockets in the HLA I groove31,32. However, we also identified the some interesting sub-motifs, such as KxxxxxxxF and xxDxxxxxY, which have distinct dominant patterns different from regular p2 and p9 (Fig. S4B,C). These findings are helpful for us in identifying peptides with allele-shared motifs that enable the development of T-cell epitope vaccines with high-population coverage using a limited number of peptides. We found that although HLA-A/B alleles solely contributed half of the motifs, respectively, most of the HLA-C sub-motifs overlapped with sub-motifs from HLA-A and/or B alleles, which account for 71% of all HLA-C sub-motifs (Fig. 4C). Notably, both ncMAPs and cMAPs were merged and contributed to identifying the same anchor residue sub-motifs (Fig. 4B). Altogether, the close correlation between HLA alleles shared motifs provided new insight into why samples with only one shared allele had a high similarity (Fig. 3B).

Neoantigens identified by integration of multi-omics

Owing to the characteristic of low mutational burden in AML, mutation-derived neoantigens are mostly unique to individuals. However, lots of evidence demonstrated tumor-specific antigens (TSAs) generation by cancer-specific translation may be a sizable additional source of potential neoantigens11. Given that neoantigens that aberrant translation in cancer has been more attractive targets for immunotherapy, we integrated multi-omics data to identify potential neoantigens accurately, which minimizes the risk of on-target/off-tumor adverse events by limiting the expression of neoantigens in normal tissues32. To more extensively identify the neoantigens originated from non-canonical translation, we focused on the TSAs, aberrant expressed TSAs (aeTSAs) and CTAs (Fig. 5A). As a result, we identified 21 potential neoantigens from 14 genes, there are 8 TSAs, 4 aeTSAs and 9 CTAs, 61.9% of which were derived from non-coding regions (Fig. 5B).

Fig. 5
figure 5

Accurately identify tumor-specific antigens by integrative multi-level omics profiling. (A) Tumor specific antigen identification pipeline at the transcription level. (B) The number of identified antigens is shown according to ncMAP and cMAP. (C) Antigens are distributed in normal tissues and AML, and the multi-omics screened antigens were highlighted in red and bold. (D) Multi-omics screening antigen source gene expression levels are displayed by heatmap. Red is the example shown next. (E) ILPSYQLFL combined with HLA-A02:01 schema diagram. (F) The Scatter plot of neoantigens, x-axis is PEP, y-axis is Andromeda score. (G) ILPSYQLFL mass spectra.

Furthermore, immunopeptidomes from healthy tissues were re-analyzed against the same reference libraries to further filter out tumor-associated antigens, targeting which could result in severe and sometimes fatal toxicities by T cells16,33. Finally, thirteen neoantigens from eight genes, including RP11-9L18.2, RP11-328J14.2, RP11-65E22.2, GLRX3P2, LA16c-306E5.1, HNRNPA1P38, OR10A5 and DNTT, were identified that were neither expressed in healthy tissues nor detected in normal tissue immunopeptidomes (Fig. 5C,D). For example, the protein-coding gene DNTT, encoding the highest number of TSAs, was normally expressed in primitive hematopoietic cells and B-cell progenitors, whereas it was also over-expressed in older AML patients with NPM1 wild-type reported by previous studies34.

In particular, we identified a ncMAP, ILPSYQLFL, which was uniquely mapped to the pseudogene RP11-9L18.2. The robust binding to HLA-A02:01 was predicted by NetMHCpan (v.4.1)35, MHCflurry (v2.0)36 and MixMHCpred (v2.1)37 (Fig. 5E). Meanwhile, binding stability of the peptide binding to HLA-A02:01 was 3.13 h estimated by NetMHCStabPan and TCR recognition probability was 3.47*10−9 calculated using the multistate thermodynamic model. Thus, the peptide was considered immunogenic. Moreover, the ILPSYQLFL displayed the typical anchor motif xLxxxxxxL, which covered a set of 23 allele-shared sub-motifs, therefore likely to be binders and not contaminants. The PEP and Andromeda score for the best associated MS/MS spectrum of ILPSYQLFL were 0.029 and 91.069 (Fig. 5F). The peptide spectra were also detected in multiple samples/replicates and possessed high coverage across b and y ions (Fig. 5G). Other neoantigens predicted binding to HLA alleles also showed similar features (Figs. S5,6). Therefore, these results demonstrated that ncMAPs without expression and binding by MHC alleles in normal cells might be a sizable additional source of potential neoantigens and provide new antigenic targets for immunotherapy.

Expressions of neoantigen source genes associated with immune evasion

T cells recognize tumor cells by monitoring short peptides presented by the HLA-I molecules on the AML cell surface. In addition, macrophages and dendritic cells, as part of the innate immune cells, colud intake exogenous antigens and present them on the cell surface in the form of HLA-I-peptide complexes through the cross-antigen pathway, which could enhance the killing effect of CD8 + CTLs. It is vital to assess the association between neoantigens and the immune cell abundance8,38. Having determined that neoantigens could be presented by specific HLA alleles, we next evaluated the association of neoantigen-related genes with the abundance of immune cells across the samples using TCGA-LAML RNA-seq. The abundance of immune cells was assessed by xCell and CIBERSORT algorithm39,40. The xCell is based on ssGSEA of marker genes to estimate the abundance of immune cells, whereas the CIBERSORT measures immune cell abundance in samples based on a deconvolution algorithm. It is not difficult to find that there was a co-expression trend between genes encoding the neoantigens (Fig. 6A). In addition, these genes exhibited significant correlations with the immune score, stroma score, microenvironment score and cell stemness (Fig. 6A). The recognition of neoantigens by T-cell receptor (TCR) relies on the presentation of antigen-presenting cell (APC), thus the correlations between antigens and APCs were calculated. In fact, antigen genes were significantly related to antigen-presenting cells (e.g., macrophages, dendritic cells), T cells and other immune cells (Fig. 6B and Fig. S7A). Moreover, in order to examine whether the expressions of the neoantigen-related genes could recruit immune cells against the tumor, we also calculated the correlation between the expressions of genes and immune cell marker genes. We found that these genes were significantly related to the marker genes of CD4+ T cells, CD8+ T cells, memory T cells, helper T cells, dendritic cells, macrophages, and natural killer cells (Fig. 6C). These results indicated that the neoantigen-related genes were highly related to immune cell abundance.

Fig. 6
figure 6

Neoantigen expressions correlated with immune evasion. (A) Correlation of expression values between genes, and Correlation between gene expression values and various indexes. ImmuneScore, StromaScore and MicroenvironmentScore were calculated by xCell. (B) The correlation between the infiltration proportion of immune cells predicted by xCell and gene expression values in AML. (C) Correlation between gene expression values and immune cell marker gene expression values. (D) ROC curve for predicting OS in AML patients by risk scores. (E) Gene expression levels are displayed according to low, middle and high groups, and the Count value represents the number of genes expressed in each sample. (F) Kaplan-Meier survival analysis between risk groups.

To further explore the contributions of neoantigen-related genes to clinical prognosis, lasso regression analysis was performed on 8 genes that encode the neoantigens in AML. The risk score was obtained according to the product of the lasso regression coefficient and gene expressions, which performed well in predicting clinical outcomes, with areas under the ROC curve (AUC) of 0.631, 0.65, and 0.729 for 1-year, 3-year, and 5-year survival, respectively (Fig. 6D). We next ranked the patients by the risk scores and divided them into three groups according to the gap of scores shown by the ranking (Fig. S7B). The expression levels of neoantigen-related genes were shown in the low-, middle- and high-risk groups (Fig. 6E and Fig. S8). Clinical prognostic analysis indicated the high-risk group was associated with a poor prognosis compared with the low- and middle-risk group (Fig. 6F, P = 0.0015, log-rank test). Moreover, patients in the high-risk group had lower mRNAsi scores41 than the low-risk group (Fig. S7C, P = 0.032, Wilcoxon’s rank sum test). The abundance of granulocyte-monocyte progenitor cells in the low-risk group were significantly higher than those in the high-risk group (Fig. S7D, P = 0.038) with the consistency of previous reports42. However, the expression levels of HLA-A, B and C genes in the high-risk group were all significantly higher than those in the low-risk group (Fig. S9, P < 0.01). Meanwhile, we discovered that a higher number of neoantigens was associated with lower risk, and the expression levels of HLA-I genes were negatively correlated with the number of neoantigens. That is, although HLA-I genes were expressed in the high-risk group, the expression of neoantigens was lacking (Figs. S10,11).

In considering the high-risk group exhibiting the immune escape phenomenon, we first evaluated the expression of genes encoding chemokines that recruit immune cells and immune inhibitory genes, high expression of which could change the bone marrow microenvironment and shorten the survival time of patients38. We found that the expressions of both chemokines and immune inhibitory genes in the high-risk group were higher than those in the low-risk group (Fig. S12A). There was a strong positive correlation between the risk score and the WNT-β-catenin signaling pathway (Fig. S12B), the activation of which is known to inhibit T cell activation and infiltration38. Consistent with the previous report that the IFN-γ response pathway can induce immune checkpoint protein PD-L1 in tumor cells by activating STAT1, thus leading to immune escape and tumor recurrence43, the pathway and the expressions of STAT1 were upregulated in the high-risk group (Fig. S12C,D).

Taken together, we found that HLA-presented neoantigens can elicit the immune system to recognize the tumor cells. The phenomenon of immune escape occurred in the high-risk group of AML patients, therefore, WNT-β-catenin/IFN-γ inhibitors or immune checkpoint blockade treatment combined with neoantigen-targeted immunotherapies can be potentially used to treat high-risk patients.

Discussion

Non-canonical category of tumor antigens has recently gained attention in various types of cancer. Despite previous efforts to study ncMAPs, our study has revealed cartography of ncMAPs arose from a harmonized large-scale analysis of multiple-omics data in AML. We constructed more comprehensive non-canonical reference library. We utilized Ribo-seq to identify non-canonical ORFs and filtered those that were entirely contained within other ORF sequences, expanding the cancer immunotherapy targets in AML. We anticipate that this study will help pave the way for future research on antigens from non-canonical sources and engage further oncology research on alternative sources of antigens. Comparing multiple characteristics of ncMAPs to cMAPs, we found that ncMAPs exhibited even better performance than cMAPs in some aspects. For instance, HLA-A and C alleles tend to present ncMAPs and high T-cell contact hydrophobicity, making ncMAPs be recognized easily. Thus, we believe that ncMAPs are considerably more attractive tumor peptides for AML patients.

Moreover, ncMAPs were mainly derived from lncRNAs, UTR regions of coding genes. Although expression levels of their source genes were lower than ones of cMAP, the generation efficiency of source genes derived ncMAPs was higher. We characterized the motifs of MAPs across HLA alleles in detail and identified the allele-shared/specific motifs. Interestingly, we found that the more shared alleles between patients, the higher the similarity. HLA alleles belonging to supertypes were more similar, such as HLA-B18:01, HLA-B44:02, HLA-B44:03 and HLA-B44:05. Moreover, 254-identified sub-motifs were clustered into different motifs shared by different alleles, which indicated that many HLA alleles displayed a similar binding preference. Through the mechanism of shared motif among distinct HLA alleles, we expected to identify the neoantigens with an allele-shared motif and presented by more HLA alleles.

We identified 21 neoantigens based on transcriptome data, in order to further filter out tumor-associated antigens and reduce adverse effects, immunopeptidome from healthy tissues, were further used to discover the neoantigen candidates, which was ignored in previous studies. Here, 13 neoantigens were detected, and the neoantigen “ILPSYQLFL” encoded by RP11-9L18.2 belong to the largest motif “xLxxxxxxL”. The binding capacity between neoantigen-specific alleles were validated by NetMHCpan (v.4.1)35, MHCflurry (v2.0)36 and MixMHCpred (v2.1)37, which highlighted the reliability of our results.

We filtered MAPs at the RNA level and the level of immunopeptidome in terms of normal tissues to minimize the risk of on-target/off-tumor adverse events. Moreover, we found that the expression level of their source genes can also reflect the infiltration degree of immune cells, such as low proportion of T cells and B cells in immune microenvironment. We also constructed a risk model based on the expressions of source genes, which could accurately predict clinical outcomes. The escape events were likely to be occurred in high-risk group, for example, the enrichment of the WNT-β-catenin and IFN-γ pathways, upregulation of chemokines and immune inhibitory genes.

Only ncORFs derived from linear RNAs were considered in this study, and the circRNAs were ignored because most of them originate from the exon circularization of the protein-coding regions, which is not easy to distinguish from the peptides encoded by the coding genes. The reliability of our model will be further verified with the increasing amount of data when more data of MS-based immunopeptidome in AML are assayed. Moreover, neoantigen generation begins with alterations at the genomic, epigenomic, transcriptomic, translational, proteomic and antigen-processing levels along the HLA-I processing and presentation pathway. For example, alternative splicing, RNA editing and post-translational modifications could result in the generation of non-canonical antigens. In consideration of these factors, the source of the neoantigens would be more abundant and diverse. In the future, we will investigate the immunogenic capacity of neoantigens and the potential as therapeutic cancer vaccines.

In conclusion, our research highlighted that ncMAPs represented a rich source of tumor epitopes. They had the similar binding motifs to cMAPs across HLA-I haplotypes. The shared mechanism of MAPs among AML patients and MHC I alleles were observed, which would help identify neoantigens shared across patients in AML. By integrating the multi-omics data, we accurately identified neoantigens and reduced the unexpected adverse effects. The expressions of neoantigen-related genes could reflect the tumor microenvironment, and the neoantigen-based model was discovered to predict clinical prognosis. Integration of non-canonical tumor antigens as targets would increase the armamentarium of immunotherapy options available.

Methods

ncORF reference library

Ribo-seq-supported ORFs with their basic annotations were integrated from RPFdb17, nuORFdb11 and TransLnc14. All the ncORFs originated from presumed untranslated regions were retained, including lncRNAs, 5′ uORF, 5′ Overlap uORF, Out−of−Frame, 3′ dORF, 3′ Overlap dORF, pseudogene and other non-coding RNA derived ORFs. We kept the ncORFs with NTG (ATG, CTG, TTG, and GTG) start codons and TAA/TGA/TAG stop codons11,15,44. Peptides were translated into amino acid sequences, and only those longer than 8 amino acids were retained. After filtering peptides that were entirely contained within other known peptides, the remaining ones constituted the benchmarked ncORF library. Finally, 475, 782 ncORFs were incorporated to build the benchmarked non-canonical peptide repertoire for further analysis.

Immunopeptidome in AML and normal tissues

The immunopeptidome enabled us to directly discover and quantify the entire HLA-presented peptides, including epitopes derived from non-coding regions. The public available MS/MS immunopeptidome of 19 AML patients and 30 normal tissues were accessed, and the corresponding RAW files (PXD018542 and PXD019643) were downloaded from the PRIDE database, including 652 datasets from healthy tissues7,45. All the immunopeptidome data were analyzed by MaxQuant (v.2.1.0.0) against our integrated benchmarked ncORF library plus the canonical human proteome obtained from 20,386 sequences of human Swiss-Prot database (downloaded on February, 2022)46. According to previous studies15,45,47, protein N-terminal acetylation and oxidation (M) were set as variable modifications and no fixed modification was set. Enzyme specificity was set to ‘unspecific’. Possible peptide identifications were restricted from 8 to 12 amino acids and maximum peptide mass was set to 1 500 Da. LFQ was set to a ‘minimum ratio count’ of 1. A peptide spectrum match FDR of 0.05 was used, and no protein FDR was set. The Andromeda score was used to evaluate how well an acquired spectrum matches with the theoretical fragment masses. We removed peptides with MaxQuant scores lower than 80 or a posterior error probability (PEP) larger than 0.147. Razor peptides, found in more than one protein group, are assigned to the protein group with the larger number of identified peptides or, in case of a tie, to the protein group with more highly scoring peptides. Peptides identified with non-canonical leading razor proteins were considered as as putative cryptic MAPs.

Transcriptome data in AML and normal tissues

RNA-seq-based gene expression data in normal tissues was downloaded from GTEx that contains 17,382 samples across 30 normal tissues48. Gene expression data and clinical data of AML patients were accessed from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/). To compare with the GTEx data, the FPKM expression units of the AML samples were transformed into TPM units.

Identification of MAPs

For the identification of MAPs, three algorithms for predicting MHC-bound peptides were applied to filter experimentally identified peptides, including NetMHCpan (v.4.1)35, MHCflurry (v2.0)36 and MixMHCpred (v2.1)37. In terms of each sample, the binding affinities between patient-specific alleles and MS-detected peptides were predicted. Only peptides with a percentile rank ≤ 2% by at least two methods were defined as MAPs.

Assessing hydrophobicity of MAPs

The hydrophobicity feature of MAPs was analyzed, including T-cell contact hydrophobicity and the fraction of hydrophobic residues. We calculated T-cell contact hydrophobicity by summing the Kyte–Doolittle hydrophobicity score for positions 4 through n − 1 of the peptide25. In addition, we also calculated the fraction of peptide residues that were hydrophobic49. Hydrophobic residues were considered to be “G”, “A”, “F”, “W”, “P”, “I”, “L”, “V” and “M”50.

Identification of neoantigens based on multi-omics data

According to the expression levels of source genes coding the MAPs, neoantigens were divided into 3 categories, including TSAs, aeTSAs and CTAs. We ranked source genes at a strict 90 percentile expression. A MAP was defined as a TSA, only if it was highly expressed (TPM > 1) in AML patients and was hardly or not expressed (TPM < = 1) in 30 healthy tissues obtained from GTEx. The aeTSAs were also identified with a relatively strict requirement, whose source genes was required with at least 2-fold higher expression in cancer patients compared to the highest level detected in corresponding normal samples10,11,51. CTAs were required to be widely expressed in AML samples but only expressed in normal germ and placental trophoblast cells (TPM > 1). We further filtered these neoantigens based on immunopeptidome from 30 healthy tissues, and those were retained for future analysis, if they did not detect or were weakly to HLA I molecular in normal tissues.

Allele similarity analysis

The information of HLA alleles was obtained from the original publication7,45. Allele similarity was assessed by the observed binding motifs as well as the degree of shared peptides between HLA alleles. Similarity of the observed binding motifs was evaluated by the pairwise correlations of vectors, which converted from the position probability matrix of 20 amino acids along the 9-mer peptide sequence per allele29.

To evaluate similarity in the degree of shared peptides, the Simpson index was calculated:

$${\rm{Simpson}}\,{\rm{index}}({\rm{i}},{\rm{j}})=\frac{|{{\rm{P}}}_{{\rm{i}}}\cap {{\rm{P}}}_{{\rm{j}}}|}{\min \{{{\rm{P}}}_{{\rm{i}}},{{\rm{P}}}_{{\rm{j}}}\}}$$

where Pi and Pj are the peptide list of binding to the specific allele i and j, respectively. The Simpson index range from no similarity (0) to full similarity (1). The closest neighbor in the observed binding motifs and the degree of shared peptides were considered to be alleles with correlation greater than 97.5% of all pairwise correlations.

Sub-motif and motif analysis across alleles

To group the 9-mer peptides identified for each allele into sub-motifs, we calculated the peptide distance metric between pairwise peptides of a given length:

$$D(A,{s}_{1},{s}_{2})=\frac{1}{L}\mathop{\sum }\limits_{i=1}^{L}distPMBEC({s}_{1i},{s}_{2i})\ast (1-{H}_{Ai})$$

where s1 and s2 are peptide sequences belonging to allele A, L is the length of peptide, H is the entropy of peptides binding to the allele A at the position i, which can be calculated by the function ‘MolecularEntropy’ of R package ‘HDMD’, \(distPMBEC=\,\max \,PMBEC-PMBEC\) is a 20 × 20 matrix of residue dissimilarities originated from a pre-computed amino acid similarity matrix biased by binding affinity data of combinatorial peptide mixtures52.

Non-metric multidimensional scaling (NMDS) was used to reduce peptide distance matrices to two dimensions. We used DBScan, a clustering method based on density, to reveal the subgroups of peptides. We also used the same method to identify the motif shared by two or more alleles11,29.

The survival risk model

The expressions of source genes that encoding the neoantigens, survival time and status in AML as input were used to perform the lasso regression analysis with the R package ‘glmnet’. The product of the lasso regression coefficient and gene expression matrix was recognized as the risk score:

$$Riskscor{e}_{s}=\mathop{\sum }\limits_{i=1}^{n}{g}_{si}\ast {\beta }_{i}$$

where g is the expression value of gene i in sample s, β is the regression coefficient of gene i, n is the number of the samples in AML.

Statistical analysis

All statistical calculations were performed using the R v.4.1.2. Correlation tests were evaluated using the R function “cor.test” with the Pearson method unless otherwise indicated. Tests involving comparisons of distributions in different group were performed with the Wilcoxon test. Log-rank P values for survival analysis were calculated with the “survival” R package. Kaplan-Meier analysis was used to estimate overall survival among patients in risk-high, risk-middle and risk-low group. Differences and associations were described as statistically significant for P < 0.05 (ns: P > 0.05; *P < 0.05; **P < 0.01, ***P < 0.001, ****P < 0.0001).