Abstract
Chinese hamster ovary (CHO) cells are used to produce almost 90% of therapeutic monoclonal antibodies (mAbs) and antibody fusion proteins (Fc-fusion). The annotation of non-canonical translation events in these cellular factories remains incomplete, limiting our ability to study CHO cell biology and detect host cell protein (HCP) impurities in the final antibody drug product. We utilised ribosome footprint profiling (Ribo-seq) to identify novel open reading frames (ORFs) including N-terminal extensions and thousands of short ORFs (sORFs) predicted to encode microproteins. Mass spectrometry-based HCP analysis of eight commercial antibody drug products (7 mAbs and 1 Fc-fusion protein) using the extended protein sequence database revealed the presence of microprotein impurities. We present evidence that microprotein abundance varies with growth phase and can be affected by the cell culture environment. In addition, our work provides a vital resource to facilitate future studies of non-canonical translation and the regulation of protein synthesis in CHO cell lines.
Similar content being viewed by others
Introduction
Chinese hamster ovary (CHO) cells are the predominant mammalian expression host for the production of biologics, with nearly 90% of therapeutic monoclonal antibodies (mAbs) and Fc-fusion proteins produced in this cell line1. During the upstream cell culture phase of production, CHO cells continually secrete the recombinant protein into the supernatant. A series of downstream purification steps are required to recover the drug substance in the harvested cell culture media and reduce a range of impurities, including those originating from the host CHO cell line. Host cell proteins (HCPs) present in the final drug product are a particular concern due to the risk that a HCP could elicit an immune response in the patient or reduce efficacy2. In addition, the presence of proteolytic HCPs can degrade or affect the stability of the therapeutic antibody3,4. Regulatory authorities consider the amount of HCP in the final product to be a critical quality attribute, and biopharmaceutical companies aim to reduce the total HCP concentration to <100 ppm5.
The efficacy and safety track record of therapeutic proteins produced in CHO cells is a testament to the continual commitment of biopharmaceutical companies and regulatory authorities to ensuring product quality. Enzyme-linked immunosorbent assays (ELISA) that enable sensitive total HCP quantitation are widely used in batch release during commercial manufacturing6. It is recognised, however, that this method can be limited in terms of coverage7. For instance, some low molecular weight proteins7 may be weakly immunogenic, or indeed, a particular protein may not elicit a response in the species immunised to generate the HCP assay8. Regulatory authorities now recommend7 the use of mass spectrometry (MS) as an orthogonal HCP detection method5 to enable the identification and quantification of individual HCPs, even those at low concentrations. The resulting data can be used during process development to monitor the HCP impurities present at each unit operation of a downstream purification process and to demonstrate HCP clearance from the final drug product7,9. Identification of HCPs can also be used to guide upstream process development10 or identify targets for cell line engineering to remove unwanted HCPs11.
Since the publication of the first CHO cell genome12 and CHO cell-specific protein databases13, the detection of CHO HCP impurities in antibody drug products using MS has significantly improved. The quality of available genomes has steadily improved over time, and with the release of the Chinese hamster PICRH genome, the field now has a reference assembly comparable to that of model organisms14. While annotation of the transcriptome has progressed significantly, characterisation of the proteome is more challenging and remains incomplete, therefore limiting the ability of MS to detect the complete repertoire of potential HCP impurities.
The Chinese hamster reference genome has, for the most part, been annotated via a combination of ab initio computational pipelines, homology, ESTs, and transcriptomics data. Previously, the effectiveness of ribosome footprint profiling (Ribo-seq) in identifying translated regions of the Chinese hamster genome has been demonstrated15. Ribo-seq enables transcriptome-wide determination of ribosome occupancy at single nucleotide resolution, facilitating the discovery of new open reading frames (ORFs), and, when combined with RNA-seq, the identification of changes in translational regulation16. The technique utilises chemical or physical inhibitors to arrest translation and fixes translating ribosomes in position, resulting in the protection of ~30 nt of mRNA within the ribosome from subsequent enzymatic degradation. The resulting monosomes are purified via sucrose gradient, sucrose cushion, or size exclusion chromatography, followed by the isolation of ribosome-protected fragments (RPFs) through size selection, from which a sequencing library is prepared. Sequencing of RPFs and alignment to a reference genome or transcriptome permits the identification and quantitation of regions undergoing active translation. Over the last decade, Ribo-seq has provided compelling evidence that the traditional rules of eukaryotic translation need to be revised. For example, translation initiation at non-AUG codons is more widespread in mammalian genomes than previously thought17.
Data from Ribo-seq experiments have been used to annotate a range of non-canonical ORFs18, including N-terminal extensions19, detect translation of RNAs previously thought to be non-coding20 and study the regulatory role of ORFs in the 5’ leader sequence of mRNAs (i.e., upstream open reading frames)21. Ribo-seq has also revealed the existence of small open reading frames (sORFs) that produce potentially functional microproteins (classified as proteins < 100 aa) in a diverse range of organisms, including Drosophila22, zebrafish23, mouse24, and human25,26. Studies have shown that microproteins are involved in various cellular processes such as oxidative phosphorylation27, mitochondrial translation28, metabolism29, DNA repair30 and can also act as transcription factors31.
There has been considerable interest in enhancing the efficiency of CHO cell factories for mAb production using systems biology32 and cell line engineering33. Yet, while the importance of non-canonical ORFs is becoming increasingly understood in other organisms, the lack of annotation severely restricts the study of their role in CHO cell biology. Perhaps more surprisingly, given the fundamental role of protein synthesis in mAb production, there is a lack of knowledge of how translational regulation impacts CHO cell behaviour during the cell culture process. Indeed, apart from a small number of studies15,34, Ribo-seq has not received widespread attention in the field, and the capability of the technique to study CHO cell translation regulation has yet to be demonstrated. Annotation of non-canonical ORFs in the Chinese hamster genome and transcriptome-wide analysis of protein synthesis in CHO cells has the potential to identify additional avenues to improve therapeutic antibody production.
In this work, we utilise Ribo-seq with different translation inhibitors to analyse translation elongation and translation initiation in CHO cells. Using these data, we have significantly enhanced the annotation of non-canonical ORFs in the Chinese hamster genome. We have identified previously unannotated translation events in protein-coding genes and thousands of short ORFs predicted to encode microproteins. We have shown that Ribo-seq enables improved characterisation of CHO cell biology compared to performing transcriptomics alone and present evidence that non-canonical ORFs are altered in response to environmental changes and over the course of cell culture. Importantly, our work has improved MS-based HCP detection, enabling microprotein impurities to be identified in antibody drug products.
Results
Transcriptome-wide analysis of CHO cell translation initiation and elongation using Ribo-seq
The reduction of cell culture temperature (i.e., temperature shift) is a method used to extend the viability of some commercial cell culture processes and improve product quality35. In this study, we simulated an industrial temperature shift and conducted a series of Ribo-seq experiments. Our laboratory has demonstrated that temperature shift induces significant differences in gene expression and alters the cellular metabolism of a mAb-producing CHO-K1 cell line (CHO-K1 mAb)36. Given our previous findings and studies from other laboratories reporting the alteration of canonical protein abundance10,37,38,39, we reasoned that this temperature shift model would also induce widespread changes in translation regulation and provide an opportunity to identify novel Chinese hamster ORFs.
To perform ribosome footprint profiling, we conducted two identical cell culture experiments for the analysis of translation initiation and elongation. For both experiments, 8 replicate shake flasks were first grown for 48 h at 37 °C before the cell culture temperature was reduced to 31 °C (temperature shifted (TS) group; n = 4) while maintaining the remainder at 37 °C (non-temperature shifted (NTS) group; n = 4). Cells from both the TS and NTS groups were harvested for Ribo-seq 24 h after the reduction of cell culture temperature (Fig. 1a), at which point there was a reduced cell density of 30% (initiation experiment) and 24% (elongation experiment) in the TS sample group (Supplementary Fig. 1; Supplementary Data 1a). We performed ribosome footprint profiling experiments using harringtonine (HARR) (n = 8), an inhibitor of translation initiation24, and cycloheximide (CHX) (n = 8), an inhibitor of translation elongation16 (Fig. 1b). For each harringtonine-treated sample, a parallel sample (n = 8) was treated with DMSO and flash frozen to arrest translation (we refer to these data as No-drug (ND)). For the CHX samples, matched gene expression profiles were acquired using total RNA-seq (n = 8) (Fig. 1c) to enable the identification of significant differences in translational efficiency (TE) between the NTS and TS sample groups.
a 8 replicate shake flasks were seeded with a mAb-producing CHO-K1 cell line cultured for 48 h; at this point, the temperature of 4 shake flasks was reduced to 31 °C. At 72 h post-seeding, cells were harvested from the non-temperature and temperature-shifted cultures. We utilised (b) Ribo-seq using different inhibitors to capture information from initiating (harringtonine) and elongating (cycloheximide and no drug) ribosomes. The chemical structures shown for harringtonine (CID = 276389) and cycloheximide (CID = 6197) were obtained from PubChem94. In addition, (c) RNA-seq was used to measure the RNA levels. Following pre-processing of the raw Ribo-seq data, we (d) retained reads within the expected size range (coloured in blue) of RPFs (28-31nt). As expected, no peak was observed for the RNA-seq data. An optimum P-site offset of 12 nt was selected for all datasets, where (e) an average of 66% of RPFs exhibited triplet periodicity. No framing pattern was observed for the RNA-seq data. A metagene analysis was conducted for each Ribo-seq dataset, confirming (f) the expected enrichment of RPFs at the TIS of annotated protein-coding genes for harringtonine Ribo-seq when compared to elongation Ribo-seq (average cycloheximide and no-drug treated RPFs). Separate metagene profiles for CHX and ND are shown in Supplementary Fig. 3. Source data are provided as a Source Data file. Panels a, b and c created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.
Sequencing of the 24 resulting Ribo-seq libraries yielded an average of ~69, ~68, and ~58 million reads per sample for the CHX, HARR, and ND Ribo-seq, respectively. An average of ~56 million reads per sample were obtained for the 8 RNA-seq libraries. Low-quality reads were removed, and adapter sequences were trimmed from the raw Ribo-seq and RNA-seq data. For Ribo-seq data, an additional filtering stage was carried out to eliminate contamination from non-coding RNA. Reads were mapped to bowtie40 indices constructed from Cricetulus griseus rRNA, tRNA, and snoRNA sequences obtained from v22 of the RNA Central database41. Reads aligning to any of these indices were discarded from further analysis. This filtering stage removed an average of ~55%, ~40%, and ~39% of trimmed reads for the CHX, HARR, and ND samples, respectively (Supplementary Fig. 2; Supplementary Data 2).
Next, we examined the remaining Ribo-seq reads within the expected RPF length range (25-34nt) (Fig. 1d) to select the P-site offset (the distance from the 5’ end of a read to the first nucleotide of the P-site codon). Each Ribo-seq dataset was mapped to the Chinese hamster PICRH-1.0 genome using STAR42. The Plastid tool43 was used to assess the P-site offset and determine the proportion of reads exhibiting triplet periodicity (Fig. 1e) for NCBI-annotated canonical protein-coding genes for each offset. The optimum P-site offset was found to be 12 nt, for which ~ 71%, 65%, and 64% of reads exhibited the expected triplet periodicity for the CHX, HARR, and ND Ribo-seq datasets, respectively, and we retained the reads between 28–31 nt for further analysis (Fig. 1d). Prior to ORF identification, we confirmed the expected preferential enrichment of ribosomes at the translation initiation sites (TIS) of annotated protein-coding genes for the HARR Ribo-seq data in comparison to the CHX and ND Ribo-seq data (Fig. 1f; Supplementary Fig. 3).
Ribo-seq enables the characterisation of novel ORFs in the Chinese hamster genome
The Ribo-seq data was used to refine the annotation of translated regions of the Chinese hamster PICRH-1.0 genome by conducting a transcriptome-wide analysis using ORF-RATER44. The ORF-RATER algorithm integrates initiation and elongation Ribo-seq data to enable the identification of unannotated ORFs by first finding all potential ORFs beginning at user-defined start codons that have an in-frame stop codon. The experimental Ribo-seq data is then used to confirm occupancy at each TIS and assess whether the putative ORF is undergoing active translation. To maximise the sensitivity of ORF detection, we merged the RPFs for all replicates in each type of Ribo-seq experiment yielding a total of approximately 144, 169, and 140 million RPFs for the harringtonine, cycloheximide, and no-drug treated Ribo-seq, respectively. Prior to ORF identification, transcripts from 4583 pseudogenes were removed. In addition, transcripts that had low coverage (n = 18,951), a high proportion of multimapped reads (n = 10), or where the RPFs aligned to a small number of positions within a transcript (n = 1662) were also excluded from further analysis. For the remaining transcripts, the initial ORF-RATER search step was limited to ORFs that began at an AUG or near-cognate start codon (i.e., CUG, GUG, or UUG). To determine if a potential TIS was occupied, only the RPF data from the HARR Ribo-seq was considered while CHX and ND-treated Ribo-seq data was utilised to determine if putative ORFs were translated by comparing the RPF occupancy of each ORF to the typical pattern of translation elongation observed for CDS of annotated protein coding genes.
An initial group of 27,784 ORFs identified by ORF-RATER with an ORF-RATER score of ≥ 0.545,46 and ORF length ≥ 5 aa was selected for further analysis. The proteoforms identified included those present in the current annotation of the Chinese hamster genome (i.e., Annotated) and N-terminal extensions (i.e., Extension). Two distinct classes of ORFs initiating upstream of the annotated CDS (i.e., the main ORF) were also identified. The first type, called upstream ORFs (i.e., uORFs), initiates upstream and terminates before the start codon of the main ORF. The second upstream ORF type, termed overlapping upstream open reading frames (ouORFs), also initiates in the 5’ leader of mRNAs but extends downstream beyond the main ORF’s start codon and is translated in a different reading frame. As well as ORFs in mRNAs, we also identified ORFs in transcripts classified as non-coding in the PICRH-1.0 genome that had previously unannotated start and stop codons (i.e., New ORFs).
The conditions used to inhibit translation initiation can, in some cases, lead to the identification of false positive internal ORFs due to the capture of residual elongating ribosomes45. In our case, we utilized flash freezing in combination with harringtonine, which will also result in the capture of a proportion of RPFs from elongating ribosomes, increasing the probability of erroneous identifications. To reduce false positives from internal TIS, we discarded truncated ORFs (n = 9365), internal ORFs (n = 1723) classifications, and other low-confidence isoforms (n = 872) from further analysis. For the remaining ORFs, we utilised a method developed by Lee et al47. to perform relative quantitation of the harringtonine signal at each TIS when compared to the ND Ribo-seq data. ORFs with a \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value < 0.01 were eliminated from further analysis (n = 4633). The validity of ORFs with non-AUG TIS was further assessed in comparison to other proteoforms that overlapped on the same transcript. Where an AUG and non-AUG ORF were predicted to start within a 7nt window, we eliminated the non-AUG initiated ORF. In cases when a pair of AUG-initiated ORFs, or a pair of non-AUG-initiated ORFs were found within the window, only the ORF with the maximum \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value was retained. For overlapping ORFs that started outside of the 7nt window, non-AUG ORFs were retained if the \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value at the TIS was at least five times higher than that of the AUG-initiated counterpart. This process eliminated a further 465 ORFs. The final stage in the assessment of novel ORFs was achieved through the calculation of Ingolia’s fragment length organisation similarity score (FLOSS)48, which was calculated for all ORFs (annotated and novel) and those ORFs with a FLOSS score classified as extreme outliers were removed (n = 525).
Upon the completion of the filtering process, 10,201 high-confidence ORFs were retained (Fig. 2a, Supplementary Fig. 4, Supplementary Data 3), of which ~44% (n = 4491) were not annotated in the Chinese hamster PICRH-1.0 genome. ~56% of these new identifications were predicted to start at near-cognate codons (i.e., CUG, GUG, or UUG). The novel N-terminal extensions, uORFs, and ouORFs that remained after this filtering process were compared to annotations in uORFdb49. The proportion of ORFs with near cognate start codons was comparatively lower in Chinese hamster than in the 11 species examined, including human, mouse, and rat (Supplementary Fig. 5). The ability to identify initiation at non-AUG codons enabled us to discover alternative ORFs of conventional protein-coding genes that would not be possible with previous annotation approaches for the Chinese hamster genome. For instance, ~11% (n = 527) of novel ORFs identified were N-terminal extensions of annotated protein-coding transcripts (e.g., Aurora kinase A (Fig. 2b)).
In this study, we utilised the ORF-RATER algorithm to identify ORFs initiating at near cognate (i.e., NUG) start codons from the Ribo-seq data. A total of (a) 10,201 ORFs were identified, including 4491 that were not previously annotated in the Chinese hamster genome. These ORFs included N-terminal extensions in annotated protein-coding genes. For instance, we identified (b) a CUG initiated N-terminal extension in a transcript of Aurka. The RNA-seq, CHX coverage of the transcript (full coverage [coloured grey] and P-site offset CHX coverage [coloured by reading frame relative to the annotated TIS]) along with the HARR-ND coverage (P-site offset) are shown, illustrating the initiation signal at the CUG start codon upstream of the NCBI annotated AUG start codon. Source data are provided as a Source Data file.
The Chinese hamster genome harbours thousands of short open reading frames
We also identified a considerable number of previously uncharacterized short open reading frames (sORFs) in the Chinese hamster genome (Supplementary Data 3). sORFs are defined as ORFs predicted to produce proteins < 100 aa, termed microproteins50. More than 90% of the ORFs identified in the 5’ region of mRNAs (uORFs (Fig. 3a) and ouORFs (Fig. 3b)) or in transcripts previously annotated as non-coding (Fig. 3c) were sORFs (Fig. 3d). 2276 uORFs were classified as sORFs and had an average putative microprotein length of 23 aa (Supplementary Fig. 6a). AUG (49.9%) was the most prevalent start codon, followed by CUG (29.3%), GUG (12.7%), and UUG (8.2%). The average predicted microprotein length of the ouORFs classified as sORFs (n = 918) was 39 aa (Supplementary Fig. 6b), with CUG (37.7%) the most frequent start codon, followed by AUG (33.0%), GUG (18.5%), and UUG (10.8%). For the New ORF class, the majority (480 of 487) were sORFs and found in transcripts annotated in NCBI as non-coding. The average length of the microproteins predicted to be encoded by these sORFs was 30 aa (Supplementary Fig. 6c). AUG (70.0%) was the most common start codon, followed by CUG (18.3%), GUG (8.3%), and UUG (3.3%).
Examples are shown of (a) an uORF found in a Ddit3 transcript, (b) an ouORF in a Rab31 transcript, and (c) a sORF found in the transcript of a long non-coding RNA gene. a, b and c show the full CHX coverage [coloured grey] and P-site offset CHX coverage [coloured by reading frame relative to the annotated TIS]. Many previously uncharacterized ORFs identified in this study were (d) sORFs predicted to produce proteins < 100 aa. We focused on short open reading frames found in the 5’ leader of protein-coding transcripts (i.e., upstream ORFs and start overlapping uORFs) as well as ORFs found in non-coding RNAs where > 90% of all identified ORFs in these classes were sORFs. Comparison of the (e) amino acid frequencies of uORFs (both uORFs and ouORFs) and ncRNA sORFs to annotated proteins, as well as the expected amino acid frequency for the Chinese hamster genome, revealed differences in usage of amino acids, including arginine and glycine when compared to conventional protein-coding ORFs ( ≥ 100 aa). Supplementary Fig. 7 shows the frequency of all amino acids. The sORF populations were also found to have (f) a reduced codon adaption index (CAI) compared to previously annotated canonical proteins. A two-sided Kolmogorov–Smirnov test was used to assess the CAI difference between ORF types; a p-value < 0.01 was considered significant. The (d) and (f) boxplot center lines show the median length, and the whiskers extend to 1.5× the interquartile range. Source data are provided as a Source Data file.
Upstream ORFs and sORFs in the New ORF group were found to have differences in amino acid usage when compared to annotated proteins with ≥ 100 aa (Fig. 3e and Supplementary Fig. 7). The amino acid usage was comparable to a recent analysis conducted for microproteins encoded in the human genome26. CHO cell sORFs were found to have increased usage of arginine, glycine, and tryptophan and decreased usage of asparagine, glutamate, lysine, and aspartic acid. Alanine and proline were more prevalent in uORFs than annotated proteins and sORFs found in ncRNA, while methionine usage was more frequent in sORFs in ncRNA. We also compared the codon adaption index (CAI)51 between previously annotated protein-coding genes and novel ORF types. We found that the sORF population had a lower CAI (Fig. 3f), indicating that microproteins tend to have a lower abundance than canonical proteins.
Microproteins are a source of process-related impurities in antibody drug products
Next, we sought to determine if microproteins predicted to be encoded by novel sORFs increased the coverage of MS-based HCP detection. We performed HCP analysis in our laboratory utilising the SP3 sample preparation method followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS)52,53 to analyse 5 commercial mAb drug products (adalimumab, denosumab, nivolumab, pertuzumab and vedolizumab) as well as a Fc-fusion protein drug product (etanercept). We also utilised a publicly available dataset from a previous HCP analysis of 4 mAb drug products (adalimumab, bevacizumab, nivolumab and trastuzumab) (Fig. 4a)54. In total, we analysed 10 separate LC-MS/MS datasets spanning 8 antibody drug products, 5 different sample preparation methods (Fig. 4b) performed on two different Orbitrap MS instruments both operated in data-dependant acquisition (DDA) mode (Fig. 4c).
We utilised (a) data from LC-MS/MS-based HCP analyses of 8 antibody-drug products generated in our laboratory as well as a previous study by Pythoud et al. spanning (b) 5 sample preparation methods and captured on (c) 2 types of Orbitrap MS instruments. The DDA MS data was first searched against canonical proteins using MetaMorpheus with spectral mass calibration enabled, resulting in the identification of (d) canonical proteins in each product tested. For microproteins, the PepQuery2 algorithm was used to (e) detect microprotein PSMs. The false positive rate is reduced by PepQuery2 by statistical evaluation against randomly shuffled sequences (i.e. PepQuery p-value) and an unrestricted modification search against known proteins. We identified (d) 40 microproteins with (f) 28 microproteins found in more than one drug product. Note: The number of microproteins detected for each sample preparation method employed in the Pythoud et al. study is shown in Supplementary Fig. 8. Source data are provided as a Source Data file. Panels a and c created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.
The first stage in our MS analysis was to search the data using MetaMorpheus55 with spectral mass calibration enabled to identify canonical HCPs present in each drug product. For each drug product, a protein sequence database was constructed which was comprised of the Chinese hamster reference proteome (n = 23,887, downloaded from UniProt on 27/03/2024), sequences of sample contaminants from the common Repository of Adventitious Proteins (cRAP, thegpm.org/crap), the respective recombinant protein sequencesand any quantitation or retention time standards that were added to the sample. A canonical protein was considered confidently identified if ≥ 2 peptides were detected and the protein-level FDR was < 0.01. We found canonical HCPs in all antibody drug products tested in both studies, including several previously identified HCPs in adalimumab (e.g., cathepsin L1, S100a1156) and vedolizumab (e.g., clusterin57) (Fig. 4d, Supplementary Data 4).
To identify microproteins in the drug product HCP data, we used PepQuery258,59, a peptide-centric algorithm designed specifically for detecting novel proteins and utilized previously for microprotein validation60. A PepQuery2 index was constructed from the MetaMorpheus mass-calibrated LC-MS/MS data from all drug product HCP analyses comprising 132 samples and > 5.2 million MS/MS spectra. The microproteins annotated by Ribo-seq (n = 3681) were digested with semi-tryptic specificity and searched against the PepQuery index to identify candidate peptide spectral matches (PSMs). False positive microprotein identifications are initially eliminated by comparing these PSMs against peptides from the reference proteome (i.e., annotated Chinese hamster proteins, all antibody drug products, mass spectrometry standards, and contaminant sequences). The known peptide set was comprised of tryptic peptides from all reference proteins. In addition, we utilized MetaMorpheus to perform a liberal search of the drug product data with semi-tryptic specificity. Semi-tryptic peptides from a protein identified in any drug product at a FDR < 10 % (n = 2973) were also included in the known peptide set. The PSMs initially associated with microproteins that were subsequently found to have an equal to or greater PepQuery score for a known peptide were excluded. The remaining candidate microproteins PSMs were compared to randomly shuffled microprotein peptide sequences to determine the FDR. In the final stage of PepQuery, an unrestricted modification search was performed, and if a candidate microprotein PSM was found to have a better match to a post-translationally or artefactually modified known peptide it was eliminated58,59.
The identification of microproteins using mass spectrometry is challenging due to their size, as a lower abundance and fewer cleavage sites amenable to digestion with trypsin in comparison to canonical proteins results in a reduction of the number of detectable peptides by MS61. Similar to other studies26,62, we considered a single peptide sufficient for microprotein identification. Only those microprotein peptides designated as confident by PepQuery (i.e., a p-value < 0.05 if the peptide length ≤ 8 aa, or < 0.01 for peptides > 8 aa) (Fig. 4e) and found in at least 50% of the replicates of each drug product or sample preparation cohort were retained (Supplementary Data 5).
A total of 40 microprotein HCPs were identified across the eight antibody products (Fig. 4d). Of the 5 sample preparation methods, native digestion protocols resulted in the lowest number of microprotein identifications (Supplementary Fig. 8). While we detected two or more peptides from 13 of these microproteins across the dataset, most microprotein identifications resulted from the detection of a single peptide. Etanercept had the largest number of individual microproteins identified (n = 11) from the data conducted in our laboratory, while the bevacizumab and trastuzumab samples analyzed by Pythoud et al. had the largest number of microproteins (n = 18). The lowest number of microproteins detected was in vedolizumab (n = 4). Twenty-eight microproteins were identified in more than one of the drug products examined (Fig. 4f). A single microprotein was found in the adalimumab analyzed in this study and the Pythoud et al. study, while two microproteins were found in the nivolumab data from both studies. A 16aa microprotein from a CUG-initiated ouORF in a Znf883 transcript (XM_027419542.2) was found in 7 of 8 antibody-drug products tested (Fig. 4f).
We utilised the data generated in our laboratory for six drug products to assess the quantities of canonical proteins and microproteins present in the drug products. The canonical and microprotein PSMs identified by MetaMorpheus and PepQuery were combined, and FlashLFQ was used to perform label-free quantitation. The Hi3 standard quantitation method53 was used to determine the presence of individual HCPs. The ten quantifiable canonical HCPs detected across the six drug products ranged in concentration from 1.52 ppm to 46.94 ppm (median = 14.28 ppm). The Hi3 method requires > 3 identified peptides for confident quantification of protein. A single quantifiable microprotein met the 3-peptide confidence level for accurate quantitation and was found to be present in etanercept at a concentration of 1.92 ppm. Given the challenges of microprotein identification, we also decided to estimate the concentration of the microproteins with fewer than three peptides identified (Supplementary Fig. 9). A microprotein with two peptides identified was found at 0.16 ppm. Quantified microproteins that were identified from a single peptide (n = 15) represent the lowest confidence estimates in terms of HCP abundance. The majority of these microproteins (n = 14) were found to be below the median concentration observed for canonical HCPs. In contrast, the remaining microprotein abundance estimate was ~800 ppm exceeding the canonical HCP range.
The translation efficiency of sORFs found in non-coding RNA genes is altered in response to a reduction of cell culture temperature
Ribo-seq can also be used to assess translation efficiency by normalizing RPF occupancy of each ORF to the corresponding RNA abundance. Comparing translation efficiency between conditions enables transcriptome-wide differences in translational regulation to be identified16. To understand the extent to which changes in the translatome are associated with the CHO cell response to sub-physiological temperature, we performed a count-based analysis of translation efficiency using the CHX-treated Ribo-seq data along with the parallel RNA-seq data captured for TS (n = 4) and NTS (n = 4) samples (Supplementary Fig. 10).
The Plastid43 cs generate algorithm was used to construct a gene-level annotation by merging the positions of all exons found in all transcripts of a gene. Only those RPFs/reads mapping to the CDS regions common to all transcripts for a particular gene contributed to the overall count. The RPFs/reads from the first 15 and last five codons for ORFs > 100 aa and the first and last codons for ORFs < 100 aa were excluded. This step was intended to reduce potential bias from the cycloheximide-associated accumulation of ribosomes at the beginning and end of the CDS and enrich for those RPFs most likely to be associated with elongation63. It is not possible to accurately distinguish the expression/occupancy of the uORF/ouORFs from the canonical ORFs with the gene-level CDS counting approach used here. We, therefore, focused only on the ORF cohort identified in genes previously classified as non-coding in the reference annotation. 495 of these ORFs were identified, 480 of which were predicted to encode a microprotein (Fig. 5a). The average length of potential microproteins was 30 aa (Fig. 5b), with the majority of parent transcript harboring 1 or 2 sORFs. However, there were instances where as many as five sORFs were found in a single non-coding RNA transcript (Fig. 5c). To ensure compatibility with the Plastid read/RPF counting algorithm43, we utilised only the longest sORF for each transcript (n = 395).
To characterise the impact of reducing cell culture temperature, we analysed translation efficiency for canonical ORFs and non-coding RNA sORFs. Of the 513 ORFs classified as New by ORF-RATER, the (a) majority (n = 480) were sORFs found in non-coding RNA genes. The average length of these sORFs was (b) 30 aa with as many as (c) 5 encoded by a single transcript. Only the longest ORF per transcript (n = 395) predicted to encode a microprotein was included in the analysis. The deltaTE method was used to identify changes in RNA abundance and RPF occupancy between the TS (biological replicates n = 4) and NTS samples (biological replicates n = 4) using the RNA-seq (RNA) and Ribo-seq (RPF) data. d 2837 genes were found to be forwarded (significant RNA and RPF difference, no translation efficiency (TE) difference following DESeq2 analysis (two-sided Wald test, Benjamini-Hochberg (BH) adjusted p-value < 0.05)). 279 genes were found to be regulated exclusively at the level of translation (significant difference in RPF and TE, no RNA difference). 392 genes were buffered (a significant RNA difference anticorrelated with a difference in TE). 199 genes were found to be intensified (RNA difference correlated with TE difference). Following the application of the fold change filter (with ≥ |1.5| fold change for RNA and RPF for forwarded genes and ≥ |1.5| TE for translation exclusive, buffered and intensified), the resulting 1220 genes were used to perform an overrepresentation analysis against GO biological processes. The proportion of translationally regulated genes contributing to the significant enrichment (BH adjusted p-value < 0.05) of the 56 biological processes was determined. The (e) 10 biological processes with the largest proportion of translationally regulated genes are shown. 15 sORFs were found (f) within the forwarded category, and 5 were (g) found to undergo changes via buffering, intensification, and regulation upon the reduction of cell culture temperature. The (g) boxplot centre line shows the median TE (normalised RNA count divided by normalised RPF count), and the whiskers extend to 1.5× the interquartile range. Source data are provided as a Source Data file.
We retained only the CDS regions (n = 10, 741), which had an average count of 10 in both the RNA-seq and Ribo-seq datasets, and used the deltaTE64 to identify and classify ORFs with differences in transcription and/or translational efficiency (ΔTE) between the TS and NTS samples. The deltaTE method introduces an interaction term to the DESeq2 generalized linear model to assess differences between the biological conditions observed from RNA-seq and Ribo-seq data separately and, importantly, between the different assays to calculate the false discovery rate (FDR) for changes in the RNA abundance, RPF occupancy and the translation efficiency for each gene. We initially classified the outputs from DESeq2 using only statistically significant differences (adjusted p-value < 0.05) and not fold change, as outlined by the deltaTE developers (Fig. 5d). Of the 3707 genes classified by deltaTE, we found that 76.5% (n = 2837) were transcriptionally forwarded, where a significant increase or decrease in RPF occupancy agreed with the change in RNA abundance observed (i.e., ΔTE adjusted p-value ≥ 0.05). 7.5% (n = 279) of differences between the TS v NTS samples were found to be translation exclusive, where both the RPF occupancy and ΔTE were significantly altered, while the RNA was unchanged (adjusted p-value ≥ 0.05). The remainder of genes that were altered at both the transcriptional and translational level (i.e., RNA and ΔTE adjusted p-value < 0.05) were further classified by taking the direction of the change into account. 10.5% (n = 392) of genes were found to be buffered where changes in transcription were tempered at the level of translation, e.g., an increase in RNA abundance was associated with a decreased ΔTE. The remaining 5% (n = 199) of genes were found to be intensified by translation regulation, e.g., an upregulation in transcription was accompanied by an increased ΔTE.
To determine the extent to which translation regulation plays a role in the CHO cell response to mildly hypothermic conditions, we further filtered the deltaTE output. For transcriptionally forwarded genes, we retained only those genes (n = 863) with ≥ |1.5| fold change between the TS and NTS samples for both the RNA and RPF data. For translationally regulated categories (translation exclusive, buffered, and intensified), only genes with a ≥ |1.5| change in ΔTE were retained (n = 357) (Supplementary Data 6). Both cohorts of genes were combined, and an overrepresentation analysis against the genome ontology (GO) was performed (Supplementary Data 7). We determined the proportion of translation-exclusive, buffered, and intensified genes contributing to the 56 enriched GO categories (FDR < 0.05). For 20 significantly enriched biological processes, > 25% of genes were found to be differentially translationally regulated (Fig. 5e). We identified 15 sORFs within the forwarded cohort where a ≥ |1.5| fold increase or decrease in RNA-seq and Ribo-seq was observed without a change in ΔTE (Fig. 5f, Supplementary Fig. 11). A further five sORFs had a | 1.5| fold increase or decrease in ΔTE and were classified as buffered, intensified, or regulated exclusively through translation (Fig. 5g).
The abundance of CHO cell microproteins change in response to mild hypothermia and between the exponential and stationary growth phases
We utilised proteomic mass spectrometry to determine if microproteins predicted to be encoded by sORFs were present in whole-cell lysates. These data allowed us to overcome the inherent limitations of the RNAseq and Ribo-seq analyses and enabled the identification of microproteins from the uORF and ouORF classes and cases when a single non-coding RNA gene encodes multiple microproteins. We also sought to determine if microprotein abundance was altered upon reducing cell culture temperature (Fig. 6a). For this experiment, we again acquired cells from a non-temperature shifted control at 72 h post seeding (n = 3) and 24 h post-temperature shift (72 h post seeding) (n = 3) as well as a sample at 48 h post-temperature shift (96 h post seeding) (n = 3) (Supplementary Fig. 12a, Supplementary Data 1b). An additional proteomics experiment was performed for a second CHO cell line to assess if microproteins could be detected and if abundance was altered between the exponential and stationary phases of cell growth (Fig. 6b). Here, a non-mAb producing CHO-K1 GS cell line was cultured for 7 days; samples were acquired for proteomics at 96 h post-seeding when the cells were in exponential growth (n = 4) and at 168 h when the cells had entered stationary phase (n = 4) (Supplementary Fig. 12b, Supplementary Data 1,c). Cell lysates from both proteomics experiments were subjected to a SP3 protein clean-up procedure and tryptic digestion before LC-MS/MS (Fig. 6c). The resulting MS data from each proteomics sample was searched in an identical manner to that of the drug product HCP data. Canonical proteins were identified using MetaMorpheus (protein-level FDR < 0.01, ≥ 2 peptides detected). For PepQuery, an index was constructed comprising > 2.9 million MS/MS spectra. The complete set of tryptic peptides from the reference proteins, along with those semi-tryptic peptides identified from a liberal MetaMorpheus database search (FDR < 10%) of the data from both proteomics experiments (n = 16,949) was utilised for the PepQuery known peptide set. Only those microproteins designated as confident by PepQuery in at least 50% of the replicates of a sample cohort were retained.
To determine if ORFs predicted from the Ribo-seq could be identified at the protein level, we conducted LC-MS/MS-based proteomics. To generate the samples for proteomics (a) the temperature shift model used for Ribo-seq and RNA-seq was repeated (biological replicates, n = 3) and (b) cells from a different non mAb-producing CHO-K1 GS cell line were captured (biological replicates, n = 4) during exponential growth (Day 4) and in the stationary phase (Day 7). Proteins were extracted from cell lysates, and a (c) SP3-based protein cleanup method followed by tryptic digestion was used to prepare samples for MS analysis. The resulting data was searched using MetaMorpheus for canonical proteins and PepQuery2 for microproteins in the same fashion as the drug product data. This analysis identified (d) 4737 and 5024 canonical proteins for the temperature shift and growth rate experiments, respectively. For microproteins, e 110 and 53 microproteins were identified from the temperature shift and growth rate proteomics experiments, respectively. The microproteins identified for both experiments (f and g) originated from uORF, ouORF, and RNAs previously annotated as non-coding. h 28 microproteins were identified in both proteomics experiments, while 2 lysate-identified microproteins were also found in antibody drug products. We identified microproteins that were significantly differentially abundant (proDA two-sided Wald test, BH adjusted p-value < 0.05) upon a comparison of non-temperature shifted control to those samples acquired at (i) 24 h, (j) 48 h post-temperature shift, as well as between (k) the exponential and stationary phases of cell culture. Source data are provided as a Source Data file. a, b, and c created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.
For the temperature shift proteomics experiment, 4737 canonical proteins were identified across the nine samples analysed by mass spectrometry (Fig. 6d, Supplementary Data 8a). 110 microproteins, were detected from the uORF (n = 45), ouORF (n = 47) and New (n = 18) classes (Fig. 6f, Supplementary Data 8b). For the growth phase experiment, 5024 canonical proteins along with 53 microproteins (Fig. 6e, Supplementary Data 8c) originating from uORFs (n = 28), ouORFs (n = 19), and the New (n = 6) classes (Fig. 6g, Supplementary Data 8d). Twenty-eight microproteins were detected in both the temperature shift and growth phase experiments (Fig. 6h), with two microproteins found in the CHO lysate samples and in antibody drug products (Fig. 6h).
The PSMs for confidently identified canonical proteins and microproteins for each experiment were merged and FlashLFQ with match-between runs enabled was used to generate LFQ values for each experiment. Only those canonical and microproteins quantified in at least 50% of samples in a replicate cohort were retained for further analysis. The proDA algorithm65 was used to identify proteins significantly altered (i.e., log2 fold change of ≥ |1.2| and BH adjusted p-value < 0.05) between conditions for the temperature shift and growth rate experiments. Upon comparison of the 24 h and 48 h post-temperature shift samples to the non-temperature shifted control, 454 and 1117 canonical proteins were differentially expressed, respectively (Supplementary Data 9a, 9c). Of the 49 microproteins reliably quantified in this experiment, the abundance of 4 microproteins at 24 h and 9 microproteins for the 48 h post-temperature shift were found to be altered (Fig. 6i, j, Supplementary Data 9b, 9d). In the second proteomics experiment, 1,636 canonical proteins (Supplementary Data 9e) and 3 of the 10 quantified microproteins were found differentially expressed upon a comparison of the exponential and stationary phases of cell growth (Fig.6k and Supplementary Data 9f).
Discussion
Here, we present the findings of a ribosome footprint profiling experiment where translation initiation and elongation were captured at single nucleotide resolution in CHO cells. The utilisation of harringtonine to arrest translation resulted in an enrichment of RPFs at the TIS and enabled transcriptome-wide identification of 4491 novel ORFs. We found that the use of alternative initiation sites is widespread across the CHO cell transcriptome, with ~56% of all new ORFs identified beginning at non-AUG start codons. We also identified 526 N-terminal extensions of previously annotated protein-coding ORFs that begin at near-cognate start codons. Most novel annotations were sORFs predicted to encode microproteins located in the 5’ leader sequence of Chinese hamster mRNAs or in ncRNA transcripts. We have confirmed the existence of more than 170 putative microproteins from LC-MS/MS analyses of whole cell lysate and, importantly, in antibody drug products.
The identification of sORFs resulted in a more comprehensive proteomic database for mass spectrometry, enabling the enhanced assessment of HCP impurities in commercial antibody drug products. We identified 40 host-cell microproteins using LC-MS/MS across eight different drug products, 28 of which were found in more than one drug product. While microproteins were identified in each drug product analysed, we could only confidently quantitate a single microprotein found in the etanercept drug product which was well below the median quantity observed for canonical HCPs. Most estimated microprotein concentrations (i.e., derived 1 or 2 peptide identifications) were found to lie within the ppm range observed for canonical HCPs. For a microprotein in etanercept, the abundance was calculated from a single peptide and found to be particularly high, but it is important to note that any microprotein HCP concentration from < 3 detected peptides reported in this study should be considered as an estimate.
We wish to emphasise that we make no claims regarding any risk to the patient or impact on the efficacy of the therapeutic antibody products arising from the host cell microprotein impurities observed in this study. In fact, the safety and effectiveness of the more than 100 mAbs approved to date66, the majority of which are manufactured in CHO cells, is compelling evidence that microproteins do not cause issues, if present, in approved drug products. Nevertheless, CHO cell microproteins are a new class of host cell impurity and future studies to evaluate if, in certain circumstances, these HCPs could elicit an immune response, affect mAb stability, or how they escape the purification process would be valuable for the industry. To facilitate these efforts, we have made the protein sequence database used for MS analysis freely available (see Data Availability section).
The improvement in HCP detection gained from this study is also an important tool for a range of process optimisation approaches focused on limiting these species in the final drug product67. For instance, the HCP content in the final drug product can be influenced, in part, by the upstream process68,69. Here, we have provided evidence that microprotein abundance can be altered by e.g., a change in the bioreactor environment, i.e., temperature. Analysis of microprotein expression during cell culture could help refine strategies that optimise the upstream process to reduce unwanted HCPs prior to harvest. Canonical HCPs are known to vary through different downstream unit operations, and examining the population at different stages facilitates the establishment of an effective purification process7. Our work enhances the ability to capture a more comprehensive picture of the HCP population at each unit operation, facilitating further optimisation. In recent years, several groups have also reported the use of cell line engineering to reduce or knock out problematic HCPs from CHO cell lines11,70,71. Through the annotation of thousands of sORFs in this study, we have considerably expanded the number of potential HCPs that are now amenable to genome editing approaches.
We believe our work also provides an essential foundation for future studies of non-canonical translation events and the control of protein synthesis in this important cell line. For instance, our work will be of utility to those researchers exploring approaches for the design of expression vectors for therapeutic antibody synthesis. The uORFs found in this study pave the way for the use of endogenous uORFs, in a similar fashion to previous reports on synthetic uORFs72 used to precisely control translation and/or the post-translational modifications of mAbs or indeed more complicated protein formats such as bispecific antibodies73.
Another important aspect of this study is demonstrating the utility of analysing translation efficiency using Ribo-seq to enhance our understanding of CHO cell biology. Here, we have shown that the expression of 30% of genes altered following a decrease in cell culture temperature were translationally regulated. 11% of genes were regulated exclusively at the level of translation. We have also observed differences in the RNA expression and translation efficiency of sORFs encoded in transcripts previously annotated as non-coding in response to sub-physiological cell culture temperature. While AUG-initiated translation is thought to result in the highest rate of protein synthesis74, it is also possible, as with other species75, that non-AUG-initiated proteoforms could play important roles in bioprocess phenotypes. Therefore Ribo-seq has the potential to result in a more comprehensive understanding of CHO cell behaviour than is possible with RNA-seq alone and has the potential to enable the identification of new candidates for cell line engineering studies in the future.
While the work conducted in this study has allowed us to significantly expand the annotation, it is possible that there remain further undiscovered ORFs in the Chinese hamster genome. We recognise that our work is also potentially limited by the combined use of harringtonine and flash freezing, which likely led to residual elongating ribosomes and subsequent identification of potential false positive translation initiation sites. We eliminated those classes of ORFs liable to be affected (i.e., truncations) entirely from further analysis and conservatively filtered the remaining classes to limit false positive identifications (at the expense of potentially increasing the false negative rate). Future studies utilising Chinese hamster tissues and different CHO cell lines grown under various conditions producing a range of mAbs and other protein formats or focusing on translation events such as ribosomal frameshifting76 could facilitate the identification of additional novel ORFs. In addition, performing Ribo-seq experiments with different translation inhibitors such as lactimidomycin or puromycin in the future could not only enable new ORFs to be identified but also allow quantitative comparison of CHO cell translation initiation in different conditions47,77.
As others have noted, detecting microproteins with mass spectrometry remains a challenge. In this study, we confirmed the existence of < 5% of predicted microproteins (173/3681). Only a small number of microroteins were quantified in the drug products or found to be differentially expressed in whole-cell lysate. Microprotein abundance and stability are thought to limit their detection. While we did not study stability, we found that the codon adaption index (CAI) for sORFs indicated that the expected abundance would generally be lower than for canonical ORFs. We observed a marginal correlation between RPF occupancy and microprotein detection (Supplementary Fig. 13), indicating that other factors are important for identification. We believe the most significant opportunities to increase detection rates remain sample preparation and mass spectrometry methods, as well as bioinformatics approaches to reduce false positives. Approaches such as data-independent acquisition (DIA)78 combined with spectral libraries generated via machine learning79 have shown considerable promise, and harnessing these advances will be important for future studies of CHO cell microproteins.
In conclusion, we have performed a series of Ribo-seq experiments with various translation inhibitors to examine translation elongation and, notably, translation initiation in CHO cells. This approach substantially enhanced the annotation of non-canonical ORFs in the Chinese hamster genome. We discovered novel translation events in previously annotated protein-coding genes and identified thousands of new short ORFs predicted to encode microproteins. Our findings allow improved MS-based HCP detection in antibody drug products. In addition, our discovery of novel ORFs presents new opportunities to understand CHO cell biology and provides a foundation for harnessing an enhanced understanding of protein synthesis in these cells to further improve manufacturing process efficiency and the quality of therapeutic proteins.
Methods
Cell culture
Cell lines
In this study two Chinese hamster ovary (CHO) cell lines were used for Ribo-seq and proteomics. The first was a CHO-K1 monoclonal antibody producing CHO cell line (CHO-K1mAb) and the second a non-producing CHO K1 GS cell line provided by Pfizer Inc and Horizon Discovery respectively. Both CHO cell lines used for this study were clonally derived and originated from the cell bank of each company.
Generation of samples for Ribo-seq and RNA-seq
The CHO-K1 mAb cell line was seeded at a density of 2 × 105 cells/ml in 50 ml CHO-S-SFM-II media (Gibco, 12052098) in 8 replicate shake flasks in a Kuhner orbital shaker at 170 rpm at 5% CO2. The cultures were grown at 37 °C for 48 h post-seeding, at which point the temperature of 4 of the shake flasks was reduced to 31 °C, while the remaining four shake flasks were maintained at 37 °C (Fig. 1a). Samples for library preparation were acquired 72 h post-seeding. The procedure was repeated in two separate experiments; the first was used to generate Ribo-seq and matched total RNA-seq libraries from cycloheximide-treated cells (8 samples), and the second to generate Ribo-seq libraries from harringtonine-treated (8 samples) and matched no drug-treated cells.
Generation of samples for proteomics
For proteomics analysis, two experiments were conducted. The first experiment utilized an identical cell culture model for the CHO-K1 mAb cell line. Here, three biological replicates (two technical replicates each) were acquired for the non-temperature shifted control, at 24 h post temperature shift, and at 48 h post temperature shift. The second proteomics experiment focussed on growth phases (stationary v exponential). A non-mAb producing CHO-K1 GS cell line was seeded at a density of 2 × 105 cells/ml in 30 ml CD FortiCHOTM medium (Gibco, cat.no. A1148301) supplemented with 4 mM L-glutamine (L-Glutamine, cat.no. 25030024) in 8 250 ml Erlenmeyer shake flasks. The cultures were maintained at 37°C, 170 rpm, 5% CO2, and 80% humidity in a shaking incubator (Kuhner) for 4 or 7 days. Samples at Day 4 (four biological replicates) and at Day 7 (four biological replicates) were acquired and stored at −80°C until analysis.
Ribosome footprint profiling
Translation Initiation sample preparation
At 72 hours post-seeding, cells were treated with harringtonine (2 µg/ml) (or DMSO) for 2 minutes at 31 °C or 37 °C. The cells were washed with ice-cold PBS supplemented with harringtonine or DMSO respectively and flash-frozen in liquid nitrogen. Frozen pellets were resuspended in 400 µl 1X Mammalian Polysome buffer (Illumina TruSeq Ribo Profile (mammalian) kit) prepared according to the manufacturer’s guidelines. Cell lysates were incubated on ice for 10 minutes, and then centrifuged at 18,000 × g for 10 minutes at 4°C to pellet cell debris. The supernatant was used for ribosome-protected fragment (RPF) isolation and library preparation.
Translation elongation sample preparation
Seventy-two hours post seeding, a total of 25 × 106 cells (per replicate) were pelleted and resuspended in 20 ml of fresh CHO-S-SFMII media supplemented with cycloheximide at a final concentration of 0.1 mg/mL and incubated at 37°C or 31 °C for 10 min. Cells were subsequently pelleted, washed in 1 mL of ice-cold PBS containing 0.1 mg/mL of CHX, clarified, and lysed. Before the generation of ribosomal footprints, part of the lysate was used for total RNA extraction and RNA-seq library preparation with the TruSeq Ribo Profile (mammalian) kit (cat. no. RPHMR12126). RPFs were size selected on a gel, purified, and used for ribosome profiling library preparation with the Illumina TruSeq Ribo Profile (mammalian) kit.
Library preparation
To prepare RNA-seq and Ribo-seq libraries for sequencing, the TruSeq Ribo Profile (Mammalian) Kit (Illumina) (cat. no. RPHMR12126) was used in accordance with the manufacturer’s specifications. For Ribo-seq samples, RNase treatment was performed with 10 µl of TruSeq Ribo Profile Nuclease per 200 µl lysate at room temperature for 45 minutes with gentle shaking. Digestion was stopped with 15 µl SUPERaseIn (20U/µl) (Ambion, cat. No. AM2696). Monosomes were isolated with size exclusion chromatography using the Illustra MicroSpin S-400 HR Columns (GE Life Sciences, cat. no. 27514001) according to the manufacturer’s instructions. Ribosomal RNA was removed with the RiboZero-Gold rRNA removal Kit (Illumina, cat. No MRZG12324). Ribosome-protected fragments were size selected from a 15% denaturing urea polyacrylamide gel following electrophoresis (7 M urea, acrylamide (19): bis-acrylamide (1)). A gel extraction step (from 15% denaturing gels) for the isolation of linker ligated ribosome protected fragments was added to the protocol after the linker ligation reaction as described in Ingolia’s protocol80 for the Harringtonine and No-drug treated samples to avoid high concentration of linker dimers contaminating the final library. Following reverse transcription, cDNA was extracted from 7.5% denaturing urea gels. PCR amplified libraries were purified from 8% polyacrylamide gels and subsequently analyzed with the Agilent High Sensitivity DNA assay (Agilent, Bioanalyzer).
Sequencing
The libraries for translation initiation and elongation analyses were sequenced on an Illumina NextSeq configured to yield 75 bp and 50 bp single-end reads, respectively.
Ribo-seq and RNA-seq data analysis
Pre-processing
Adapter sequences were trimmed from the Ribo-seq and RNA-seq datasets using Cutadapt v1.1881, and Trimmomatic v0.3682 was used to remove low-quality bases. To remove reads from contaminating RNAs from the Ribo-seq data, Chinese hamster rRNA, tRNA, and snoRNA sequences were downloaded from the RNAcentral v22 database41, and an individual bowtie v1.3.140 index was built for each type of RNA. The Ribo-seq reads were aligned against each index using the parameters: -v 2 -l 20 -norc. Reads that mapped to rRNA, tRNA, or snoRNA were discarded.
Read alignment
The pre-processed Ribo-seq and RNA-seq data were aligned to the NCBI CriGri-PICRH 1.0 genome and transcriptome (GCA_003668045.2)14 with STAR v2.7.8a42 using the following parameters: −outFilterMismatchNmax 2 −outFilterMultimapNmax 1 −outFilterMatchNmin 16 −aligEndsType EndToEnd.
Ribo-seq P-site offset identification and selection of RPFs
The P-site offset (the number of nucleotides between the 5’ end of a Ribo-seq read and the P-site of the ribosome footprint that was captured) was determined using Plastid v0.5.143 by first defining the genomic region around annotated Chinese hamster CDS using the Plastid metagene generate algorithm with default settings. The Plastid P-site tool was then used to assess the P-site for different read lengths around the expected mammalian RPF size (25-34nt) for CDSs with at least ten mapped reads at the start region. Following the determination of P-site offsets for each read length for the CHX, HARR, and ND Ribo-seq data, only those read lengths where ≥ 60% of the reads were found to have the expected triplet periodicity with a P-site offset of 12 nt were retained for further analysis.
ORF identification
The eight replicates from each Ribo-seq type were merged to increase sensitivity before the ORF-RATER pipeline44 was used to identify ORFs in the Chinese hamster genome. Annotated pseudogenes were removed from the reference, and only those transcripts with a minimum of 64 mapped RPFs from the CHX and ND Ribo-seq data were considered for ORF identification. The ORF search was limited to NUG codons, with only the HARR Ribo-seq data used to identify the translation initiation sites. The CHX and ND RPFs were used to assess translation elongation for putative ORFs. Identified ORFs with an ORF-RATER score ≥ 0.545,46 and length ≥ 5aa were retained. Further filtering of ORF-RATER was performed via the removal of low-confidence ORFs necessitated by the nature of our Ribo-seq data. For the identified ORFs, we utilised the Lee et al47. method to confirm the enrichment of the data generated via a translational initiation inhibitor (i.e., harringtonine) versus a matched elongation inhibitor control (i.e., no-drug) as follows:
Where \(X\) is the number of RPFs occupying a 7nt window around the ORF-RATER identified TIS, and \(N\) is the total number of RPFs mapped to the transcript. These counts were performed using the GenomicRanges v1.54.1 Bioconductor R package83. Where more than one novel ORF overlapped on the same transcript, we removed any near-cognate initiating ORFs when their TIS was located inside the 7nt window of the AUG TIS. Where more than one AUG or non-AUG ORF initiated within the window, only the ORF with the maximum \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value was retained. When ORFs overlapped outside this window near cognate ORFs were removed if their \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value was less than five times its AUG-initiated ORF counterpart. For the final stage of filtering, the FLOSS score48 was calculated with the ORFik v1.22.0 Bioconductor R package84, and ORFs classified as extreme when compared to the FLOSS scores of annotated ORFs were removed. Visualisation of ORF coverage was accomplished using the Plastid make_wiggle algorithm for p-site off-set coverage and deeptools v3.5.185 for full coverage. Where more than one coverage type was displayed in the same figure, the bins per million (BPM) value was scaled between 0 and 1.
Codon adaption index
The 500 ORFs with the highest RPF occupancy were determined from the fragments per kilobase mapped (FPKM) value for the cycloheximide Ribo-seq data. These reference ORFs were used to determine the relative synonymous codon usage (RSCU). The RSCU was used to calculate the CAI with cubar R package86 for the Canonical and sORFs from uORF, ouORF, and New ORF classes.
Differential RNA expression, RPF count and differential translation analysis
To conduct gene-level differential translation analysis, the reference protein coding annotation was merged with selected ORFs found on non-coding RNAs. Before counting, Plastid cs generate was used to collapse transcripts that shared exons, remove regions comprised of more than one same-strand gene, and create position groups corresponding to exons, CDS, 5’ leader, and 3’UTR. Reads and RPFs aligning to the first 15 and last five codons of each CDS were discarded for ORFs ≥ 100 aa, while for those ORFs < 100 aa, the first and last codon counts were excluded26. For differential translation analysis, we utilised the deltaTE method64 using the RNA-seq counts as an interaction term within the DESeq287 model to enable the identification of changes in RPF density independently of RNA abundance. An absolute fold change ≥ |1.5| and BH adjusted p-value < 0.05 were considered significant for all analyses.
Enrichment analysis
The overrepresentation of GO biological processes in differentially expressed and/or differentially translated gene lists was assessed with the R WebGestaltR package88. Where no gene symbol was available, the Chinese hamster gene name was mapped to the NCBI Mus musculus GRCm39 annotation, and the corresponding mouse gene symbol was used. GO biological processes with a BH-adjusted p-value of <0.05 were considered significant.
CHO cell lysate proteomics
Sample preparation for reversed phase liquid chromatography-tandem mass spectrometry (RPLC-MS/MS)
Samples obtained from the temperature shift and growth phase experiments described above were prepared for proteomics using a semi-automated version of the SP3 protocol52. Briefly, CHO cells were pelleted via centrifugation at 300 × g for 5 mins. Following a wash step with 1 × PBS, cells were lysed using 1 × RIPA buffer (Cell Signalling Technology, Dublin, Ireland) containing 1 × protease inhibitor (cOmplete™, Mini, EDTA-free Protease Inhibitor Cocktail, Sigma, Wicklow, Ireland) followed by sonication. After removal of cell debris via centrifugation at 14,000 × g for 10 mins and resuspension in PBS containing 1 × protease inhibitor, the protein concentration was determined using the Pierce™ 660 nm Protein Assay Kit (Thermo Fisher Scientific, Dublin, Ireland). Protein reduction was accomplished via the addition of 5 mM 1, 4-dithiothreitol to a sample aliquot containing 50 µg of protein followed by incubation at 56°C for 30 mins. Alkylation of proteins was achieved by incubating the sample for 30 mins with 10 mM iodoacetamide at room temperature. Upon completion of this step, 50% (v/v) ethanol was added to the samples prior to purification with Sera-Mag Carboxylate SpeedBeads (Cytiva, Buckinghamshire, UK) according the to the SP3 protocol52. Following recovery of the beads, proteins were digested in 50 mM ammonium bicarbonate (Sigma) containing 1 µg trypsin (Promega, Madison, WI, USA) for 4 h at 37° C89. The digested samples were acidified by adding 0.1% (v/v) formic acid before LC-MS/MS analysis. Note: An identical sample preparation procedure was carried out for HCP analysis of mAb drug products.
Temperature shift experiment RPLC-MS/MS analysis
CHO-K1 mAb cell samples from the temperature shift experiment were analysed using an Orbitrap Exploris™ 480 mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) online hyphenated to an UltiMate™ 3000 RSLCnano system using an EASY-Spray™ source. 1 µg per sample was loaded onto a C18 Nano-Trap column followed by separation using an EASY-Spray Acclaim PepMap 100, 75 µm × 50 cm column maintained at 45.0 °C at a flow rate of 250.0 nL/min. Separation was achieved using a gradient of (A) 0.10% (v/v) formic acid in water and (B) 0.10% (v/v) formic acid in acetonitrile. Gradient conditions were as follows: 2–25% B in 120 min, followed by another increase to 45% B in 30 min. The separation was followed by two wash steps at 80% B for 5 min, and the column was re-equilibrated at 5% B for 15 min.
MS detection was carried out in centroid positive ion mode. First, full scans were acquired at a resolution setting of 60,000 (at m/z 200) with a scan range of m/z 200–2000. The normalised automatic gain control (AGC) target was set to 100% with a maximum IT of 50 ms. The 20 most abundant precursor ions were selected for HCD fragmentation using a normalised collision energy of 28%. For the isolation of precursor ions, an isolation window of 1.2 m/z and an intensity threshold of 5.0e3 was used. Fragment scans were acquired using a resolution setting of 15,000 (at m/z 200) with an AGC target of 50% and a maximum IT of 70 ms. Unassigned charge states and charge states > 7 were excluded from fragmentation. A dynamic exclusion was used for 45 seconds with a tolerance of ± 5 ppm.
Growth phase experiment RPLC-MS/MS analysis
For the Growth phase experiment, mass spectrometric analysis was performed using an Orbitrap Eclipse™ Tribid™ mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) coupled to an UltiMate™ 3000 RSLCnano system using an EASY-Spray™ source (Thermo Fisher Scientific, Germering, Germany). 2 µg per sample were loaded onto a C18 Nano-Trap Column followed by separation using an EASY-Spray Acclaim PepMap 100, 75 µm × 50 cm column (Thermo Fisher Scientific, Sunnyvale, CA, United States) maintained at 45.0°C at a flow rate of 250.0 nL/min. Separation was achieved using a gradient of (A) 0.10% (v/v) formic acid in water and (B) 0.10% (v/v) formic acid in acetonitrile (LC-MS optima, Fisher Scientific). Gradient conditions were as follows: 5% B for 5 min, followed by a linear gradient of 5–25% in 95 min, followed by another increase to 35% B in 20 min. The separation was followed by two wash steps at 90% B for 5 min, and the column was re-equilibrated at 5% B for 15 min.
MS analysis was performed in positive ion mode. Full scans were acquired in the Orbitrap at a resolution setting of 120,000 (at m/z 200) with a scan range of m/z 200–2000 using a normalised AGC target of 100% and an automatic maximum injection time in centroid mode. Using a 3 s cycle time, ions were selected for HCD fragmentation using a collision energy setting of 28%. Fragment scans were acquired in the Orbitrap using a resolution setting of 30,000 (at m/z 200). The AGC target was set to 200%. For isolation of precursor ions, an isolation window of 1.2 m/z was used. An intensity threshold of 5.0e4 was applied while unassigned charge states, as well as charges > 6, were excluded. The dynamic exclusion time was set to 60 s with ± 5 ppm tolerance.
Host cell protein analysis
RPLC-MS/MS analysis of drug products
Following tryptic digestion, as described above, HCP analysis was conducted for adalimumab, denosumab, etanercept, nivolumab pertuzumab and vedolizumab commercial drug product acquired from Evidentic (Berlin, Germany). Six technical replicates were acquired for each drug product (two separate aliquots of material from a vial and HCP analysis was performed in triplicate for each aliquot). Mass spectrometry was carried out using an Orbitrap Exploris™ 480 mass spectrometer coupled to a Vanquish™ Neo UHPLC system (Thermo Fisher Scientific, Germering, Germany) using an EASY-Spray™ source. LC-MS/MS analysis was performed in two steps: First, 100 ng/sample was analysed to generate an exclusion list containing drug product derived peptides. Second, replicate 3 µg samples were analysed using the corresponding exclusion lists to facilitate HCP detection. For quantitation, Hi3 E. coli (Waters, Milford, MA, USA) standard was added to reach a final concentration of 50 fmol/µg of protein injected.
Using the pressure driven-injection mode, either 100 ng or 3 µg of sample were loaded onto a C18 Nano-Trap column followed by separation using a 50 cm × 75 µm EASY-Spray™ PepMap™ Neo UHPLC column (Thermo Fisher Scientific, Sunnyvale, CA, USA). For separation, a linear gradient of 2–25% B (0.1% formic acid (v/v) in acetonitrile) over 60 min followed by another increase to 45% B in 30 min was used. Separation was followed by two wash steps at 80% B before column re-equilibration using 3 column volumes. The flow rate was 250 nl/min, and the column temperature was maintained at 45.0 °C.
Data-dependent MS/MS analysis was performed in positive ion mode. First, full scans were acquired covering a scan range of m/z 200-2000 using a resolution setting of 60,000 (at m/z 200). The AGC target was set to 100% with a maximum IT of 50 ms. 20 most abundant ions were selected for fragmentation using a HCD collision energy of 28%. An isolation window of 1.2 m/z was used. The AGC target for MS2 scans acquired using a resolution setting of 15,000 (at m/z 200) was set to 50% with a maximum IT of 70 ms. Only charge states of + 2 to + 7 were used for fragmentation. An intensity threshold of 5.0e3 was applied. Dynamic exclusion was used for 45 s with a ± 5 ppm tolerance. For targeted mass exclusion of drug product derived peptides, a retention time window of 5 min and a tolerance of 10 ppm were allowed.
Mass spectrometry data analysis
Canonical protein identification
MetaMorpheus55 v1.0.5 was used to conduct a database search of the data from each drug product and whole cell lysate experiment. The following search settings were used: protease = trypsin; maximum missed cleavages = 2; minimum peptide length = 7; maximum peptide length = 45; initiator methionine behaviour = Variable; fixed modifications = Carbamidomethyl on C (except for those Pythoud et al. sample where native digestion preparation methods were used54); variable modifications = Oxidation on M, protein N-term acetylation; max mods per peptide = 2; precursor mass tolerance = ±10 ppm; product mass tolerance = ±0.05 Da. For each drug product search, a database comprised of the Chinese hamster reference proteome (n = 23,887, downloaded from UniProt on 27/03/2024), contaminant sequences, the respective recombinant antibody protein sequences as well as any quantitation or indexed retention time standards added to a given sample. Only the UniProt reference proteome and contaminant sequences were used for the whole-cell lysate samples. Those canonical proteins with ≥ 2 peptides detected and the protein-level FDR < 0.01 were considered confidently identified.
Microprotein identification
Microproteins found using Ribo-seq were identified from mass spectrometry data using PepQuery v2.0.2. Two distinct searches were performed, one for all drug product HCP data and the other for whole cell lysate proteomic data. The mass-calibrated mzML files generated by the MetaMorpheus canonical protein searches were converted to MGF files using ProteoWizard90 and PepQuery indexes constructed from these data.
For the PepQuery search, the chainsaw algorithm from ProteoWizard90 was used to perform a semi-tryptic in-silico digestion of the putative microproteins, 2 missed cleavages were allowed, the minimum peptide size was 7 aa and the maximum peptide length was 45 aa. The known proteome was comprised of UniProt Chinese hamster proteins, contaminants, antibody sequences, and quantitation standards. First, a fully tryptic in-silico digestion of the known proteome was performed with chainsaw, 2 missed cleavages were allowed, with a minimum peptide length of 7 aa and a maximum peptide length of 45 aa. In addition, we also performed a MetaMorpheus search for lysate and drug product data with semi-tryptic specificity. The other MetaMorpheus parameters were identical to the canonical protein search described above. Semi-tryptic peptides identified by MetaMorpheus with a peptide Q value < 10% for drug product and lysate data were combined with the fully tryptic peptides in the PepQuery known protein set.
Microprotein peptides were searched against the PepQuery indices with the following parameters: precursor mass tolerance = ± 10 ppm; product mass tolerance = ±0.05 Da; fixed modifications = Carbamidomethyl on C (except for Pythoud et al. samples where native digestion was used54); variable modifications = Oxidation on M. PepQuery does not permit protein N-term acetylation as a variable modification and we carried out separate searches for drug product and lysate data with N-terminal peptide for each microprotein using the same parameters as detailed above except with peptide N-term acetylation enabled. Peptides from microproteins were considered identified if PepQuery confident = Yes (i.e., a p-value < 0.05 if the peptide length ≤ 8 aa, or < 0.01 for peptides > 8 aa) and the peptide was detected in at least 50% of replicates for the respective sample cohort.
Label free quantitation
FlashLFQ91 was used to perform label-free quantitation for the drug product and whole cell lysate data. Mass tolerance was set to ±10 ppm, and Match between runs (MBR) was enabled with the maximum retention time window specified as 0.7 minutes92. A protein was retained for differential abundance analysis if at least 50% of samples within one replicate cohort had non-zero LFQ values.
Identification of differentially expressed proteins
LFQ data was log2 transformed, normalised on total peptide amount, and the proDA algorithm65 was then utilised to fit a probabilistic dropout model to these data prior to differential expression analysis. Proteins with a ≥ |1.2| fold change and BH adjusted p-value < 0.05 were considered differentially expressed.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The Ribo-seq and RNA-seq data from the harringtonine, cycloheximide and no-drug treated cells have been deposited in the Sequence Read Archive (SRA) with accession code PRJNA778050. The mass spectrometry data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the following dataset identifiers: PXD053529 (drug product), PXD053492 (temperature shift proteomics), and PXD053511 (growth phase proteomics). The Pythoud et al. mass spectrometry data was obtained from PRIDE, accession ID: PXD019668. The microprotein sequence database used for HCP and proteomics analysis is available at https://doi.org/10.5281/zenodo.12609363. Source data are provided with this paper.
Code availability
The code required to reproduce the results presented in this manuscript93 is available at https://github.com/clarke-lab/tzani_microprotein_manuscript.
References
Walsh, G. & Walsh, E. Biopharmaceutical benchmarks 2022. Nat. Biotechnol. 40, 1722–1760 (2022).
Hanania, N. A. et al. Lebrikizumab in moderate-to-severe asthma: pooled data from two randomised placebo-controlled studies. Thorax 70, 748–756 (2015).
Li, X. et al. Identification and characterization of a residual host cell protein hexosaminidase B associated with N-glycan degradation during the stability study of a therapeutic recombinant monoclonal antibody product. Biotechnol. Prog. 37, e3128 (2021).
Luo, H. et al. Cathepsin L causes proteolytic cleavage of chinese-hamster-ovary cell expressed proteins during processing and storage: identification, characterization, and mitigation. Biotechnol. Prog. 35, e2732 (2019).
Bracewell, D. G., Francis, R. & Smales, C. M. The future of host cell protein (HCP) identification during process development and manufacturing linked to a risk-based management for their control. Biotechnol. Bioeng. 112, 1727–1737 (2015).
Zhu-Shimoni, J. et al. Host cell protein testing by ELISAs and the use of orthogonal methods. Biotechnol. Bioeng. 111, 2367–2379 (2014).
Pilely, K. et al. Monitoring process-related impurities in biologics–host cell protein analysis. Anal. Bioanal. Chem. 414, 747–758 (2022).
Henry, S. M., Sutlief, E., Salas-Solano, O. & Valliere-Douglass, J. ELISA reagent coverage evaluation by affinity purification tandem mass spectrometry. mAbs 9, 1065–1075 (2017).
Huang, Y., Molden, R., Hu, M., Qiu, H. & Li, N. Toward unbiased identification and comparative quantification of host cell protein impurities by automated iterative LC–MS/MS (HCP-AIMS) for therapeutic protein development. J. Pharm. Biomed. Anal. 200, 114069 (2021).
Goey, C. H., Bell, D. & Kontoravdi, C. Mild hypothermic culture conditions affect residual host cell protein composition post-Protein A chromatography. mAbs 10, 476–487 (2018).
Chiu, J. et al. Knockout of a difficult-to-remove CHO host cell protein, lipoprotein lipase, for improved polysorbate stability in monoclonal antibody formulations. Biotechnol. Bioeng. 114, 1006–1015 (2017).
Xu, X. et al. The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat. Biotechnol. 29, 735–741 (2011).
Meleady, P. et al. Utilization and evaluation of CHO-specific sequence databases for mass spectrometry based proteomics. Biotechnol. Bioeng. 109, 1386–1394 (2012).
Hilliard, W., MacDonald, M. L. & Lee, K. H. Chromosome-scale scaffolds for the Chinese hamster reference genome assembly to facilitate the study of the CHO epigenome. Biotechnol. Bioeng. 117, 2331–2339 (2020).
Li, S. et al. Proteogenomic annotation of the Chinese hamster reveals extensive novel translation events and endogenous retroviral elements. J. Proteome Res. 18, 2433–2445 (2019).
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Wright, B. W., Yi, Z., Weissman, J. S. & Chen, J. The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. https://doi.org/10.1016/j.tcb.2021.10.010 (2021).
Mudge, J. M. et al. Standardized annotation of translated open reading frames. Nat. Biotechnol. 40, 994–999 (2022).
Ivanov, I. P., Firth, A. E., Michel, A. M., Atkins, J. F. & Baranov, P. V. Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences. Nucleic Acids Res. 39, 4220–4234 (2011).
Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs, 5’UTRs, and pseudogenes are translated and some are likely to express functional proteins. eLife 4, e08890 (2015).
Zhang, H. et al. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nat. Commun. 12, 1076 (2021).
Aspden, J. L. et al. Extensive translation of small open reading frames revealed by Poly-Ribo-Seq. eLife 3, e03528 (2014).
Bazzini, A. A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).
Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011).
Chen, J. et al. Pervasive functional translation of noncanonical human open reading frames. Science 367, 1140–1146 (2020).
Martinez, T. F. et al. Accurate annotation of human protein-coding small open reading frames. Nat. Chem. Biol. 16, 458–468 (2020).
Zhang, S. et al. Mitochondrial peptide BRAWNIN is essential for vertebrate respiratory complex III assembly. Nat. Commun. 11, 1312 (2020).
Rathore, A. et al. MIEF1 microprotein regulates mitochondrial translation. Biochemistry 57, 5564–5575 (2018).
Lee, C. et al. The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metab. 21, 443–454 (2015).
Slavoff, S. A., Heo, J., Budnik, B. A., Hanakahi, L. A. & Saghatelian, A. A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining. J. Biol. Chem. 289, 10950–10957 (2014).
Koh, M. et al. A short ORF-encoded transcriptional regulator. Proc. Natl Acad. Sci. USA 118, e2021943118 (2021).
Kuo, C.-C. et al. The emerging role of systems biology for engineering protein production in CHO cells. Curr. Opin. Biotechnol. 51, 64–69 (2018).
Donaldson, J., Kleinjan, D.-J. & Rosser, S. Synthetic biology approaches for dynamic CHO cell engineering. Curr. Opin. Biotechnol. 78, 102806 (2022).
Kallehauge, T. B. et al. Ribosome profiling-guided depletion of an mRNA increases cell growth rate and protein secretion. Sci. Rep. 7, 40388 (2017).
Masterton, R. J. & Smales, C. M. The impact of process temperature on mammalian cell lines and the implications for the production of recombinant proteins in CHO cells. Pharm. Bioprocess. 2, 49–61 (2014).
Tzani, I. et al. Subphysiological temperature induces pervasive alternative splicing in Chinese hamster ovary cells. Biotechnol. Bioeng. 117, 2489–2503 (2020).
Goey, C. H., Tsang, J. M. H., Bell, D. & Kontoravdi, C. Cascading effect in bioprocessing-The impact of mild hypothermia on CHO cell behavior and host cell protein composition. Biotechnol. Bioeng. 114, 2771–2781 (2017).
Jin, M., Szapiel, N., Zhang, J., Hickey, J. & Ghose, S. Profiling of host cell proteins by two-dimensional difference gel electrophoresis (2D-DIGE): Implications for downstream process development. Biotechnol. Bioeng. 105, 306–316 (2010).
Tait, A. S., Tarrant, R. D. R., Velez-Suberbie, M. L., Spencer, D. I. R. & Bracewell, D. G. Differential response in downstream processing of cho cells grown under mild hypothermic conditions. Biotechnol. Prog. 29, 688–696 (2013).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
RNAcentral Consortium RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 49, D212–D220 (2021).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Dunn, J. G. & Weissman, J. S. Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data. BMC Genomics 17, 958 (2016).
Fields, A. P. et al. A regression-based analysis of ribosome-profiling data reveals a conserved complexity to mammalian translation. Mol. Cell 60, 816–827 (2015).
Eisenberg, A. R. et al. Translation initiation site profiling reveals widespread synthesis of non-AUG-initiated protein isoforms in yeast. Cell Syst. 11, 145–160.e5 (2020).
Finkel, Y. et al. The coding capacity of SARS-CoV-2. Nature 589, 125–130 (2021).
Lee, S. et al. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc. Natl Acad. Sci. USA 109, E2424–E2432 (2012).
Ingolia, N. T. et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep. 8, 1365–1379 (2014).
Manske, F. et al. The new uORFdb: integrating literature, sequence, and variation data in a central hub for uORF research. Nucleic Acids Res. 51, D328–D336 (2023).
Olexiouk, V., Van Criekinge, W. & Menschaert, G. An update on sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res. 46, D497–D502 (2018).
Sharp, P. M. & Li, W. H. The codon Adaptation Index−a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).
Hughes, C. S. et al. Single-pot, solid-phase-enhanced sample preparation for proteomics experiments. Nat. Protoc. 14, 68–85 (2019).
Strasser, L. et al. Detection and quantitation of host cell proteins in monoclonal antibody drug products using automated sample preparation and data-independent acquisition LC-MS/MS. J. Pharm. Anal. 11, 726–731 (2021).
Pythoud, N. et al. Optimized sample preparation and data processing of data-independent acquisition methods for the robust quantification of trace-level host cell protein impurities in antibody drug products. J. Proteome Res. 20, 923–931 (2021).
Solntsev, S. K., Shortreed, M. R., Frey, B. L. & Smith, L. M. enhanced global post-translational modification discovery with MetaMorpheus. J. Proteome Res. 17, 1844–1851 (2018).
Füssl, F. et al. Comprehensive characterisation of the heterogeneity of adalimumab via charge variant analysis hyphenated on-line to native high resolution Orbitrap mass spectrometry. mAbs 11, 116–128 (2019).
Zhang, Q. et al. Comprehensive tracking of host cell proteins during monoclonal antibody purifications using mass spectrometry. mAbs 6, 659–670 (2014).
Wen, B. & Zhang, B. PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations. Genome Res 29, 485–493 (2019).
Wen, B. & Zhang, B. PepQuery2 democratizes public MS proteomics data for rapid peptide searching. Nat. Commun. 14, 2213 (2023).
Cao, X. et al. Comparative proteomic profiling of unannotated microproteins and alternative proteins in human cell lines. J. Proteome Res. 19, 3418–3426 (2020).
Leong, A. Z.-X. et al. Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures. J. Biomed. Sci. 29, 19 (2022).
Zhu, Y. et al. Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat. Commun. 9, 903 (2018).
McGlincy, N. J. & Ingolia, N. T. Transcriptome-wide measurement of translation by ribosome profiling. Methods San. Diego Calif. 126, 112–129 (2017).
Chothani, S. et al. deltaTE: detection of translationally regulated genes by integrative analysis of Ribo-seq and RNA-seq data. Curr. Protoc. Mol. Biol. 129, e108 (2019).
Ahlmann-Eltze, C. & Anders, S. proDA: Probabilistic dropout analysis for identifying differentially abundant proteins in label-free mass spectrometry. Biorxiv Preprint at https://doi.org/10.1101/661496 (2020).
Mullard, A. FDA approves 100th monoclonal antibody product. Nat. Rev. Drug Discov. 20, 491–495 (2021).
Tuameh, A., Harding, S. E. & Darton, N. J. Methods for addressing host cell protein impurities in biopharmaceutical product development. Biotechnol. J. 18, e2200115 (2023).
Wilson, L. J., Lewis, W., Kucia-Tran, R. & Bracewell, D. G. Identification of upstream culture conditions and harvest time parameters that affect host cell protein clearance. Biotechnol. Prog. 35, e2805 (2019).
Hogwood, C. E., Bracewell, D. G. & Smales, C. M. Measurement and control of host cell proteins (HCPs) in CHO cell bioprocesses. Curr. Opin. Biotechnol. 30, 153–160 (2014).
Fukuda, N., Senga, Y. & Honda, S. Anxa2- and Ctsd-knockout CHO cell lines to diminish the risk of contamination with host cell proteins. Biotechnol. Prog. 35, e2820 (2019).
Kol, S. et al. Multiplex secretome engineering enhances recombinant protein production and purity. Nat. Commun. 11, 1908 (2020).
Ferreira, J. P., Overton, K. W. & Wang, C. L. Tuning gene expression with synthetic upstream open reading frames. Proc. Natl Acad. Sci. USA 110, 11284–11289 (2013).
Ong, H. K., Nguyen, N. T. B., Bi, J. & Yang, Y. Vector design for enhancing expression level and assembly of knob-into-hole based FabscFv-Fc bispecific antibodies in CHO cells. Antib. Ther. 5, 288 (2022).
Kearse, M. G. & Wilusz, J. E. Non-AUG translation: a new start for protein synthesis in eukaryotes. Genes Dev. 31, 1717–1731 (2017).
Liang, H. et al. PTENα, a PTEN isoform translated through alternative initiation, regulates mitochondrial function and energy metabolism. Cell Metab. 19, 836–848 (2014).
Ketteler, R. On programmed ribosomal frameshifting: the alternative proteomes. Front. Genet. 3, 242 (2012).
Zhang, P. et al. Genome-wide identification and differential analysis of translational initiation. Nat. Commun. 8, 1749 (2017).
Martinez, T. F. et al. Profiling mouse brown and white adipocytes to identify metabolically relevant small ORFs and functional microproteins. Cell Metab. 35, 166–183.e11 (2023).
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M. & Weissman, J. S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat. Protoc. 7, 1534–1550 (2012).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinforma. Oxf. Engl. 30, 2114–2120 (2014).
Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
Tjeldnes, H. et al. ORFik: a comprehensive R toolkit for the analysis of translation. BMC Bioinforma. 22, 336 (2021).
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Zhang, H. mt1022/cubar: Release v0.5.1. Zenodo https://doi.org/10.5281/zenodo.11060142 (2024).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Wang, J., Vasaikar, S., Shi, Z., Greer, M. & Zhang, B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucleic Acids Res. 45, W130–W137 (2017).
Strasser, L. et al. Proteomic landscape of adeno-associated virus (AAV)-producing HEK293 Cells. Int. J. Mol. Sci. 22, 11499 (2021).
Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).
Millikin, R. J., Solntsev, S. K., Shortreed, M. R. & Smith, L. M. Ultrafast Peptide Label-Free Quantification with FlashLFQ. J. Proteome Res. 17, 386–391 (2018).
Sandmann, C.-L. et al. Evolutionary origins and interactomes of human, young microproteins and small peptides translated from short open reading frames. Mol. Cell 83, 994–1011.e18 (2023).
Clarke, et al. manuscript code. https://doi.org/10.5281/zenodo.13285416 (2024).
Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).
Acknowledgements
The authors gratefully acknowledge funding from Science Foundation Ireland (grant references: 15/CDA/3259 (to CC) and 13/SIRG/2084 (to CC). We thank Michelle Chain, Filipe Guapo and Ciara Tierney for technical assistance in the temperature shift proteomics cell culture experiment and mass spectrometry analyses. We are also grateful to the Carapito Laboratory for their support in incorporating the Pythoud et al. mass spectrometry data in this study.
Author information
Authors and Affiliations
Contributions
I.T. and C.C. conceived the study and designed experiments; Cell culture and Ribo-seq were carried by I.T. and P.K. Ribo-seq data analysis was performed by M.C.R. and C.C. L.S., and J.B. performed the mass spectrometry experiments analyses and C.C. analysed the data. M.C.R., I.T., L.Z., M.C., B.L.K., N.B., J.B. and C.C. wrote the manuscript. All authors reviewed the paper.
Corresponding author
Ethics declarations
Competing interests
I.T., M.C.R., P.K., L.S., B.L.K., M.C., N.B., and J.B., and C.C. declare no competing interests. L.Z. is an employee of Pfizer Inc.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tzani, I., Castro-Rivadeneyra, M., Kelly, P. et al. Detection of host cell microprotein impurities in antibody drug products. Nat Commun 15, 8605 (2024). https://doi.org/10.1038/s41467-024-51870-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-51870-0