Detection of host cell microprotein impurities in antibody drug products

Tzani, Ioanna; Castro-Rivadeneyra, Marina; Kelly, Paul; Strasser, Lisa; Zhang, Lin; Clynes, Martin; Karger, Barry L.; Barron, Niall; Bones, Jonathan; Clarke, Colin

doi:10.1038/s41467-024-51870-0

Download PDF

Article
Open access
Published: 04 October 2024

Detection of host cell microprotein impurities in antibody drug products

Nature Communications volume 15, Article number: 8605 (2024) Cite this article

8156 Accesses
3 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Chinese hamster ovary (CHO) cells are used to produce almost 90% of therapeutic monoclonal antibodies (mAbs) and antibody fusion proteins (Fc-fusion). The annotation of non-canonical translation events in these cellular factories remains incomplete, limiting our ability to study CHO cell biology and detect host cell protein (HCP) impurities in the final antibody drug product. We utilised ribosome footprint profiling (Ribo-seq) to identify novel open reading frames (ORFs) including N-terminal extensions and thousands of short ORFs (sORFs) predicted to encode microproteins. Mass spectrometry-based HCP analysis of eight commercial antibody drug products (7 mAbs and 1 Fc-fusion protein) using the extended protein sequence database revealed the presence of microprotein impurities. We present evidence that microprotein abundance varies with growth phase and can be affected by the cell culture environment. In addition, our work provides a vital resource to facilitate future studies of non-canonical translation and the regulation of protein synthesis in CHO cell lines.

Development of a stable antibody production system utilizing an Hspa5 promoter in CHO cells

Article Open access 24 May 2022

Novel and effective screening system for recombinant protein production in CHO cells

Article Open access 06 September 2024

Comparative systeomics to elucidate physiological differences between CHO and SP2/0 cell lines

Article Open access 28 February 2022

Introduction

Chinese hamster ovary (CHO) cells are the predominant mammalian expression host for the production of biologics, with nearly 90% of therapeutic monoclonal antibodies (mAbs) and Fc-fusion proteins produced in this cell line¹. During the upstream cell culture phase of production, CHO cells continually secrete the recombinant protein into the supernatant. A series of downstream purification steps are required to recover the drug substance in the harvested cell culture media and reduce a range of impurities, including those originating from the host CHO cell line. Host cell proteins (HCPs) present in the final drug product are a particular concern due to the risk that a HCP could elicit an immune response in the patient or reduce efficacy². In addition, the presence of proteolytic HCPs can degrade or affect the stability of the therapeutic antibody^3,4. Regulatory authorities consider the amount of HCP in the final product to be a critical quality attribute, and biopharmaceutical companies aim to reduce the total HCP concentration to <100 ppm⁵.

The efficacy and safety track record of therapeutic proteins produced in CHO cells is a testament to the continual commitment of biopharmaceutical companies and regulatory authorities to ensuring product quality. Enzyme-linked immunosorbent assays (ELISA) that enable sensitive total HCP quantitation are widely used in batch release during commercial manufacturing⁶. It is recognised, however, that this method can be limited in terms of coverage⁷. For instance, some low molecular weight proteins⁷ may be weakly immunogenic, or indeed, a particular protein may not elicit a response in the species immunised to generate the HCP assay⁸. Regulatory authorities now recommend⁷ the use of mass spectrometry (MS) as an orthogonal HCP detection method⁵ to enable the identification and quantification of individual HCPs, even those at low concentrations. The resulting data can be used during process development to monitor the HCP impurities present at each unit operation of a downstream purification process and to demonstrate HCP clearance from the final drug product^7,9. Identification of HCPs can also be used to guide upstream process development¹⁰ or identify targets for cell line engineering to remove unwanted HCPs¹¹.

Since the publication of the first CHO cell genome¹² and CHO cell-specific protein databases¹³, the detection of CHO HCP impurities in antibody drug products using MS has significantly improved. The quality of available genomes has steadily improved over time, and with the release of the Chinese hamster PICRH genome, the field now has a reference assembly comparable to that of model organisms¹⁴. While annotation of the transcriptome has progressed significantly, characterisation of the proteome is more challenging and remains incomplete, therefore limiting the ability of MS to detect the complete repertoire of potential HCP impurities.

The Chinese hamster reference genome has, for the most part, been annotated via a combination of ab initio computational pipelines, homology, ESTs, and transcriptomics data. Previously, the effectiveness of ribosome footprint profiling (Ribo-seq) in identifying translated regions of the Chinese hamster genome has been demonstrated¹⁵. Ribo-seq enables transcriptome-wide determination of ribosome occupancy at single nucleotide resolution, facilitating the discovery of new open reading frames (ORFs), and, when combined with RNA-seq, the identification of changes in translational regulation¹⁶. The technique utilises chemical or physical inhibitors to arrest translation and fixes translating ribosomes in position, resulting in the protection of ~30 nt of mRNA within the ribosome from subsequent enzymatic degradation. The resulting monosomes are purified via sucrose gradient, sucrose cushion, or size exclusion chromatography, followed by the isolation of ribosome-protected fragments (RPFs) through size selection, from which a sequencing library is prepared. Sequencing of RPFs and alignment to a reference genome or transcriptome permits the identification and quantitation of regions undergoing active translation. Over the last decade, Ribo-seq has provided compelling evidence that the traditional rules of eukaryotic translation need to be revised. For example, translation initiation at non-AUG codons is more widespread in mammalian genomes than previously thought¹⁷.

Data from Ribo-seq experiments have been used to annotate a range of non-canonical ORFs¹⁸, including N-terminal extensions¹⁹, detect translation of RNAs previously thought to be non-coding²⁰ and study the regulatory role of ORFs in the 5’ leader sequence of mRNAs (i.e., upstream open reading frames)²¹. Ribo-seq has also revealed the existence of small open reading frames (sORFs) that produce potentially functional microproteins (classified as proteins < 100 aa) in a diverse range of organisms, including Drosophila²², zebrafish²³, mouse²⁴, and human^25,26. Studies have shown that microproteins are involved in various cellular processes such as oxidative phosphorylation²⁷, mitochondrial translation²⁸, metabolism²⁹, DNA repair³⁰ and can also act as transcription factors³¹.

There has been considerable interest in enhancing the efficiency of CHO cell factories for mAb production using systems biology³² and cell line engineering³³. Yet, while the importance of non-canonical ORFs is becoming increasingly understood in other organisms, the lack of annotation severely restricts the study of their role in CHO cell biology. Perhaps more surprisingly, given the fundamental role of protein synthesis in mAb production, there is a lack of knowledge of how translational regulation impacts CHO cell behaviour during the cell culture process. Indeed, apart from a small number of studies^15,34, Ribo-seq has not received widespread attention in the field, and the capability of the technique to study CHO cell translation regulation has yet to be demonstrated. Annotation of non-canonical ORFs in the Chinese hamster genome and transcriptome-wide analysis of protein synthesis in CHO cells has the potential to identify additional avenues to improve therapeutic antibody production.

In this work, we utilise Ribo-seq with different translation inhibitors to analyse translation elongation and translation initiation in CHO cells. Using these data, we have significantly enhanced the annotation of non-canonical ORFs in the Chinese hamster genome. We have identified previously unannotated translation events in protein-coding genes and thousands of short ORFs predicted to encode microproteins. We have shown that Ribo-seq enables improved characterisation of CHO cell biology compared to performing transcriptomics alone and present evidence that non-canonical ORFs are altered in response to environmental changes and over the course of cell culture. Importantly, our work has improved MS-based HCP detection, enabling microprotein impurities to be identified in antibody drug products.

Results

Transcriptome-wide analysis of CHO cell translation initiation and elongation using Ribo-seq

The reduction of cell culture temperature (i.e., temperature shift) is a method used to extend the viability of some commercial cell culture processes and improve product quality³⁵. In this study, we simulated an industrial temperature shift and conducted a series of Ribo-seq experiments. Our laboratory has demonstrated that temperature shift induces significant differences in gene expression and alters the cellular metabolism of a mAb-producing CHO-K1 cell line (CHO-K1 mAb)³⁶. Given our previous findings and studies from other laboratories reporting the alteration of canonical protein abundance^10,37,38,39, we reasoned that this temperature shift model would also induce widespread changes in translation regulation and provide an opportunity to identify novel Chinese hamster ORFs.

To perform ribosome footprint profiling, we conducted two identical cell culture experiments for the analysis of translation initiation and elongation. For both experiments, 8 replicate shake flasks were first grown for 48 h at 37 °C before the cell culture temperature was reduced to 31 °C (temperature shifted (TS) group; n = 4) while maintaining the remainder at 37 °C (non-temperature shifted (NTS) group; n = 4). Cells from both the TS and NTS groups were harvested for Ribo-seq 24 h after the reduction of cell culture temperature (Fig. 1a), at which point there was a reduced cell density of 30% (initiation experiment) and 24% (elongation experiment) in the TS sample group (Supplementary Fig. 1; Supplementary Data 1a). We performed ribosome footprint profiling experiments using harringtonine (HARR) (n = 8), an inhibitor of translation initiation²⁴, and cycloheximide (CHX) (n = 8), an inhibitor of translation elongation¹⁶ (Fig. 1b). For each harringtonine-treated sample, a parallel sample (n = 8) was treated with DMSO and flash frozen to arrest translation (we refer to these data as No-drug (ND)). For the CHX samples, matched gene expression profiles were acquired using total RNA-seq (n = 8) (Fig. 1c) to enable the identification of significant differences in translational efficiency (TE) between the NTS and TS sample groups.

**Fig. 1: Analysis of CHO cell translation using ribosome footprint profiling.**

Sequencing of the 24 resulting Ribo-seq libraries yielded an average of ~69, ~68, and ~58 million reads per sample for the CHX, HARR, and ND Ribo-seq, respectively. An average of ~56 million reads per sample were obtained for the 8 RNA-seq libraries. Low-quality reads were removed, and adapter sequences were trimmed from the raw Ribo-seq and RNA-seq data. For Ribo-seq data, an additional filtering stage was carried out to eliminate contamination from non-coding RNA. Reads were mapped to bowtie⁴⁰ indices constructed from Cricetulus griseus rRNA, tRNA, and snoRNA sequences obtained from v22 of the RNA Central database⁴¹. Reads aligning to any of these indices were discarded from further analysis. This filtering stage removed an average of ~55%, ~40%, and ~39% of trimmed reads for the CHX, HARR, and ND samples, respectively (Supplementary Fig. 2; Supplementary Data 2).

Next, we examined the remaining Ribo-seq reads within the expected RPF length range (25-34nt) (Fig. 1d) to select the P-site offset (the distance from the 5’ end of a read to the first nucleotide of the P-site codon). Each Ribo-seq dataset was mapped to the Chinese hamster PICRH-1.0 genome using STAR⁴². The Plastid tool⁴³ was used to assess the P-site offset and determine the proportion of reads exhibiting triplet periodicity (Fig. 1e) for NCBI-annotated canonical protein-coding genes for each offset. The optimum P-site offset was found to be 12 nt, for which ~ 71%, 65%, and 64% of reads exhibited the expected triplet periodicity for the CHX, HARR, and ND Ribo-seq datasets, respectively, and we retained the reads between 28–31 nt for further analysis (Fig. 1d). Prior to ORF identification, we confirmed the expected preferential enrichment of ribosomes at the translation initiation sites (TIS) of annotated protein-coding genes for the HARR Ribo-seq data in comparison to the CHX and ND Ribo-seq data (Fig. 1f; Supplementary Fig. 3).

Ribo-seq enables the characterisation of novel ORFs in the Chinese hamster genome

The Ribo-seq data was used to refine the annotation of translated regions of the Chinese hamster PICRH-1.0 genome by conducting a transcriptome-wide analysis using ORF-RATER⁴⁴. The ORF-RATER algorithm integrates initiation and elongation Ribo-seq data to enable the identification of unannotated ORFs by first finding all potential ORFs beginning at user-defined start codons that have an in-frame stop codon. The experimental Ribo-seq data is then used to confirm occupancy at each TIS and assess whether the putative ORF is undergoing active translation. To maximise the sensitivity of ORF detection, we merged the RPFs for all replicates in each type of Ribo-seq experiment yielding a total of approximately 144, 169, and 140 million RPFs for the harringtonine, cycloheximide, and no-drug treated Ribo-seq, respectively. Prior to ORF identification, transcripts from 4583 pseudogenes were removed. In addition, transcripts that had low coverage (n = 18,951), a high proportion of multimapped reads (n = 10), or where the RPFs aligned to a small number of positions within a transcript (n = 1662) were also excluded from further analysis. For the remaining transcripts, the initial ORF-RATER search step was limited to ORFs that began at an AUG or near-cognate start codon (i.e., CUG, GUG, or UUG). To determine if a potential TIS was occupied, only the RPF data from the HARR Ribo-seq was considered while CHX and ND-treated Ribo-seq data was utilised to determine if putative ORFs were translated by comparing the RPF occupancy of each ORF to the typical pattern of translation elongation observed for CDS of annotated protein coding genes.

An initial group of 27,784 ORFs identified by ORF-RATER with an ORF-RATER score of ≥ 0.5^45,46 and ORF length ≥ 5 aa was selected for further analysis. The proteoforms identified included those present in the current annotation of the Chinese hamster genome (i.e., Annotated) and N-terminal extensions (i.e., Extension). Two distinct classes of ORFs initiating upstream of the annotated CDS (i.e., the main ORF) were also identified. The first type, called upstream ORFs (i.e., uORFs), initiates upstream and terminates before the start codon of the main ORF. The second upstream ORF type, termed overlapping upstream open reading frames (ouORFs), also initiates in the 5’ leader of mRNAs but extends downstream beyond the main ORF’s start codon and is translated in a different reading frame. As well as ORFs in mRNAs, we also identified ORFs in transcripts classified as non-coding in the PICRH-1.0 genome that had previously unannotated start and stop codons (i.e., New ORFs).

The conditions used to inhibit translation initiation can, in some cases, lead to the identification of false positive internal ORFs due to the capture of residual elongating ribosomes⁴⁵. In our case, we utilized flash freezing in combination with harringtonine, which will also result in the capture of a proportion of RPFs from elongating ribosomes, increasing the probability of erroneous identifications. To reduce false positives from internal TIS, we discarded truncated ORFs (n = 9365), internal ORFs (n = 1723) classifications, and other low-confidence isoforms (n = 872) from further analysis. For the remaining ORFs, we utilised a method developed by Lee et al⁴⁷. to perform relative quantitation of the harringtonine signal at each TIS when compared to the ND Ribo-seq data. ORFs with a ${{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}$ value < 0.01 were eliminated from further analysis (n = 4633). The validity of ORFs with non-AUG TIS was further assessed in comparison to other proteoforms that overlapped on the same transcript. Where an AUG and non-AUG ORF were predicted to start within a 7nt window, we eliminated the non-AUG initiated ORF. In cases when a pair of AUG-initiated ORFs, or a pair of non-AUG-initiated ORFs were found within the window, only the ORF with the maximum ${{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}$ value was retained. For overlapping ORFs that started outside of the 7nt window, non-AUG ORFs were retained if the ${{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}$ value at the TIS was at least five times higher than that of the AUG-initiated counterpart. This process eliminated a further 465 ORFs. The final stage in the assessment of novel ORFs was achieved through the calculation of Ingolia’s fragment length organisation similarity score (FLOSS)⁴⁸, which was calculated for all ORFs (annotated and novel) and those ORFs with a FLOSS score classified as extreme outliers were removed (n = 525).

Upon the completion of the filtering process, 10,201 high-confidence ORFs were retained (Fig. 2a, Supplementary Fig. 4, Supplementary Data 3), of which ~44% (n = 4491) were not annotated in the Chinese hamster PICRH-1.0 genome. ~56% of these new identifications were predicted to start at near-cognate codons (i.e., CUG, GUG, or UUG). The novel N-terminal extensions, uORFs, and ouORFs that remained after this filtering process were compared to annotations in uORFdb⁴⁹. The proportion of ORFs with near cognate start codons was comparatively lower in Chinese hamster than in the 11 species examined, including human, mouse, and rat (Supplementary Fig. 5). The ability to identify initiation at non-AUG codons enabled us to discover alternative ORFs of conventional protein-coding genes that would not be possible with previous annotation approaches for the Chinese hamster genome. For instance, ~11% (n = 527) of novel ORFs identified were N-terminal extensions of annotated protein-coding transcripts (e.g., Aurora kinase A (Fig. 2b)).

**Fig. 2: Ribo-seq identifies thousands of novel CHO cell ORFs.**

The Chinese hamster genome harbours thousands of short open reading frames

We also identified a considerable number of previously uncharacterized short open reading frames (sORFs) in the Chinese hamster genome (Supplementary Data 3). sORFs are defined as ORFs predicted to produce proteins < 100 aa, termed microproteins⁵⁰. More than 90% of the ORFs identified in the 5’ region of mRNAs (uORFs (Fig. 3a) and ouORFs (Fig. 3b)) or in transcripts previously annotated as non-coding (Fig. 3c) were sORFs (Fig. 3d). 2276 uORFs were classified as sORFs and had an average putative microprotein length of 23 aa (Supplementary Fig. 6a). AUG (49.9%) was the most prevalent start codon, followed by CUG (29.3%), GUG (12.7%), and UUG (8.2%). The average predicted microprotein length of the ouORFs classified as sORFs (n = 918) was 39 aa (Supplementary Fig. 6b), with CUG (37.7%) the most frequent start codon, followed by AUG (33.0%), GUG (18.5%), and UUG (10.8%). For the New ORF class, the majority (480 of 487) were sORFs and found in transcripts annotated in NCBI as non-coding. The average length of the microproteins predicted to be encoded by these sORFs was 30 aa (Supplementary Fig. 6c). AUG (70.0%) was the most common start codon, followed by CUG (18.3%), GUG (8.3%), and UUG (3.3%).

**Fig. 3: Ribosome footprint profiling uncovers thousands of short open reading frames in the Chinese hamster genome.**

Upstream ORFs and sORFs in the New ORF group were found to have differences in amino acid usage when compared to annotated proteins with ≥ 100 aa (Fig. 3e and Supplementary Fig. 7). The amino acid usage was comparable to a recent analysis conducted for microproteins encoded in the human genome²⁶. CHO cell sORFs were found to have increased usage of arginine, glycine, and tryptophan and decreased usage of asparagine, glutamate, lysine, and aspartic acid. Alanine and proline were more prevalent in uORFs than annotated proteins and sORFs found in ncRNA, while methionine usage was more frequent in sORFs in ncRNA. We also compared the codon adaption index (CAI)⁵¹ between previously annotated protein-coding genes and novel ORF types. We found that the sORF population had a lower CAI (Fig. 3f), indicating that microproteins tend to have a lower abundance than canonical proteins.

Microproteins are a source of process-related impurities in antibody drug products

Next, we sought to determine if microproteins predicted to be encoded by novel sORFs increased the coverage of MS-based HCP detection. We performed HCP analysis in our laboratory utilising the SP3 sample preparation method followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS)^52,53 to analyse 5 commercial mAb drug products (adalimumab, denosumab, nivolumab, pertuzumab and vedolizumab) as well as a Fc-fusion protein drug product (etanercept). We also utilised a publicly available dataset from a previous HCP analysis of 4 mAb drug products (adalimumab, bevacizumab, nivolumab and trastuzumab) (Fig. 4a)⁵⁴. In total, we analysed 10 separate LC-MS/MS datasets spanning 8 antibody drug products, 5 different sample preparation methods (Fig. 4b) performed on two different Orbitrap MS instruments both operated in data-dependant acquisition (DDA) mode (Fig. 4c).

**Fig. 4: Microproteins are a class of potential host cell impurity in antibody drug products.**

The first stage in our MS analysis was to search the data using MetaMorpheus⁵⁵ with spectral mass calibration enabled to identify canonical HCPs present in each drug product. For each drug product, a protein sequence database was constructed which was comprised of the Chinese hamster reference proteome (n = 23,887, downloaded from UniProt on 27/03/2024), sequences of sample contaminants from the common Repository of Adventitious Proteins (cRAP, thegpm.org/crap), the respective recombinant protein sequencesand any quantitation or retention time standards that were added to the sample. A canonical protein was considered confidently identified if ≥ 2 peptides were detected and the protein-level FDR was < 0.01. We found canonical HCPs in all antibody drug products tested in both studies, including several previously identified HCPs in adalimumab (e.g., cathepsin L1, S100a11⁵⁶) and vedolizumab (e.g., clusterin⁵⁷) (Fig. 4d, Supplementary Data 4).

To identify microproteins in the drug product HCP data, we used PepQuery2^58,59, a peptide-centric algorithm designed specifically for detecting novel proteins and utilized previously for microprotein validation⁶⁰. A PepQuery2 index was constructed from the MetaMorpheus mass-calibrated LC-MS/MS data from all drug product HCP analyses comprising 132 samples and > 5.2 million MS/MS spectra. The microproteins annotated by Ribo-seq (n = 3681) were digested with semi-tryptic specificity and searched against the PepQuery index to identify candidate peptide spectral matches (PSMs). False positive microprotein identifications are initially eliminated by comparing these PSMs against peptides from the reference proteome (i.e., annotated Chinese hamster proteins, all antibody drug products, mass spectrometry standards, and contaminant sequences). The known peptide set was comprised of tryptic peptides from all reference proteins. In addition, we utilized MetaMorpheus to perform a liberal search of the drug product data with semi-tryptic specificity. Semi-tryptic peptides from a protein identified in any drug product at a FDR < 10 % (n = 2973) were also included in the known peptide set. The PSMs initially associated with microproteins that were subsequently found to have an equal to or greater PepQuery score for a known peptide were excluded. The remaining candidate microproteins PSMs were compared to randomly shuffled microprotein peptide sequences to determine the FDR. In the final stage of PepQuery, an unrestricted modification search was performed, and if a candidate microprotein PSM was found to have a better match to a post-translationally or artefactually modified known peptide it was eliminated^58,59.

The identification of microproteins using mass spectrometry is challenging due to their size, as a lower abundance and fewer cleavage sites amenable to digestion with trypsin in comparison to canonical proteins results in a reduction of the number of detectable peptides by MS⁶¹. Similar to other studies^26,62, we considered a single peptide sufficient for microprotein identification. Only those microprotein peptides designated as confident by PepQuery (i.e., a p-value < 0.05 if the peptide length ≤ 8 aa, or < 0.01 for peptides > 8 aa) (Fig. 4e) and found in at least 50% of the replicates of each drug product or sample preparation cohort were retained (Supplementary Data 5).

A total of 40 microprotein HCPs were identified across the eight antibody products (Fig. 4d). Of the 5 sample preparation methods, native digestion protocols resulted in the lowest number of microprotein identifications (Supplementary Fig. 8). While we detected two or more peptides from 13 of these microproteins across the dataset, most microprotein identifications resulted from the detection of a single peptide. Etanercept had the largest number of individual microproteins identified (n = 11) from the data conducted in our laboratory, while the bevacizumab and trastuzumab samples analyzed by Pythoud et al. had the largest number of microproteins (n = 18). The lowest number of microproteins detected was in vedolizumab (n = 4). Twenty-eight microproteins were identified in more than one of the drug products examined (Fig. 4f). A single microprotein was found in the adalimumab analyzed in this study and the Pythoud et al. study, while two microproteins were found in the nivolumab data from both studies. A 16aa microprotein from a CUG-initiated ouORF in a Znf883 transcript (XM_027419542.2) was found in 7 of 8 antibody-drug products tested (Fig. 4f).

We utilised the data generated in our laboratory for six drug products to assess the quantities of canonical proteins and microproteins present in the drug products. The canonical and microprotein PSMs identified by MetaMorpheus and PepQuery were combined, and FlashLFQ was used to perform label-free quantitation. The Hi3 standard quantitation method⁵³ was used to determine the presence of individual HCPs. The ten quantifiable canonical HCPs detected across the six drug products ranged in concentration from 1.52 ppm to 46.94 ppm (median = 14.28 ppm). The Hi3 method requires > 3 identified peptides for confident quantification of protein. A single quantifiable microprotein met the 3-peptide confidence level for accurate quantitation and was found to be present in etanercept at a concentration of 1.92 ppm. Given the challenges of microprotein identification, we also decided to estimate the concentration of the microproteins with fewer than three peptides identified (Supplementary Fig. 9). A microprotein with two peptides identified was found at 0.16 ppm. Quantified microproteins that were identified from a single peptide (n = 15) represent the lowest confidence estimates in terms of HCP abundance. The majority of these microproteins (n = 14) were found to be below the median concentration observed for canonical HCPs. In contrast, the remaining microprotein abundance estimate was ~800 ppm exceeding the canonical HCP range.

The translation efficiency of sORFs found in non-coding RNA genes is altered in response to a reduction of cell culture temperature

Ribo-seq can also be used to assess translation efficiency by normalizing RPF occupancy of each ORF to the corresponding RNA abundance. Comparing translation efficiency between conditions enables transcriptome-wide differences in translational regulation to be identified¹⁶. To understand the extent to which changes in the translatome are associated with the CHO cell response to sub-physiological temperature, we performed a count-based analysis of translation efficiency using the CHX-treated Ribo-seq data along with the parallel RNA-seq data captured for TS (n = 4) and NTS (n = 4) samples (Supplementary Fig. 10).

The Plastid⁴³ cs generate algorithm was used to construct a gene-level annotation by merging the positions of all exons found in all transcripts of a gene. Only those RPFs/reads mapping to the CDS regions common to all transcripts for a particular gene contributed to the overall count. The RPFs/reads from the first 15 and last five codons for ORFs > 100 aa and the first and last codons for ORFs < 100 aa were excluded. This step was intended to reduce potential bias from the cycloheximide-associated accumulation of ribosomes at the beginning and end of the CDS and enrich for those RPFs most likely to be associated with elongation⁶³. It is not possible to accurately distinguish the expression/occupancy of the uORF/ouORFs from the canonical ORFs with the gene-level CDS counting approach used here. We, therefore, focused only on the ORF cohort identified in genes previously classified as non-coding in the reference annotation. 495 of these ORFs were identified, 480 of which were predicted to encode a microprotein (Fig. 5a). The average length of potential microproteins was 30 aa (Fig. 5b), with the majority of parent transcript harboring 1 or 2 sORFs. However, there were instances where as many as five sORFs were found in a single non-coding RNA transcript (Fig. 5c). To ensure compatibility with the Plastid read/RPF counting algorithm⁴³, we utilised only the longest sORF for each transcript (n = 395).

**Fig. 5: Temperature shift induces alterations in translation regulation of CHO cell canonical ORFs and sORFs.**

We retained only the CDS regions (n = 10, 741), which had an average count of 10 in both the RNA-seq and Ribo-seq datasets, and used the deltaTE⁶⁴ to identify and classify ORFs with differences in transcription and/or translational efficiency (ΔTE) between the TS and NTS samples. The deltaTE method introduces an interaction term to the DESeq2 generalized linear model to assess differences between the biological conditions observed from RNA-seq and Ribo-seq data separately and, importantly, between the different assays to calculate the false discovery rate (FDR) for changes in the RNA abundance, RPF occupancy and the translation efficiency for each gene. We initially classified the outputs from DESeq2 using only statistically significant differences (adjusted p-value < 0.05) and not fold change, as outlined by the deltaTE developers (Fig. 5d). Of the 3707 genes classified by deltaTE, we found that 76.5% (n = 2837) were transcriptionally forwarded, where a significant increase or decrease in RPF occupancy agreed with the change in RNA abundance observed (i.e., ΔTE adjusted p-value ≥ 0.05). 7.5% (n = 279) of differences between the TS v NTS samples were found to be translation exclusive, where both the RPF occupancy and ΔTE were significantly altered, while the RNA was unchanged (adjusted p-value ≥ 0.05). The remainder of genes that were altered at both the transcriptional and translational level (i.e., RNA and ΔTE adjusted p-value < 0.05) were further classified by taking the direction of the change into account. 10.5% (n = 392) of genes were found to be buffered where changes in transcription were tempered at the level of translation, e.g., an increase in RNA abundance was associated with a decreased ΔTE. The remaining 5% (n = 199) of genes were found to be intensified by translation regulation, e.g., an upregulation in transcription was accompanied by an increased ΔTE.

To determine the extent to which translation regulation plays a role in the CHO cell response to mildly hypothermic conditions, we further filtered the deltaTE output. For transcriptionally forwarded genes, we retained only those genes (n = 863) with ≥ |1.5| fold change between the TS and NTS samples for both the RNA and RPF data. For translationally regulated categories (translation exclusive, buffered, and intensified), only genes with a ≥ |1.5| change in ΔTE were retained (n = 357) (Supplementary Data 6). Both cohorts of genes were combined, and an overrepresentation analysis against the genome ontology (GO) was performed (Supplementary Data 7). We determined the proportion of translation-exclusive, buffered, and intensified genes contributing to the 56 enriched GO categories (FDR < 0.05). For 20 significantly enriched biological processes, > 25% of genes were found to be differentially translationally regulated (Fig. 5e). We identified 15 sORFs within the forwarded cohort where a ≥ |1.5| fold increase or decrease in RNA-seq and Ribo-seq was observed without a change in ΔTE (Fig. 5f, Supplementary Fig. 11). A further five sORFs had a | 1.5| fold increase or decrease in ΔTE and were classified as buffered, intensified, or regulated exclusively through translation (Fig. 5g).

The abundance of CHO cell microproteins change in response to mild hypothermia and between the exponential and stationary growth phases

We utilised proteomic mass spectrometry to determine if microproteins predicted to be encoded by sORFs were present in whole-cell lysates. These data allowed us to overcome the inherent limitations of the RNAseq and Ribo-seq analyses and enabled the identification of microproteins from the uORF and ouORF classes and cases when a single non-coding RNA gene encodes multiple microproteins. We also sought to determine if microprotein abundance was altered upon reducing cell culture temperature (Fig. 6a). For this experiment, we again acquired cells from a non-temperature shifted control at 72 h post seeding (n = 3) and 24 h post-temperature shift (72 h post seeding) (n = 3) as well as a sample at 48 h post-temperature shift (96 h post seeding) (n = 3) (Supplementary Fig. 12a, Supplementary Data 1b). An additional proteomics experiment was performed for a second CHO cell line to assess if microproteins could be detected and if abundance was altered between the exponential and stationary phases of cell growth (Fig. 6b). Here, a non-mAb producing CHO-K1 GS cell line was cultured for 7 days; samples were acquired for proteomics at 96 h post-seeding when the cells were in exponential growth (n = 4) and at 168 h when the cells had entered stationary phase (n = 4) (Supplementary Fig. 12b, Supplementary Data 1,c). Cell lysates from both proteomics experiments were subjected to a SP3 protein clean-up procedure and tryptic digestion before LC-MS/MS (Fig. 6c). The resulting MS data from each proteomics sample was searched in an identical manner to that of the drug product HCP data. Canonical proteins were identified using MetaMorpheus (protein-level FDR < 0.01, ≥ 2 peptides detected). For PepQuery, an index was constructed comprising > 2.9 million MS/MS spectra. The complete set of tryptic peptides from the reference proteins, along with those semi-tryptic peptides identified from a liberal MetaMorpheus database search (FDR < 10%) of the data from both proteomics experiments (n = 16,949) was utilised for the PepQuery known peptide set. Only those microproteins designated as confident by PepQuery in at least 50% of the replicates of a sample cohort were retained.

**Fig. 6: Proteomic analysis of CHO cell microproteins in response to mild hypothermia and at different cell culture growth phases.**

For the temperature shift proteomics experiment, 4737 canonical proteins were identified across the nine samples analysed by mass spectrometry (Fig. 6d, Supplementary Data 8a). 110 microproteins, were detected from the uORF (n = 45), ouORF (n = 47) and New (n = 18) classes (Fig. 6f, Supplementary Data 8b). For the growth phase experiment, 5024 canonical proteins along with 53 microproteins (Fig. 6e, Supplementary Data 8c) originating from uORFs (n = 28), ouORFs (n = 19), and the New (n = 6) classes (Fig. 6g, Supplementary Data 8d). Twenty-eight microproteins were detected in both the temperature shift and growth phase experiments (Fig. 6h), with two microproteins found in the CHO lysate samples and in antibody drug products (Fig. 6h).

The PSMs for confidently identified canonical proteins and microproteins for each experiment were merged and FlashLFQ with match-between runs enabled was used to generate LFQ values for each experiment. Only those canonical and microproteins quantified in at least 50% of samples in a replicate cohort were retained for further analysis. The proDA algorithm⁶⁵ was used to identify proteins significantly altered (i.e., log₂ fold change of ≥ |1.2| and BH adjusted p-value < 0.05) between conditions for the temperature shift and growth rate experiments. Upon comparison of the 24 h and 48 h post-temperature shift samples to the non-temperature shifted control, 454 and 1117 canonical proteins were differentially expressed, respectively (Supplementary Data 9a, 9c). Of the 49 microproteins reliably quantified in this experiment, the abundance of 4 microproteins at 24 h and 9 microproteins for the 48 h post-temperature shift were found to be altered (Fig. 6i, j, Supplementary Data 9b, 9d). In the second proteomics experiment, 1,636 canonical proteins (Supplementary Data 9e) and 3 of the 10 quantified microproteins were found differentially expressed upon a comparison of the exponential and stationary phases of cell growth (Fig.6k and Supplementary Data 9f).

Discussion

Here, we present the findings of a ribosome footprint profiling experiment where translation initiation and elongation were captured at single nucleotide resolution in CHO cells. The utilisation of harringtonine to arrest translation resulted in an enrichment of RPFs at the TIS and enabled transcriptome-wide identification of 4491 novel ORFs. We found that the use of alternative initiation sites is widespread across the CHO cell transcriptome, with ~56% of all new ORFs identified beginning at non-AUG start codons. We also identified 526 N-terminal extensions of previously annotated protein-coding ORFs that begin at near-cognate start codons. Most novel annotations were sORFs predicted to encode microproteins located in the 5’ leader sequence of Chinese hamster mRNAs or in ncRNA transcripts. We have confirmed the existence of more than 170 putative microproteins from LC-MS/MS analyses of whole cell lysate and, importantly, in antibody drug products.

The identification of sORFs resulted in a more comprehensive proteomic database for mass spectrometry, enabling the enhanced assessment of HCP impurities in commercial antibody drug products. We identified 40 host-cell microproteins using LC-MS/MS across eight different drug products, 28 of which were found in more than one drug product. While microproteins were identified in each drug product analysed, we could only confidently quantitate a single microprotein found in the etanercept drug product which was well below the median quantity observed for canonical HCPs. Most estimated microprotein concentrations (i.e., derived 1 or 2 peptide identifications) were found to lie within the ppm range observed for canonical HCPs. For a microprotein in etanercept, the abundance was calculated from a single peptide and found to be particularly high, but it is important to note that any microprotein HCP concentration from < 3 detected peptides reported in this study should be considered as an estimate.

We wish to emphasise that we make no claims regarding any risk to the patient or impact on the efficacy of the therapeutic antibody products arising from the host cell microprotein impurities observed in this study. In fact, the safety and effectiveness of the more than 100 mAbs approved to date⁶⁶, the majority of which are manufactured in CHO cells, is compelling evidence that microproteins do not cause issues, if present, in approved drug products. Nevertheless, CHO cell microproteins are a new class of host cell impurity and future studies to evaluate if, in certain circumstances, these HCPs could elicit an immune response, affect mAb stability, or how they escape the purification process would be valuable for the industry. To facilitate these efforts, we have made the protein sequence database used for MS analysis freely available (see Data Availability section).

The improvement in HCP detection gained from this study is also an important tool for a range of process optimisation approaches focused on limiting these species in the final drug product⁶⁷. For instance, the HCP content in the final drug product can be influenced, in part, by the upstream process^68,69. Here, we have provided evidence that microprotein abundance can be altered by e.g., a change in the bioreactor environment, i.e., temperature. Analysis of microprotein expression during cell culture could help refine strategies that optimise the upstream process to reduce unwanted HCPs prior to harvest. Canonical HCPs are known to vary through different downstream unit operations, and examining the population at different stages facilitates the establishment of an effective purification process⁷. Our work enhances the ability to capture a more comprehensive picture of the HCP population at each unit operation, facilitating further optimisation. In recent years, several groups have also reported the use of cell line engineering to reduce or knock out problematic HCPs from CHO cell lines^11,70,71. Through the annotation of thousands of sORFs in this study, we have considerably expanded the number of potential HCPs that are now amenable to genome editing approaches.

We believe our work also provides an essential foundation for future studies of non-canonical translation events and the control of protein synthesis in this important cell line. For instance, our work will be of utility to those researchers exploring approaches for the design of expression vectors for therapeutic antibody synthesis. The uORFs found in this study pave the way for the use of endogenous uORFs, in a similar fashion to previous reports on synthetic uORFs⁷² used to precisely control translation and/or the post-translational modifications of mAbs or indeed more complicated protein formats such as bispecific antibodies⁷³.

Another important aspect of this study is demonstrating the utility of analysing translation efficiency using Ribo-seq to enhance our understanding of CHO cell biology. Here, we have shown that the expression of 30% of genes altered following a decrease in cell culture temperature were translationally regulated. 11% of genes were regulated exclusively at the level of translation. We have also observed differences in the RNA expression and translation efficiency of sORFs encoded in transcripts previously annotated as non-coding in response to sub-physiological cell culture temperature. While AUG-initiated translation is thought to result in the highest rate of protein synthesis⁷⁴, it is also possible, as with other species⁷⁵, that non-AUG-initiated proteoforms could play important roles in bioprocess phenotypes. Therefore Ribo-seq has the potential to result in a more comprehensive understanding of CHO cell behaviour than is possible with RNA-seq alone and has the potential to enable the identification of new candidates for cell line engineering studies in the future.

While the work conducted in this study has allowed us to significantly expand the annotation, it is possible that there remain further undiscovered ORFs in the Chinese hamster genome. We recognise that our work is also potentially limited by the combined use of harringtonine and flash freezing, which likely led to residual elongating ribosomes and subsequent identification of potential false positive translation initiation sites. We eliminated those classes of ORFs liable to be affected (i.e., truncations) entirely from further analysis and conservatively filtered the remaining classes to limit false positive identifications (at the expense of potentially increasing the false negative rate). Future studies utilising Chinese hamster tissues and different CHO cell lines grown under various conditions producing a range of mAbs and other protein formats or focusing on translation events such as ribosomal frameshifting⁷⁶ could facilitate the identification of additional novel ORFs. In addition, performing Ribo-seq experiments with different translation inhibitors such as lactimidomycin or puromycin in the future could not only enable new ORFs to be identified but also allow quantitative comparison of CHO cell translation initiation in different conditions^47,77.

As others have noted, detecting microproteins with mass spectrometry remains a challenge. In this study, we confirmed the existence of < 5% of predicted microproteins (173/3681). Only a small number of microroteins were quantified in the drug products or found to be differentially expressed in whole-cell lysate. Microprotein abundance and stability are thought to limit their detection. While we did not study stability, we found that the codon adaption index (CAI) for sORFs indicated that the expected abundance would generally be lower than for canonical ORFs. We observed a marginal correlation between RPF occupancy and microprotein detection (Supplementary Fig. 13), indicating that other factors are important for identification. We believe the most significant opportunities to increase detection rates remain sample preparation and mass spectrometry methods, as well as bioinformatics approaches to reduce false positives. Approaches such as data-independent acquisition (DIA)⁷⁸ combined with spectral libraries generated via machine learning⁷⁹ have shown considerable promise, and harnessing these advances will be important for future studies of CHO cell microproteins.

In conclusion, we have performed a series of Ribo-seq experiments with various translation inhibitors to examine translation elongation and, notably, translation initiation in CHO cells. This approach substantially enhanced the annotation of non-canonical ORFs in the Chinese hamster genome. We discovered novel translation events in previously annotated protein-coding genes and identified thousands of new short ORFs predicted to encode microproteins. Our findings allow improved MS-based HCP detection in antibody drug products. In addition, our discovery of novel ORFs presents new opportunities to understand CHO cell biology and provides a foundation for harnessing an enhanced understanding of protein synthesis in these cells to further improve manufacturing process efficiency and the quality of therapeutic proteins.

Methods

Cell culture

Cell lines

In this study two Chinese hamster ovary (CHO) cell lines were used for Ribo-seq and proteomics. The first was a CHO-K1 monoclonal antibody producing CHO cell line (CHO-K1mAb) and the second a non-producing CHO K1 GS cell line provided by Pfizer Inc and Horizon Discovery respectively. Both CHO cell lines used for this study were clonally derived and originated from the cell bank of each company.

Generation of samples for Ribo-seq and RNA-seq

The CHO-K1 mAb cell line was seeded at a density of 2 × 10⁵ cells/ml in 50 ml CHO-S-SFM-II media (Gibco, 12052098) in 8 replicate shake flasks in a Kuhner orbital shaker at 170 rpm at 5% CO₂. The cultures were grown at 37 °C for 48 h post-seeding, at which point the temperature of 4 of the shake flasks was reduced to 31 °C, while the remaining four shake flasks were maintained at 37 °C (Fig. 1a). Samples for library preparation were acquired 72 h post-seeding. The procedure was repeated in two separate experiments; the first was used to generate Ribo-seq and matched total RNA-seq libraries from cycloheximide-treated cells (8 samples), and the second to generate Ribo-seq libraries from harringtonine-treated (8 samples) and matched no drug-treated cells.

Generation of samples for proteomics

For proteomics analysis, two experiments were conducted. The first experiment utilized an identical cell culture model for the CHO-K1 mAb cell line. Here, three biological replicates (two technical replicates each) were acquired for the non-temperature shifted control, at 24 h post temperature shift, and at 48 h post temperature shift. The second proteomics experiment focussed on growth phases (stationary v exponential). A non-mAb producing CHO-K1 GS cell line was seeded at a density of 2 × 10⁵ cells/ml in 30 ml CD FortiCHO^TM medium (Gibco, cat.no. A1148301) supplemented with 4 mM L-glutamine (L-Glutamine, cat.no. 25030024) in 8 250 ml Erlenmeyer shake flasks. The cultures were maintained at 37°C, 170 rpm, 5% CO₂, and 80% humidity in a shaking incubator (Kuhner) for 4 or 7 days. Samples at Day 4 (four biological replicates) and at Day 7 (four biological replicates) were acquired and stored at −80°C until analysis.

Ribosome footprint profiling

Translation Initiation sample preparation

At 72 hours post-seeding, cells were treated with harringtonine (2 µg/ml) (or DMSO) for 2 minutes at 31 °C or 37 °C. The cells were washed with ice-cold PBS supplemented with harringtonine or DMSO respectively and flash-frozen in liquid nitrogen. Frozen pellets were resuspended in 400 µl 1X Mammalian Polysome buffer (Illumina TruSeq Ribo Profile (mammalian) kit) prepared according to the manufacturer’s guidelines. Cell lysates were incubated on ice for 10 minutes, and then centrifuged at 18,000 × g for 10 minutes at 4°C to pellet cell debris. The supernatant was used for ribosome-protected fragment (RPF) isolation and library preparation.

Translation elongation sample preparation

Seventy-two hours post seeding, a total of 25 × 10⁶ cells (per replicate) were pelleted and resuspended in 20 ml of fresh CHO-S-SFMII media supplemented with cycloheximide at a final concentration of 0.1 mg/mL and incubated at 37°C or 31 °C for 10 min. Cells were subsequently pelleted, washed in 1 mL of ice-cold PBS containing 0.1 mg/mL of CHX, clarified, and lysed. Before the generation of ribosomal footprints, part of the lysate was used for total RNA extraction and RNA-seq library preparation with the TruSeq Ribo Profile (mammalian) kit (cat. no. RPHMR12126). RPFs were size selected on a gel, purified, and used for ribosome profiling library preparation with the Illumina TruSeq Ribo Profile (mammalian) kit.

Library preparation

To prepare RNA-seq and Ribo-seq libraries for sequencing, the TruSeq Ribo Profile (Mammalian) Kit (Illumina) (cat. no. RPHMR12126) was used in accordance with the manufacturer’s specifications. For Ribo-seq samples, RNase treatment was performed with 10 µl of TruSeq Ribo Profile Nuclease per 200 µl lysate at room temperature for 45 minutes with gentle shaking. Digestion was stopped with 15 µl SUPERaseIn (20U/µl) (Ambion, cat. No. AM2696). Monosomes were isolated with size exclusion chromatography using the Illustra MicroSpin S-400 HR Columns (GE Life Sciences, cat. no. 27514001) according to the manufacturer’s instructions. Ribosomal RNA was removed with the RiboZero-Gold rRNA removal Kit (Illumina, cat. No MRZG12324). Ribosome-protected fragments were size selected from a 15% denaturing urea polyacrylamide gel following electrophoresis (7 M urea, acrylamide (19): bis-acrylamide (1)). A gel extraction step (from 15% denaturing gels) for the isolation of linker ligated ribosome protected fragments was added to the protocol after the linker ligation reaction as described in Ingolia’s protocol⁸⁰ for the Harringtonine and No-drug treated samples to avoid high concentration of linker dimers contaminating the final library. Following reverse transcription, cDNA was extracted from 7.5% denaturing urea gels. PCR amplified libraries were purified from 8% polyacrylamide gels and subsequently analyzed with the Agilent High Sensitivity DNA assay (Agilent, Bioanalyzer).

Sequencing

The libraries for translation initiation and elongation analyses were sequenced on an Illumina NextSeq configured to yield 75 bp and 50 bp single-end reads, respectively.

Ribo-seq and RNA-seq data analysis

Pre-processing

Adapter sequences were trimmed from the Ribo-seq and RNA-seq datasets using Cutadapt v1.18⁸¹, and Trimmomatic v0.36⁸² was used to remove low-quality bases. To remove reads from contaminating RNAs from the Ribo-seq data, Chinese hamster rRNA, tRNA, and snoRNA sequences were downloaded from the RNAcentral v22 database⁴¹, and an individual bowtie v1.3.1⁴⁰ index was built for each type of RNA. The Ribo-seq reads were aligned against each index using the parameters: -v 2 -l 20 -norc. Reads that mapped to rRNA, tRNA, or snoRNA were discarded.

Read alignment

The pre-processed Ribo-seq and RNA-seq data were aligned to the NCBI CriGri-PICRH 1.0 genome and transcriptome (GCA_003668045.2)¹⁴ with STAR v2.7.8a⁴² using the following parameters: −outFilterMismatchNmax 2 −outFilterMultimapNmax 1 −outFilterMatchNmin 16 −aligEndsType EndToEnd.

Ribo-seq P-site offset identification and selection of RPFs

The P-site offset (the number of nucleotides between the 5’ end of a Ribo-seq read and the P-site of the ribosome footprint that was captured) was determined using Plastid v0.5.1⁴³ by first defining the genomic region around annotated Chinese hamster CDS using the Plastid metagene generate algorithm with default settings. The Plastid P-site tool was then used to assess the P-site for different read lengths around the expected mammalian RPF size (25-34nt) for CDSs with at least ten mapped reads at the start region. Following the determination of P-site offsets for each read length for the CHX, HARR, and ND Ribo-seq data, only those read lengths where ≥ 60% of the reads were found to have the expected triplet periodicity with a P-site offset of 12 nt were retained for further analysis.

ORF identification

The eight replicates from each Ribo-seq type were merged to increase sensitivity before the ORF-RATER pipeline⁴⁴ was used to identify ORFs in the Chinese hamster genome. Annotated pseudogenes were removed from the reference, and only those transcripts with a minimum of 64 mapped RPFs from the CHX and ND Ribo-seq data were considered for ORF identification. The ORF search was limited to NUG codons, with only the HARR Ribo-seq data used to identify the translation initiation sites. The CHX and ND RPFs were used to assess translation elongation for putative ORFs. Identified ORFs with an ORF-RATER score ≥ 0.5^45,46 and length ≥ 5aa were retained. Further filtering of ORF-RATER was performed via the removal of low-confidence ORFs necessitated by the nature of our Ribo-seq data. For the identified ORFs, we utilised the Lee et al⁴⁷. method to confirm the enrichment of the data generated via a translational initiation inhibitor (i.e., harringtonine) versus a matched elongation inhibitor control (i.e., no-drug) as follows:

$${{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}=\left(\left(\frac{{{{\rm{X}}}}_{{{\rm{harr}}}}}{{{{\rm{N}}}}_{{{\rm{harr}}}}}\right)\times 10\right)-\left(\left(\frac{{{{\rm{X}}}}_{{{\rm{nd}}}}}{{{{\rm{N}}}}_{{{\rm{nd}}}}}\right)\times 10\right)$$

Where $X$ is the number of RPFs occupying a 7nt window around the ORF-RATER identified TIS, and $N$ is the total number of RPFs mapped to the transcript. These counts were performed using the GenomicRanges v1.54.1 Bioconductor R package⁸³. Where more than one novel ORF overlapped on the same transcript, we removed any near-cognate initiating ORFs when their TIS was located inside the 7nt window of the AUG TIS. Where more than one AUG or non-AUG ORF initiated within the window, only the ORF with the maximum ${{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}$ value was retained. When ORFs overlapped outside this window near cognate ORFs were removed if their ${{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}$ value was less than five times its AUG-initiated ORF counterpart. For the final stage of filtering, the FLOSS score⁴⁸ was calculated with the ORFik v1.22.0 Bioconductor R package⁸⁴, and ORFs classified as extreme when compared to the FLOSS scores of annotated ORFs were removed. Visualisation of ORF coverage was accomplished using the Plastid make_wiggle algorithm for p-site off-set coverage and deeptools v3.5.1⁸⁵ for full coverage. Where more than one coverage type was displayed in the same figure, the bins per million (BPM) value was scaled between 0 and 1.

Codon adaption index

The 500 ORFs with the highest RPF occupancy were determined from the fragments per kilobase mapped (FPKM) value for the cycloheximide Ribo-seq data. These reference ORFs were used to determine the relative synonymous codon usage (RSCU). The RSCU was used to calculate the CAI with cubar R package⁸⁶ for the Canonical and sORFs from uORF, ouORF, and New ORF classes.

Differential RNA expression, RPF count and differential translation analysis

To conduct gene-level differential translation analysis, the reference protein coding annotation was merged with selected ORFs found on non-coding RNAs. Before counting, Plastid cs generate was used to collapse transcripts that shared exons, remove regions comprised of more than one same-strand gene, and create position groups corresponding to exons, CDS, 5’ leader, and 3’UTR. Reads and RPFs aligning to the first 15 and last five codons of each CDS were discarded for ORFs ≥ 100 aa, while for those ORFs < 100 aa, the first and last codon counts were excluded²⁶. For differential translation analysis, we utilised the deltaTE method⁶⁴ using the RNA-seq counts as an interaction term within the DESeq2⁸⁷ model to enable the identification of changes in RPF density independently of RNA abundance. An absolute fold change ≥ |1.5| and BH adjusted p-value < 0.05 were considered significant for all analyses.

Enrichment analysis

The overrepresentation of GO biological processes in differentially expressed and/or differentially translated gene lists was assessed with the R WebGestaltR package⁸⁸. Where no gene symbol was available, the Chinese hamster gene name was mapped to the NCBI Mus musculus GRCm39 annotation, and the corresponding mouse gene symbol was used. GO biological processes with a BH-adjusted p-value of <0.05 were considered significant.

CHO cell lysate proteomics

Sample preparation for reversed phase liquid chromatography-tandem mass spectrometry (RPLC-MS/MS)

Samples obtained from the temperature shift and growth phase experiments described above were prepared for proteomics using a semi-automated version of the SP3 protocol⁵². Briefly, CHO cells were pelleted via centrifugation at 300 × g for 5 mins. Following a wash step with 1 × PBS, cells were lysed using 1 × RIPA buffer (Cell Signalling Technology, Dublin, Ireland) containing 1 × protease inhibitor (cOmplete™, Mini, EDTA-free Protease Inhibitor Cocktail, Sigma, Wicklow, Ireland) followed by sonication. After removal of cell debris via centrifugation at 14,000 × g for 10 mins and resuspension in PBS containing 1 × protease inhibitor, the protein concentration was determined using the Pierce™ 660 nm Protein Assay Kit (Thermo Fisher Scientific, Dublin, Ireland). Protein reduction was accomplished via the addition of 5 mM 1, 4-dithiothreitol to a sample aliquot containing 50 µg of protein followed by incubation at 56°C for 30 mins. Alkylation of proteins was achieved by incubating the sample for 30 mins with 10 mM iodoacetamide at room temperature. Upon completion of this step, 50% (v/v) ethanol was added to the samples prior to purification with Sera-Mag Carboxylate SpeedBeads (Cytiva, Buckinghamshire, UK) according the to the SP3 protocol⁵². Following recovery of the beads, proteins were digested in 50 mM ammonium bicarbonate (Sigma) containing 1 µg trypsin (Promega, Madison, WI, USA) for 4 h at 37° C⁸⁹. The digested samples were acidified by adding 0.1% (v/v) formic acid before LC-MS/MS analysis. Note: An identical sample preparation procedure was carried out for HCP analysis of mAb drug products.

Temperature shift experiment RPLC-MS/MS analysis

CHO-K1 mAb cell samples from the temperature shift experiment were analysed using an Orbitrap Exploris™ 480 mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) online hyphenated to an UltiMate™ 3000 RSLCnano system using an EASY-Spray™ source. 1 µg per sample was loaded onto a C18 Nano-Trap column followed by separation using an EASY-Spray Acclaim PepMap 100, 75 µm × 50 cm column maintained at 45.0 °C at a flow rate of 250.0 nL/min. Separation was achieved using a gradient of (A) 0.10% (v/v) formic acid in water and (B) 0.10% (v/v) formic acid in acetonitrile. Gradient conditions were as follows: 2–25% B in 120 min, followed by another increase to 45% B in 30 min. The separation was followed by two wash steps at 80% B for 5 min, and the column was re-equilibrated at 5% B for 15 min.

MS detection was carried out in centroid positive ion mode. First, full scans were acquired at a resolution setting of 60,000 (at m/z 200) with a scan range of m/z 200–2000. The normalised automatic gain control (AGC) target was set to 100% with a maximum IT of 50 ms. The 20 most abundant precursor ions were selected for HCD fragmentation using a normalised collision energy of 28%. For the isolation of precursor ions, an isolation window of 1.2 m/z and an intensity threshold of 5.0e3 was used. Fragment scans were acquired using a resolution setting of 15,000 (at m/z 200) with an AGC target of 50% and a maximum IT of 70 ms. Unassigned charge states and charge states > 7 were excluded from fragmentation. A dynamic exclusion was used for 45 seconds with a tolerance of ± 5 ppm.

Growth phase experiment RPLC-MS/MS analysis

For the Growth phase experiment, mass spectrometric analysis was performed using an Orbitrap Eclipse™ Tribid™ mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) coupled to an UltiMate™ 3000 RSLCnano system using an EASY-Spray™ source (Thermo Fisher Scientific, Germering, Germany). 2 µg per sample were loaded onto a C18 Nano-Trap Column followed by separation using an EASY-Spray Acclaim PepMap 100, 75 µm × 50 cm column (Thermo Fisher Scientific, Sunnyvale, CA, United States) maintained at 45.0°C at a flow rate of 250.0 nL/min. Separation was achieved using a gradient of (A) 0.10% (v/v) formic acid in water and (B) 0.10% (v/v) formic acid in acetonitrile (LC-MS optima, Fisher Scientific). Gradient conditions were as follows: 5% B for 5 min, followed by a linear gradient of 5–25% in 95 min, followed by another increase to 35% B in 20 min. The separation was followed by two wash steps at 90% B for 5 min, and the column was re-equilibrated at 5% B for 15 min.

MS analysis was performed in positive ion mode. Full scans were acquired in the Orbitrap at a resolution setting of 120,000 (at m/z 200) with a scan range of m/z 200–2000 using a normalised AGC target of 100% and an automatic maximum injection time in centroid mode. Using a 3 s cycle time, ions were selected for HCD fragmentation using a collision energy setting of 28%. Fragment scans were acquired in the Orbitrap using a resolution setting of 30,000 (at m/z 200). The AGC target was set to 200%. For isolation of precursor ions, an isolation window of 1.2 m/z was used. An intensity threshold of 5.0e4 was applied while unassigned charge states, as well as charges > 6, were excluded. The dynamic exclusion time was set to 60 s with ± 5 ppm tolerance.

Host cell protein analysis

RPLC-MS/MS analysis of drug products

Following tryptic digestion, as described above, HCP analysis was conducted for adalimumab, denosumab, etanercept, nivolumab pertuzumab and vedolizumab commercial drug product acquired from Evidentic (Berlin, Germany). Six technical replicates were acquired for each drug product (two separate aliquots of material from a vial and HCP analysis was performed in triplicate for each aliquot). Mass spectrometry was carried out using an Orbitrap Exploris™ 480 mass spectrometer coupled to a Vanquish™ Neo UHPLC system (Thermo Fisher Scientific, Germering, Germany) using an EASY-Spray™ source. LC-MS/MS analysis was performed in two steps: First, 100 ng/sample was analysed to generate an exclusion list containing drug product derived peptides. Second, replicate 3 µg samples were analysed using the corresponding exclusion lists to facilitate HCP detection. For quantitation, Hi3 E. coli (Waters, Milford, MA, USA) standard was added to reach a final concentration of 50 fmol/µg of protein injected.

Using the pressure driven-injection mode, either 100 ng or 3 µg of sample were loaded onto a C18 Nano-Trap column followed by separation using a 50 cm × 75 µm EASY-Spray™ PepMap™ Neo UHPLC column (Thermo Fisher Scientific, Sunnyvale, CA, USA). For separation, a linear gradient of 2–25% B (0.1% formic acid (v/v) in acetonitrile) over 60 min followed by another increase to 45% B in 30 min was used. Separation was followed by two wash steps at 80% B before column re-equilibration using 3 column volumes. The flow rate was 250 nl/min, and the column temperature was maintained at 45.0 °C.

Data-dependent MS/MS analysis was performed in positive ion mode. First, full scans were acquired covering a scan range of m/z 200-2000 using a resolution setting of 60,000 (at m/z 200). The AGC target was set to 100% with a maximum IT of 50 ms. 20 most abundant ions were selected for fragmentation using a HCD collision energy of 28%. An isolation window of 1.2 m/z was used. The AGC target for MS2 scans acquired using a resolution setting of 15,000 (at m/z 200) was set to 50% with a maximum IT of 70 ms. Only charge states of + 2 to + 7 were used for fragmentation. An intensity threshold of 5.0e3 was applied. Dynamic exclusion was used for 45 s with a ± 5 ppm tolerance. For targeted mass exclusion of drug product derived peptides, a retention time window of 5 min and a tolerance of 10 ppm were allowed.

Mass spectrometry data analysis

Canonical protein identification

MetaMorpheus⁵⁵ v1.0.5 was used to conduct a database search of the data from each drug product and whole cell lysate experiment. The following search settings were used: protease = trypsin; maximum missed cleavages = 2; minimum peptide length = 7; maximum peptide length = 45; initiator methionine behaviour = Variable; fixed modifications = Carbamidomethyl on C (except for those Pythoud et al. sample where native digestion preparation methods were used⁵⁴); variable modifications = Oxidation on M, protein N-term acetylation; max mods per peptide = 2; precursor mass tolerance = ±10 ppm; product mass tolerance = ±0.05 Da. For each drug product search, a database comprised of the Chinese hamster reference proteome (n = 23,887, downloaded from UniProt on 27/03/2024), contaminant sequences, the respective recombinant antibody protein sequences as well as any quantitation or indexed retention time standards added to a given sample. Only the UniProt reference proteome and contaminant sequences were used for the whole-cell lysate samples. Those canonical proteins with ≥ 2 peptides detected and the protein-level FDR < 0.01 were considered confidently identified.

Microprotein identification

Microproteins found using Ribo-seq were identified from mass spectrometry data using PepQuery v2.0.2. Two distinct searches were performed, one for all drug product HCP data and the other for whole cell lysate proteomic data. The mass-calibrated mzML files generated by the MetaMorpheus canonical protein searches were converted to MGF files using ProteoWizard⁹⁰ and PepQuery indexes constructed from these data.

For the PepQuery search, the chainsaw algorithm from ProteoWizard⁹⁰ was used to perform a semi-tryptic in-silico digestion of the putative microproteins, 2 missed cleavages were allowed, the minimum peptide size was 7 aa and the maximum peptide length was 45 aa. The known proteome was comprised of UniProt Chinese hamster proteins, contaminants, antibody sequences, and quantitation standards. First, a fully tryptic in-silico digestion of the known proteome was performed with chainsaw, 2 missed cleavages were allowed, with a minimum peptide length of 7 aa and a maximum peptide length of 45 aa. In addition, we also performed a MetaMorpheus search for lysate and drug product data with semi-tryptic specificity. The other MetaMorpheus parameters were identical to the canonical protein search described above. Semi-tryptic peptides identified by MetaMorpheus with a peptide Q value < 10% for drug product and lysate data were combined with the fully tryptic peptides in the PepQuery known protein set.

Microprotein peptides were searched against the PepQuery indices with the following parameters: precursor mass tolerance = ± 10 ppm; product mass tolerance = ±0.05 Da; fixed modifications = Carbamidomethyl on C (except for Pythoud et al. samples where native digestion was used⁵⁴); variable modifications = Oxidation on M. PepQuery does not permit protein N-term acetylation as a variable modification and we carried out separate searches for drug product and lysate data with N-terminal peptide for each microprotein using the same parameters as detailed above except with peptide N-term acetylation enabled. Peptides from microproteins were considered identified if PepQuery confident = Yes (i.e., a p-value < 0.05 if the peptide length ≤ 8 aa, or < 0.01 for peptides > 8 aa) and the peptide was detected in at least 50% of replicates for the respective sample cohort.

Label free quantitation

FlashLFQ⁹¹ was used to perform label-free quantitation for the drug product and whole cell lysate data. Mass tolerance was set to ±10 ppm, and Match between runs (MBR) was enabled with the maximum retention time window specified as 0.7 minutes⁹². A protein was retained for differential abundance analysis if at least 50% of samples within one replicate cohort had non-zero LFQ values.

Identification of differentially expressed proteins

LFQ data was log2 transformed, normalised on total peptide amount, and the proDA algorithm⁶⁵ was then utilised to fit a probabilistic dropout model to these data prior to differential expression analysis. Proteins with a ≥ |1.2| fold change and BH adjusted p-value < 0.05 were considered differentially expressed.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The Ribo-seq and RNA-seq data from the harringtonine, cycloheximide and no-drug treated cells have been deposited in the Sequence Read Archive (SRA) with accession code PRJNA778050. The mass spectrometry data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the following dataset identifiers: PXD053529 (drug product), PXD053492 (temperature shift proteomics), and PXD053511 (growth phase proteomics). The Pythoud et al. mass spectrometry data was obtained from PRIDE, accession ID: PXD019668. The microprotein sequence database used for HCP and proteomics analysis is available at https://doi.org/10.5281/zenodo.12609363. Source data are provided with this paper.

Code availability

The code required to reproduce the results presented in this manuscript⁹³ is available at https://github.com/clarke-lab/tzani_microprotein_manuscript.

References

Walsh, G. & Walsh, E. Biopharmaceutical benchmarks 2022. Nat. Biotechnol. 40, 1722–1760 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hanania, N. A. et al. Lebrikizumab in moderate-to-severe asthma: pooled data from two randomised placebo-controlled studies. Thorax 70, 748–756 (2015).
Article PubMed Google Scholar
Li, X. et al. Identification and characterization of a residual host cell protein hexosaminidase B associated with N-glycan degradation during the stability study of a therapeutic recombinant monoclonal antibody product. Biotechnol. Prog. 37, e3128 (2021).
Article CAS PubMed PubMed Central Google Scholar
Luo, H. et al. Cathepsin L causes proteolytic cleavage of chinese-hamster-ovary cell expressed proteins during processing and storage: identification, characterization, and mitigation. Biotechnol. Prog. 35, e2732 (2019).
Article PubMed Google Scholar
Bracewell, D. G., Francis, R. & Smales, C. M. The future of host cell protein (HCP) identification during process development and manufacturing linked to a risk-based management for their control. Biotechnol. Bioeng. 112, 1727–1737 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zhu-Shimoni, J. et al. Host cell protein testing by ELISAs and the use of orthogonal methods. Biotechnol. Bioeng. 111, 2367–2379 (2014).
Article CAS PubMed Google Scholar
Pilely, K. et al. Monitoring process-related impurities in biologics–host cell protein analysis. Anal. Bioanal. Chem. 414, 747–758 (2022).
Article CAS PubMed Google Scholar
Henry, S. M., Sutlief, E., Salas-Solano, O. & Valliere-Douglass, J. ELISA reagent coverage evaluation by affinity purification tandem mass spectrometry. mAbs 9, 1065–1075 (2017).
Article CAS PubMed PubMed Central Google Scholar
Huang, Y., Molden, R., Hu, M., Qiu, H. & Li, N. Toward unbiased identification and comparative quantification of host cell protein impurities by automated iterative LC–MS/MS (HCP-AIMS) for therapeutic protein development. J. Pharm. Biomed. Anal. 200, 114069 (2021).
Article CAS PubMed Google Scholar
Goey, C. H., Bell, D. & Kontoravdi, C. Mild hypothermic culture conditions affect residual host cell protein composition post-Protein A chromatography. mAbs 10, 476–487 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chiu, J. et al. Knockout of a difficult-to-remove CHO host cell protein, lipoprotein lipase, for improved polysorbate stability in monoclonal antibody formulations. Biotechnol. Bioeng. 114, 1006–1015 (2017).
Article CAS PubMed Google Scholar
Xu, X. et al. The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat. Biotechnol. 29, 735–741 (2011).
Article CAS PubMed PubMed Central Google Scholar
Meleady, P. et al. Utilization and evaluation of CHO-specific sequence databases for mass spectrometry based proteomics. Biotechnol. Bioeng. 109, 1386–1394 (2012).
Article CAS PubMed Google Scholar
Hilliard, W., MacDonald, M. L. & Lee, K. H. Chromosome-scale scaffolds for the Chinese hamster reference genome assembly to facilitate the study of the CHO epigenome. Biotechnol. Bioeng. 117, 2331–2339 (2020).
Article CAS PubMed Google Scholar
Li, S. et al. Proteogenomic annotation of the Chinese hamster reveals extensive novel translation events and endogenous retroviral elements. J. Proteome Res. 18, 2433–2445 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Wright, B. W., Yi, Z., Weissman, J. S. & Chen, J. The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. https://doi.org/10.1016/j.tcb.2021.10.010 (2021).
Mudge, J. M. et al. Standardized annotation of translated open reading frames. Nat. Biotechnol. 40, 994–999 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ivanov, I. P., Firth, A. E., Michel, A. M., Atkins, J. F. & Baranov, P. V. Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences. Nucleic Acids Res. 39, 4220–4234 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs, 5’UTRs, and pseudogenes are translated and some are likely to express functional proteins. eLife 4, e08890 (2015).
Article PubMed PubMed Central Google Scholar
Zhang, H. et al. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nat. Commun. 12, 1076 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Aspden, J. L. et al. Extensive translation of small open reading frames revealed by Poly-Ribo-Seq. eLife 3, e03528 (2014).
Article PubMed PubMed Central Google Scholar
Bazzini, A. A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chen, J. et al. Pervasive functional translation of noncanonical human open reading frames. Science 367, 1140–1146 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Martinez, T. F. et al. Accurate annotation of human protein-coding small open reading frames. Nat. Chem. Biol. 16, 458–468 (2020).
Article ADS CAS PubMed Google Scholar
Zhang, S. et al. Mitochondrial peptide BRAWNIN is essential for vertebrate respiratory complex III assembly. Nat. Commun. 11, 1312 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Rathore, A. et al. MIEF1 microprotein regulates mitochondrial translation. Biochemistry 57, 5564–5575 (2018).
Article CAS PubMed Google Scholar
Lee, C. et al. The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metab. 21, 443–454 (2015).
Article CAS PubMed PubMed Central Google Scholar
Slavoff, S. A., Heo, J., Budnik, B. A., Hanakahi, L. A. & Saghatelian, A. A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining. J. Biol. Chem. 289, 10950–10957 (2014).
Article CAS PubMed PubMed Central Google Scholar
Koh, M. et al. A short ORF-encoded transcriptional regulator. Proc. Natl Acad. Sci. USA 118, e2021943118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kuo, C.-C. et al. The emerging role of systems biology for engineering protein production in CHO cells. Curr. Opin. Biotechnol. 51, 64–69 (2018).
Article CAS PubMed Google Scholar
Donaldson, J., Kleinjan, D.-J. & Rosser, S. Synthetic biology approaches for dynamic CHO cell engineering. Curr. Opin. Biotechnol. 78, 102806 (2022).
Article CAS PubMed Google Scholar
Kallehauge, T. B. et al. Ribosome profiling-guided depletion of an mRNA increases cell growth rate and protein secretion. Sci. Rep. 7, 40388 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Masterton, R. J. & Smales, C. M. The impact of process temperature on mammalian cell lines and the implications for the production of recombinant proteins in CHO cells. Pharm. Bioprocess. 2, 49–61 (2014).
Article Google Scholar
Tzani, I. et al. Subphysiological temperature induces pervasive alternative splicing in Chinese hamster ovary cells. Biotechnol. Bioeng. 117, 2489–2503 (2020).
Article CAS PubMed Google Scholar
Goey, C. H., Tsang, J. M. H., Bell, D. & Kontoravdi, C. Cascading effect in bioprocessing-The impact of mild hypothermia on CHO cell behavior and host cell protein composition. Biotechnol. Bioeng. 114, 2771–2781 (2017).
Article CAS PubMed Google Scholar
Jin, M., Szapiel, N., Zhang, J., Hickey, J. & Ghose, S. Profiling of host cell proteins by two-dimensional difference gel electrophoresis (2D-DIGE): Implications for downstream process development. Biotechnol. Bioeng. 105, 306–316 (2010).
Article CAS PubMed Google Scholar
Tait, A. S., Tarrant, R. D. R., Velez-Suberbie, M. L., Spencer, D. I. R. & Bracewell, D. G. Differential response in downstream processing of cho cells grown under mild hypothermic conditions. Biotechnol. Prog. 29, 688–696 (2013).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article PubMed PubMed Central Google Scholar
RNAcentral Consortium RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 49, D212–D220 (2021).
Article Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Dunn, J. G. & Weissman, J. S. Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data. BMC Genomics 17, 958 (2016).
Article PubMed PubMed Central Google Scholar
Fields, A. P. et al. A regression-based analysis of ribosome-profiling data reveals a conserved complexity to mammalian translation. Mol. Cell 60, 816–827 (2015).
Article CAS PubMed PubMed Central Google Scholar
Eisenberg, A. R. et al. Translation initiation site profiling reveals widespread synthesis of non-AUG-initiated protein isoforms in yeast. Cell Syst. 11, 145–160.e5 (2020).
Article CAS PubMed PubMed Central Google Scholar
Finkel, Y. et al. The coding capacity of SARS-CoV-2. Nature 589, 125–130 (2021).
Article ADS CAS PubMed Google Scholar
Lee, S. et al. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc. Natl Acad. Sci. USA 109, E2424–E2432 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ingolia, N. T. et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep. 8, 1365–1379 (2014).
Article CAS PubMed PubMed Central Google Scholar
Manske, F. et al. The new uORFdb: integrating literature, sequence, and variation data in a central hub for uORF research. Nucleic Acids Res. 51, D328–D336 (2023).
Article CAS PubMed Google Scholar
Olexiouk, V., Van Criekinge, W. & Menschaert, G. An update on sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res. 46, D497–D502 (2018).
Article CAS PubMed Google Scholar
Sharp, P. M. & Li, W. H. The codon Adaptation Index−a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).
Article ADS CAS PubMed PubMed Central Google Scholar
Hughes, C. S. et al. Single-pot, solid-phase-enhanced sample preparation for proteomics experiments. Nat. Protoc. 14, 68–85 (2019).
Article CAS PubMed Google Scholar
Strasser, L. et al. Detection and quantitation of host cell proteins in monoclonal antibody drug products using automated sample preparation and data-independent acquisition LC-MS/MS. J. Pharm. Anal. 11, 726–731 (2021).
Article PubMed PubMed Central Google Scholar
Pythoud, N. et al. Optimized sample preparation and data processing of data-independent acquisition methods for the robust quantification of trace-level host cell protein impurities in antibody drug products. J. Proteome Res. 20, 923–931 (2021).
Article CAS PubMed Google Scholar
Solntsev, S. K., Shortreed, M. R., Frey, B. L. & Smith, L. M. enhanced global post-translational modification discovery with MetaMorpheus. J. Proteome Res. 17, 1844–1851 (2018).
Article CAS PubMed Google Scholar
Füssl, F. et al. Comprehensive characterisation of the heterogeneity of adalimumab via charge variant analysis hyphenated on-line to native high resolution Orbitrap mass spectrometry. mAbs 11, 116–128 (2019).
Article PubMed Google Scholar
Zhang, Q. et al. Comprehensive tracking of host cell proteins during monoclonal antibody purifications using mass spectrometry. mAbs 6, 659–670 (2014).
Article PubMed PubMed Central Google Scholar
Wen, B. & Zhang, B. PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations. Genome Res 29, 485–493 (2019).
Wen, B. & Zhang, B. PepQuery2 democratizes public MS proteomics data for rapid peptide searching. Nat. Commun. 14, 2213 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Cao, X. et al. Comparative proteomic profiling of unannotated microproteins and alternative proteins in human cell lines. J. Proteome Res. 19, 3418–3426 (2020).
Article CAS PubMed PubMed Central Google Scholar
Leong, A. Z.-X. et al. Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures. J. Biomed. Sci. 29, 19 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhu, Y. et al. Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat. Commun. 9, 903 (2018).
Article ADS PubMed PubMed Central Google Scholar
McGlincy, N. J. & Ingolia, N. T. Transcriptome-wide measurement of translation by ribosome profiling. Methods San. Diego Calif. 126, 112–129 (2017).
Article PubMed Google Scholar
Chothani, S. et al. deltaTE: detection of translationally regulated genes by integrative analysis of Ribo-seq and RNA-seq data. Curr. Protoc. Mol. Biol. 129, e108 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ahlmann-Eltze, C. & Anders, S. proDA: Probabilistic dropout analysis for identifying differentially abundant proteins in label-free mass spectrometry. Biorxiv Preprint at https://doi.org/10.1101/661496 (2020).
Mullard, A. FDA approves 100th monoclonal antibody product. Nat. Rev. Drug Discov. 20, 491–495 (2021).
Article CAS PubMed Google Scholar
Tuameh, A., Harding, S. E. & Darton, N. J. Methods for addressing host cell protein impurities in biopharmaceutical product development. Biotechnol. J. 18, e2200115 (2023).
Article PubMed Google Scholar
Wilson, L. J., Lewis, W., Kucia-Tran, R. & Bracewell, D. G. Identification of upstream culture conditions and harvest time parameters that affect host cell protein clearance. Biotechnol. Prog. 35, e2805 (2019).
Article PubMed Google Scholar
Hogwood, C. E., Bracewell, D. G. & Smales, C. M. Measurement and control of host cell proteins (HCPs) in CHO cell bioprocesses. Curr. Opin. Biotechnol. 30, 153–160 (2014).
Article CAS PubMed Google Scholar
Fukuda, N., Senga, Y. & Honda, S. Anxa2- and Ctsd-knockout CHO cell lines to diminish the risk of contamination with host cell proteins. Biotechnol. Prog. 35, e2820 (2019).
Article PubMed Google Scholar
Kol, S. et al. Multiplex secretome engineering enhances recombinant protein production and purity. Nat. Commun. 11, 1908 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Ferreira, J. P., Overton, K. W. & Wang, C. L. Tuning gene expression with synthetic upstream open reading frames. Proc. Natl Acad. Sci. USA 110, 11284–11289 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Ong, H. K., Nguyen, N. T. B., Bi, J. & Yang, Y. Vector design for enhancing expression level and assembly of knob-into-hole based FabscFv-Fc bispecific antibodies in CHO cells. Antib. Ther. 5, 288 (2022).
CAS PubMed PubMed Central Google Scholar
Kearse, M. G. & Wilusz, J. E. Non-AUG translation: a new start for protein synthesis in eukaryotes. Genes Dev. 31, 1717–1731 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liang, H. et al. PTENα, a PTEN isoform translated through alternative initiation, regulates mitochondrial function and energy metabolism. Cell Metab. 19, 836–848 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ketteler, R. On programmed ribosomal frameshifting: the alternative proteomes. Front. Genet. 3, 242 (2012).
Article PubMed PubMed Central Google Scholar
Zhang, P. et al. Genome-wide identification and differential analysis of translational initiation. Nat. Commun. 8, 1749 (2017).
Article ADS PubMed PubMed Central Google Scholar
Martinez, T. F. et al. Profiling mouse brown and white adipocytes to identify metabolically relevant small ORFs and functional microproteins. Cell Metab. 35, 166–183.e11 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Article CAS PubMed Google Scholar
Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M. & Weissman, J. S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat. Protoc. 7, 1534–1550 (2012).
Article CAS PubMed PubMed Central Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).
Article Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinforma. Oxf. Engl. 30, 2114–2120 (2014).
Article CAS Google Scholar
Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
Tjeldnes, H. et al. ORFik: a comprehensive R toolkit for the analysis of translation. BMC Bioinforma. 22, 336 (2021).
Article CAS Google Scholar
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Article PubMed PubMed Central Google Scholar
Zhang, H. mt1022/cubar: Release v0.5.1. Zenodo https://doi.org/10.5281/zenodo.11060142 (2024).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Wang, J., Vasaikar, S., Shi, Z., Greer, M. & Zhang, B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucleic Acids Res. 45, W130–W137 (2017).
Article CAS PubMed PubMed Central Google Scholar
Strasser, L. et al. Proteomic landscape of adeno-associated virus (AAV)-producing HEK293 Cells. Int. J. Mol. Sci. 22, 11499 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).
Article CAS PubMed PubMed Central Google Scholar
Millikin, R. J., Solntsev, S. K., Shortreed, M. R. & Smith, L. M. Ultrafast Peptide Label-Free Quantification with FlashLFQ. J. Proteome Res. 17, 386–391 (2018).
Article CAS PubMed Google Scholar
Sandmann, C.-L. et al. Evolutionary origins and interactomes of human, young microproteins and small peptides translated from short open reading frames. Mol. Cell 83, 994–1011.e18 (2023).
Article CAS PubMed PubMed Central Google Scholar
Clarke, et al. manuscript code. https://doi.org/10.5281/zenodo.13285416 (2024).
Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge funding from Science Foundation Ireland (grant references: 15/CDA/3259 (to CC) and 13/SIRG/2084 (to CC). We thank Michelle Chain, Filipe Guapo and Ciara Tierney for technical assistance in the temperature shift proteomics cell culture experiment and mass spectrometry analyses. We are also grateful to the Carapito Laboratory for their support in incorporating the Pythoud et al. mass spectrometry data in this study.

Author information

These authors contributed equally: Ioanna Tzani, Marina Castro-Rivadeneyra, Jonathan Bones, Colin Clarke.

Authors and Affiliations

National Institute for Bioprocessing Research and Training, Fosters Avenue, Blackrock, Co, Dublin, Ireland
Ioanna Tzani, Marina Castro-Rivadeneyra, Paul Kelly, Lisa Strasser, Niall Barron, Jonathan Bones & Colin Clarke
School of Chemical and Bioprocess Engineering, University College Dublin, Belfield, Dublin, Ireland
Marina Castro-Rivadeneyra, Niall Barron, Jonathan Bones & Colin Clarke
Bioprocess R&D, Pfizer Inc. Andover, Massachusetts, USA
Lin Zhang
National Institute for Cellular Biotechnology, Dublin City University, Dublin, Ireland
Martin Clynes
Barnett Institute, Northeastern University, 360 Huntington Ave, Boston, MA, USA
Barry L. Karger

Authors

Ioanna Tzani
View author publications
Search author on:PubMed Google Scholar
Marina Castro-Rivadeneyra
View author publications
Search author on:PubMed Google Scholar
Paul Kelly
View author publications
Search author on:PubMed Google Scholar
Lisa Strasser
View author publications
Search author on:PubMed Google Scholar
Lin Zhang
View author publications
Search author on:PubMed Google Scholar
Martin Clynes
View author publications
Search author on:PubMed Google Scholar
Barry L. Karger
View author publications
Search author on:PubMed Google Scholar
Niall Barron
View author publications
Search author on:PubMed Google Scholar
Jonathan Bones
View author publications
Search author on:PubMed Google Scholar
Colin Clarke
View author publications
Search author on:PubMed Google Scholar

Contributions

I.T. and C.C. conceived the study and designed experiments; Cell culture and Ribo-seq were carried by I.T. and P.K. Ribo-seq data analysis was performed by M.C.R. and C.C. L.S., and J.B. performed the mass spectrometry experiments analyses and C.C. analysed the data. M.C.R., I.T., L.Z., M.C., B.L.K., N.B., J.B. and C.C. wrote the manuscript. All authors reviewed the paper.

Corresponding author

Correspondence to Colin Clarke.

Ethics declarations

Competing interests

I.T., M.C.R., P.K., L.S., B.L.K., M.C., N.B., and J.B., and C.C. declare no competing interests. L.Z. is an employee of Pfizer Inc.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Supplementary Data 5

Supplementary Data 6

Supplementary Data 7

Supplementary Data 8

Supplementary Data 9

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tzani, I., Castro-Rivadeneyra, M., Kelly, P. et al. Detection of host cell microprotein impurities in antibody drug products. Nat Commun 15, 8605 (2024). https://doi.org/10.1038/s41467-024-51870-0

Download citation

Received: 12 April 2023
Accepted: 21 August 2024
Published: 04 October 2024
DOI: https://doi.org/10.1038/s41467-024-51870-0