Clinical and analytical validation of a combined RNA and DNA exome assay across a large tumor cohort

Yudina, Anastasiya; Tazearslan, Cagdas; Baisangurov, Artur; Nuzhdina, Ekaterina; Lauziere, Kelley; Segodin, Vitaly; Podsvirova, Svetlana; Starikov, Sergey; Chasse, Madison; Shaposhnikov, Kirill; Kaneunyenye, Leznath; Klimchuk, Olesia; Kuzkina, Natalia; English, Noel; Khegai, Gleb; Sookiasian, Danielle; Shafranskaya, Daria; Fernandez, Dawn; Lozinsky, Yaroslav; Sobolev, Andrew; Abdou, Mary; Turova, Polina; Chernyshov, Konstantin; Efremov, Alexey; Andrewes, Samuel; Feinberg, Aviva; McKenna, Brianna; Brown, Jessica H.; Love, Anna; Curran, John; Lennerz, Jochen; Bagaev, Alexander

doi:10.1038/s43856-025-00934-3

Download PDF

Article
Open access
Published: 16 June 2025

Clinical and analytical validation of a combined RNA and DNA exome assay across a large tumor cohort

Anastasiya Yudina¹,
Cagdas Tazearslan¹,
Artur Baisangurov¹,
Ekaterina Nuzhdina¹,
Kelley Lauziere¹,
Vitaly Segodin¹,
Svetlana Podsvirova¹,
Sergey Starikov¹,
Madison Chasse¹,
Kirill Shaposhnikov ORCID: orcid.org/0009-0006-2153-8565¹,
Leznath Kaneunyenye¹,
Olesia Klimchuk¹,
Natalia Kuzkina¹,
Noel English¹,
Gleb Khegai ORCID: orcid.org/0009-0003-2505-3846¹,
Danielle Sookiasian¹,
Daria Shafranskaya¹,
Dawn Fernandez¹,
Yaroslav Lozinsky¹,
Andrew Sobolev¹,
Mary Abdou¹,
Polina Turova¹,
Konstantin Chernyshov¹,
Alexey Efremov¹,
Samuel Andrewes¹,
Aviva Feinberg¹,
Brianna McKenna¹,
Jessica H. Brown¹,
Anna Love ORCID: orcid.org/0000-0003-4627-3138¹,
John Curran¹,
Jochen Lennerz ORCID: orcid.org/0000-0003-2434-4978¹ &
…
Alexander Bagaev ORCID: orcid.org/0000-0002-8680-854X¹

Communications Medicine volume 5, Article number: 236 (2025) Cite this article

8311 Accesses
1 Citations
Metrics details

Subjects

Abstract

Background

Combining RNA sequencing (RNA-seq) with whole exome sequencing (WES) from a single tumor sample can substantially improve the detection of clinically relevant alterations in cancer. However, routine clinical adoption of this integrated approach remains limited, especially for RNA-seq, due to the absence of standardized validation frameworks.

Methods

We developed and validated an assay that integrates RNA-seq and WES for evaluating gene expression, gene fusions, tumor microenvironment signatures, somatic single nucleotide variants (SNVs), insertions/deletions (INDELs), and copy number variations (CNVs). Exome-wide somatic reference standards were generated to support analytical validation using multiple sequencing runs of cell lines at varying purities.

Results

Assay validation involves 3 steps: (1) analytical validation using custom reference samples containing 3042 SNVs and 47,466 CNVs; (2) orthogonal testing in patient samples; and (3) assessment of clinical utility in real-world cases. Applied to 2230 clinical tumor samples, the integrated assay enables direct correlation of somatic alterations with gene expression, recovery of variants missed by DNA-only testing, and improves detection of gene fusions. In addition to uncovering clinically actionable alterations in 98% of cases, the assay also reveals complex genomic rearrangements that would likely have remained undetected without RNA data.

Conclusions

This study provides practical validation guidelines for integrated RNA and DNA sequencing in clinical oncology. The combined assay enhances the detection of actionable alterations, thereby facilitating personalized treatment strategies for cancer patients.

Plain language summary

People with cancer have changes in the DNA sequence within their cancer cells that are not present in noncancerous cells. Accurate identification of DNA changes in tumors is crucial for optimizing cancer care for individual people with cancer. It is also helpful to be able to look at changes in the RNA produced by cells, as RNA is the molecule used in cells to determine which proteins are produced, and the proteins tend to determine how a cell behaves. Currently, assessment of DNA and RNA is often performed separately. Here, we validated a combined test to detect both clinically relevant DNA alterations and changes in the RNA produced. Looking at over 2000 tumors, we found alterations missed by DNA-only approaches. Our method could improve diagnostic accuracy, streamline clinical workflows, and potentially save money, ultimately leading to improved cancer treatment decisions and better patient outcomes.

Consistently processed RNA sequencing data from 50 sources enriched for pediatric data

Article Open access 02 July 2025

Variant calling from RNA-Seq data reveals allele-specific differential expression of pathogenic cancer variants

Article Open access 28 May 2025

Creation and validation of models to predict response to primary treatment in serous ovarian cancer

Article Open access 16 March 2021

Introduction

Diagnostic methods have evolved to address emerging therapeutic targets to support the rapid growth of personalized medicine. In 2013, Frampton et al. validated a targeted genomic panel, establishing a framework for next-generation sequencing (NGS) assay development¹. Nonetheless, most clinical NGS assays rely on DNA-seq with targeted gene panels, leaving many clinically relevant genes untested².

While RNA-seq has become a standard approach for measuring fusions and characterizing tissue phenotypes, its clinical adoption remains limited. Whole exome sequencing (WES) identifies single nucleotide variants (SNVs), insertions/deletions (INDELs), copy number variations (CNVs), loss of heterozygosity (LOH), microsatellite instability (MSI), and tumor mutational burden (TMB) across more than 20,000 genes. It also surpasses targeted panels in identifying TMB³ and large-scale (arm-level) CNVs^4,5,6. Combining WES with RNA-seq increases actionable findings through the detection of gene expression changes, fusions, and alternative splicing events. Gene expression signatures can also predict immunotherapy outcomes, highlighting the importance of robust validation methods.

Currently, comprehensive guidelines for integrated RNA-seq and WES assay development, validation, and analysis are lacking. Therefore, genomic studies to date relied on germline refs. ^7,8,9 or validated targeted panel assays^10,11,12. These guidelines are integral for clinical implementation, particularly for the interpretation of somatic variants from RNA-seq.

Here, we propose comprehensive guidelines for validating SNVs, CNVs, MSI, TMB, gene expression, and fusions detected through an in-house, integrated WES and RNA-seq assay (Tumor Portrait™ [BostonGene Corporation; Waltham, MA, USA]). Validations include: (1) an analytical step using reference materials and cell lines; (2) orthogonal testing with clinical samples; and (3) clinical validation on 2230 patient samples.

Clinical validation enables the creation of an interpretation framework linking somatic variants, CNVs, and fusions to related gene expression profiles, revealing allele-specific expression of oncogenic drivers. Additionally, the RNA-seq variant-calling framework improves the detection of low-coverage hotspot variants. These findings underscore the need for updated guidelines supporting integrated DNA- and RNA-seq assays. By providing a clear roadmap for validating combined assays and interpreting their results, we aim to facilitate their clinical adoption, advancing patient care and personalized treatment strategies in oncology.

Methods

Laboratory procedures

Nucleic acid isolation

Nucleic acid isolation was performed from fresh frozen (FF) solid tumors with the AllPrep DNA/RNA Mini Kit (Qiagen, Valencia, CA, USA) and from normal tissue (whole blood, peripheral blood mononuclear cells [PBMCs], or saliva) with the QIAmp DNA Blood Mini Kit (Qiagen, Valencia, CA, USA) and Maxwell RSC Stabilized Saliva DNA Kit (Promega, Madison, WI, USA). The AllPrep DNA/RNA FFPE Kit (Qiagen, Valencia, CA, USA) was used for nucleic acid isolation from formalin-fixed paraffin-embedded (FFPE) solid tumors. DNA and RNA extracts were tested for contamination and structural integrity. DNA and RNA quantity and quality were measured using a Qubit 2.0 (Thermo Fisher Scientific, Waltham, MA, USA), NanoDrop OneC (Thermo Fisher Scientific, Waltham, MA, USA), and TapeStation 4200 (Agilent Technologies, Santa Clara, CA, USA).

Library preparation for DNA and RNA sequencing

For both FF and FFPE protocols, 10–200 ng of extracted DNA or RNA was required for their respective library preparations. Library construction from FF tissue RNA was performed with the TruSeq stranded mRNA kit (Illumina, San Diego, CA, USA). Library construction from FFPE tissue was performed using exome capture kits, SureSelect XTHS2 DNA and SureSelect XTHS2 RNA kit, respectively (Agilent Technologies, Santa Clara, CA, USA). For hybridization and capture, the SureSelect Human All Exon V7 + UTR (Agilent Technologies, Santa Clara, CA, USA) exome probe was used for RNA, and the SureSelect Human All Exon V7 exome probe (Agilent Technologies, Santa Clara, CA, USA) was used for DNA. Quality, concentration, and size of the prepared libraries were assessed using the Qubit 2.0 (Thermo Fisher Scientific, Waltham, MA, USA), Tapestation 4200 (Agilent Technologies, Santa Clara, CA, and LightCycler 480 (Roche, CA) or QuantStudio^TM 5 Real-Time PCR System (Thermo Fisher Scientific, Waltham, MA) precision equipment.

Sequencing

Sequencing was performed on a NovaSeq 6000 (Illumina, San Diego, CA, USA). For each step of library preparation, acceptable and target values of amounts, concentrations, average fragment size, RNA integration number (RIN) scores, and light absorption metrics were confirmed with quality control steps. The primary analysis of NovaSeq 6000 QC metrics (Q30 > 90%, PF > 80%) in BaseSpace Sequence Hub was monitored during every run, and stringent bioinformatics pipeline analysis was performed for the prepared libraries. All samples that passed QC thresholds at every stage of DNA or RNA library preparation and bioinformatics analysis.

Bioinformatics procedures

Alignment

WES data were mapped to the human genome (hg38) using BWA aligner v.0.7.17. GATK v4.1.2 and mosdepth v0.2.1 were used for PCR duplicates read markup and sequencing metrics collection, including average coverage statistics. RNA-seq data were mapped to the human genome (hg38) using the default parameters of STAR aligner v2.4.2 with minor modifications. For gene expression quantification, reads were aligned to the human transcriptome (hg38) with Kallisto v0.43.0 using default parameters¹³.

Quality control (QC)

Methods for QC were performed as previously described by Bagaev et al.¹⁴. Briefly, standard QC for WES was performed via fastQC v0.11.9 and FastqScreen v0.14.0. Picard v2.20.7 MarkDuplicates was used to remove duplicate reads. Off-target reads were calculated using samtools view v1.10 in intersection with target file with regions provided by the vendor (Agilent Technologies, Santa Clara, CA, USA). Unique reads were calculated using samtools view v1.10 with -F0x400 flag. Off-target and duplication rates were calculated by dividing the off-target and duplicate reads by the total number of reads, respectively. Standard QC for RNA-seq was performed via RSeQC v3.0.1, including assessment of percentage of sense strand reads for DNA contamination control. Control of sample mixing was done by comparison of HLA types (obtained via OptiType v1.3.5)¹⁵ and calculation of SNV concordance of germline variants in housekeeping genes.

SNV and INDEL detection

Variant calling

Germline SNVs and INDELs and somatic SNVs were detected using optimized Strelka v2.9.10 on both normal and paired tumor/normal samples in exome mode with the following parameter modifications: QSI_NT score ≥ 50, QSS_NT ≥ 50, Somatic EVS > 15¹⁶. Somatic INDEL (1–49 bp) calling was performed via Strelka v2.9.10 using small INDEL candidates from Manta v1.5.0^16,17. Variant calling from RNA-seq data was performed via Pisces v5.2.10.49¹⁸.

Filtration

The outputs from the variant calling algorithms underwent subsequent filtration. For germline variants, filtration of calls from chrX and chrY was performed according to gender. An additional strand bias filter was applied to germline INDELs. For somatic mutations, the output after Strelka2 processing underwent filtration using a basic filter (tumor depth ≥ 10 reads, normal depth ≥ 20 reads, and normal VAF ≤ 0.05). Then, a threshold for tumor VAF ≥ 0.05 was applied. Finally, a complex filter based on the combination of Strelka2 QSS and EVS scores was applied:

$${QC}=\frac{{e}^{x}}{1+{e}^{x}}$$

(1)

where x = b1 * QSS + b2 * EVS + c. Parameters b1, b2, and c were assessed via logistic regression. The following additional filters were applied to the standard parameters: VAF and alternative read counts in tumor and normal samples. Somatic and germline mutation calling was performed across all covered regions (including covered off-target regions), and per-nucleotide coverage thresholds were applied (80X for tumor samples and 20X for normal samples) to establish the required coverage needed to optimize quality metrics of the output.

Benchmarking

Mutect2 and FilterMutectCalls from GATK v4.1 and Dragen v4.3 in somatic mode were used for SNV/INDEL calling^19,20.

Pileup files parsing

For additional verification of SNV/INDEL calls, samtools mpileup v1.10 and a downstream parsing custom script were used to precisely check discordant calls between two samples.

TMB assessment

The TMB score was calculated based on the somatic mutation calling output with the additional exclusion of the intron and silent mutations. The mutations for TMB score assessment were taken only from target regions, and normalization was performed by target length in megabases. TMB-high was defined as ≥10 mut/Mb, and TMB-low was defined as <10 mut/Mb. For FF and FFPE samples, mutations that passed the following thresholds were used for TMB score calculation: (1) tumor coverage depth for mutation ≥25; (2) tumor VAF (SNV, INDEL) ≥ 0.05; (3) tumor alternative read support ≥ 5; (4) normal coverage depth for mutation ≥15; (5) normal VAF ≤ 0.05; (6) tumor VAF * tumor alternative read support ≥ 0.6.

MSI status determination

MSISensor2 was used to calculate the MSI score. A standard cut-off of 20, which represents the percentage of microsatellite unstable regions out of 2800 analyzed, was used for MSI status classification²¹.

CNV detection

A primary step for CNV detection was compilation of an extraction coverage reference for tumor and normal samples. The reference was defined by finding common regions/peaks of coverage in a set of 30 normal reference samples. Peaks of coverage were calculated with macs3 v3.0.0a7. Samtools v1.10 was used for coverage extraction²². For each sample, only highly covered regions were left after filtration (at least 20% of the median sample coverage), which were then intersected by BEDTools v2.30.0 resulting in a final ref. ²³. Modified Sequenza v2.1.2 was used to call CNVs using the following process: (1) The coverage area was calculated under each peak from the extraction coverage reference; (2) The peak coverage was used as a homozygous input for the calculation of the depth of coverage ratio (ratio of coverage between tumor and normal samples); (3) Heterozygous germline SNVs were used to calculate B-allele frequency (BAF) in tumor samples; (4) Homozygous and heterozygous positions were merged together for the main input into Sequenza in pileup format²⁴. This approach allowed us to focus our attention on highly covered regions in the genome, reducing the impact of noise. This was particularly important for low-purity samples, where the slightest elevation or reduction of the depth ratio and BAF metrics can be interpreted as a change in copy numbers.

The following additional standard techniques were also used during CNV calling: (1) Full segmentation based on the depth ratio and BAF metrics; (2) Filtration of CNV calls based on a depth of coverage ratio ≥10 (ratio of coverage between tumor and normal samples), and a coverage of tumor alternative allele ≥5; (3) Filtration of segments during model fitting based on the number of homozygous/heterozygous positions found in a segment (threshold is equal to 10 for the number of homozygous/heterozygous positions); (4) Filtration of segments caught in a centromere region.

For correct interpretation of copy numbers, ploidy of the sample was assessed with Sequenza, FACETS, and manual verification as needed^24,25. Normalized states of CNVs were introduced to compare between CNV profiles. We defined the following CNV levels: “Loss,” “Deletion,” “Neutral,” “Gain,” “Amplification,” and “High Amplification.” High-level CNVs included “Loss” (complete loss of a region), “Amplification,” and “High Amplification,” while low-level CNVs included “Deletion” (partial loss) and “Gain” (slight increase in number of copies). “Neutral” was defined as no CNV detected, and the total number of CNVs was equal to ploidy. These states were defined through ploidy normalization of the total CNV value for a segment/gene, which was calculated for each sample and used as a measure showing the average number of complete chromosome sets. In the analysis, we also used normalized CNV values (−2, −1, 0, 1, 2) that corresponded to the aforementioned levels: “Loss,” “Deletion,” “Neutral,” “Gain,” and “Amplification.”

The classification of the above-listed CNV categories can be done via the following logical expression:

$${if\; pl}1.5* {cn}\to {amplification},$$

(2)

$${if}1.5* {pl}\ge {cn} > {pl}\to {gain},$$

(3)

$${if\; pl}={cn}\to {neutral},$$

(4)

$${if\; pl} > {cn} > 0.5* {pl}\to {shallow\; deletion},$$

(5)

$${if}\, 0.5* {pl}\ge {cn}\to {deletion},$$

(6)

where pl is ploidy (≥2), and cn is the copy number of the segment.

To classify segments by length and calculate statistics for every group, the following segment categories were introduced: (1) Focal-level segments with a length less than 10% of the arm, but no more than 3 Mb. (2) Arm-level segments with a length at least 50% of the arm length. (3) Long-focal segments that fall in between the focal- and arm-level categories.

Benchmarking

Dragen v4.3 in panel of normal mode and FACETS were used for CNV detection^20,25.

Expression analysis

Gene expression analysis was performed as previously described^14,26. RNA-seq reads were aligned to GRCh38.d1.vd1 using Kallisto v0.42.4 and annotated with GENCODE v23 transcripts with default parameters^13,27,28. The noncoding RNA, mitochondria- and histone-related transcripts were removed, and the protein-coding transcripts, IGH/K/L- and TCR-related transcripts, were retained, resulting in 20,062 protein-coding genes. Gene expression was quantified as transcripts per million (TPM) and log2-transformed²⁹, and gene expression signatures were calculated based on methods from Bagaev et al.¹⁴. STAR aligner v2.4.2 was used to provide read counts for exon skipping events detection for CLDN18, AR, and MET genes.

Fusion transcript detection

For fusion transcripts detection, STAR-fusion v1.8.1 was used with subsequent annotation of FusionInspector in validation mode^30,31. Detected fusion transcripts that met any of the following filtering criteria were removed: (1) Fusion fragments per million (FFPM) values ≥ 0.1; (2) Number of junction reads supporting the fusion ≥5; (3) A sum of all supporting reads (junction and spanning) ≥7; and (4) genes overlapping each other or with a breakpoint distance >10 kb. Functional annotation was also taken into account (mitochondrial genes and fusions from FusionCatcher black list were filtered out). The developed pipeline used a list of databases (Supplementary Table 11) to functionally annotate fusions and to identify de novo events.

Benchmarking

Arriba 2.4 was used for fusion transcript detection³².

Immune receptor repertoire analysis

Immune clonotypes were determined with MiXCR v3.0.12³³ using standard parameters according to the recommendation for bulk RNA-seq analysis.

Analytical validation procedures

Reference materials

Validation experiments were performed on FF cell lines and human tissues from the following commercial vendors: iSpecimen (Lexington, MA, USA), American Type Culture Collection (ATCC; Manassas, VA, USA), Coriell Institute for Medical Research (Camden, NJ, USA), and CureLine (Brisbane, CA, USA). Cell lines and FF tissue samples were stored long-term at −80 °C until used. Cell lines were cultured and tested for mycoplasma according to best practices from the ATCC. All specimens were examined for healthy morphology before proceeding to DNA and RNA extraction for next-generation sequencing (NGS). Commercial FFPE tissue samples (Cureline, Brisbane, CA, USA) and FFPE blocks made in-house from FF samples with the same origin were used to assess the performance of the NGS laboratory-developed test (LDT). After receiving or creating FFPE tissues, blocks were stored at room temperature until RNA and/or DNA extraction. All Cureline patient samples were collected with signed informed consent.

Commercial synthetic reference materials

DNA

The commercially available reference standards Seraseq® gDNA TMB Mix Score 7 (Cat. No. 0710-1326), 9 (Cat. No. 0710-1325), 20 (Cat. No. 0710-1324), 26 (Cat. No. 0710-1323; SeraCare, Gaithersburg, MD, USA), which contain a defined number of overall alterations per megabase of the genome, were utilized to validate the TMB assay in FF tissue and perform further orthogonal validation of FFPE tissue. The commercially available reference standards Seraseq FFPE TMB RM Score 7 (Cat. No. 0710-1307), 9 (Cat. No. 0710-1308), 20 (Cat. No. 0710-1309), and 26 (Cat. No. 0710-1310; WT + TUMOR), which contain a defined number of overall alterations per megabase of the genome, were utilized to validate the TMB assay in FFPE tissue.

The Seraseq® Lung & Brain CNV Mix (SeraCare, Gaithersburg, MD, USA) and Seraseq® Breast CNV Mix (SeraCare, Gaithersburg, MD, USA) reference DNA mixes were tested for CNVs using the BostonGene WES workflow. To establish the limit of detection (LOD) for CNVs, we used these reference materials with a number of amplification levels (+3 copies, +6 copies, +12 copies).

RNA

Validation of expression level assessment was performed using the following reference standards: Invitrogen^TM Universal Human Reference RNA (Thermo Fisher Scientific, Waltham, MA, USA) and Invitrogen^TM External RNA Controls Consortium (ERCC) ExFold Spike-In Mixes (Cat. No. 4456739, Thermo Fisher Scientific, Waltham, MA, USA) for FF samples. The references were used to estimate the range of detection, LOD, accuracy, precision, and reproducibility for detecting RNA expression.

The following reference standards were used for gene fusion detection validation: Seraseq® FFPE Tumor Fusion RNA v4 Reference Material (SeraCare, Gaithersburg, MD, USA), ALK-RET-ROS1 Fusion FFPE RNA Reference Standard (Horizon Discovery, Waterbeach, UK), and 5-Fusion Multiplex (Negative Control) (Horizon Discovery, Waterbeach, UK).

Cell lines

Platinum Genome cell lines, GM12877 and GM12878, prepared both as FF samples and in FFPE blocks were used to measure the accuracy of germline mutation calling for hereditary cancer predisposition syndromes. Platinum Genome references were used for germline calling optimization and validation³⁴. To validate the accuracy of somatic SNV and INDEL detection, high-level CNV detection, and TMB score assessment, pools of previously sequenced and well characterized cell lines were used: HCC1143 (CRL-2321)³⁵, HCC1937 (CRL-2336)³⁶, COLO829 (CRL-1974)³⁷, HCC1395 (CRL-2324)³⁵, and NCI-H1770 (CRL-5893)³⁸, and matched baselines were COLO829BL (CRL-1980), HCC1143BL (CRL-2362), HCC1937BL (CRL-2337), HCC1395BL (CRL-2325), and NCI-BL1770 (CRL-5960). For each cell line and its corresponding normal B cells, separate FFPE blocks were prepared—one for the tumor cells and one for the matched-normal cells. These blocks were stored for 3-4 months, after which DNA extracts were prepared from both the tumor and normal FFPE blocks, as well as from tumor and baseline B-cell cultures. The tumor and matched-normal DNA extracts were then mixed in a range of proportions (0:100, 10:90, 20:80, 30:70, 50:50, 25:75, 100:0) before proceeding with library construction procedures. For the RNA-seq dilution experiment, RNA extracts from the 100% tumor FFPE blocks were mixed with RNA extracts from the matched-normal FFPE blocks.

For validation of our MSI assay in FFPE tissues, pools of four MSI-high (MSI-H) cell lines, HCT 116 (CCL-247), LoVo (CCL-229), DLD-1 (CCL-221), and HCT-15 (CCL-225), with a range of purities (5%, 10%, 20%, 30%, 50%, 75%, 100%) were used^39,40,41. HCC1395BL (CRL-2325) was used as a background for dilutions since these cell lines did not have matched-normal B-cell cultures.

The cell line fusion analysis was performed with extracts from K562 (CCL-243), MCF-7 (HTB-22), A-673 (CRL-1598), RT-4 (HTB-2), U-118MG (HTB-15), BT-20 (HTB-19), NCI-H2228 (CRL-5935), SK-BR-3 (HTB-30), and THP-1 (TIB-22) FFPE blocks. Extracts from GM12877, GM12878, and HCC1143BL (CRL-2362) cell lines in FFPE blocks were used as negative controls^{35,42,43,44,45,46,47,48,49,50}. To assess the LOD of fusion detection, RNA extract from one tumor cell line was diluted into RNA extract from another tumor cell line. For FFPE samples, the following cell lines were used: BT20:THP-1 and NCI-H2228:K562. For FF samples, the following cell lines were used: K562:MCF-7. The dilutions started at a ratio of 95:5 and changed to a 5:95 ratio with a 15% step.

Gene expression analysis was performed using A-673 (CRL-1598), BT-20 (HTB-19), COLO829 (CRL-1974), Caki-1 (HTB-46), GM12877, GM12878, HCC1143 (CRL-2321), K562 (CCL-243), MCF-7 (HTB-22), NCI-H2228, RT-4 (HTB-2), Reh (CRL-8286), SK-BR-3 (HTB-30), SNU-16 (CRL-5974), T-47D (HTB-133), THP-1 (TIB-22), U-118MG (HTB-15), and WERI-Rb-1 (HTB-169).

Biological replicates of the COLO829 cell line were used as a positive control for each clinical sequencing run. The samples were used for longitudinal reproducibility assessment of both RNA-seq and WES-defined events.

Tumor samples

A total of 59 paired normal and tumor samples (internal BostonGene samples and Cureline Tissue Bank [Brisbane, CA, USA]) from patients of various solid diagnoses were used for orthogonal validation. From this cohort, 53 patients were selected for further analysis after the median coverage and purity assessment for each sample was established (Supplementary Table 14). Paired FF and FFPE samples from 110 patients were used to measure the detection accuracy of RNA expression in FFPE samples. Informed consent was obtained for BostonGene and Cureline patient samples.

Orthogonal validation

qPCR

Quantitative polymerase chain reaction (qPCR) was performed to validate RNA-seq expression in 51 FFPE samples using predesigned TaqMan Gene Expression Assays (Thermo Fisher Scientific, Waltham, MA, USA) and TaqMan Fast Advance Master Mix (Cat# 4444556, Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s protocol for 99 genes (Supplementary Table 8). We normalized Ct values of qPCR for the PCBP1 housekeeping gene⁵¹ to compare with RNA-seq-derived expression values (TPM). qPCR was also performed to validate clinically relevant fusion genes detection (118 fusion genes, 63 clinical samples). To detect fusions, custom qPCR ZEN double-quenched probe assays, containing a 5’ FAM 520 fluorophore and a 3’ Iowa Black FQ quencher, were designed using IDT’s PrimerQuest™ Tool (Integrated DNA Technologies, Inc., Coralville, Iowa, USA). Fusion qPCR reactions were performed with TaqMan Fast Advance Master Mix (Cat# 4444556, Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s protocol.

Orthogonal laboratory sequencing

Paired-tumor and -normal FFPE samples from 59 patients were sequenced at the BostonGene laboratory. The same set of samples was sequenced at a CLIA-certified, CAP-accredited, and NY-state-approved reference laboratory, Novogene (https://en.novogene.com/technology/certification/). All mutations assessed via Novogene’s sequencing technology, including SNVs, INDELs, CNVs, TMB score, and MSI status, were considered as a reference set.

Assessment of coverage requirements

We used 83 internal control (COLO829 cell line) samples to assess the percentage of coverage completeness. To observe the coverage dependency on low-covered samples, in silico downsampling of each sample was performed to get a range from median 20X to 440X via random deletion of subsets of reads from each sample’s BAM file using samtools v.1.15.1.

To establish the sufficient number of reads for proper transcripts detection and expression level assessment, 10 technical replicates of the same extract from the COLO829 cell line were used. Each sample was in silico downsampled to the stage of 5, 10, 15, 20, 30, 40, 50, 60, 70, 80 million reads. The downsampling was performed on fastq files via seqtk v1.3 prior to further analysis.

Assessment Of SNV and Indel variant-calling performance

Cell line references

True mutations and putative artifacts were classified by assessing the concordance between VAF decrease and cancer cell line purity. Mutations detected at least three times in serial dilutions with a linear coefficient range of 0.70 to 1.30 and a Pearson’s regression coefficient greater than 0.80 or a minimum of four times with a Pearson’s regression coefficient greater than 0.50 were classified as true mutations and were used to assess the precision of our mutation calling. True-positive mutation sets for this validation were defined by performing four to six NGS replicates on COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 cell lines with 100% tumor purity. The replicate sequencing outputs were corrected with sequencing data from matched-normal samples and intersected to assess for true mutations with a VAF standard deviation of less than 10% and covered at least 80 reads to define true positives for each cell line, which was used to assess the sensitivity of mutation calling. We adjusted the reference set of variants for each dilution, excluding variants with a median VAF lower than 5%. The SNV VAFs of HCC1395, HCC1143, HCC1937, and COLO829 cell lines were compared with previously published data^52,53. Sensitivity and VAF correlation were calculated.

Clinical data utilization

The performance of germline and somatic variant calling was assessed via calculation of recall (sensitivity) and precision quality metrics. Variants identified in the independent laboratory’s (Novogene) sequencing data were considered as a reference set of mutations. Clinical data was also used for mutation counts assessment within clinically actionable genes (Supplementary Table 6).

Performance assessment of TMB-score and MSI status detection

TMB score and classification of high or low TMB levels were calculated based on somatic mutation calling from both BostonGene’s and the independent laboratory’s (Novogene) sequencing data. Correlation of the TMB score and the accuracy of classification was assessed. MSI score and classification as microsatellite stable (MSS) or MSI were calculated for both BostonGene’s and Novogene’s sequencing data. Correlation of the MSI score and the accuracy of classification was assessed.

Assessment of CNV detection performance

Performance of our LDT’s CNV detection was first assessed using data available for COLO829 from Craig et al.⁵³. The original annotation table consisted of 6586 genes covering four unique CNV classifications: gain, loss, focal gain, and focal loss. After intersection with our list of covered regions, the final set consisted of 5706 genes. In order to compare the previously published classifications with our classifications, we unified the categories as a gain, loss, or neutral. Therefore, we replaced the focal gain and focal loss categories with gain and loss, respectively. The same annotation was used for our categories: shallow deletions and deletions were categorized as losses, while amplifications and gains were categorized as a gains.

Next, we evaluated the consistency of our CNV calling algorithm by assessing the quality metrics for cell line dilutions. Due to the absence of gold standard references with widely acknowledged lists of CNVs, we created references for each cell line by using different and independent runs of samples with 100% tumor purity: six replicates for COLO829, HCC1143, and HCC1937 and four replicates for HCC1395 and NCI-H1770.

For CNV normalization and classification, we used an approach based on ploidy of the sample and calculated the total number of copies. We normalized total CNVs by ploidy value, thus classifying them into five categories (please see the Bioinformatics procedures section). All the comparisons with references and clinical samples were carried out in these normalized terms. For cell line references, normalized regions were compared between four (NCI-0H1770, HCC1395) and six (COLO829, HCC1143, HCC1937) sequencing replicates. For further analysis, only completely concordant regions were selected.

We also performed analytical validation using 52 clinical samples that were sequenced on two platforms: BostonGene and an independent laboratory (Novogene). The Novogene samples were used as references for all the metrics that were calculated; the samples were divided into two categories (low-purity: 20–30%; high-purity ≥ 30%) based on the differences in metrics.

Statistical analysis for test samples (i.e., cell line dilutions, clinical samples) was calculated for arm- and gene-level CNVs. An arm-level CNV was defined as a segment that exceeds 50% of the actual arm length. Thus, the total copy number (tCN) of the arm was defined by the presence of arm-level segments. Notably, if the arm was fractured by a high amount of CNVs (e.g., chromothripsis), then it was excluded from the statistical analysis of that sample. A gene-level CNV was defined as a gene that intersects with its corresponding segment by at least 10% of its length. If a gene had an intersection with multiple segments, then the tCN of the longest of those segments was assigned for that gene. This approach allowed us to resolve discrepancies between annotation of segments, which allowed for two CNV profiles to be objectively compared.

Loss of heterozygosity (LOH) events were defined when BAF equal to zero, while copy neutral LOH (cnLOH) events were defined as segments that had a BAF equal to zero and a tCN equal to ploidy. Gene-level metrics were calculated for both types of events. Negative events were defined as neutral events in the reference sample (tCN is equal to ploidy), while other events (tCN is not equal to ploidy) were considered as positive events. During the comparison between test and reference samples, we used a straightforward approach when normalized statuses matched each other: nCN “gain” for a particular gene/arm in the reference sample had to be “gain” in the test sample to be counted as a true positive, while a “gain” in the reference sample and “amplification” in the test sample resulted in a false-positive event.

Analysis of gene expression correlation

We assessed the expression level of ERCC spike-in ions for FF samples (37 clinical samples and 18 cell lines) by determining the TPM limit of detection. ERCC transcripts were aligned by Kallisto v0.42.4 to GRCh38.d1.vd1 with an addition of 92 ERCC transcript sequences. After removal of outliers and alignment, the expressions of 89 ERCC transcripts were normalized into TPM, but normalized into a summary of 10,000 instead of 1 million. Genes with an expression STD greater than 2 between paired FF and FFPE samples, processed using poly-A-based or EC-based RNA-seq; this resulted in 1389 genes for comparison. The need to assess highly variable genes arose from the observation that, for stably expressed genes, the correlation between FF and FFPE RNA-seq data may appear artificially low, even though direct comparisons showed that gene expression levels remained consistent across the cohort of samples. Then we performed a comparison of expression levels of 99 genes within FFPE samples of 51 patients to qPCR assessment.

The reproducibility (CV for gene expression and gene signatures) assessment was performed on 10 FFPE clinical samples¹⁴. All ten samples were made into three technical replicates, and expression-based CV was assessed within each samples’ replicates separately.

Reproducibility of tumor microenvironment (TME) subtype classification

TME subtypes were determined using methods previously described by Bagaev et al.¹⁴. Seven clinical samples from patients with solid diagnosis were used for reproducibility assessment. Each sample was prepared in three replicates within one day and on two additional days (five samples in total). Each set of samples was prepared with 15, 20, and 50 ng of input starting material, resulting in a total of 15 replicates for each clinical sample.

Assessment of fusion genes detection performance

The F1-score was calculated for different FFPM thresholds: samples for precision assessment - synthetic reference materials (Myeloid Fusion Reference, Seraseq FFPE Tumor Fusion RNA v4 Reference_Material, Horizon Reference), samples for sensitivity assessment - the above listed synthetic reference materials and cell lines (NCI-H2228, MCF-7, K562, RT-4, THP-1, A-673, SK-BR-3, U118MG, BT-20). We determined the LOD on the maximum fusion expression level based on our cell lines dilution experiments.

Three different types of samples were used to assess the performance of our LDT’s detection of fusion: (i) FF samples; (ii) FFPE samples extracted using a QIAamp DNAMaxwell kit, and (iii) FFPE samples extracted by a Qiagen kit. Fusions detected by RNA-seq were confirmed by real-time qPCR reactions with primers designed to detect fusions or fusion products.

Statistics and reproducibility

Analytical accuracy

The accuracy of our NGS LDT was calculated by comparing our results to the reference results (either reference sets or orthogonal data). Accuracy was evaluated with a standard formula that reflects the percentage of correct predictions in the entire set:

$$\frac{{True\; Positive}+{True\; Negative}}{{True\; Positive}+{False\; Positive}+{True\; Negative}+{False\; Negative}}$$

(7)

Analytical sensitivity/recall (PPA, or true-positive rate)

Sensitivity was evaluated with a standard formula that can be interpreted as the percentage of correct positive predictions in the reference data set:

$$\frac{{True\; Positive}}{{True\; Positive}+{False\; Negative}}.$$

(8)

Analytical specificity (NPA, or true-negative rate)

Specificity was evaluated with a standard formula that reflects the percentage of correct negative predictions in the reference data set:

$$\frac{{True\; Negative}}{{True\; Negative}+{False\; Positive}}$$

(9)

Precision (PPV)

Precision was evaluated with a standard formula that reflects the percentage of correct positive predictions in the reported data set:

$$\frac{{True\; Positive}}{{True\; Positive}+{False\; Positive}}$$

(10)

Jaccard index

Jaccard index was calculated for side-by-side comparison of CNV profiling of clinical samples via a standard formula where the length of segments was evaluated in nucleotides:

$$\frac{{Intersection\; Segments}}{{Intersection\; Segments}+{Assay\; Specific\; Segments}+{Orthogonal\; Specific\; Segments}}$$

(11)

VAF correlation coefficient

Standard Pearson’s correlation coefficient was calculated for test and reference data.

Coefficient of variance for gene level expression

$${Coefficient\; of\; Variance}=\frac{{Standard\; Deviation}}{{Mean}},$$

(12)

for each gene expression as a measure of reproducibility.

Gene signature variation assessment

$${STD}2{Size}90=\frac{{Standard\; Deviation}}{(95q-5q)}$$

(13)

for each gene signature for the corresponding samples, where q is a quantile of signature expression distribution for the corresponding TCGA cohort.

Clinical patient analysis

Cohort of patients

A total of 2230 clinical samples (1996 solid tumors and 234 hematological malignancies) for which both WES and RNA-seq data was available were included in this study (Supplementary Table 19). Patients samples were collected at the BostonGene laboratory between 2021 and 2024. All samples were processed in-house following the same standard operation procedures. Each patient provided informed consent for the use of their tumor and normal samples and associated clinical data for research purposes. The use of clinical samples was conducted in accordance with the Declaration of Helsinki and has been granted exemption from ethics approval by the Biomedical Research Alliance of New York (BRANY) Institutional Review Board (IRB; #22-12-938-853).

For each patient, sets of somatic SNVs and INDELs with VAF assessment, CNVs with absolute and normalized copy numbers, and fusion transcripts with FFPM and expression values across the entire transcriptome were obtained.

Panorama of driving events

The mutation rate of a particular gene for each diagnosis group was investigated. The percent assigned to each displayed gene for each diagnosis was assessed as a cumulative value of all alterations present in the gene (if the gene for a particular patient contained 2 or more SNVs or CNVs, it was counted only once). Thresholds for different events included 0.2 for somatic SNVs/INDELs, 0.2 for CNV events (both amplification and deletion), 0.05 for fusion events, and 0.03 for clinically pertinent germline variants. For somatic SNVs and INDELs, intronic and silent variants were filtered out.

Overexpression calculation

The percentage of overexpression of a particular gene for a diagnosis was calculated as the number of samples with expression of the gene higher than 87% (more than the STD of normal distribution) of the cohort with the same diagnosis divided by the overall number of samples in the cohort (internal normalization).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

Development of an integrated RNA and DNA sequencing assay

A complementary WES and RNA-seq bioinformatics pipeline was optimized for an integrated comprehensive NGS-based assay (Methods). Matched-normal WES was used to accurately identify tumor purity, ploidy, somatic variants, TMB, MSI, and CNVs⁵⁴. As nucleic acid degradation is a well reported challenge during extraction from FFPE tissues^55,56, an exome capture (EC) workflow was used to increase DNA- and RNA-seq quality from degraded FFPE material (Fig. 1a)⁵⁷. To replicate standard poly-A mRNA enrichment from fresh frozen (FF) tissue, 17 Mb of untranslated regions (UTRs) were added to the RNA EC protocol (Fig. 1a, Methods), resulting in even gene body coverage and full transcript capture (Supplementary Fig. 1a, b)⁵⁸.

**Fig. 1: Integrated WES/RNA-seq assay from the same tissue.**

To ensure WES and RNA-seq integration and exclude cross-contamination, SNV concordance (>70%) and HLA genotyping were assessed across normal and tumor WES and tumor RNA-seq samples (Fig. 1b). While exonic germline variants were detected across all three samples, utilization of matched-normal samples allowed separation of somatic and germline variants with high precision. RNA-seq further confirmed expressed driver mutations (Fig. 1b).

To improve the overall performance of the WES assay, EC protocols covering 19,047 genes were optimized to maintain low off-target and duplication rates (Fig. 1c). The duplication rate (27.2 ± 0.90%) was used to assess the initial library complexity and quality of the input material for WES, and the off-target rate (29.0 ± 0.28%) served as a probe specificity indicator (Methods). Low complexity regions, homopolymeric regions (≥9 bp), and highly polymorphic genes were excluded to minimize false-positive SNVs and INDELs (Supplementary Tables 1–3). Less than 3% of the target exome was characterized as difficult to align and filtered from the analysis (Fig. 1c).

For RNA-seq, proportion of on-target reads was prioritized. After removing low-quality and unmapped reads, 91% were from coding sequence (CDS) exons and UTRs, while 9% were in introns and intergenic regions (Fig. 1c). The total number of reads (read count) determines the limit of detection (LOD) of expression⁵⁹. Transcript diversity plateaued at 50 M paired-end (PE) reads (Fig. 1d) with <20% CV for transcripts expressed above 1 transcript per million (TPM; Fig. 1e, Methods).

Final performance assessment for each biomarker group included: (1) technical validation using reference standards; (2) concordance with orthogonal assays on a reference cohort of patients with known alterations; (3) assay application in clinical practice and comparison with other large genomic studies (Fig. 1f).

Unique exome-wide WES/RNA integrated somatic reference standards

A major challenge in the analytical validation of WES assays is the lack of reference exomes for somatic variant calling to establish accurate exome-wide performance characteristics^60,61. To address this limitation, somatic mutation reference standards were established using commercial cell lines (COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770) with serial dilutions (10%–100%; Fig. 1g, Methods, Supplementary Table 4). Reference cancer cell lines were heterogeneous and highly polyploid, with somatic variants spanning 5% to 100% VAF⁵². Tumor purity was concordant with VAF for true-positive calls; therefore, false positives were identified by VAF and tumor purity discordance (Supplementary Figs. 1c–g). Similarly, true variants had biologically meaningful trinucleotide context, while FFPE artifacts (i.e., polymerase errors) exhibited random context, indirectly supporting separation of true- and false-positive calls (Supplementary Figs. 1c–g)^62,63,64. The defined set of mutations was additionally confirmed by evaluating corresponding SNVs from RNA-seq (Fig. 1h, Supplementary Fig. 2a–e). While 35% of reference variants were found in RNA-seq, only 1% of variants below the thresholds intersected (Fig. 1h, Supplementary Fig. 2a–e). The VAF of expressed variants between RNA-seq and WES had varying correlations (r = 0.2–0.6 for 10–100% purity, Supplementary Fig. 2f–k), likely explained by allele-specific expression⁶⁵ and differential expression of cancer-specific genes (Supplementary Fig. 2l–n)⁶⁶. RNA-seq VAF of genes overexpressed in cancerous tissue can be high at low purities (<30%; Supplementary Fig. 2m).

The final set contained 3042 variants and up to 18,354 wild-type (WT) genes for each cell line, expanding previous limited sets (Supplementary Tables 4–5)⁵². The SNVs and INDELs were distributed uniformly across the exome with an average density of 43 mut/Mb, providing the most comprehensive reference somatic exome to date (Supplementary Fig. 2o). The reference is available at https://github.com/BostonGene/Somatic_reference_standards/⁶⁷.

Optimized thresholds for accurate variant calling

A minimum coverage depth of 80 reads was required to achieve F1-scores of >0.90 for cell lines with ≥20% purity (Fig. 1i, Supplementary Fig. 3a, b). A similar analysis of germline variant calling established a minimal coverage depth of 20 reads (Fig. 1j). A median WES target coverage of 150X yielded >85% completeness at the 80 reads depth and stable F1 scores (Fig. 1k). At 150X median coverage, clinically actionable genes achieved a median coverage of 240X, lowering the LOD to 2% VAF for these genes (Supplementary Table 6). At 100X normal WES, germline variants reached >0.95 F1-score with a threshold of 20 reads per nucleotide (Fig. 1j–k).

Precise assessment of normal/tumor depth ratio and B-allele frequency (BAF) is instrumental in the accurate determination of CNVs²⁴. At 150X, the variance of depth ratio plateaued and BAF stabilized (Fig. 1l), while the F1-score slightly improved with higher coverage (Supplementary Fig. 3a, b). A median coverage of 150X with high overall completeness appeared to be optimal for clinical tumor/normal matched variant calling and was significantly lower than the coverage typically required by targeted tumor-only assays¹.

Analytical validation Of RNA-Seq gene expression

DNA contamination is difficult to identify in RNA libraries and can result in abnormal distribution of gene expression. Additional DNase treatment steps can mitigate DNA contamination in RNA-seq samples (Fig. 2a). A positive dependence was observed between DNA contamination and the proportion of sense strand reads (Fig. 2b), which could serve as an indicator of contamination in strand-specific RNA-seq.

**Fig. 2: Development and validation of gene expression calling pipelines.**

The RNA-seq pipeline was tested on commercial references ERCC Spike-In Mixes (89 foreign transcripts) admixed in RNA extracted from 18 cell lines and 55 tumor samples (r = 0.96–0.97; Fig. 2c, d)^68,69. Although expression values from EC and poly-A RNA-seq cannot be compared directly due to batch effects stemming from technical differences^70,71, gene expression across 110 paired FFPE and FF samples exhibited strong correlations (Fig. 2e, f, Supplementary Table 7). The median correlation of 1389 variably expressed genes (STD > 2) reached 0.88 (Fig. 2f).

To orthogonally validate FFPE-derived RNA-seq expression values, qPCR was performed on matched samples for 99 genes from different genomic regions in 51 clinical samples (median r = 0.85; Fig. 2e, g, h, Supplementary Table 8). FFPE RNA-seq was stable (r = 0.99, p = 1.0e-308) with strong correlations between replicates from different days (Fig. 2i). Using multiple replicates the RNA-seq LOD was established as 1 TPM, where gene expression CV was <20%. Below 1 TPM, gene expression reproducibility was low (Fig. 2j). Overall, the CV for all genes expressed at >1 TPM was as low as 3.6% and 2.4% for single genes and gene signatures, respectively (Fig. 2k). RNA from two COLO829 FFPE blocks were sequenced over four months to represent assay stability over time (Fig. 2l). Expression analysis demonstrated high inter-assay reproducibility with <5% variability over time for cell lines from the same passage (Fig. 2l). We recommend careful preparation and passage annotation of cell line RNA prepared as a reference control.

Reproducibility of gene signature scores and TME classification

Gene signatures can capture functional changes and complex properties of cancer tissue and can be applied for tumor microenvironment (TME) classification across cancer types^{72,73,74,75,76,77}. Bagaev et al. described pancancer TME subtypes associated with immunotherapy response: Immune-Enriched (IE), Immune-Enriched/Fibrotic (IE/F), Fibrotic (F), and Immune-Depleted (D)¹⁴.

To further validate the feasibility of classification using these TME subtypes in clinical settings, tumor samples (10, 20, and 50 ng input) were sequenced five times. (Fig. 2m)¹⁴. Gene signature (ssGSEA) scores were stable across intra- and inter-days with 2.8% and 3.1% variance (Fig. 2n), resulting in high reproducibility of TME classification probability scores across all RNA inputs (STD < 2%, Fig. 2o). As shown in the corresponding heatmaps for a range of typical and atypical samples (Fig. 2p, Supplementary Fig. 4a), the overall reproducibility of signatures and TME classification across all tested samples supported the clinical application of RNA-seq for complex diagnostic signatures. Applying the classification to a cohort of 1399 clinical samples, we confirmed the distribution of gene signatures (Supplementary Fig. 5a), mean Z-score of signatures (Supplementary Fig. 5b), and subtype distribution within major cancer types (Supplementary Fig. 5c) all concurred with previously reported findings, supporting reproducibility³⁹.

Gene fusion calling accuracy depends on intrinsic transcript expression level

Fusions were identified by junction and spanning reads after removing artifacts and misalignments (Fig. 3a). Expression values for fusion transcripts are generally higher than WT transcripts⁷⁸, which can upshift fusion transcript exon coverage (Fig. 3b); however, if WT gene expression is high, the impact of fusion transcripts on coverage may not be observed. Detection of fusions depended on the transcript expression, spanning and junction reads, and overall coverage. A minimum of 30 M PE reads was required to detect the majority of tested fusions, with fusion-related reads plateauing at 50 M PE reads (Fig. 3c). The pipeline was benchmarked against other approaches. For low-expressed fusions at 30% purity, the recommended approach led to 100% sensitivity, compared to 88.9% and 77.8% for Arriba (v2.4) and STAR-Fusion, respectively (Supplementary Fig. 6a–c)^30,32, with similar performance for highly expressed events (Supplementary Fig. 6a, b).

**Fig. 3: Validation of novel gene fusion calling from RNA-seq.**

The first technical validation step included a diverse set of reference materials and cell lines (Methods; Fig. 3d, Supplementary Table 9) and resulted in 0.98 overall analytical sensitivity for detection of 83 various fusions (Fig. 3d). The properties of the fusion breakpoint regions significantly varied, especially the expression levels of gene partners of the fusion transcript (Fig. 3d). Well characterized cell lines and reference materials (Methods) were used to assess sensitivity, precision, and F1 scores, indicating an 0.1 fusion fragments per million (FFPM) LOD maximized fusion detection performance for FF (F1 = 0.95 ± 0.02) and FFPE (F1 = 0.91 ± 0.01) samples (Fig. 3e, f). Four cell lines were evaluated to establish the LOD for fusions with varying maximum FFPMs (Fig. 3g). For transcripts with >2 FFPM, the LOD was 10% purity, while fusions with 1–2 FFPM had a 20–30% purity LOD (Fig. 3h). Fusions with low expression values (<1 FFPM) required an estimated 30–50% purity LOD. The LOD depended on tumor content⁴⁸ and fusion expression levels.

Orthogonal qPCR validation included 115 fusions in 63 patient samples (Fig. 3i and Supplementary Tables 10–11), resulting in 98.0 ± 1.7% sensitivity, 99.0 ± 0.9% precision, and >99.9% specificity (Fig. 3i). Notably, whole transcriptome RNA-seq can be used for de novo gene fusion calling, overcoming the limitations of predesigned probe sets. A total of 69 de novo fusions were confirmed by qPCR, including RPUSD3--NTRK2, which was detected in a uterine sarcoma (Fig. 3j). Evaluation of 19 fusions in 8 clinical samples over 3 days generated high reproducibility (100%) for fusions (>1 FFPM; Fig. 3k).

Bulk RNA-seq can also be used to assemble V(D)J recombination fusion transcripts of the CDR3 region of T- (TCR) and B-cell receptors (BCR)^33,78. Minimal differences were observed when BCR fractions were compared from FF and FFPE on B cell lines and lymphoma samples, demonstrating that EC is comparable to poly-A RNA-seq for BCR assembly (Fig. 3l and Supplementary Table 12).

Analytical validation of exome-wide somatic SNV and INDEL variant calling

Initial somatic SNV and INDEL variant calling was performed by applying Strelka2¹⁶ and standard filtration⁷⁹, which rendered low F1 scores (Supplementary Fig. 7a–f, Methods). Variant calling between FF and FFPE samples was tested using the reference set of 3042 mutations, resulting in concordant VAFs (r = 0.94, p = 1.0e-308) and high precision (95%; Fig. 4a). However, 176 of the 239 variants were confirmed as true variants in pileup files (Methods), leading to an adjusted precision of 98%, indicating the filtering approach removed the majority of FFPE artifacts.

**Fig. 4: Validation of SNV and INDEL calling.**

Technical validation included analysis of 558 previously established somatic variants for COLO829, HCC1143, HCC1937, and HCC1395⁵², which yielded high sensitivity (98%) and concordance with the previously reported VAFs (r = 0.87, p = 4.9e-177; Fig. 4b). TMB measured in cell lines was dependent on tumor purity but stable above 20% (Fig. 4c).

Next, the LOD and sensitivity for SNV/INDEL detection were assessed on the 3042 reference variants from COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 at serial dilutions (10%–100% purity; Fig. 4d, e and Supplementary Table 13). Samples with ≥20% purity demonstrated sensitivities of 95.3 ± 0.89% for SNVs and 85.2 ± 3% for INDELs, while those with ≥30% purity had sensitivities of 97.1 ± 0.8% for SNVs and 89.9 ± 1.8% for INDELs (Fig. 4d, e and Supplementary Table 13). Across all cell line dilutions, >95% precision and 99.8 ± 0.04% specificity were observed. Remarkably, precision remained stable across all purities (Fig. 4d, e, and Supplementary Table 13), and the majority of mutations (>50%) were missed when purity was ≤10% (Supplementary Fig. 8a). Given these observations, 20% purity and 5% VAF LODs were established.

To achieve high precision (>95% across all purities), additional quality filters were introduced (Methods). The pipeline was benchmarked with Mutect2 (GATK v4.1)¹⁹ and Illumina Dragen (v4.3)²⁰ in somatic mode using recommended filters (Supplementary Fig. 7a–f). The proposed pipeline yielded higher F1-scores for SNVs (F1 = 95.3%) and INDELs (F1 = 91.9%) compared to other tools: Strelka2 (v2.9; SNVs F1 = 73.7%; INDELs F1 = 76.4%), Mutect2 (SNVs F1 = 68.9%; INDELs F1 = 28.4%), and Dragen (v4.3; SNVs F1 = 91.6%; INDELs F1 = 31.1%) at 50% tumor purity (Supplementary Fig. 7a–f).

For orthogonal validation, 53 clinical tumor samples of various tissue origins (Supplementary Table 14) were sequenced by an independent CLIA-certified laboratory. Strong VAF correlations for the 16,553 SNVs (r = 0.92, p = 1.0e-308) and 2046 INDELs (r = 0.82, p = 1.0e-308) were observed (Fig. 4f, g). Nevertheless, 2842 variants were initially discordant; however, 2505 variants were manually identified as low VAF true positives (Methods), resulting in 98.8% precision and 99.1% sensitivity. Exome-wide performance on all reference and tissue samples resulted in a sensitivity of 96.1% for SNVs and 96.9% for INDELs for all genes, reaching >99% for clinically actionable alterations (Supplementary Table 15).

Comprehensive clinical tests require internal controls with diverse features, such as COLO829. Sequencing COLO829 over 4 months demonstrated consistent exome-wide mutation calling, supporting its use as a reference material in NGS testing (Fig. 4h).

Calculating exome-wide TMB and MSI

TMB and MSI are widely used in clinical practice as predictive and prognostic biomarkers for many cancer types^{3,80,81,82,83,84,85}. TMB was dependent on tumor purity and with an LOD of 20% purity (Fig. 4c, Methods). Evaluation of TMB accuracy with Seraseq® gDNA TMB (30% and 100%) reference materials (Methods) strongly correlated with the expected values of the reference materials (100%: r = 0.9997, p = 6.1e-8; 30%: r = 0.998, p = 1.5e-8; Supplementary Fig. 8b). Notably, assessment of patient and cell line samples indicated the optimized thresholds for FFPE filtration (Methods) resulted in negligible differences in the TMB for FF materials (r = 0.99, p = 4.2e-64; Supplementary Fig. 8c). TMB from 53 clinical samples (20–90% tumor purity; Supplementary Table 14) was concordant between our workflow and an independent reference laboratory (r = 0.99, p = 1.7e-41; Fig. 4i).

To determine the MSI LOD, serial dilutions of MSI-high cell lines, LoVo, DLD-1, HCT15, and HCT16, were prepared. Similar to TMB, MSI scores depended on purity (Supplementary Fig. 8d). For samples with >20% purity, our assay achieved 100% accuracy against orthogonal results (Supplementary Table 15). Interestingly, among the 53 clinical samples, TMB-high samples were generally MSI-high (Fig. 4j). MSI patient samples analyzed in an independent laboratory (Supplementary Table 16), yielded up to 100% sensitivity and specificity.

Robust single-copy CNV and LOH calling

An adopted algorithm for copy number detection and normalization was applied to account for sample ploidy during CNV classification (amplifications, gains, shallow deletions, deletions, and neutrals; Methods; Fig. 5a). CNVs were interpreted as either amplifications or deletions, depending on sample ploidy (Fig. 5b). For example, a segment with three copies reported as neutral in a triploid sample would appear as a gain in diploid and shallow deletion in tetraploid samples (Fig. 5b).

**Fig. 5: Assessing copy number variations.**

Validation on commercial reference materials showed that the algorithm detected amplifications in the three genes in Seraseq® Breast and Brain-Lung CNV Mixes and Brain-Lung-Breast (BLB) Mix (Supplementary Table 17). Then, using a published set of 5700 genes from the COLO829⁵³, we demonstrated >98% sensitivity at ≥20% purity (Fig. 5c). To further validate exome-wide CNV calling, a unified reference exome for CNVs covering 19,047 genes was prepared for FFPE COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 cell lines (Fig. 5d, Supplementary Fig. 9a, Methods). BAF (r = 0.95, p = 1.0e-308) and depth ratio (r = 0.99, p = 1.0e-308) strongly correlated between FF and FFPE tissues, confirming the effectiveness of CNV calling from both sample types (Fig. 5d, e). Exome-wide CNV calling showed high precision (>97.6% and >98%), sensitivity (>81% and >94%), specificity (>96.5% and >96.4%), and accuracy (>91.8% and >96.1%) at 30% tumor purity for arm- and gene-level CNVs, respectively (Fig. 5f, Supplementary Fig. 9b, and Supplementary Table 18). Gene-level LOH and copy-neutral LOH resulted in >85% sensitivity and precision at 20% purity and >98% at 30% purity (Fig. 5g, Supplementary Fig. 9c, and Supplementary Table 18).

The pipeline achieved the highest F1-score (>91%), sensitivity (>86%), and specificity (>91%) across all purities compared to the performance of standard Sequenza (v2.1)²⁴, FACETS (v0.5.14)²⁵, and Dragen 4.3 (v4.3; Panel of Normals)²⁰, which showed sensitivities of >78%, >61%, and >31%, respectively (Supplementary Figs. 10a–c). Specificity for the other tools generally remained below 80% (Supplementary Fig. 10c).

Orthogonal testing of 52 tumors in an independent laboratory revealed high exome-wide concordance per nucleotide (Jaccard index = 98.5%; Methods, Fig. 5h, Supplementary Fig. 11a). Gene-level CNVs at >30% purity achieved median specificity of >95% and sensitivity of >92% (Fig. 5i). Using COLO829, high reproducibility with only mild variations in depth ratio was observed over 3 months (Supplementary Fig. 11b, c).

Clinical applications of the integrated whole exome and transcriptome Tumor Portrait^TM assay

Integrated WES and RNA-seq were applied to 2230 real-world clinical samples to assess mutation rates and biomarker overexpression (Supplementary Table 19, Methods). The analysis included alterations and RNA overexpression across tumor types (Fig. 6). On average, each sample had 173 SNVs, 8 INDELs, 34,10 amplifications, 4174 deletions, 5 fusions, and 4 overexpressed antibody-drug conjugate (ADC)-related genes. Polyploidy was observed in 30.3% of tumors. TMB-high status (>10 mut/Mb) was found in 6.8% of cases, most commonly in cutaneous melanoma and uterine cancers. Clinically actionable mutations were found in 98% of samples, and ADC-related gene overexpression in 89%. Mutation frequencies for TP53 (43%), KRAS (12%), PTEN (4.5%), and PIK3CA (11.2%) matched published large-scale data (Supplementary Table 20)^86,87,88. Notable fusions included ERG in 29.2% of prostate cancers and RB1 in 9.5% of uterine tumors.

**Fig. 6: Analysis of genomic events of clinical samples.**

To support therapeutic targeting, we evaluated RNA overexpression of surface markers, including potential ADC targets (Fig. 7a)^89,90. ERBB2 was overexpressed in 73% of HER2+ breast cancers, and its expression was concordant with amplification rate. Other commonly overexpressed genes included: FOLR1 in 67% of kidney and 77% of ovarian cancers; PMEL in 82% of melanoma; and TACSTD2 in 73% of bladder cancer. RNA data also enabled classification of TME subtypes (Fig. 7b), informing immunotherapy decisions¹⁴.

**Fig. 7: Analysis of expression patterns of clinical samples.**

Interpretation of somatic mutations and germline variants with RNA-seq in clinical practice

While DNA remains the primary source for detecting variants, RNA-seq provides complementary information for clinical interpretation and validation. We found 37.1% of somatic SNVs and INDELs were expressed in RNA, rising to 63.9% in clinically actionable genes (Fig. 8a, b and Supplementary Fig. 12a, b). RNA-seq also confirmed 55.5% of germline variants, with 2.1% showing altered zygosity compared to WES, enabling allele-specific expression analysis. Most expressed variants were missense (63%) or silent (24.2%), but a small percentage were nonsense (5%) and frameshift (3%; Supplementary Table 21), indicating some structurally impactful mutations escape nonsense-mediated decay and may affect protein function (Fig. 8c).

**Fig. 8: Concordance of genomic events and RNA-seq data.**

RNA-seq VAF often exceeded DNA-derived VAF, with 16.7% of variants showing increased RNA to DNA VAF ratios and only 3.2% showing lower ratios (Fig. 8a). Seventeen genes, including CDKN2A, BCL2, MYC, and CCND1, had increased RNA VAF, which can be associated with amplification or selective expression, providing a new avenue for clinical interpretation (Supplementary Table 22 and Fig. 8a).

RNA-seq also rescued low-VAF subclonal mutations missed in DNA due to coverage thresholds^91,92. Rescued variants had 2.7x higher VAF and 2.4x higher coverage in RNA (p = 0.1e-79 and 0.1e-9, respectively; Fig. 8d). Across 2230 patients, RNA-seq integration recovered 18,231 additional missense mutations (+4.5% to WES), 2962 splice site variants (+0.7%), and 505 frameshift mutations (+0.12%).

Interpretation of structural variants by RNA-seq

Clinically actionable CNV events correlated with expression levels in a tissue-specific manner (Methods, Fig. 8e)^93,94. We assessed Spearman correlations for recurrent genes with CNVs, including CDK4 (r = 1.0, p = 1.4e-241) and CCNE1 (r = 1.0, p = 1.4e-24) in sarcoma and ERBB2 (r = 0.93, p = 2.3e-2) and CCND1 (r = 1.0, p = 1.4e-24) in breast cancer (Fig. 8e, f and Supplementary Fig. 13a).

We found evidence of overexpression of a fused oncogene target and a promoter donor gene (Fig. 8g and Supplementary Fig. 13c, Methods) in TMPRSS2–ERG for prostate cancer and DNAJB1–PRKACA for hepatobiliary cancer, where ERG and PRKACA are known driver events (Fig. 8g and Supplementary Fig. 13c)^95,96,97. Further analysis across the 2230 patient samples found previously unreported fusion events, with 57 cases of potentially clinically actionable events needing further clinical annotation. The described approach could be used as an additional interpretation tool for nonreported oncogenic fusions.

Discussion

As the landscape of precision oncology continues to evolve with ongoing biomarker discovery and development of targeted therapies^98,99, integrative analysis of RNA signatures and DNA biomarkers from a single sample is becoming increasingly important in clinical decision-making^100,101,102. Unlike targeted assays covering a limited number of genes, WES and RNA-seq assays can identify de novo events, novel fusions, and gene expression levels, thereby expanding the repertoire of clinically actionable findings^14,100,103. RNA-seq may also serve as a valuable immunohistochemistry alternative for uncovering ADC targets^90,104. Complex gene signatures are already widely used in treatment response prediction models to estimate probabilities of immunotherapy outcomes in research settings⁹. Importantly, our analysis herein demonstrated the reproducibility of our TME subtype classification, highlighting its clinical utility.

To address challenges associated with the longitudinal dynamics of RNA expression that complicate reproducibility¹⁰⁵, we used COLO829 cells as a sequencing control to monitor expression stability over time. As expected, our results revealed stable gene expression within a single batch of extracted RNA, indicating the suitability of COLO829 cells as a reference.

Whole transcriptome RNA-seq can identify nonconventional fusions¹⁰⁶. We tested 171 gene fusions from synthetic materials, cell lines, and clinical samples, including previously unreported events with potential clinical relevance (e.g., RPUSD3–NTRK2), to uncover RNA-seq parameters that may influence fusion detection. Our analysis showed high specificity (>99.9%) and sensitivity (96%) overall for fusion detection (Supplementary Table 15). Moreover, tumor bulk RNA-seq allowed simultaneous assembly of V(D)J fusion transcripts of B and T cells within the TME. These findings indicated FFPE-derived RNA-seq can reliably identify complex and previously undetected fusion events in a clinical setting.

Traditional validation guidelines were designed for small targeted panels focused on known hotspots^1,107,108, often using diluted platinum cell line variants to estimate VAF LOD¹⁰⁸. However, such models fail to capture tumor heterogeneity and purity^{109,110,111,112}. Standard false-positive metrics—like specificity calculated per nucleotide—inflate performance estimates and are not scalable for exome-wide analysis. To address this, we developed an exome-wide reference dataset of 3042 somatic SNVs and INDELs as well as 47,472 CNVs and high-confidence WT gene sets across five polyclonal, polyploid cell lines. This approach enables direct precision assessment based on event-level metrics, yielding a more accurate false-positive rate evaluation for WES assay development.

Targeted panels may not reliably assess TMB and MSI for certain cancer types^113,114, and TMB and MSI scores declined with tumor purity, indicating a need for subclonal mutation correction or purity-adjusted thresholds and exome-wide panel use^115,116. Still, our assay showed >99% concordance with reference materials and orthogonal methods at ≥20% tumor purity. Across 16,553 SNVs and 2046 INDELs from samples with wide purity ranges (20–90%), exome-wide sensitivity was 96.2–96.9%, with 99.9% specificity and 99.3% sensitivity for clinically actionable genes (Supplementary Table 15).

Reliable CNV detection in tumors with variable purity and ploidy cannot rely solely on BAF and depth ratio from targeted regions^{117,118,119,120,121}. Targeted assays often report amplifications without ploidy correction¹²², leading to misinterpretation. Our analysis revealed multiploidy in over 30% of samples analyzed, highlighting the need to account for ploidy in clinical interpretation of CNV gains and losses.

In the final validation step of the guidelines proposed herein, assay performance was evaluated on 2230 real-world clinical samples and compared to known mutation rates by diagnosis (Supplementary Table 20)^86,87,123. Almost all cases (98%) harbored variants in clinically actionable genes, and 89% showed overexpression of ADC-related drug targets. These findings support the clinical utility of an exome-wide, non–targeted WES and RNA-seq combination assay for therapy selection, underscoring its untapped potential.

RNA-seq enhances the sensitivity for detecting subclonal variants with low VAF or those in poorly covered regions⁹¹. Consistent with RADIA algorithm findings, RNA-seq confirmed 37% of WES-identified variants and rescued an additional 7% of events below detection thresholds, up to 50% of which affected protein-coding regions. Notably, RNA-based VAF reflected not only tumor content, but also allele-specific expression influenced by gene regulation or CNVs. Statistically significant VAF deviations in genes like CDKN2A, BCL2, MYC, and CCND1 demonstrate the value of RNA-seq value in refining variant interpretation. Clinically actionable CNV events (e.g., MTAP, ERBB2, CCNE1, CD274) showed expression changes correlated with copy number, supporting RNA-seq as a functional tool for CNV variant interpretation at borderline values (5–7 copies).

In summary, emerging assays require new principles for analytical and clinical validation beyond targeted panels of hotspot mutations. Integrated WES and RNA-seq assays can evaluate entire genomes to identify more clinically actionable findings, highlighting the potential clinical use of RNA-seq as a variant interpretation tool for complex somatic rearrangements. Such greater diagnostic yields are valuable to both clinicians, pharmaceutical companies, and payers as they direct their patients to biomarker-based targeted therapies with superior outcomes, realizing the full potential of personalized medicine. We believe our proposed three-step validation approach addresses a critical need for comprehensive guidelines that can steer the development of integrated RNA and DNA sequencing assays.

Data availability

Processed and reference data are available in the supplementary tables. The raw sequencing data for cell line data generated and analyzed during the current study are available at SRA repository PRJNA1134786 at the following URL: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1134786.v¹²⁴ Variant reference files are deposited at https://github.com/BostonGene/Somatic_reference_standards/⁶⁷. The raw sequencing data of clinical patients are not publicly available due to data privacy regulations on the use of such data. De-identified processed clinical data used to demonstrate the clinical application of the integrated WES and WTS assay is provided at https://zenodo.org/doi/10.5281/zenodo.15261029¹²⁵ Supplementary Data 1 includes processed and reference data in supplementary tables. Supplementary Tables 1–3 list genomic regions excluded from analysis due to low complexity, homopolymeric stretches, or high polymorphism. Supplementary Tables 4–5 contain somatic references and WT genes for COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 reference cell lines. Genes termed “clinically actionable” are available in Supplementary Table 6. Supplementary Tables 7–8 report correlations among FF and FFPE along with RNA-seq and qPCR. Supplementary Tables 9–11 include data from fusion analysis, such as performance, orthogonal qPCR, and reference databases. Supplementary Table 12 contains BCRs across FF and FFPE samples. Performance metrics are summarized in Supplementary Tables 13, 15, 17, and 18. Supplementary Tables 14 and 16 report sequencing results and MSI status for the 53 clinical samples. Supplementary Table 19 illustrates the distribution of diagnoses from the full cohort of 2230 patients. Supplementary Table 20 contains a comparative analysis of the BostonGene Tumor Portrait^TM against other large-scale genomic analyzes. The distribution of alteration types from RNA-seq and allele-specific expression is available in Supplementary Tables 21, 22. Source data for the main figures in the manuscript are provided in Supplementary Data 2.

Code availability

All code for reference cell lines datasets processing is deposited online (https://github.com/BostonGene/Somatic_reference_standards/)⁶⁷.

Abbreviations

BAF:: B-allele frequency
BCR:: B-cell receptor
CDS:: Coding sequence
CNV:: Copy number variation
EC:: Exome capture
FF:: Fresh frozen
FFPE:: Formalin-fixed paraffin-embedded
INDELs:: Insertions and deletions
LOD:: Limit of detection
LOH:: Loss of heterozygosity
MSI:: Microsatellite instability
NGS:: Next-generation sequencing
PE:: Paired-end
RNA-seq:: RNA sequencing
TMB:: Tumor mutational burden
TCR:: T-cell receptor
TPM:: Transcript per million
TME:: Tumor microenvironment
UTR:: Untranslated region
VAF:: Variant allele frequency
WES:: Whole exome sequencing
WT:: Wild-type
WTS:: Whole transcriptome sequencing

References

Frampton, G. M. et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat. Biotechnol. 31, 1023–1031 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hussen, B. M. et al. The emerging roles of NGS in clinical oncology and personalized medicine. Pathol. Res. Pract. 230, 153760 (2022).
Article CAS PubMed Google Scholar
Merino, D. M. et al. Establishing guidelines to harmonize tumor mutational burden (TMB): in silico assessment of variation in TMB quantification across diagnostic platforms: phase I of the Friends of Cancer Research TMB Harmonization Project. J. Immunother. Cancer 8, e000147 (2020).
Article PubMed PubMed Central Google Scholar
Wang, X. et al. Copy number alterations detected by whole-exome and whole-genome sequencing of esophageal adenocarcinoma. Hum. Genomics 9, 22 (2015).
Article PubMed PubMed Central Google Scholar
Luthra, R. et al. A targeted high-throughput next-generation sequencing panel for clinical screening of mutations, gene amplifications, and fusions in solid Tumors. J. Mol. Diagn. JMD 19, 255–264 (2017).
Article CAS PubMed Google Scholar
Grasso, C. et al. Assessing copy number alterations in targeted, amplicon-based next-generation sequencing data. J. Mol. Diagn. JMD 17, 53–63 (2015).
Article CAS PubMed Google Scholar
Linderman, M. D. et al. Analytical validation of whole exome and whole genome sequencing for clinical applications. BMC Med. Genomics 7, 20 (2014).
Article PubMed PubMed Central Google Scholar
Wrzeszczynski, K. O. et al. Analytical validation of clinical whole-genome and transcriptome sequencing of patient-derived tumors for reporting targetable variants in cancer. J. Mol. Diagn. 20, 822–835 (2018).
Article CAS PubMed PubMed Central Google Scholar
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS PubMed PubMed Central Google Scholar
Rennert, H. et al. Development and validation of a whole-exome sequencing test for simultaneous detection of point mutations, indels and copy-number alterations for precision cancer care. NPJ Genomic Med. 1, 16019 (2016).
Article Google Scholar
Roepman, P. et al. Clinical validation of whole genome sequencing for cancer diagnostics. J. Mol. Diagn. JMD 23, 816–833 (2021).
Article CAS PubMed Google Scholar
Beaubier, N. et al. Integrated genomic profiling expands clinical options for patients with cancer. Nat. Biotechnol. 37, 1351–1360 (2019).
Article CAS PubMed Google Scholar
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Article CAS PubMed Google Scholar
Bagaev, A. et al. Conserved pan-cancer microenvironment subtypes predict response to immunotherapy. Cancer Cell 39, 845–865.e7 (2021).
Article CAS PubMed Google Scholar
Szolek, A. et al. OptiType: precision HLA typing from next-generation sequencing data. Bioinforma. Oxf. Engl. 30, 3310–3316 (2014).
Article CAS Google Scholar
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Article CAS PubMed Google Scholar
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinforma. Oxf. Engl. 32, 1220–1222 (2016).
Article CAS Google Scholar
Dunn, T. et al. Pisces: an accurate and versatile variant caller for somatic and germline next-generation sequencing data. Bioinforma. Oxf. Engl. 35, 1579–1581 (2019).
Article CAS Google Scholar
Auwera, G. Vder & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. (O’Reilly, Beijing Boston Farnham Sebastopol Tokyo, 2020).
Google Scholar
Illumina. DRAGEN secondary analysis (Version 4.3) [Computer software]. https://www.illumina.com/products/by-type/informatics-products/dragen-secondary-analysis.html (2024).
Niu, B. et al. MSIsensor: microsatellite instability detection using paired tumor-normal sequence data. Bioinforma. Oxf. Engl. 30, 1015–1016 (2014).
Article CAS Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Article PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Favero, F. et al. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann. Oncol. 26, 64–70 (2015).
Article CAS PubMed Google Scholar
Shen, R. & Seshan, V. E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Res. 44, e131–e131 (2016).
Article PubMed PubMed Central Google Scholar
Zaitsev, A. et al. Precise reconstruction of the TME using bulk RNA-seq and a machine learning algorithm trained on artificial transcriptomes. Cancer Cell 40, 879–894.e16 (2022).
Article CAS PubMed Google Scholar
Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).
Article CAS PubMed PubMed Central Google Scholar
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Article CAS PubMed Google Scholar
Abrams, Z. B., Johnson, T. S., Huang, K., Payne, P. R. O. & Coombes, K. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinformatics 20, 679 (2019).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 20, 213 (2019).
Haas, B. J. et al. Targeted in silico characterization of fusion transcripts in tumor and normal tissues via FusionInspector. Cell Rep. Methods 3, 100467 (2023).
Article CAS PubMed PubMed Central Google Scholar
Uhrig, S. et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 31, 448–460 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015).
Article CAS PubMed Google Scholar
Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gazdar, A. F. et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Int. J. Cancer 78, 766–774 (1998).
Article CAS PubMed Google Scholar
Tomlinson, G. E. et al. Characterization of a breast cancer cell line derived from a germ-line BRCA1 mutation carrier. Cancer Res 58, 3237–3242 (1998).
CAS PubMed Google Scholar
Easty, D. J. et al. Protein B61 as a new growth factor: expression of B61 and up-regulation of its receptor epithelial cell kinase during melanoma progression. Cancer Res. 55, 2528–2532 (1995).
CAS PubMed Google Scholar
Phelps, R. M. et al. NCI-navy medical oncology branch cell line data base. J. Cell. Biochem. Suppl. 24, 32–91 (1996).
Article CAS PubMed Google Scholar
Brattain, M. G., Fine, W. D., Khaled, F. M., Thompson, J. & Brattain, D. E. Heterogeneity of malignant cells from a human colonic carcinoma. Cancer Res. 41, 1751–1756 (1981).
CAS PubMed Google Scholar
Drewinko, B., Romsdahl, M. M., Yang, L. Y., Ahearn, M. J. & Trujillo, J. M. Establishment of a human carcinoembryonic antigen-producing colon adenocarcinoma cell line. Cancer Res. 36, 467–475 (1976).
CAS PubMed Google Scholar
Dexter, D. L., Barbosa, J. A. & Calabresi, P. N,N-dimethylformamide-induced alteration of cell culture characteristics and loss of tumorigenicity in cultured human colon carcinoma cells. Cancer Res. 39, 1020–1025 (1979).
CAS PubMed Google Scholar
Martínez-Ramírez, A. et al. Characterization of the A673 cell line (Ewing tumor) by molecular cytogenetic techniques. Cancer Genet. Cytogenet. 141, 138–142 (2003).
Article PubMed Google Scholar
Rigby, C. C. & Franks, L. M. A human tissue culture cell line from a transitional cell tumour of the urinary bladder: growth, chromosone pattern and ultrastructure. Br. J. Cancer 24, 746–754 (1970).
Article CAS PubMed Central Google Scholar
Pontén, J. & Macintyre, E. H. Long term culture of normal and neoplastic human glia. Acta Pathol. Microbiol. Scand. 74, 465–486 (1968).
Article PubMed Google Scholar
Lasfargues, E. Y. & Ozzello, L. Cultivation of human breast carcinomas. J. Natl. Cancer Inst. 21, 1131–1147 (1958).
CAS PubMed Google Scholar
Quinn, K. A. et al. Insulin-like growth factor expression in human cancer cell lines. J. Biol. Chem. 271, 11477–11483 (1996).
Article CAS PubMed Google Scholar
Fogh, J. & Trempe, G. New Human Tumor Cell Lines. In Human Tumor Cells In Vitro (ed. Fogh, J.) 115–159 (Springer US, https://doi.org/10.1007/978-1-4757-1647-4_5 1975).
Tsuchiya, S. et al. Establishment and characterization of a human acute monocytic leukemia cell line (THP-1). Int. J. Cancer 26, 171–176 (1980).
Article CAS PubMed Google Scholar
Campbell, C. D. et al. Population-genetic properties of differentiated human copy-number polymorphisms. Am. J. Hum. Genet. 88, 317–332 (2011).
Article CAS PubMed PubMed Central Google Scholar
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
Article CAS PubMed PubMed Central Google Scholar
Rácz, G. A., Nagy, N., Tóvári, J., Apáti, Á. & Vértessy, B. G. Identification of new reference genes with stable expression patterns for gene expression studies using human cancer and normal cell lines. Sci. Rep. 11, 19459 (2021).
Article PubMed PubMed Central Google Scholar
Olsson, E. et al. Mutation screening of 1237 cancer genes across six model cell lines of basal-like breast cancer. PLoS ONE 10, e0144528 (2015).
Article PubMed PubMed Central Google Scholar
Craig, D. W. et al. A somatic reference standard for cancer genome sequencing. Sci. Rep. 6, 24607 (2016).
Article CAS PubMed PubMed Central Google Scholar
Mandelker, D. & Ceyhan-Birsoy, O. Evolving significance of tumor-normal sequencing in cancer care. Trends Cancer 6, 31–39 (2020).
Article CAS PubMed Google Scholar
Medeiros, F., Rigl, C. T., Anderson, G. G., Becker, S. H. & Halling, K. C. Tissue handling for genome-wide expression analysis: a review of the issues, evidence, and opportunities. Arch. Pathol. Lab. Med. 131, 1805–1816 (2007).
Article CAS PubMed Google Scholar
Turashvili, G. et al. Nucleic acid quantity and quality from paraffin blocks: defining optimal fixation, processing and DNA/RNA extraction techniques. Exp. Mol. Pathol. 92, 33–43 (2012).
Article CAS PubMed Google Scholar
Cieslik, M. et al. The use of exome capture RNA-seq for highly degraded RNA with application to clinical cancer sequencing. Genome Res. 25, 1372–1381 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kotlov, N. et al. Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data. Commun. Biol. 7, 392 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Article CAS PubMed Google Scholar
Xiao, C. et al. Personalized genome assembly for accurate cancer somatic mutation discovery using tumor-normal paired reference samples. Genome Biol. 23, 237 (2022).
Article CAS PubMed PubMed Central Google Scholar
Xu, H., DiCarlo, J., Satya, R. V., Peng, Q. & Wang, Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 15, 244 (2014).
Article PubMed PubMed Central Google Scholar
Diossy, M. et al. Strand orientation bias detector to determine the probability of FFPE sequencing artifacts. Brief. Bioinform. 22, bbab186 (2021).
Article PubMed Google Scholar
Korona, D. A., LeCompte, K. G. & Pursell, Z. F. The high fidelity and unique error signature of human DNA polymerase ε. Nucleic Acids Res. 39, 1763–1773 (2011).
Article CAS PubMed Google Scholar
Do, H. & Dobrovic, A. Sequence artifacts in DNA from formalin-fixed tissues: causes and strategies for minimization. Clin. Chem. 61, 64–71 (2015).
Article CAS PubMed Google Scholar
Robles-Espinoza, C. D., Mohammadi, P., Bonilla, X. & Gutierrez-Arcelus, M. Allele-specific expression: applications in cancer and technical considerations. Curr. Opin. Genet. Dev. 66, 10–19 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jiang, J. et al. Identification of hub genes associated with melanoma development by comprehensive bioinformatics analysis. Front. Oncol. 11, 621430 (2021).
Yudina, A. & Kotlov, N. BostonGene/Somatic_reference_standards: Somatic reference for SNV/INDEL and CNV. Zenodo https://doi.org/10.5281/ZENODO.15261589 (2025).
Owens, N. D. L. et al. Measuring absolute RNA copy numbers at high temporal resolution reveals transcriptome kinetics in development. Cell Rep. 14, 632–647 (2016).
Article CAS PubMed PubMed Central Google Scholar
SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium. Nat. Biotechnol. 32, 903–914 (2014).
Article Google Scholar
Esteve-Codina, A. et al. A comparison of RNA-Seq results from paired formalin-fixed paraffin-embedded and fresh-frozen glioblastoma tissue samples. PLoS ONE 12, e0170632 (2017).
Article PubMed PubMed Central Google Scholar
Bossel Ben-Moshe, N. et al. mRNA-seq whole transcriptome profiling of fresh frozen versus archived fixed tissues. BMC Genomics 19, 419 (2018).
Article PubMed PubMed Central Google Scholar
Boutros, P. C. et al. Prognostic gene signatures for non-small-cell lung cancer. Proc. Natl. Acad. Sci. USA 106, 2824–2828 (2009).
Article CAS PubMed PubMed Central Google Scholar
Bao, M., Zhang, L. & Hu, Y. Novel gene signatures for prognosis prediction in ovarian cancer. J. Cell. Mol. Med. 24, 9972–9984 (2020).
Article CAS PubMed PubMed Central Google Scholar
Latha, N. R. et al. Gene expression signatures: a tool for analysis of breast cancer prognosis and therapy. Crit. Rev. Oncol. Hematol. 151, 102964 (2020).
Article PubMed Google Scholar
Rabushko, E. et al. Experimentally deduced criteria for detection of clinically relevant fusion 3’ oncogenes from FFPE bulk RNA sequencing data. Biomedicines 10, 1866 (2022).
Article CAS PubMed PubMed Central Google Scholar
Deng, W. et al. Fusion gene detection using whole-exome sequencing data in cancer patients. Front. Genet. 13, 820493 (2022).
Walther, C. et al. Gene fusion detection in formalin-fixed paraffin-embedded benign fibrous histiocytomas using fluorescence in situ hybridization and RNA sequencing. Lab. Investig. J. Tech. Methods Pathol. 95, 1071–1076 (2015).
Article CAS Google Scholar
Heyer, E. E. et al. Diagnosis of fusion genes using targeted RNA sequencing. Nat. Commun. 10, 1388 (2019).
Article CAS PubMed PubMed Central Google Scholar
de Schaetzen van Brienen, L. et al. Comparative analysis of somatic variant calling on matched FF and FFPE WGS samples. BMC Med. Genomics 13, 94 (2020).
Article PubMed PubMed Central Google Scholar
Rizvi, N. A. et al. Mutational landscape determines sensitivity to PD-1 blockade in non–small cell lung cancer. Science 348, 124–128 (2015).
Article CAS PubMed PubMed Central Google Scholar
Yarchoan, M., Hopkins, A. & Jaffee, E. M. Tumor mutational burden and response rate to PD-1 inhibition. N. Engl. J. Med. 377, 2500–2501 (2017).
Article PubMed PubMed Central Google Scholar
Snyder, A. et al. Genetic basis for clinical response to CTLA-4 blockade in melanoma. N. Engl. J. Med. 371, 2189–2199 (2014).
Article PubMed PubMed Central Google Scholar
Innocenti, F. et al. Mutational analysis of patients with colorectal cancer in CALGB/SWOG 80405 identifies new roles of microsatellite instability and tumor mutational burden for patient outcome. J. Clin. Oncol. 37, 1217–1227 (2019).
Article CAS PubMed PubMed Central Google Scholar
Popat, S., Hubner, R. & Houlston, R. S. Systematic review of microsatellite instability and colorectal cancer prognosis. J. Clin. Oncol. 23, 609–618 (2005).
Article CAS PubMed Google Scholar
Fancello, L., Gandini, S., Pelicci, P. G. & Mazzarella, L. Tumor mutational burden quantification from targeted gene panels: major advancements and challenges. J. Immunother. Cancer 7, 183 (2019).
Article PubMed PubMed Central Google Scholar
Aaltonen, L. A. et al. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Article Google Scholar
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Research Network. et al. The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat. Genet. 45, 1113–1120 (2013).
Article PubMed Central Google Scholar
Sorokin, M. et al. RNA sequencing in comparison to immunohistochemistry for measuring cancer biomarkers in breast cancer and lung cancer specimens. Biomedicines 8, 114 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kushnarev, V. et al. 143 Correlating RNA-seq detection and IHC staining of potential antibody-drug conjugate (ADC) targets: HER3, HER2, TROP2, Nectin4, and aFLR. in Regular and Young Investigator Award Abstracts A163–A163 (Journal for ImmunoTherapy of Cancer, :https://doi.org/10.1136/jitc-2023-SITC2023.0143 (2023).
Radenbaugh, A. J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE 9, e111516 (2014).
Article PubMed PubMed Central Google Scholar
Neums, L. et al. VaDiR: an integrated approach to Variant Detection in RNA. GigaScience 7, gix122 (2018).
Article Google Scholar
Shi, P., Chen, C. & Yao, Y. Correlation between HER-2 gene amplification or protein expression and clinical pathological features of breast cancer. Cancer Biother. Radiopharm. 34, 42–46 (2019).
CAS PubMed Google Scholar
Blancato, J., Singh, B., Liu, A., Liao, D. J. & Dickson, R. B. Correlation of amplification and overexpression of the c-myc oncogene in high-grade breast cancer: FISH, in situ hybridisation and immunohistochemical analyses. Br. J. Cancer 90, 1612–1619 (2004).
Article CAS PubMed PubMed Central Google Scholar
Latysheva, N. S. & Babu, M. M. Discovering and understanding oncogenic gene fusions through data intensive computational approaches. Nucleic Acids Res. 44, 4487–4503 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kastenhuber, E. R. et al. DNAJB1–PRKACA fusion kinase interacts with β-catenin and the liver regenerative response to drive fibrolamellar hepatocellular carcinoma. Proc. Natl. Acad. Sci. USA 114, 13076–13084 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tomlins, S. A. et al. Role of the TMPRSS2-ERG gene fusion in prostate cancer. Neoplasia 10, 177-IN9 (2008).
Article Google Scholar
Al-Jundi, M., Thakur, S., Gubbi, S. & Klubo-Gwiezdzinska, J. Novel targeted therapies for metastatic thyroid cancer—a comprehensive review. Cancers 12, 2104 (2020).
Article CAS PubMed PubMed Central Google Scholar
Melosky, B. et al. The rapidly evolving landscape of novel targeted therapies in advanced non-small cell lung cancer. Lung Cancer 160, 136–151 (2021).
Article CAS PubMed Google Scholar
Chernyshov, K. et al. Aggregated analysis of 1000 patients with cancer to assess the benefits of integrated whole exome and whole transcriptome sequencing. J. Clin. Oncol. 41, 3076–3076 (2023).
Article Google Scholar
Peymani, F., Farzeen, A. & Prokisch, H. RNA sequencing role and application in clinical diagnostic. Pediatr. Investig. 6, 29–35 (2022).
Article CAS PubMed PubMed Central Google Scholar
Rakicevic, L. DNA and RNA molecules as a foundation of therapy strategies for treatment of cardiovascular diseases. Pharmaceutics 15, 2141 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lau, D., Bobe, A. M. & Khan, A. A. RNA sequencing of the tumor microenvironment in precision cancer immunotherapy. Trends Cancer 5, 149–156 (2019).
Article CAS PubMed Google Scholar
Flynn, P., Suryaprakash, S., Grossman, D., Panier, V. & Wu, J. The antibody-drug conjugate landscape. Nat. Rev. Drug Discov. https://doi.org/10.1038/d41573-024-00064-w (2024).
Article PubMed Google Scholar
Byron, S. A., Van Keuren-Jensen, K. R., Engelthaler, D. M., Carpten, J. D. & Craig, D. W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 17, 257–271 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zito Marino, F. et al. NTRK fusions, from the diagnostic algorithm to innovative treatment in the era of precision medicine. Int. J. Mol. Sci. 21, 3718 (2020).
Article PubMed PubMed Central Google Scholar
McCabe, M. J. et al. Development and validation of a targeted gene sequencing panel for application to disparate cancers. Sci. Rep. 9, 17052 (2019).
Article PubMed PubMed Central Google Scholar
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinforma. Oxf. Engl. 30, 2843–2851 (2014).
Article CAS Google Scholar
Cai, L., Yuan, W., Zhang, Z., He, L. & Chou, K.-C. In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data. Sci. Rep. 6, 36540 (2016).
Article CAS PubMed PubMed Central Google Scholar
LaDuca, H. et al. Exome sequencing covers >98% of mutations identified on targeted next generation sequencing panels. PLoS ONE 12, e0170843 (2017).
Article PubMed PubMed Central Google Scholar
Krøigård, A. B., Thomassen, M., Lænkholm, A.-V., Kruse, T. A. & Larsen, M. J. Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLoS ONE 11, e0151664 (2016).
Article PubMed PubMed Central Google Scholar
Chen, Z. et al. Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Sci. Rep. 10, 3501 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fang, H. et al. Tumour mutational burden is overestimated by target cancer gene panels. J. Natl. Cancer Cent. 3, 56–64 (2023).
Article PubMed Google Scholar
Bartels, S. et al. Concordance in detection of microsatellite instability by PCR and NGS in routinely processed tumor specimens of several cancer types. Cancer Med. 12, 16707–16715 (2023).
Article CAS PubMed PubMed Central Google Scholar
Anagnostou, V. et al. Multimodal genomic features predict outcome of immune checkpoint blockade in non-small-cell lung cancer. Nat. Cancer 1, 99–111 (2020).
Article CAS PubMed PubMed Central Google Scholar
Schou Nørøxe, D. et al. Tumor mutational burden and purity adjustment before and after treatment with temozolomide in 27 paired samples of glioblastoma: a prospective study. Mol. Oncol. 16, 206–218 (2022).
Article PubMed Google Scholar
Masood, D. et al. Evaluation of somatic copy number variation detection by NGS technologies and bioinformatics tools on a hyper-diploid cancer genome. Genome Biol. 25, 163 (2024).
Gordeeva, V. et al. Benchmarking germline CNV calling tools from exome sequencing data. Sci. Rep. 11, 14416 (2021).
Article CAS PubMed PubMed Central Google Scholar
Seed, G. et al. Gene copy number estimation from targeted next-generation sequencing of prostate cancer biopsies: analytic validation and clinical qualification. Clin. Cancer Res. 23, 6070–6077 (2017).
Article CAS PubMed Google Scholar
Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
Article CAS PubMed PubMed Central Google Scholar
Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. USA 107, 16910–16915 (2010).
Article PubMed PubMed Central Google Scholar
Singh, A. K. et al. Detecting copy number variation in next generation sequencing data from diagnostic gene panels. BMC Med. Genomics 14, 214 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med. 23, 703–713 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bagaev, A. & BostonGene Corporation. Clinical and analytical validation of a combined RNA and DNA exome assay. NCBI (2024).
Bagaev, A. & BostonGene Corporation. Clinical and analytical validation of a combined RNA and DNA exome assay across a large tumor cohort: dataset. https://doi.org/10.5281/zenodo.15261029 (2025).

Download references

Acknowledgements

This study was funded by BostonGene Corporation. We thank Egor Anoshkin and Elizabeth (Suos) Scott for their help with sample procurement, Felix Frenkel for establishing the initial bioinformatics infrastructure used in this study, Anton Sivkov and Danil Stupichev for their support of preceding experiments, and Anna Kamysheva and Alexander Morozov for their valuable insights.

Author information

Authors and Affiliations

BostonGene Corporation, Waltham, MA, USA
Anastasiya Yudina, Cagdas Tazearslan, Artur Baisangurov, Ekaterina Nuzhdina, Kelley Lauziere, Vitaly Segodin, Svetlana Podsvirova, Sergey Starikov, Madison Chasse, Kirill Shaposhnikov, Leznath Kaneunyenye, Olesia Klimchuk, Natalia Kuzkina, Noel English, Gleb Khegai, Danielle Sookiasian, Daria Shafranskaya, Dawn Fernandez, Yaroslav Lozinsky, Andrew Sobolev, Mary Abdou, Polina Turova, Konstantin Chernyshov, Alexey Efremov, Samuel Andrewes, Aviva Feinberg, Brianna McKenna, Jessica H. Brown, Anna Love, John Curran, Jochen Lennerz & Alexander Bagaev

Authors

Anastasiya Yudina
View author publications
Search author on:PubMed Google Scholar
Cagdas Tazearslan
View author publications
Search author on:PubMed Google Scholar
Artur Baisangurov
View author publications
Search author on:PubMed Google Scholar
Ekaterina Nuzhdina
View author publications
Search author on:PubMed Google Scholar
Kelley Lauziere
View author publications
Search author on:PubMed Google Scholar
Vitaly Segodin
View author publications
Search author on:PubMed Google Scholar
Svetlana Podsvirova
View author publications
Search author on:PubMed Google Scholar
Sergey Starikov
View author publications
Search author on:PubMed Google Scholar
Madison Chasse
View author publications
Search author on:PubMed Google Scholar
Kirill Shaposhnikov
View author publications
Search author on:PubMed Google Scholar
Leznath Kaneunyenye
View author publications
Search author on:PubMed Google Scholar
Olesia Klimchuk
View author publications
Search author on:PubMed Google Scholar
Natalia Kuzkina
View author publications
Search author on:PubMed Google Scholar
Noel English
View author publications
Search author on:PubMed Google Scholar
Gleb Khegai
View author publications
Search author on:PubMed Google Scholar
Danielle Sookiasian
View author publications
Search author on:PubMed Google Scholar
Daria Shafranskaya
View author publications
Search author on:PubMed Google Scholar
Dawn Fernandez
View author publications
Search author on:PubMed Google Scholar
Yaroslav Lozinsky
View author publications
Search author on:PubMed Google Scholar
Andrew Sobolev
View author publications
Search author on:PubMed Google Scholar
Mary Abdou
View author publications
Search author on:PubMed Google Scholar
Polina Turova
View author publications
Search author on:PubMed Google Scholar
Konstantin Chernyshov
View author publications
Search author on:PubMed Google Scholar
Alexey Efremov
View author publications
Search author on:PubMed Google Scholar
Samuel Andrewes
View author publications
Search author on:PubMed Google Scholar
Aviva Feinberg
View author publications
Search author on:PubMed Google Scholar
Brianna McKenna
View author publications
Search author on:PubMed Google Scholar
Jessica H. Brown
View author publications
Search author on:PubMed Google Scholar
Anna Love
View author publications
Search author on:PubMed Google Scholar
John Curran
View author publications
Search author on:PubMed Google Scholar
Jochen Lennerz
View author publications
Search author on:PubMed Google Scholar
Alexander Bagaev
View author publications
Search author on:PubMed Google Scholar

Contributions

A.Y. contributed to data generation, major data analysis, data structuring, data harmonization, analysis discussions, the writing and reviewing of the manuscript, figure design and preparation, and study design. C.T. and A. Baisangurov contributed to the design and supervision of the research and data analysis. E.N. contributed to the design and supervision of the bioinformatics pipelines and validation experiments, and preparing infrastructure for calculations. K.L., M.C., L.K., N.E., D.S., D.F., S.A., A.F., and B.M. performed laboratory research. V.S. contributed to the development of the CNV calling pipeline, CNV experiment planning, and analysis of CNV calling results (for both cell lines and clinical samples). S.P. performed research and data analysis. S.S. contributed to the development of the gene fusion calling pipeline, experiment planning, and analysis of gene fusion calling results for cell lines, reference materials, and clinical samples. K.S. performed analysis of RNA-seq expression data of FF and FFPE samples and qPCR orthogonal validation results. O.K. contributed to the design of orthogonal validation experiments, quality control of sequencing data for clinical samples, and SNV/INDEL cell line data analysis for exome-wide reference preparation. N.K. performed development of the SNV and INDEL mutation calling pipeline and comparison of the somatic SNV calling results to published data. G.K. contributed to the optimization of the somatic SNV and INDEL filtration step for downstream analysis, TMB score calculation, and comparison of TMB score to orthogonal data. D. Sh. contributed to development of the optimization of the gene fusion calling pipeline and analysis of gene fusion calling results for references. Y.L. contributed to the development of the RNA-seq expression assessment pipeline and the analysis of the ERCC Spike-in experiments. A.S. monitored the somatic SNV, CNV, and expression data for internal control samples. M.A. helped obtain new reagents and samples. A.E. helped develop the quality control pipeline and analyze the coverage characteristics of validation samples. A.L. and J.H.B. contributed to the writing, reviewing, and preparation of figures. J.C. and J.L. contributed to the supervision of the work. A. Bagaev conceived and supervised the study, provided experiment design, and contributed to the writing and reviewing of the manuscript and figures.

Corresponding author

Correspondence to Alexander Bagaev.

Ethics declarations

Competing interests

The authors declare the following competing interests: Jochen Lennerz is the Chief Scientific Officer, and Alexander Bagaev is the Chief Product Officer at BostonGene Corporation. All authors were employees of BostonGene Corporation when the study was performed. All other authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary File

Supplementary Data 1

Supplementary Data 2

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yudina, A., Tazearslan, C., Baisangurov, A. et al. Clinical and analytical validation of a combined RNA and DNA exome assay across a large tumor cohort. Commun Med 5, 236 (2025). https://doi.org/10.1038/s43856-025-00934-3

Download citation

Received: 09 March 2025
Accepted: 27 May 2025
Published: 16 June 2025
Version of record: 16 June 2025
DOI: https://doi.org/10.1038/s43856-025-00934-3

Subjects

Abstract

Background

Methods

Results

Conclusions

Plain language summary

Similar content being viewed by others

Introduction

Methods

Laboratory procedures

Nucleic acid isolation

Library preparation for DNA and RNA sequencing

Sequencing

Bioinformatics procedures

Alignment

Quality control (QC)

SNV and INDEL detection

Variant calling

Filtration

Benchmarking

Pileup files parsing

TMB assessment

MSI status determination

CNV detection

Benchmarking

Expression analysis

Fusion transcript detection

Benchmarking

Immune receptor repertoire analysis

Analytical validation procedures

Reference materials

Commercial synthetic reference materials

DNA

RNA

Cell lines

Tumor samples

Orthogonal validation

qPCR

Orthogonal laboratory sequencing

Assessment of coverage requirements

Assessment Of SNV and Indel variant-calling performance

Cell line references

Clinical data utilization

Performance assessment of TMB-score and MSI status detection

Assessment of CNV detection performance

Analysis of gene expression correlation

Reproducibility of tumor microenvironment (TME) subtype classification

Assessment of fusion genes detection performance

Statistics and reproducibility

Analytical accuracy

Analytical sensitivity/recall (PPA, or true-positive rate)

Analytical specificity (NPA, or true-negative rate)

Precision (PPV)

Jaccard index

VAF correlation coefficient

Coefficient of variance for gene level expression

Gene signature variation assessment

Clinical patient analysis

Cohort of patients

Panorama of driving events

Overexpression calculation

Reporting summary

Results

Development of an integrated RNA and DNA sequencing assay

Unique exome-wide WES/RNA integrated somatic reference standards

Optimized thresholds for accurate variant calling

Analytical validation Of RNA-Seq gene expression

Reproducibility of gene signature scores and TME classification

Gene fusion calling accuracy depends on intrinsic transcript expression level

Analytical validation of exome-wide somatic SNV and INDEL variant calling

Calculating exome-wide TMB and MSI

Robust single-copy CNV and LOH calling

Clinical applications of the integrated whole exome and transcriptome Tumor PortraitTM assay

Interpretation of somatic mutations and germline variants with RNA-seq in clinical practice

Interpretation of structural variants by RNA-seq

Discussion

Data availability

Code availability

Abbreviations

Clinical applications of the integrated whole exome and transcriptome Tumor Portrait^TM assay