Abstract
Background
Combining RNA sequencing (RNA-seq) with whole exome sequencing (WES) from a single tumor sample can substantially improve the detection of clinically relevant alterations in cancer. However, routine clinical adoption of this integrated approach remains limited, especially for RNA-seq, due to the absence of standardized validation frameworks.
Methods
We developed and validated an assay that integrates RNA-seq and WES for evaluating gene expression, gene fusions, tumor microenvironment signatures, somatic single nucleotide variants (SNVs), insertions/deletions (INDELs), and copy number variations (CNVs). Exome-wide somatic reference standards were generated to support analytical validation using multiple sequencing runs of cell lines at varying purities.
Results
Assay validation involves 3 steps: (1) analytical validation using custom reference samples containing 3042 SNVs and 47,466 CNVs; (2) orthogonal testing in patient samples; and (3) assessment of clinical utility in real-world cases. Applied to 2230 clinical tumor samples, the integrated assay enables direct correlation of somatic alterations with gene expression, recovery of variants missed by DNA-only testing, and improves detection of gene fusions. In addition to uncovering clinically actionable alterations in 98% of cases, the assay also reveals complex genomic rearrangements that would likely have remained undetected without RNA data.
Conclusions
This study provides practical validation guidelines for integrated RNA and DNA sequencing in clinical oncology. The combined assay enhances the detection of actionable alterations, thereby facilitating personalized treatment strategies for cancer patients.
Plain language summary
People with cancer have changes in the DNA sequence within their cancer cells that are not present in noncancerous cells. Accurate identification of DNA changes in tumors is crucial for optimizing cancer care for individual people with cancer. It is also helpful to be able to look at changes in the RNA produced by cells, as RNA is the molecule used in cells to determine which proteins are produced, and the proteins tend to determine how a cell behaves. Currently, assessment of DNA and RNA is often performed separately. Here, we validated a combined test to detect both clinically relevant DNA alterations and changes in the RNA produced. Looking at over 2000 tumors, we found alterations missed by DNA-only approaches. Our method could improve diagnostic accuracy, streamline clinical workflows, and potentially save money, ultimately leading to improved cancer treatment decisions and better patient outcomes.
Similar content being viewed by others
Introduction
Diagnostic methods have evolved to address emerging therapeutic targets to support the rapid growth of personalized medicine. In 2013, Frampton et al. validated a targeted genomic panel, establishing a framework for next-generation sequencing (NGS) assay development1. Nonetheless, most clinical NGS assays rely on DNA-seq with targeted gene panels, leaving many clinically relevant genes untested2.
While RNA-seq has become a standard approach for measuring fusions and characterizing tissue phenotypes, its clinical adoption remains limited. Whole exome sequencing (WES) identifies single nucleotide variants (SNVs), insertions/deletions (INDELs), copy number variations (CNVs), loss of heterozygosity (LOH), microsatellite instability (MSI), and tumor mutational burden (TMB) across more than 20,000 genes. It also surpasses targeted panels in identifying TMB3 and large-scale (arm-level) CNVs4,5,6. Combining WES with RNA-seq increases actionable findings through the detection of gene expression changes, fusions, and alternative splicing events. Gene expression signatures can also predict immunotherapy outcomes, highlighting the importance of robust validation methods.
Currently, comprehensive guidelines for integrated RNA-seq and WES assay development, validation, and analysis are lacking. Therefore, genomic studies to date relied on germline refs. 7,8,9 or validated targeted panel assays10,11,12. These guidelines are integral for clinical implementation, particularly for the interpretation of somatic variants from RNA-seq.
Here, we propose comprehensive guidelines for validating SNVs, CNVs, MSI, TMB, gene expression, and fusions detected through an in-house, integrated WES and RNA-seq assay (Tumor Portrait™ [BostonGene Corporation; Waltham, MA, USA]). Validations include: (1) an analytical step using reference materials and cell lines; (2) orthogonal testing with clinical samples; and (3) clinical validation on 2230 patient samples.
Clinical validation enables the creation of an interpretation framework linking somatic variants, CNVs, and fusions to related gene expression profiles, revealing allele-specific expression of oncogenic drivers. Additionally, the RNA-seq variant-calling framework improves the detection of low-coverage hotspot variants. These findings underscore the need for updated guidelines supporting integrated DNA- and RNA-seq assays. By providing a clear roadmap for validating combined assays and interpreting their results, we aim to facilitate their clinical adoption, advancing patient care and personalized treatment strategies in oncology.
Methods
Laboratory procedures
Nucleic acid isolation
Nucleic acid isolation was performed from fresh frozen (FF) solid tumors with the AllPrep DNA/RNA Mini Kit (Qiagen, Valencia, CA, USA) and from normal tissue (whole blood, peripheral blood mononuclear cells [PBMCs], or saliva) with the QIAmp DNA Blood Mini Kit (Qiagen, Valencia, CA, USA) and Maxwell RSC Stabilized Saliva DNA Kit (Promega, Madison, WI, USA). The AllPrep DNA/RNA FFPE Kit (Qiagen, Valencia, CA, USA) was used for nucleic acid isolation from formalin-fixed paraffin-embedded (FFPE) solid tumors. DNA and RNA extracts were tested for contamination and structural integrity. DNA and RNA quantity and quality were measured using a Qubit 2.0 (Thermo Fisher Scientific, Waltham, MA, USA), NanoDrop OneC (Thermo Fisher Scientific, Waltham, MA, USA), and TapeStation 4200 (Agilent Technologies, Santa Clara, CA, USA).
Library preparation for DNA and RNA sequencing
For both FF and FFPE protocols, 10–200 ng of extracted DNA or RNA was required for their respective library preparations. Library construction from FF tissue RNA was performed with the TruSeq stranded mRNA kit (Illumina, San Diego, CA, USA). Library construction from FFPE tissue was performed using exome capture kits, SureSelect XTHS2 DNA and SureSelect XTHS2 RNA kit, respectively (Agilent Technologies, Santa Clara, CA, USA). For hybridization and capture, the SureSelect Human All Exon V7 + UTR (Agilent Technologies, Santa Clara, CA, USA) exome probe was used for RNA, and the SureSelect Human All Exon V7 exome probe (Agilent Technologies, Santa Clara, CA, USA) was used for DNA. Quality, concentration, and size of the prepared libraries were assessed using the Qubit 2.0 (Thermo Fisher Scientific, Waltham, MA, USA), Tapestation 4200 (Agilent Technologies, Santa Clara, CA, and LightCycler 480 (Roche, CA) or QuantStudioTM 5 Real-Time PCR System (Thermo Fisher Scientific, Waltham, MA) precision equipment.
Sequencing
Sequencing was performed on a NovaSeq 6000 (Illumina, San Diego, CA, USA). For each step of library preparation, acceptable and target values of amounts, concentrations, average fragment size, RNA integration number (RIN) scores, and light absorption metrics were confirmed with quality control steps. The primary analysis of NovaSeq 6000 QC metrics (Q30 > 90%, PF > 80%) in BaseSpace Sequence Hub was monitored during every run, and stringent bioinformatics pipeline analysis was performed for the prepared libraries. All samples that passed QC thresholds at every stage of DNA or RNA library preparation and bioinformatics analysis.
Bioinformatics procedures
Alignment
WES data were mapped to the human genome (hg38) using BWA aligner v.0.7.17. GATK v4.1.2 and mosdepth v0.2.1 were used for PCR duplicates read markup and sequencing metrics collection, including average coverage statistics. RNA-seq data were mapped to the human genome (hg38) using the default parameters of STAR aligner v2.4.2 with minor modifications. For gene expression quantification, reads were aligned to the human transcriptome (hg38) with Kallisto v0.43.0 using default parameters13.
Quality control (QC)
Methods for QC were performed as previously described by Bagaev et al.14. Briefly, standard QC for WES was performed via fastQC v0.11.9 and FastqScreen v0.14.0. Picard v2.20.7 MarkDuplicates was used to remove duplicate reads. Off-target reads were calculated using samtools view v1.10 in intersection with target file with regions provided by the vendor (Agilent Technologies, Santa Clara, CA, USA). Unique reads were calculated using samtools view v1.10 with -F0x400 flag. Off-target and duplication rates were calculated by dividing the off-target and duplicate reads by the total number of reads, respectively. Standard QC for RNA-seq was performed via RSeQC v3.0.1, including assessment of percentage of sense strand reads for DNA contamination control. Control of sample mixing was done by comparison of HLA types (obtained via OptiType v1.3.5)15 and calculation of SNV concordance of germline variants in housekeeping genes.
SNV and INDEL detection
Variant calling
Germline SNVs and INDELs and somatic SNVs were detected using optimized Strelka v2.9.10 on both normal and paired tumor/normal samples in exome mode with the following parameter modifications: QSI_NT score ≥ 50, QSS_NT ≥ 50, Somatic EVS > 1516. Somatic INDEL (1–49 bp) calling was performed via Strelka v2.9.10 using small INDEL candidates from Manta v1.5.016,17. Variant calling from RNA-seq data was performed via Pisces v5.2.10.4918.
Filtration
The outputs from the variant calling algorithms underwent subsequent filtration. For germline variants, filtration of calls from chrX and chrY was performed according to gender. An additional strand bias filter was applied to germline INDELs. For somatic mutations, the output after Strelka2 processing underwent filtration using a basic filter (tumor depth ≥ 10 reads, normal depth ≥ 20 reads, and normal VAF ≤ 0.05). Then, a threshold for tumor VAF ≥ 0.05 was applied. Finally, a complex filter based on the combination of Strelka2 QSS and EVS scores was applied:
where x = b1 * QSS + b2 * EVS + c. Parameters b1, b2, and c were assessed via logistic regression. The following additional filters were applied to the standard parameters: VAF and alternative read counts in tumor and normal samples. Somatic and germline mutation calling was performed across all covered regions (including covered off-target regions), and per-nucleotide coverage thresholds were applied (80X for tumor samples and 20X for normal samples) to establish the required coverage needed to optimize quality metrics of the output.
Benchmarking
Mutect2 and FilterMutectCalls from GATK v4.1 and Dragen v4.3 in somatic mode were used for SNV/INDEL calling19,20.
Pileup files parsing
For additional verification of SNV/INDEL calls, samtools mpileup v1.10 and a downstream parsing custom script were used to precisely check discordant calls between two samples.
TMB assessment
The TMB score was calculated based on the somatic mutation calling output with the additional exclusion of the intron and silent mutations. The mutations for TMB score assessment were taken only from target regions, and normalization was performed by target length in megabases. TMB-high was defined as ≥10 mut/Mb, and TMB-low was defined as <10 mut/Mb. For FF and FFPE samples, mutations that passed the following thresholds were used for TMB score calculation: (1) tumor coverage depth for mutation ≥25; (2) tumor VAF (SNV, INDEL) ≥ 0.05; (3) tumor alternative read support ≥ 5; (4) normal coverage depth for mutation ≥15; (5) normal VAF ≤ 0.05; (6) tumor VAF * tumor alternative read support ≥ 0.6.
MSI status determination
MSISensor2 was used to calculate the MSI score. A standard cut-off of 20, which represents the percentage of microsatellite unstable regions out of 2800 analyzed, was used for MSI status classification21.
CNV detection
A primary step for CNV detection was compilation of an extraction coverage reference for tumor and normal samples. The reference was defined by finding common regions/peaks of coverage in a set of 30 normal reference samples. Peaks of coverage were calculated with macs3 v3.0.0a7. Samtools v1.10 was used for coverage extraction22. For each sample, only highly covered regions were left after filtration (at least 20% of the median sample coverage), which were then intersected by BEDTools v2.30.0 resulting in a final ref. 23. Modified Sequenza v2.1.2 was used to call CNVs using the following process: (1) The coverage area was calculated under each peak from the extraction coverage reference; (2) The peak coverage was used as a homozygous input for the calculation of the depth of coverage ratio (ratio of coverage between tumor and normal samples); (3) Heterozygous germline SNVs were used to calculate B-allele frequency (BAF) in tumor samples; (4) Homozygous and heterozygous positions were merged together for the main input into Sequenza in pileup format24. This approach allowed us to focus our attention on highly covered regions in the genome, reducing the impact of noise. This was particularly important for low-purity samples, where the slightest elevation or reduction of the depth ratio and BAF metrics can be interpreted as a change in copy numbers.
The following additional standard techniques were also used during CNV calling: (1) Full segmentation based on the depth ratio and BAF metrics; (2) Filtration of CNV calls based on a depth of coverage ratio ≥10 (ratio of coverage between tumor and normal samples), and a coverage of tumor alternative allele ≥5; (3) Filtration of segments during model fitting based on the number of homozygous/heterozygous positions found in a segment (threshold is equal to 10 for the number of homozygous/heterozygous positions); (4) Filtration of segments caught in a centromere region.
For correct interpretation of copy numbers, ploidy of the sample was assessed with Sequenza, FACETS, and manual verification as needed24,25. Normalized states of CNVs were introduced to compare between CNV profiles. We defined the following CNV levels: “Loss,” “Deletion,” “Neutral,” “Gain,” “Amplification,” and “High Amplification.” High-level CNVs included “Loss” (complete loss of a region), “Amplification,” and “High Amplification,” while low-level CNVs included “Deletion” (partial loss) and “Gain” (slight increase in number of copies). “Neutral” was defined as no CNV detected, and the total number of CNVs was equal to ploidy. These states were defined through ploidy normalization of the total CNV value for a segment/gene, which was calculated for each sample and used as a measure showing the average number of complete chromosome sets. In the analysis, we also used normalized CNV values (−2, −1, 0, 1, 2) that corresponded to the aforementioned levels: “Loss,” “Deletion,” “Neutral,” “Gain,” and “Amplification.”
The classification of the above-listed CNV categories can be done via the following logical expression:
where pl is ploidy (≥2), and cn is the copy number of the segment.
To classify segments by length and calculate statistics for every group, the following segment categories were introduced: (1) Focal-level segments with a length less than 10% of the arm, but no more than 3 Mb. (2) Arm-level segments with a length at least 50% of the arm length. (3) Long-focal segments that fall in between the focal- and arm-level categories.
Benchmarking
Dragen v4.3 in panel of normal mode and FACETS were used for CNV detection20,25.
Expression analysis
Gene expression analysis was performed as previously described14,26. RNA-seq reads were aligned to GRCh38.d1.vd1 using Kallisto v0.42.4 and annotated with GENCODE v23 transcripts with default parameters13,27,28. The noncoding RNA, mitochondria- and histone-related transcripts were removed, and the protein-coding transcripts, IGH/K/L- and TCR-related transcripts, were retained, resulting in 20,062 protein-coding genes. Gene expression was quantified as transcripts per million (TPM) and log2-transformed29, and gene expression signatures were calculated based on methods from Bagaev et al.14. STAR aligner v2.4.2 was used to provide read counts for exon skipping events detection for CLDN18, AR, and MET genes.
Fusion transcript detection
For fusion transcripts detection, STAR-fusion v1.8.1 was used with subsequent annotation of FusionInspector in validation mode30,31. Detected fusion transcripts that met any of the following filtering criteria were removed: (1) Fusion fragments per million (FFPM) values ≥ 0.1; (2) Number of junction reads supporting the fusion ≥5; (3) A sum of all supporting reads (junction and spanning) ≥7; and (4) genes overlapping each other or with a breakpoint distance >10 kb. Functional annotation was also taken into account (mitochondrial genes and fusions from FusionCatcher black list were filtered out). The developed pipeline used a list of databases (Supplementary Table 11) to functionally annotate fusions and to identify de novo events.
Benchmarking
Arriba 2.4 was used for fusion transcript detection32.
Immune receptor repertoire analysis
Immune clonotypes were determined with MiXCR v3.0.1233 using standard parameters according to the recommendation for bulk RNA-seq analysis.
Analytical validation procedures
Reference materials
Validation experiments were performed on FF cell lines and human tissues from the following commercial vendors: iSpecimen (Lexington, MA, USA), American Type Culture Collection (ATCC; Manassas, VA, USA), Coriell Institute for Medical Research (Camden, NJ, USA), and CureLine (Brisbane, CA, USA). Cell lines and FF tissue samples were stored long-term at −80 °C until used. Cell lines were cultured and tested for mycoplasma according to best practices from the ATCC. All specimens were examined for healthy morphology before proceeding to DNA and RNA extraction for next-generation sequencing (NGS). Commercial FFPE tissue samples (Cureline, Brisbane, CA, USA) and FFPE blocks made in-house from FF samples with the same origin were used to assess the performance of the NGS laboratory-developed test (LDT). After receiving or creating FFPE tissues, blocks were stored at room temperature until RNA and/or DNA extraction. All Cureline patient samples were collected with signed informed consent.
Commercial synthetic reference materials
DNA
The commercially available reference standards Seraseq® gDNA TMB Mix Score 7 (Cat. No. 0710-1326), 9 (Cat. No. 0710-1325), 20 (Cat. No. 0710-1324), 26 (Cat. No. 0710-1323; SeraCare, Gaithersburg, MD, USA), which contain a defined number of overall alterations per megabase of the genome, were utilized to validate the TMB assay in FF tissue and perform further orthogonal validation of FFPE tissue. The commercially available reference standards Seraseq FFPE TMB RM Score 7 (Cat. No. 0710-1307), 9 (Cat. No. 0710-1308), 20 (Cat. No. 0710-1309), and 26 (Cat. No. 0710-1310; WT + TUMOR), which contain a defined number of overall alterations per megabase of the genome, were utilized to validate the TMB assay in FFPE tissue.
The Seraseq® Lung & Brain CNV Mix (SeraCare, Gaithersburg, MD, USA) and Seraseq® Breast CNV Mix (SeraCare, Gaithersburg, MD, USA) reference DNA mixes were tested for CNVs using the BostonGene WES workflow. To establish the limit of detection (LOD) for CNVs, we used these reference materials with a number of amplification levels (+3 copies, +6 copies, +12 copies).
RNA
Validation of expression level assessment was performed using the following reference standards: InvitrogenTM Universal Human Reference RNA (Thermo Fisher Scientific, Waltham, MA, USA) and InvitrogenTM External RNA Controls Consortium (ERCC) ExFold Spike-In Mixes (Cat. No. 4456739, Thermo Fisher Scientific, Waltham, MA, USA) for FF samples. The references were used to estimate the range of detection, LOD, accuracy, precision, and reproducibility for detecting RNA expression.
The following reference standards were used for gene fusion detection validation: Seraseq® FFPE Tumor Fusion RNA v4 Reference Material (SeraCare, Gaithersburg, MD, USA), ALK-RET-ROS1 Fusion FFPE RNA Reference Standard (Horizon Discovery, Waterbeach, UK), and 5-Fusion Multiplex (Negative Control) (Horizon Discovery, Waterbeach, UK).
Cell lines
Platinum Genome cell lines, GM12877 and GM12878, prepared both as FF samples and in FFPE blocks were used to measure the accuracy of germline mutation calling for hereditary cancer predisposition syndromes. Platinum Genome references were used for germline calling optimization and validation34. To validate the accuracy of somatic SNV and INDEL detection, high-level CNV detection, and TMB score assessment, pools of previously sequenced and well characterized cell lines were used: HCC1143 (CRL-2321)35, HCC1937 (CRL-2336)36, COLO829 (CRL-1974)37, HCC1395 (CRL-2324)35, and NCI-H1770 (CRL-5893)38, and matched baselines were COLO829BL (CRL-1980), HCC1143BL (CRL-2362), HCC1937BL (CRL-2337), HCC1395BL (CRL-2325), and NCI-BL1770 (CRL-5960). For each cell line and its corresponding normal B cells, separate FFPE blocks were prepared—one for the tumor cells and one for the matched-normal cells. These blocks were stored for 3-4 months, after which DNA extracts were prepared from both the tumor and normal FFPE blocks, as well as from tumor and baseline B-cell cultures. The tumor and matched-normal DNA extracts were then mixed in a range of proportions (0:100, 10:90, 20:80, 30:70, 50:50, 25:75, 100:0) before proceeding with library construction procedures. For the RNA-seq dilution experiment, RNA extracts from the 100% tumor FFPE blocks were mixed with RNA extracts from the matched-normal FFPE blocks.
For validation of our MSI assay in FFPE tissues, pools of four MSI-high (MSI-H) cell lines, HCT 116 (CCL-247), LoVo (CCL-229), DLD-1 (CCL-221), and HCT-15 (CCL-225), with a range of purities (5%, 10%, 20%, 30%, 50%, 75%, 100%) were used39,40,41. HCC1395BL (CRL-2325) was used as a background for dilutions since these cell lines did not have matched-normal B-cell cultures.
The cell line fusion analysis was performed with extracts from K562 (CCL-243), MCF-7 (HTB-22), A-673 (CRL-1598), RT-4 (HTB-2), U-118MG (HTB-15), BT-20 (HTB-19), NCI-H2228 (CRL-5935), SK-BR-3 (HTB-30), and THP-1 (TIB-22) FFPE blocks. Extracts from GM12877, GM12878, and HCC1143BL (CRL-2362) cell lines in FFPE blocks were used as negative controls35,42,43,44,45,46,47,48,49,50. To assess the LOD of fusion detection, RNA extract from one tumor cell line was diluted into RNA extract from another tumor cell line. For FFPE samples, the following cell lines were used: BT20:THP-1 and NCI-H2228:K562. For FF samples, the following cell lines were used: K562:MCF-7. The dilutions started at a ratio of 95:5 and changed to a 5:95 ratio with a 15% step.
Gene expression analysis was performed using A-673 (CRL-1598), BT-20 (HTB-19), COLO829 (CRL-1974), Caki-1 (HTB-46), GM12877, GM12878, HCC1143 (CRL-2321), K562 (CCL-243), MCF-7 (HTB-22), NCI-H2228, RT-4 (HTB-2), Reh (CRL-8286), SK-BR-3 (HTB-30), SNU-16 (CRL-5974), T-47D (HTB-133), THP-1 (TIB-22), U-118MG (HTB-15), and WERI-Rb-1 (HTB-169).
Biological replicates of the COLO829 cell line were used as a positive control for each clinical sequencing run. The samples were used for longitudinal reproducibility assessment of both RNA-seq and WES-defined events.
Tumor samples
A total of 59 paired normal and tumor samples (internal BostonGene samples and Cureline Tissue Bank [Brisbane, CA, USA]) from patients of various solid diagnoses were used for orthogonal validation. From this cohort, 53 patients were selected for further analysis after the median coverage and purity assessment for each sample was established (Supplementary Table 14). Paired FF and FFPE samples from 110 patients were used to measure the detection accuracy of RNA expression in FFPE samples. Informed consent was obtained for BostonGene and Cureline patient samples.
Orthogonal validation
qPCR
Quantitative polymerase chain reaction (qPCR) was performed to validate RNA-seq expression in 51 FFPE samples using predesigned TaqMan Gene Expression Assays (Thermo Fisher Scientific, Waltham, MA, USA) and TaqMan Fast Advance Master Mix (Cat# 4444556, Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s protocol for 99 genes (Supplementary Table 8). We normalized Ct values of qPCR for the PCBP1 housekeeping gene51 to compare with RNA-seq-derived expression values (TPM). qPCR was also performed to validate clinically relevant fusion genes detection (118 fusion genes, 63 clinical samples). To detect fusions, custom qPCR ZEN double-quenched probe assays, containing a 5’ FAM 520 fluorophore and a 3’ Iowa Black FQ quencher, were designed using IDT’s PrimerQuest™ Tool (Integrated DNA Technologies, Inc., Coralville, Iowa, USA). Fusion qPCR reactions were performed with TaqMan Fast Advance Master Mix (Cat# 4444556, Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s protocol.
Orthogonal laboratory sequencing
Paired-tumor and -normal FFPE samples from 59 patients were sequenced at the BostonGene laboratory. The same set of samples was sequenced at a CLIA-certified, CAP-accredited, and NY-state-approved reference laboratory, Novogene (https://en.novogene.com/technology/certification/). All mutations assessed via Novogene’s sequencing technology, including SNVs, INDELs, CNVs, TMB score, and MSI status, were considered as a reference set.
Assessment of coverage requirements
We used 83 internal control (COLO829 cell line) samples to assess the percentage of coverage completeness. To observe the coverage dependency on low-covered samples, in silico downsampling of each sample was performed to get a range from median 20X to 440X via random deletion of subsets of reads from each sample’s BAM file using samtools v.1.15.1.
To establish the sufficient number of reads for proper transcripts detection and expression level assessment, 10 technical replicates of the same extract from the COLO829 cell line were used. Each sample was in silico downsampled to the stage of 5, 10, 15, 20, 30, 40, 50, 60, 70, 80 million reads. The downsampling was performed on fastq files via seqtk v1.3 prior to further analysis.
Assessment Of SNV and Indel variant-calling performance
Cell line references
True mutations and putative artifacts were classified by assessing the concordance between VAF decrease and cancer cell line purity. Mutations detected at least three times in serial dilutions with a linear coefficient range of 0.70 to 1.30 and a Pearson’s regression coefficient greater than 0.80 or a minimum of four times with a Pearson’s regression coefficient greater than 0.50 were classified as true mutations and were used to assess the precision of our mutation calling. True-positive mutation sets for this validation were defined by performing four to six NGS replicates on COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 cell lines with 100% tumor purity. The replicate sequencing outputs were corrected with sequencing data from matched-normal samples and intersected to assess for true mutations with a VAF standard deviation of less than 10% and covered at least 80 reads to define true positives for each cell line, which was used to assess the sensitivity of mutation calling. We adjusted the reference set of variants for each dilution, excluding variants with a median VAF lower than 5%. The SNV VAFs of HCC1395, HCC1143, HCC1937, and COLO829 cell lines were compared with previously published data52,53. Sensitivity and VAF correlation were calculated.
Clinical data utilization
The performance of germline and somatic variant calling was assessed via calculation of recall (sensitivity) and precision quality metrics. Variants identified in the independent laboratory’s (Novogene) sequencing data were considered as a reference set of mutations. Clinical data was also used for mutation counts assessment within clinically actionable genes (Supplementary Table 6).
Performance assessment of TMB-score and MSI status detection
TMB score and classification of high or low TMB levels were calculated based on somatic mutation calling from both BostonGene’s and the independent laboratory’s (Novogene) sequencing data. Correlation of the TMB score and the accuracy of classification was assessed. MSI score and classification as microsatellite stable (MSS) or MSI were calculated for both BostonGene’s and Novogene’s sequencing data. Correlation of the MSI score and the accuracy of classification was assessed.
Assessment of CNV detection performance
Performance of our LDT’s CNV detection was first assessed using data available for COLO829 from Craig et al.53. The original annotation table consisted of 6586 genes covering four unique CNV classifications: gain, loss, focal gain, and focal loss. After intersection with our list of covered regions, the final set consisted of 5706 genes. In order to compare the previously published classifications with our classifications, we unified the categories as a gain, loss, or neutral. Therefore, we replaced the focal gain and focal loss categories with gain and loss, respectively. The same annotation was used for our categories: shallow deletions and deletions were categorized as losses, while amplifications and gains were categorized as a gains.
Next, we evaluated the consistency of our CNV calling algorithm by assessing the quality metrics for cell line dilutions. Due to the absence of gold standard references with widely acknowledged lists of CNVs, we created references for each cell line by using different and independent runs of samples with 100% tumor purity: six replicates for COLO829, HCC1143, and HCC1937 and four replicates for HCC1395 and NCI-H1770.
For CNV normalization and classification, we used an approach based on ploidy of the sample and calculated the total number of copies. We normalized total CNVs by ploidy value, thus classifying them into five categories (please see the Bioinformatics procedures section). All the comparisons with references and clinical samples were carried out in these normalized terms. For cell line references, normalized regions were compared between four (NCI-0H1770, HCC1395) and six (COLO829, HCC1143, HCC1937) sequencing replicates. For further analysis, only completely concordant regions were selected.
We also performed analytical validation using 52 clinical samples that were sequenced on two platforms: BostonGene and an independent laboratory (Novogene). The Novogene samples were used as references for all the metrics that were calculated; the samples were divided into two categories (low-purity: 20–30%; high-purity ≥ 30%) based on the differences in metrics.
Statistical analysis for test samples (i.e., cell line dilutions, clinical samples) was calculated for arm- and gene-level CNVs. An arm-level CNV was defined as a segment that exceeds 50% of the actual arm length. Thus, the total copy number (tCN) of the arm was defined by the presence of arm-level segments. Notably, if the arm was fractured by a high amount of CNVs (e.g., chromothripsis), then it was excluded from the statistical analysis of that sample. A gene-level CNV was defined as a gene that intersects with its corresponding segment by at least 10% of its length. If a gene had an intersection with multiple segments, then the tCN of the longest of those segments was assigned for that gene. This approach allowed us to resolve discrepancies between annotation of segments, which allowed for two CNV profiles to be objectively compared.
Loss of heterozygosity (LOH) events were defined when BAF equal to zero, while copy neutral LOH (cnLOH) events were defined as segments that had a BAF equal to zero and a tCN equal to ploidy. Gene-level metrics were calculated for both types of events. Negative events were defined as neutral events in the reference sample (tCN is equal to ploidy), while other events (tCN is not equal to ploidy) were considered as positive events. During the comparison between test and reference samples, we used a straightforward approach when normalized statuses matched each other: nCN “gain” for a particular gene/arm in the reference sample had to be “gain” in the test sample to be counted as a true positive, while a “gain” in the reference sample and “amplification” in the test sample resulted in a false-positive event.
Analysis of gene expression correlation
We assessed the expression level of ERCC spike-in ions for FF samples (37 clinical samples and 18 cell lines) by determining the TPM limit of detection. ERCC transcripts were aligned by Kallisto v0.42.4 to GRCh38.d1.vd1 with an addition of 92 ERCC transcript sequences. After removal of outliers and alignment, the expressions of 89 ERCC transcripts were normalized into TPM, but normalized into a summary of 10,000 instead of 1 million. Genes with an expression STD greater than 2 between paired FF and FFPE samples, processed using poly-A-based or EC-based RNA-seq; this resulted in 1389 genes for comparison. The need to assess highly variable genes arose from the observation that, for stably expressed genes, the correlation between FF and FFPE RNA-seq data may appear artificially low, even though direct comparisons showed that gene expression levels remained consistent across the cohort of samples. Then we performed a comparison of expression levels of 99 genes within FFPE samples of 51 patients to qPCR assessment.
The reproducibility (CV for gene expression and gene signatures) assessment was performed on 10 FFPE clinical samples14. All ten samples were made into three technical replicates, and expression-based CV was assessed within each samples’ replicates separately.
Reproducibility of tumor microenvironment (TME) subtype classification
TME subtypes were determined using methods previously described by Bagaev et al.14. Seven clinical samples from patients with solid diagnosis were used for reproducibility assessment. Each sample was prepared in three replicates within one day and on two additional days (five samples in total). Each set of samples was prepared with 15, 20, and 50 ng of input starting material, resulting in a total of 15 replicates for each clinical sample.
Assessment of fusion genes detection performance
The F1-score was calculated for different FFPM thresholds: samples for precision assessment - synthetic reference materials (Myeloid Fusion Reference, Seraseq FFPE Tumor Fusion RNA v4 Reference_Material, Horizon Reference), samples for sensitivity assessment - the above listed synthetic reference materials and cell lines (NCI-H2228, MCF-7, K562, RT-4, THP-1, A-673, SK-BR-3, U118MG, BT-20). We determined the LOD on the maximum fusion expression level based on our cell lines dilution experiments.
Three different types of samples were used to assess the performance of our LDT’s detection of fusion: (i) FF samples; (ii) FFPE samples extracted using a QIAamp DNAMaxwell kit, and (iii) FFPE samples extracted by a Qiagen kit. Fusions detected by RNA-seq were confirmed by real-time qPCR reactions with primers designed to detect fusions or fusion products.
Statistics and reproducibility
Analytical accuracy
The accuracy of our NGS LDT was calculated by comparing our results to the reference results (either reference sets or orthogonal data). Accuracy was evaluated with a standard formula that reflects the percentage of correct predictions in the entire set:
Analytical sensitivity/recall (PPA, or true-positive rate)
Sensitivity was evaluated with a standard formula that can be interpreted as the percentage of correct positive predictions in the reference data set:
Analytical specificity (NPA, or true-negative rate)
Specificity was evaluated with a standard formula that reflects the percentage of correct negative predictions in the reference data set:
Precision (PPV)
Precision was evaluated with a standard formula that reflects the percentage of correct positive predictions in the reported data set:
Jaccard index
Jaccard index was calculated for side-by-side comparison of CNV profiling of clinical samples via a standard formula where the length of segments was evaluated in nucleotides:
VAF correlation coefficient
Standard Pearson’s correlation coefficient was calculated for test and reference data.
Coefficient of variance for gene level expression
for each gene expression as a measure of reproducibility.
Gene signature variation assessment
for each gene signature for the corresponding samples, where q is a quantile of signature expression distribution for the corresponding TCGA cohort.
Clinical patient analysis
Cohort of patients
A total of 2230 clinical samples (1996 solid tumors and 234 hematological malignancies) for which both WES and RNA-seq data was available were included in this study (Supplementary Table 19). Patients samples were collected at the BostonGene laboratory between 2021 and 2024. All samples were processed in-house following the same standard operation procedures. Each patient provided informed consent for the use of their tumor and normal samples and associated clinical data for research purposes. The use of clinical samples was conducted in accordance with the Declaration of Helsinki and has been granted exemption from ethics approval by the Biomedical Research Alliance of New York (BRANY) Institutional Review Board (IRB; #22-12-938-853).
For each patient, sets of somatic SNVs and INDELs with VAF assessment, CNVs with absolute and normalized copy numbers, and fusion transcripts with FFPM and expression values across the entire transcriptome were obtained.
Panorama of driving events
The mutation rate of a particular gene for each diagnosis group was investigated. The percent assigned to each displayed gene for each diagnosis was assessed as a cumulative value of all alterations present in the gene (if the gene for a particular patient contained 2 or more SNVs or CNVs, it was counted only once). Thresholds for different events included 0.2 for somatic SNVs/INDELs, 0.2 for CNV events (both amplification and deletion), 0.05 for fusion events, and 0.03 for clinically pertinent germline variants. For somatic SNVs and INDELs, intronic and silent variants were filtered out.
Overexpression calculation
The percentage of overexpression of a particular gene for a diagnosis was calculated as the number of samples with expression of the gene higher than 87% (more than the STD of normal distribution) of the cohort with the same diagnosis divided by the overall number of samples in the cohort (internal normalization).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Results
Development of an integrated RNA and DNA sequencing assay
A complementary WES and RNA-seq bioinformatics pipeline was optimized for an integrated comprehensive NGS-based assay (Methods). Matched-normal WES was used to accurately identify tumor purity, ploidy, somatic variants, TMB, MSI, and CNVs54. As nucleic acid degradation is a well reported challenge during extraction from FFPE tissues55,56, an exome capture (EC) workflow was used to increase DNA- and RNA-seq quality from degraded FFPE material (Fig. 1a)57. To replicate standard poly-A mRNA enrichment from fresh frozen (FF) tissue, 17 Mb of untranslated regions (UTRs) were added to the RNA EC protocol (Fig. 1a, Methods), resulting in even gene body coverage and full transcript capture (Supplementary Fig. 1a, b)58.
a Tissue sample processing workflow. b Tumor/normal variant calling and RNA confirmation in TP53 exons; reference (blue) and mutant (red) alleles shown. Sample integrity confirmed via SNV concordance and HLA genotype. c Proportion of read types in optimized DNA and RNA sequencing. d Dependence of variability of detected transcripts on total read numbers per sample (n = 10). Box plots show the interquartile range. e Dependence of coefficient of variance (CV) of single gene expression (log2TPM) on the expression level. f Schema showing the three-step validation approach for different biomarker groups. g Schema of experiment to develop comprehensive somatic references for COLO829, HCC1143, HCC1395, HCC1937, and NCI-H1770 cell lines. h Scatter plot of VAF and coverage for all variants (n = 500) in COLO829 (100% purity); true positives (purple), RNA-confirmed (yellow), and filtered false positives (pink). Dashed line marks filtering threshold. Somatic (i) and germline (j) variant calling F1-scores by coverage and tumor purity (i) or genotypes (j); shading shows s.e.m. from 10 samples per dilution (n = 60). Dependence of coverage completeness (k), BAF variance (pink dots, left y-axis), and tumor/normal depth ratio variance (purple dots, right y-axis) (l) on median sample coverage (n = 250).
To ensure WES and RNA-seq integration and exclude cross-contamination, SNV concordance (>70%) and HLA genotyping were assessed across normal and tumor WES and tumor RNA-seq samples (Fig. 1b). While exonic germline variants were detected across all three samples, utilization of matched-normal samples allowed separation of somatic and germline variants with high precision. RNA-seq further confirmed expressed driver mutations (Fig. 1b).
To improve the overall performance of the WES assay, EC protocols covering 19,047 genes were optimized to maintain low off-target and duplication rates (Fig. 1c). The duplication rate (27.2 ± 0.90%) was used to assess the initial library complexity and quality of the input material for WES, and the off-target rate (29.0 ± 0.28%) served as a probe specificity indicator (Methods). Low complexity regions, homopolymeric regions (≥9 bp), and highly polymorphic genes were excluded to minimize false-positive SNVs and INDELs (Supplementary Tables 1–3). Less than 3% of the target exome was characterized as difficult to align and filtered from the analysis (Fig. 1c).
For RNA-seq, proportion of on-target reads was prioritized. After removing low-quality and unmapped reads, 91% were from coding sequence (CDS) exons and UTRs, while 9% were in introns and intergenic regions (Fig. 1c). The total number of reads (read count) determines the limit of detection (LOD) of expression59. Transcript diversity plateaued at 50 M paired-end (PE) reads (Fig. 1d) with <20% CV for transcripts expressed above 1 transcript per million (TPM; Fig. 1e, Methods).
Final performance assessment for each biomarker group included: (1) technical validation using reference standards; (2) concordance with orthogonal assays on a reference cohort of patients with known alterations; (3) assay application in clinical practice and comparison with other large genomic studies (Fig. 1f).
Unique exome-wide WES/RNA integrated somatic reference standards
A major challenge in the analytical validation of WES assays is the lack of reference exomes for somatic variant calling to establish accurate exome-wide performance characteristics60,61. To address this limitation, somatic mutation reference standards were established using commercial cell lines (COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770) with serial dilutions (10%–100%; Fig. 1g, Methods, Supplementary Table 4). Reference cancer cell lines were heterogeneous and highly polyploid, with somatic variants spanning 5% to 100% VAF52. Tumor purity was concordant with VAF for true-positive calls; therefore, false positives were identified by VAF and tumor purity discordance (Supplementary Figs. 1c–g). Similarly, true variants had biologically meaningful trinucleotide context, while FFPE artifacts (i.e., polymerase errors) exhibited random context, indirectly supporting separation of true- and false-positive calls (Supplementary Figs. 1c–g)62,63,64. The defined set of mutations was additionally confirmed by evaluating corresponding SNVs from RNA-seq (Fig. 1h, Supplementary Fig. 2a–e). While 35% of reference variants were found in RNA-seq, only 1% of variants below the thresholds intersected (Fig. 1h, Supplementary Fig. 2a–e). The VAF of expressed variants between RNA-seq and WES had varying correlations (r = 0.2–0.6 for 10–100% purity, Supplementary Fig. 2f–k), likely explained by allele-specific expression65 and differential expression of cancer-specific genes (Supplementary Fig. 2l–n)66. RNA-seq VAF of genes overexpressed in cancerous tissue can be high at low purities (<30%; Supplementary Fig. 2m).
The final set contained 3042 variants and up to 18,354 wild-type (WT) genes for each cell line, expanding previous limited sets (Supplementary Tables 4–5)52. The SNVs and INDELs were distributed uniformly across the exome with an average density of 43 mut/Mb, providing the most comprehensive reference somatic exome to date (Supplementary Fig. 2o). The reference is available at https://github.com/BostonGene/Somatic_reference_standards/67.
Optimized thresholds for accurate variant calling
A minimum coverage depth of 80 reads was required to achieve F1-scores of >0.90 for cell lines with ≥20% purity (Fig. 1i, Supplementary Fig. 3a, b). A similar analysis of germline variant calling established a minimal coverage depth of 20 reads (Fig. 1j). A median WES target coverage of 150X yielded >85% completeness at the 80 reads depth and stable F1 scores (Fig. 1k). At 150X median coverage, clinically actionable genes achieved a median coverage of 240X, lowering the LOD to 2% VAF for these genes (Supplementary Table 6). At 100X normal WES, germline variants reached >0.95 F1-score with a threshold of 20 reads per nucleotide (Fig. 1j–k).
Precise assessment of normal/tumor depth ratio and B-allele frequency (BAF) is instrumental in the accurate determination of CNVs24. At 150X, the variance of depth ratio plateaued and BAF stabilized (Fig. 1l), while the F1-score slightly improved with higher coverage (Supplementary Fig. 3a, b). A median coverage of 150X with high overall completeness appeared to be optimal for clinical tumor/normal matched variant calling and was significantly lower than the coverage typically required by targeted tumor-only assays1.
Analytical validation Of RNA-Seq gene expression
DNA contamination is difficult to identify in RNA libraries and can result in abnormal distribution of gene expression. Additional DNase treatment steps can mitigate DNA contamination in RNA-seq samples (Fig. 2a). A positive dependence was observed between DNA contamination and the proportion of sense strand reads (Fig. 2b), which could serve as an indicator of contamination in strand-specific RNA-seq.
Gene expression in DNA-contaminated samples (a), sense strand reads proportion and DNA contamination (b), before and after DNase treatment. Correlation of measured TPMs for ERCC spike-ins (n = 89) with known molar concentrations in tumor (c) and cell line (d) samples. Colors indicate four transcript groups in the mix (red, pink, blue, gray). e Schema of orthogonal validation of gene expressions. WTS - whole transcriptome sequencing (f) Distribution of Pearson’s correlation coefficients for single gene expression (n = 1389) between FF and FFPE samples of the same tissue. Genes with wide dynamic range (STD > 2) were used. Shading shows 95% confidence interval of correlation coefficient. g Correlation between RNA-seq (log2TPM) and qPCR (ΔCt) measurements for 4 genes (n = 57). h Distribution of Pearson’s correlation coefficients for single gene expressions (n = 99) measured by RNA-seq (TPM) or qPCR. i Intra-day reproducibility of gene expressions (TPM) measured in the same sample; r = 0.99, p = 1.0e-308. j Gene expression CV and TPM across 10 clinical samples measured on 3 different days. k Median CV of gene expressions (n = 20,062) and signatures (n = 21) for TPM > 1. Box plots show the interquartile range. l Gene expression stability across COLO829 cell line ranges over 4 months; expressions from the same FFPE block preparation remained stable. m Schema of TME subtypes classification. n Median variance for signature score reproducibility for inter- and intra-day replicates (n = 3) of clinical samples (n = 7). Box plots show the interquartile range. o TME subtype probability for varying input amounts (10, 20, 50 ng). Error bars: s.e.m. p Heatmap of gene signature scores from Bagaev et al. across all input amounts in reproducibility experiments for predicted TME subtypes.
The RNA-seq pipeline was tested on commercial references ERCC Spike-In Mixes (89 foreign transcripts) admixed in RNA extracted from 18 cell lines and 55 tumor samples (r = 0.96–0.97; Fig. 2c, d)68,69. Although expression values from EC and poly-A RNA-seq cannot be compared directly due to batch effects stemming from technical differences70,71, gene expression across 110 paired FFPE and FF samples exhibited strong correlations (Fig. 2e, f, Supplementary Table 7). The median correlation of 1389 variably expressed genes (STD > 2) reached 0.88 (Fig. 2f).
To orthogonally validate FFPE-derived RNA-seq expression values, qPCR was performed on matched samples for 99 genes from different genomic regions in 51 clinical samples (median r = 0.85; Fig. 2e, g, h, Supplementary Table 8). FFPE RNA-seq was stable (r = 0.99, p = 1.0e-308) with strong correlations between replicates from different days (Fig. 2i). Using multiple replicates the RNA-seq LOD was established as 1 TPM, where gene expression CV was <20%. Below 1 TPM, gene expression reproducibility was low (Fig. 2j). Overall, the CV for all genes expressed at >1 TPM was as low as 3.6% and 2.4% for single genes and gene signatures, respectively (Fig. 2k). RNA from two COLO829 FFPE blocks were sequenced over four months to represent assay stability over time (Fig. 2l). Expression analysis demonstrated high inter-assay reproducibility with <5% variability over time for cell lines from the same passage (Fig. 2l). We recommend careful preparation and passage annotation of cell line RNA prepared as a reference control.
Reproducibility of gene signature scores and TME classification
Gene signatures can capture functional changes and complex properties of cancer tissue and can be applied for tumor microenvironment (TME) classification across cancer types72,73,74,75,76,77. Bagaev et al. described pancancer TME subtypes associated with immunotherapy response: Immune-Enriched (IE), Immune-Enriched/Fibrotic (IE/F), Fibrotic (F), and Immune-Depleted (D)14.
To further validate the feasibility of classification using these TME subtypes in clinical settings, tumor samples (10, 20, and 50 ng input) were sequenced five times. (Fig. 2m)14. Gene signature (ssGSEA) scores were stable across intra- and inter-days with 2.8% and 3.1% variance (Fig. 2n), resulting in high reproducibility of TME classification probability scores across all RNA inputs (STD < 2%, Fig. 2o). As shown in the corresponding heatmaps for a range of typical and atypical samples (Fig. 2p, Supplementary Fig. 4a), the overall reproducibility of signatures and TME classification across all tested samples supported the clinical application of RNA-seq for complex diagnostic signatures. Applying the classification to a cohort of 1399 clinical samples, we confirmed the distribution of gene signatures (Supplementary Fig. 5a), mean Z-score of signatures (Supplementary Fig. 5b), and subtype distribution within major cancer types (Supplementary Fig. 5c) all concurred with previously reported findings, supporting reproducibility39.
Gene fusion calling accuracy depends on intrinsic transcript expression level
Fusions were identified by junction and spanning reads after removing artifacts and misalignments (Fig. 3a). Expression values for fusion transcripts are generally higher than WT transcripts78, which can upshift fusion transcript exon coverage (Fig. 3b); however, if WT gene expression is high, the impact of fusion transcripts on coverage may not be observed. Detection of fusions depended on the transcript expression, spanning and junction reads, and overall coverage. A minimum of 30 M PE reads was required to detect the majority of tested fusions, with fusion-related reads plateauing at 50 M PE reads (Fig. 3c). The pipeline was benchmarked against other approaches. For low-expressed fusions at 30% purity, the recommended approach led to 100% sensitivity, compared to 88.9% and 77.8% for Arriba (v2.4) and STAR-Fusion, respectively (Supplementary Fig. 6a–c)30,32, with similar performance for highly expressed events (Supplementary Fig. 6a, b).
a Schema for fusion detection based on junction and spanning reads. b Diagram of ATIC–ALK fusion with exon coverage near the breakpoint. Red and blue lines show average coverage for right and left partner genes; gray line indicates normal gene coverage. c Fusion-supporting reads and total sequenced reads; dashed line marks the required read threshold. d Scatter plot of fusion gene partner expression and fusion fragments per million (FFPM) for 76 reference samples. F1-score, sensitivity, and precision of fusion calling and FFPM threshold as LOD for FF (e) and FFPE (f) samples. Shading shows the s.e.m. g Fusion covering reads and tumor purity for 22 fusions at 100% purity. h Purity LOD for fusions with different initial FFPM levels (>2, 1–2, <1). i Heatmap of qPCR validation for fusion-gene detection in patient samples (n = 25). j Diagram of RPUSD3–NTRK2 fusion with exon coverage near the breakpoint. k Reproducibility of fusion detection across different days; red crosses mark absent calls. l BCR fusion assembly and major clonotype fraction in Epstein-Barr immortalized B cells (GM12877, GM12878) and follicular lymphoma samples, comparing FF and FFPE tissues.
The first technical validation step included a diverse set of reference materials and cell lines (Methods; Fig. 3d, Supplementary Table 9) and resulted in 0.98 overall analytical sensitivity for detection of 83 various fusions (Fig. 3d). The properties of the fusion breakpoint regions significantly varied, especially the expression levels of gene partners of the fusion transcript (Fig. 3d). Well characterized cell lines and reference materials (Methods) were used to assess sensitivity, precision, and F1 scores, indicating an 0.1 fusion fragments per million (FFPM) LOD maximized fusion detection performance for FF (F1 = 0.95 ± 0.02) and FFPE (F1 = 0.91 ± 0.01) samples (Fig. 3e, f). Four cell lines were evaluated to establish the LOD for fusions with varying maximum FFPMs (Fig. 3g). For transcripts with >2 FFPM, the LOD was 10% purity, while fusions with 1–2 FFPM had a 20–30% purity LOD (Fig. 3h). Fusions with low expression values (<1 FFPM) required an estimated 30–50% purity LOD. The LOD depended on tumor content48 and fusion expression levels.
Orthogonal qPCR validation included 115 fusions in 63 patient samples (Fig. 3i and Supplementary Tables 10–11), resulting in 98.0 ± 1.7% sensitivity, 99.0 ± 0.9% precision, and >99.9% specificity (Fig. 3i). Notably, whole transcriptome RNA-seq can be used for de novo gene fusion calling, overcoming the limitations of predesigned probe sets. A total of 69 de novo fusions were confirmed by qPCR, including RPUSD3--NTRK2, which was detected in a uterine sarcoma (Fig. 3j). Evaluation of 19 fusions in 8 clinical samples over 3 days generated high reproducibility (100%) for fusions (>1 FFPM; Fig. 3k).
Bulk RNA-seq can also be used to assemble V(D)J recombination fusion transcripts of the CDR3 region of T- (TCR) and B-cell receptors (BCR)33,78. Minimal differences were observed when BCR fractions were compared from FF and FFPE on B cell lines and lymphoma samples, demonstrating that EC is comparable to poly-A RNA-seq for BCR assembly (Fig. 3l and Supplementary Table 12).
Analytical validation of exome-wide somatic SNV and INDEL variant calling
Initial somatic SNV and INDEL variant calling was performed by applying Strelka216 and standard filtration79, which rendered low F1 scores (Supplementary Fig. 7a–f, Methods). Variant calling between FF and FFPE samples was tested using the reference set of 3042 mutations, resulting in concordant VAFs (r = 0.94, p = 1.0e-308) and high precision (95%; Fig. 4a). However, 176 of the 239 variants were confirmed as true variants in pileup files (Methods), leading to an adjusted precision of 98%, indicating the filtering approach removed the majority of FFPE artifacts.
a Correlation of VAF for reference variants (n = 3042) between fresh frozen (FF) and formalin-fixed paraffin-embedded (FFPE) samples; Pearson correlation r = 0.94, p = 1.0e-308. Shading shows the 95% confidence interval. b Concordance of detected variants in the current study with previously published variants (n = 558) in the same four cell lines52, performed in duplicate; Pearson correlation r = 0.87, p = 4.9e-177. Shading shows the 95% confidence interval. c Dependence of TMB score on tumor purity in reference cell lines. Dashed lines represent the expected TMB for each cell line. Dependence of sensitivity and precision for somatic SNVs (d) and INDELs (e) on purity in the reference cell lines; Shading shows the s.e.m. Concordance of SNV (f) and INDEL (g) variant calling with orthogonal platform. Shading shows the 95% confidence interval. h Heatmap representing VAF reproducibility for the COLO829 reference cell line sequenced multiple times within 4 months (mutations in BRAF, ATR, HP1BP3 are highlighted in red); (i) Concordance of calculated TMB score with orthogonal platform. j Correlation of TMB and MSI scores in tumors.
Technical validation included analysis of 558 previously established somatic variants for COLO829, HCC1143, HCC1937, and HCC139552, which yielded high sensitivity (98%) and concordance with the previously reported VAFs (r = 0.87, p = 4.9e-177; Fig. 4b). TMB measured in cell lines was dependent on tumor purity but stable above 20% (Fig. 4c).
Next, the LOD and sensitivity for SNV/INDEL detection were assessed on the 3042 reference variants from COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 at serial dilutions (10%–100% purity; Fig. 4d, e and Supplementary Table 13). Samples with ≥20% purity demonstrated sensitivities of 95.3 ± 0.89% for SNVs and 85.2 ± 3% for INDELs, while those with ≥30% purity had sensitivities of 97.1 ± 0.8% for SNVs and 89.9 ± 1.8% for INDELs (Fig. 4d, e and Supplementary Table 13). Across all cell line dilutions, >95% precision and 99.8 ± 0.04% specificity were observed. Remarkably, precision remained stable across all purities (Fig. 4d, e, and Supplementary Table 13), and the majority of mutations (>50%) were missed when purity was ≤10% (Supplementary Fig. 8a). Given these observations, 20% purity and 5% VAF LODs were established.
To achieve high precision (>95% across all purities), additional quality filters were introduced (Methods). The pipeline was benchmarked with Mutect2 (GATK v4.1)19 and Illumina Dragen (v4.3)20 in somatic mode using recommended filters (Supplementary Fig. 7a–f). The proposed pipeline yielded higher F1-scores for SNVs (F1 = 95.3%) and INDELs (F1 = 91.9%) compared to other tools: Strelka2 (v2.9; SNVs F1 = 73.7%; INDELs F1 = 76.4%), Mutect2 (SNVs F1 = 68.9%; INDELs F1 = 28.4%), and Dragen (v4.3; SNVs F1 = 91.6%; INDELs F1 = 31.1%) at 50% tumor purity (Supplementary Fig. 7a–f).
For orthogonal validation, 53 clinical tumor samples of various tissue origins (Supplementary Table 14) were sequenced by an independent CLIA-certified laboratory. Strong VAF correlations for the 16,553 SNVs (r = 0.92, p = 1.0e-308) and 2046 INDELs (r = 0.82, p = 1.0e-308) were observed (Fig. 4f, g). Nevertheless, 2842 variants were initially discordant; however, 2505 variants were manually identified as low VAF true positives (Methods), resulting in 98.8% precision and 99.1% sensitivity. Exome-wide performance on all reference and tissue samples resulted in a sensitivity of 96.1% for SNVs and 96.9% for INDELs for all genes, reaching >99% for clinically actionable alterations (Supplementary Table 15).
Comprehensive clinical tests require internal controls with diverse features, such as COLO829. Sequencing COLO829 over 4 months demonstrated consistent exome-wide mutation calling, supporting its use as a reference material in NGS testing (Fig. 4h).
Calculating exome-wide TMB and MSI
TMB and MSI are widely used in clinical practice as predictive and prognostic biomarkers for many cancer types3,80,81,82,83,84,85. TMB was dependent on tumor purity and with an LOD of 20% purity (Fig. 4c, Methods). Evaluation of TMB accuracy with Seraseq® gDNA TMB (30% and 100%) reference materials (Methods) strongly correlated with the expected values of the reference materials (100%: r = 0.9997, p = 6.1e-8; 30%: r = 0.998, p = 1.5e-8; Supplementary Fig. 8b). Notably, assessment of patient and cell line samples indicated the optimized thresholds for FFPE filtration (Methods) resulted in negligible differences in the TMB for FF materials (r = 0.99, p = 4.2e-64; Supplementary Fig. 8c). TMB from 53 clinical samples (20–90% tumor purity; Supplementary Table 14) was concordant between our workflow and an independent reference laboratory (r = 0.99, p = 1.7e-41; Fig. 4i).
To determine the MSI LOD, serial dilutions of MSI-high cell lines, LoVo, DLD-1, HCT15, and HCT16, were prepared. Similar to TMB, MSI scores depended on purity (Supplementary Fig. 8d). For samples with >20% purity, our assay achieved 100% accuracy against orthogonal results (Supplementary Table 15). Interestingly, among the 53 clinical samples, TMB-high samples were generally MSI-high (Fig. 4j). MSI patient samples analyzed in an independent laboratory (Supplementary Table 16), yielded up to 100% sensitivity and specificity.
Robust single-copy CNV and LOH calling
An adopted algorithm for copy number detection and normalization was applied to account for sample ploidy during CNV classification (amplifications, gains, shallow deletions, deletions, and neutrals; Methods; Fig. 5a). CNVs were interpreted as either amplifications or deletions, depending on sample ploidy (Fig. 5b). For example, a segment with three copies reported as neutral in a triploid sample would appear as a gain in diploid and shallow deletion in tetraploid samples (Fig. 5b).
a Schema for CNV calling using BAF and tumor/normal depth ratio. b Whole exome CNV calling examples for reference cell lines with varying ploidy (2n, 3n, 4n). CNV interpretation depends on tumor ploidy. c Sensitivity of CNV calling for COLO829 cell line (n = 5692) and its dependence on purity. d Comparison of COLO829 exome from FF and FFPE tissues. e BAF and depth ratio correlation between FF and FFPE samples. Shading shows the 95% confidence interval. f Arm- and gene-level CNV and g LOH/cnLOH calling performance across different purities for five reference cell lines (error bars show s.e.m.). h Heatmaps of exome-wide CNV comparison across 8 clinical samples sequenced in different labs; top - high purity, bottom - low purity. Gray - neutral, blue - shallow deletion, purple - deletion, pink - gain, red - amplification. i Concordance of CNV calling with orthogonal sequencing in high purity (>30%) clinical samples (n = 39). Box plots show the interquartile range.
Validation on commercial reference materials showed that the algorithm detected amplifications in the three genes in Seraseq® Breast and Brain-Lung CNV Mixes and Brain-Lung-Breast (BLB) Mix (Supplementary Table 17). Then, using a published set of 5700 genes from the COLO82953, we demonstrated >98% sensitivity at ≥20% purity (Fig. 5c). To further validate exome-wide CNV calling, a unified reference exome for CNVs covering 19,047 genes was prepared for FFPE COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 cell lines (Fig. 5d, Supplementary Fig. 9a, Methods). BAF (r = 0.95, p = 1.0e-308) and depth ratio (r = 0.99, p = 1.0e-308) strongly correlated between FF and FFPE tissues, confirming the effectiveness of CNV calling from both sample types (Fig. 5d, e). Exome-wide CNV calling showed high precision (>97.6% and >98%), sensitivity (>81% and >94%), specificity (>96.5% and >96.4%), and accuracy (>91.8% and >96.1%) at 30% tumor purity for arm- and gene-level CNVs, respectively (Fig. 5f, Supplementary Fig. 9b, and Supplementary Table 18). Gene-level LOH and copy-neutral LOH resulted in >85% sensitivity and precision at 20% purity and >98% at 30% purity (Fig. 5g, Supplementary Fig. 9c, and Supplementary Table 18).
The pipeline achieved the highest F1-score (>91%), sensitivity (>86%), and specificity (>91%) across all purities compared to the performance of standard Sequenza (v2.1)24, FACETS (v0.5.14)25, and Dragen 4.3 (v4.3; Panel of Normals)20, which showed sensitivities of >78%, >61%, and >31%, respectively (Supplementary Figs. 10a–c). Specificity for the other tools generally remained below 80% (Supplementary Fig. 10c).
Orthogonal testing of 52 tumors in an independent laboratory revealed high exome-wide concordance per nucleotide (Jaccard index = 98.5%; Methods, Fig. 5h, Supplementary Fig. 11a). Gene-level CNVs at >30% purity achieved median specificity of >95% and sensitivity of >92% (Fig. 5i). Using COLO829, high reproducibility with only mild variations in depth ratio was observed over 3 months (Supplementary Fig. 11b, c).
Clinical applications of the integrated whole exome and transcriptome Tumor PortraitTM assay
Integrated WES and RNA-seq were applied to 2230 real-world clinical samples to assess mutation rates and biomarker overexpression (Supplementary Table 19, Methods). The analysis included alterations and RNA overexpression across tumor types (Fig. 6). On average, each sample had 173 SNVs, 8 INDELs, 34,10 amplifications, 4174 deletions, 5 fusions, and 4 overexpressed antibody-drug conjugate (ADC)-related genes. Polyploidy was observed in 30.3% of tumors. TMB-high status (>10 mut/Mb) was found in 6.8% of cases, most commonly in cutaneous melanoma and uterine cancers. Clinically actionable mutations were found in 98% of samples, and ADC-related gene overexpression in 89%. Mutation frequencies for TP53 (43%), KRAS (12%), PTEN (4.5%), and PIK3CA (11.2%) matched published large-scale data (Supplementary Table 20)86,87,88. Notable fusions included ERG in 29.2% of prostate cancers and RB1 in 9.5% of uterine tumors.
Major genomic findings for oncogenes and tumor suppressors across solid cancers (n = 1399). Frequency thresholds for color code: 0.2 for somatic SNVs/INDELs, 0.2 for CNVs, 0.05 for fusion events, 0.03 for clinically pertinent germline variants. Intronic and silent variants were filtered out for SNVs/INDELs.
To support therapeutic targeting, we evaluated RNA overexpression of surface markers, including potential ADC targets (Fig. 7a)89,90. ERBB2 was overexpressed in 73% of HER2+ breast cancers, and its expression was concordant with amplification rate. Other commonly overexpressed genes included: FOLR1 in 67% of kidney and 77% of ovarian cancers; PMEL in 82% of melanoma; and TACSTD2 in 73% of bladder cancer. RNA data also enabled classification of TME subtypes (Fig. 7b), informing immunotherapy decisions14.
Interpretation of somatic mutations and germline variants with RNA-seq in clinical practice
While DNA remains the primary source for detecting variants, RNA-seq provides complementary information for clinical interpretation and validation. We found 37.1% of somatic SNVs and INDELs were expressed in RNA, rising to 63.9% in clinically actionable genes (Fig. 8a, b and Supplementary Fig. 12a, b). RNA-seq also confirmed 55.5% of germline variants, with 2.1% showing altered zygosity compared to WES, enabling allele-specific expression analysis. Most expressed variants were missense (63%) or silent (24.2%), but a small percentage were nonsense (5%) and frameshift (3%; Supplementary Table 21), indicating some structurally impactful mutations escape nonsense-mediated decay and may affect protein function (Fig. 8c).
a Correlation of RNA and DNA VAF for actionable gene variants in clinical samples (n = 2230), highlighting CCND1 and CDKN2A shifts in RNA VAF. b Proportion of confirmed and RNA-seq rescued variants across 2230 patients. Error bars show the s.e.m. c Mutation type distribution of RNA-seq rescued variants. d RNA and DNA VAF and coverage statistics for RNA-seq rescued variants. Box plots show interquartile range. e Heatmap of Spearman correlation of RNA expression and CNAs across genes and cancer types. f Gene expression vs. CNV status in sarcoma and breast cancer. Box plots show interquartile range. g Expression levels of partner genes in TMPRSS2–ERG or DNAJB1–PRKACA fusion-positive and -negative samples. Boxes show interquartile range.
RNA-seq VAF often exceeded DNA-derived VAF, with 16.7% of variants showing increased RNA to DNA VAF ratios and only 3.2% showing lower ratios (Fig. 8a). Seventeen genes, including CDKN2A, BCL2, MYC, and CCND1, had increased RNA VAF, which can be associated with amplification or selective expression, providing a new avenue for clinical interpretation (Supplementary Table 22 and Fig. 8a).
RNA-seq also rescued low-VAF subclonal mutations missed in DNA due to coverage thresholds91,92. Rescued variants had 2.7x higher VAF and 2.4x higher coverage in RNA (p = 0.1e-79 and 0.1e-9, respectively; Fig. 8d). Across 2230 patients, RNA-seq integration recovered 18,231 additional missense mutations (+4.5% to WES), 2962 splice site variants (+0.7%), and 505 frameshift mutations (+0.12%).
Interpretation of structural variants by RNA-seq
Clinically actionable CNV events correlated with expression levels in a tissue-specific manner (Methods, Fig. 8e)93,94. We assessed Spearman correlations for recurrent genes with CNVs, including CDK4 (r = 1.0, p = 1.4e-241) and CCNE1 (r = 1.0, p = 1.4e-24) in sarcoma and ERBB2 (r = 0.93, p = 2.3e-2) and CCND1 (r = 1.0, p = 1.4e-24) in breast cancer (Fig. 8e, f and Supplementary Fig. 13a).
We found evidence of overexpression of a fused oncogene target and a promoter donor gene (Fig. 8g and Supplementary Fig. 13c, Methods) in TMPRSS2–ERG for prostate cancer and DNAJB1–PRKACA for hepatobiliary cancer, where ERG and PRKACA are known driver events (Fig. 8g and Supplementary Fig. 13c)95,96,97. Further analysis across the 2230 patient samples found previously unreported fusion events, with 57 cases of potentially clinically actionable events needing further clinical annotation. The described approach could be used as an additional interpretation tool for nonreported oncogenic fusions.
Discussion
As the landscape of precision oncology continues to evolve with ongoing biomarker discovery and development of targeted therapies98,99, integrative analysis of RNA signatures and DNA biomarkers from a single sample is becoming increasingly important in clinical decision-making100,101,102. Unlike targeted assays covering a limited number of genes, WES and RNA-seq assays can identify de novo events, novel fusions, and gene expression levels, thereby expanding the repertoire of clinically actionable findings14,100,103. RNA-seq may also serve as a valuable immunohistochemistry alternative for uncovering ADC targets90,104. Complex gene signatures are already widely used in treatment response prediction models to estimate probabilities of immunotherapy outcomes in research settings9. Importantly, our analysis herein demonstrated the reproducibility of our TME subtype classification, highlighting its clinical utility.
To address challenges associated with the longitudinal dynamics of RNA expression that complicate reproducibility105, we used COLO829 cells as a sequencing control to monitor expression stability over time. As expected, our results revealed stable gene expression within a single batch of extracted RNA, indicating the suitability of COLO829 cells as a reference.
Whole transcriptome RNA-seq can identify nonconventional fusions106. We tested 171 gene fusions from synthetic materials, cell lines, and clinical samples, including previously unreported events with potential clinical relevance (e.g., RPUSD3–NTRK2), to uncover RNA-seq parameters that may influence fusion detection. Our analysis showed high specificity (>99.9%) and sensitivity (96%) overall for fusion detection (Supplementary Table 15). Moreover, tumor bulk RNA-seq allowed simultaneous assembly of V(D)J fusion transcripts of B and T cells within the TME. These findings indicated FFPE-derived RNA-seq can reliably identify complex and previously undetected fusion events in a clinical setting.
Traditional validation guidelines were designed for small targeted panels focused on known hotspots1,107,108, often using diluted platinum cell line variants to estimate VAF LOD108. However, such models fail to capture tumor heterogeneity and purity109,110,111,112. Standard false-positive metrics—like specificity calculated per nucleotide—inflate performance estimates and are not scalable for exome-wide analysis. To address this, we developed an exome-wide reference dataset of 3042 somatic SNVs and INDELs as well as 47,472 CNVs and high-confidence WT gene sets across five polyclonal, polyploid cell lines. This approach enables direct precision assessment based on event-level metrics, yielding a more accurate false-positive rate evaluation for WES assay development.
Targeted panels may not reliably assess TMB and MSI for certain cancer types113,114, and TMB and MSI scores declined with tumor purity, indicating a need for subclonal mutation correction or purity-adjusted thresholds and exome-wide panel use115,116. Still, our assay showed >99% concordance with reference materials and orthogonal methods at ≥20% tumor purity. Across 16,553 SNVs and 2046 INDELs from samples with wide purity ranges (20–90%), exome-wide sensitivity was 96.2–96.9%, with 99.9% specificity and 99.3% sensitivity for clinically actionable genes (Supplementary Table 15).
Reliable CNV detection in tumors with variable purity and ploidy cannot rely solely on BAF and depth ratio from targeted regions117,118,119,120,121. Targeted assays often report amplifications without ploidy correction122, leading to misinterpretation. Our analysis revealed multiploidy in over 30% of samples analyzed, highlighting the need to account for ploidy in clinical interpretation of CNV gains and losses.
In the final validation step of the guidelines proposed herein, assay performance was evaluated on 2230 real-world clinical samples and compared to known mutation rates by diagnosis (Supplementary Table 20)86,87,123. Almost all cases (98%) harbored variants in clinically actionable genes, and 89% showed overexpression of ADC-related drug targets. These findings support the clinical utility of an exome-wide, non–targeted WES and RNA-seq combination assay for therapy selection, underscoring its untapped potential.
RNA-seq enhances the sensitivity for detecting subclonal variants with low VAF or those in poorly covered regions91. Consistent with RADIA algorithm findings, RNA-seq confirmed 37% of WES-identified variants and rescued an additional 7% of events below detection thresholds, up to 50% of which affected protein-coding regions. Notably, RNA-based VAF reflected not only tumor content, but also allele-specific expression influenced by gene regulation or CNVs. Statistically significant VAF deviations in genes like CDKN2A, BCL2, MYC, and CCND1 demonstrate the value of RNA-seq value in refining variant interpretation. Clinically actionable CNV events (e.g., MTAP, ERBB2, CCNE1, CD274) showed expression changes correlated with copy number, supporting RNA-seq as a functional tool for CNV variant interpretation at borderline values (5–7 copies).
In summary, emerging assays require new principles for analytical and clinical validation beyond targeted panels of hotspot mutations. Integrated WES and RNA-seq assays can evaluate entire genomes to identify more clinically actionable findings, highlighting the potential clinical use of RNA-seq as a variant interpretation tool for complex somatic rearrangements. Such greater diagnostic yields are valuable to both clinicians, pharmaceutical companies, and payers as they direct their patients to biomarker-based targeted therapies with superior outcomes, realizing the full potential of personalized medicine. We believe our proposed three-step validation approach addresses a critical need for comprehensive guidelines that can steer the development of integrated RNA and DNA sequencing assays.
Data availability
Processed and reference data are available in the supplementary tables. The raw sequencing data for cell line data generated and analyzed during the current study are available at SRA repository PRJNA1134786 at the following URL: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1134786.v124 Variant reference files are deposited at https://github.com/BostonGene/Somatic_reference_standards/67. The raw sequencing data of clinical patients are not publicly available due to data privacy regulations on the use of such data. De-identified processed clinical data used to demonstrate the clinical application of the integrated WES and WTS assay is provided at https://zenodo.org/doi/10.5281/zenodo.15261029125 Supplementary Data 1 includes processed and reference data in supplementary tables. Supplementary Tables 1–3 list genomic regions excluded from analysis due to low complexity, homopolymeric stretches, or high polymorphism. Supplementary Tables 4–5 contain somatic references and WT genes for COLO829, HCC1143, HCC1937, HCC1395, and NCI-H1770 reference cell lines. Genes termed “clinically actionable” are available in Supplementary Table 6. Supplementary Tables 7–8 report correlations among FF and FFPE along with RNA-seq and qPCR. Supplementary Tables 9–11 include data from fusion analysis, such as performance, orthogonal qPCR, and reference databases. Supplementary Table 12 contains BCRs across FF and FFPE samples. Performance metrics are summarized in Supplementary Tables 13, 15, 17, and 18. Supplementary Tables 14 and 16 report sequencing results and MSI status for the 53 clinical samples. Supplementary Table 19 illustrates the distribution of diagnoses from the full cohort of 2230 patients. Supplementary Table 20 contains a comparative analysis of the BostonGene Tumor PortraitTM against other large-scale genomic analyzes. The distribution of alteration types from RNA-seq and allele-specific expression is available in Supplementary Tables 21, 22. Source data for the main figures in the manuscript are provided in Supplementary Data 2.
Code availability
All code for reference cell lines datasets processing is deposited online (https://github.com/BostonGene/Somatic_reference_standards/)67.
Abbreviations
- BAF:
-
B-allele frequency
- BCR:
-
B-cell receptor
- CDS:
-
Coding sequence
- CNV:
-
Copy number variation
- EC:
-
Exome capture
- FF:
-
Fresh frozen
- FFPE:
-
Formalin-fixed paraffin-embedded
- INDELs:
-
Insertions and deletions
- LOD:
-
Limit of detection
- LOH:
-
Loss of heterozygosity
- MSI:
-
Microsatellite instability
- NGS:
-
Next-generation sequencing
- PE:
-
Paired-end
- RNA-seq:
-
RNA sequencing
- TMB:
-
Tumor mutational burden
- TCR:
-
T-cell receptor
- TPM:
-
Transcript per million
- TME:
-
Tumor microenvironment
- UTR:
-
Untranslated region
- VAF:
-
Variant allele frequency
- WES:
-
Whole exome sequencing
- WT:
-
Wild-type
- WTS:
-
Whole transcriptome sequencing
References
Frampton, G. M. et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat. Biotechnol. 31, 1023–1031 (2013).
Hussen, B. M. et al. The emerging roles of NGS in clinical oncology and personalized medicine. Pathol. Res. Pract. 230, 153760 (2022).
Merino, D. M. et al. Establishing guidelines to harmonize tumor mutational burden (TMB): in silico assessment of variation in TMB quantification across diagnostic platforms: phase I of the Friends of Cancer Research TMB Harmonization Project. J. Immunother. Cancer 8, e000147 (2020).
Wang, X. et al. Copy number alterations detected by whole-exome and whole-genome sequencing of esophageal adenocarcinoma. Hum. Genomics 9, 22 (2015).
Luthra, R. et al. A targeted high-throughput next-generation sequencing panel for clinical screening of mutations, gene amplifications, and fusions in solid Tumors. J. Mol. Diagn. JMD 19, 255–264 (2017).
Grasso, C. et al. Assessing copy number alterations in targeted, amplicon-based next-generation sequencing data. J. Mol. Diagn. JMD 17, 53–63 (2015).
Linderman, M. D. et al. Analytical validation of whole exome and whole genome sequencing for clinical applications. BMC Med. Genomics 7, 20 (2014).
Wrzeszczynski, K. O. et al. Analytical validation of clinical whole-genome and transcriptome sequencing of patient-derived tumors for reporting targetable variants in cancer. J. Mol. Diagn. 20, 822–835 (2018).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Rennert, H. et al. Development and validation of a whole-exome sequencing test for simultaneous detection of point mutations, indels and copy-number alterations for precision cancer care. NPJ Genomic Med. 1, 16019 (2016).
Roepman, P. et al. Clinical validation of whole genome sequencing for cancer diagnostics. J. Mol. Diagn. JMD 23, 816–833 (2021).
Beaubier, N. et al. Integrated genomic profiling expands clinical options for patients with cancer. Nat. Biotechnol. 37, 1351–1360 (2019).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Bagaev, A. et al. Conserved pan-cancer microenvironment subtypes predict response to immunotherapy. Cancer Cell 39, 845–865.e7 (2021).
Szolek, A. et al. OptiType: precision HLA typing from next-generation sequencing data. Bioinforma. Oxf. Engl. 30, 3310–3316 (2014).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinforma. Oxf. Engl. 32, 1220–1222 (2016).
Dunn, T. et al. Pisces: an accurate and versatile variant caller for somatic and germline next-generation sequencing data. Bioinforma. Oxf. Engl. 35, 1579–1581 (2019).
Auwera, G. Vder & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. (O’Reilly, Beijing Boston Farnham Sebastopol Tokyo, 2020).
Illumina. DRAGEN secondary analysis (Version 4.3) [Computer software]. https://www.illumina.com/products/by-type/informatics-products/dragen-secondary-analysis.html (2024).
Niu, B. et al. MSIsensor: microsatellite instability detection using paired tumor-normal sequence data. Bioinforma. Oxf. Engl. 30, 1015–1016 (2014).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Favero, F. et al. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann. Oncol. 26, 64–70 (2015).
Shen, R. & Seshan, V. E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Res. 44, e131–e131 (2016).
Zaitsev, A. et al. Precise reconstruction of the TME using bulk RNA-seq and a machine learning algorithm trained on artificial transcriptomes. Cancer Cell 40, 879–894.e16 (2022).
Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Abrams, Z. B., Johnson, T. S., Huang, K., Payne, P. R. O. & Coombes, K. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinformatics 20, 679 (2019).
Haas, B. J. et al. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 20, 213 (2019).
Haas, B. J. et al. Targeted in silico characterization of fusion transcripts in tumor and normal tissues via FusionInspector. Cell Rep. Methods 3, 100467 (2023).
Uhrig, S. et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 31, 448–460 (2021).
Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015).
Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Gazdar, A. F. et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Int. J. Cancer 78, 766–774 (1998).
Tomlinson, G. E. et al. Characterization of a breast cancer cell line derived from a germ-line BRCA1 mutation carrier. Cancer Res 58, 3237–3242 (1998).
Easty, D. J. et al. Protein B61 as a new growth factor: expression of B61 and up-regulation of its receptor epithelial cell kinase during melanoma progression. Cancer Res. 55, 2528–2532 (1995).
Phelps, R. M. et al. NCI-navy medical oncology branch cell line data base. J. Cell. Biochem. Suppl. 24, 32–91 (1996).
Brattain, M. G., Fine, W. D., Khaled, F. M., Thompson, J. & Brattain, D. E. Heterogeneity of malignant cells from a human colonic carcinoma. Cancer Res. 41, 1751–1756 (1981).
Drewinko, B., Romsdahl, M. M., Yang, L. Y., Ahearn, M. J. & Trujillo, J. M. Establishment of a human carcinoembryonic antigen-producing colon adenocarcinoma cell line. Cancer Res. 36, 467–475 (1976).
Dexter, D. L., Barbosa, J. A. & Calabresi, P. N,N-dimethylformamide-induced alteration of cell culture characteristics and loss of tumorigenicity in cultured human colon carcinoma cells. Cancer Res. 39, 1020–1025 (1979).
Martínez-Ramírez, A. et al. Characterization of the A673 cell line (Ewing tumor) by molecular cytogenetic techniques. Cancer Genet. Cytogenet. 141, 138–142 (2003).
Rigby, C. C. & Franks, L. M. A human tissue culture cell line from a transitional cell tumour of the urinary bladder: growth, chromosone pattern and ultrastructure. Br. J. Cancer 24, 746–754 (1970).
Pontén, J. & Macintyre, E. H. Long term culture of normal and neoplastic human glia. Acta Pathol. Microbiol. Scand. 74, 465–486 (1968).
Lasfargues, E. Y. & Ozzello, L. Cultivation of human breast carcinomas. J. Natl. Cancer Inst. 21, 1131–1147 (1958).
Quinn, K. A. et al. Insulin-like growth factor expression in human cancer cell lines. J. Biol. Chem. 271, 11477–11483 (1996).
Fogh, J. & Trempe, G. New Human Tumor Cell Lines. In Human Tumor Cells In Vitro (ed. Fogh, J.) 115–159 (Springer US, https://doi.org/10.1007/978-1-4757-1647-4_5 1975).
Tsuchiya, S. et al. Establishment and characterization of a human acute monocytic leukemia cell line (THP-1). Int. J. Cancer 26, 171–176 (1980).
Campbell, C. D. et al. Population-genetic properties of differentiated human copy-number polymorphisms. Am. J. Hum. Genet. 88, 317–332 (2011).
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
Rácz, G. A., Nagy, N., Tóvári, J., Apáti, Á. & Vértessy, B. G. Identification of new reference genes with stable expression patterns for gene expression studies using human cancer and normal cell lines. Sci. Rep. 11, 19459 (2021).
Olsson, E. et al. Mutation screening of 1237 cancer genes across six model cell lines of basal-like breast cancer. PLoS ONE 10, e0144528 (2015).
Craig, D. W. et al. A somatic reference standard for cancer genome sequencing. Sci. Rep. 6, 24607 (2016).
Mandelker, D. & Ceyhan-Birsoy, O. Evolving significance of tumor-normal sequencing in cancer care. Trends Cancer 6, 31–39 (2020).
Medeiros, F., Rigl, C. T., Anderson, G. G., Becker, S. H. & Halling, K. C. Tissue handling for genome-wide expression analysis: a review of the issues, evidence, and opportunities. Arch. Pathol. Lab. Med. 131, 1805–1816 (2007).
Turashvili, G. et al. Nucleic acid quantity and quality from paraffin blocks: defining optimal fixation, processing and DNA/RNA extraction techniques. Exp. Mol. Pathol. 92, 33–43 (2012).
Cieslik, M. et al. The use of exome capture RNA-seq for highly degraded RNA with application to clinical cancer sequencing. Genome Res. 25, 1372–1381 (2015).
Kotlov, N. et al. Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data. Commun. Biol. 7, 392 (2024).
Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Xiao, C. et al. Personalized genome assembly for accurate cancer somatic mutation discovery using tumor-normal paired reference samples. Genome Biol. 23, 237 (2022).
Xu, H., DiCarlo, J., Satya, R. V., Peng, Q. & Wang, Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 15, 244 (2014).
Diossy, M. et al. Strand orientation bias detector to determine the probability of FFPE sequencing artifacts. Brief. Bioinform. 22, bbab186 (2021).
Korona, D. A., LeCompte, K. G. & Pursell, Z. F. The high fidelity and unique error signature of human DNA polymerase ε. Nucleic Acids Res. 39, 1763–1773 (2011).
Do, H. & Dobrovic, A. Sequence artifacts in DNA from formalin-fixed tissues: causes and strategies for minimization. Clin. Chem. 61, 64–71 (2015).
Robles-Espinoza, C. D., Mohammadi, P., Bonilla, X. & Gutierrez-Arcelus, M. Allele-specific expression: applications in cancer and technical considerations. Curr. Opin. Genet. Dev. 66, 10–19 (2021).
Jiang, J. et al. Identification of hub genes associated with melanoma development by comprehensive bioinformatics analysis. Front. Oncol. 11, 621430 (2021).
Yudina, A. & Kotlov, N. BostonGene/Somatic_reference_standards: Somatic reference for SNV/INDEL and CNV. Zenodo https://doi.org/10.5281/ZENODO.15261589 (2025).
Owens, N. D. L. et al. Measuring absolute RNA copy numbers at high temporal resolution reveals transcriptome kinetics in development. Cell Rep. 14, 632–647 (2016).
SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium. Nat. Biotechnol. 32, 903–914 (2014).
Esteve-Codina, A. et al. A comparison of RNA-Seq results from paired formalin-fixed paraffin-embedded and fresh-frozen glioblastoma tissue samples. PLoS ONE 12, e0170632 (2017).
Bossel Ben-Moshe, N. et al. mRNA-seq whole transcriptome profiling of fresh frozen versus archived fixed tissues. BMC Genomics 19, 419 (2018).
Boutros, P. C. et al. Prognostic gene signatures for non-small-cell lung cancer. Proc. Natl. Acad. Sci. USA 106, 2824–2828 (2009).
Bao, M., Zhang, L. & Hu, Y. Novel gene signatures for prognosis prediction in ovarian cancer. J. Cell. Mol. Med. 24, 9972–9984 (2020).
Latha, N. R. et al. Gene expression signatures: a tool for analysis of breast cancer prognosis and therapy. Crit. Rev. Oncol. Hematol. 151, 102964 (2020).
Rabushko, E. et al. Experimentally deduced criteria for detection of clinically relevant fusion 3’ oncogenes from FFPE bulk RNA sequencing data. Biomedicines 10, 1866 (2022).
Deng, W. et al. Fusion gene detection using whole-exome sequencing data in cancer patients. Front. Genet. 13, 820493 (2022).
Walther, C. et al. Gene fusion detection in formalin-fixed paraffin-embedded benign fibrous histiocytomas using fluorescence in situ hybridization and RNA sequencing. Lab. Investig. J. Tech. Methods Pathol. 95, 1071–1076 (2015).
Heyer, E. E. et al. Diagnosis of fusion genes using targeted RNA sequencing. Nat. Commun. 10, 1388 (2019).
de Schaetzen van Brienen, L. et al. Comparative analysis of somatic variant calling on matched FF and FFPE WGS samples. BMC Med. Genomics 13, 94 (2020).
Rizvi, N. A. et al. Mutational landscape determines sensitivity to PD-1 blockade in non–small cell lung cancer. Science 348, 124–128 (2015).
Yarchoan, M., Hopkins, A. & Jaffee, E. M. Tumor mutational burden and response rate to PD-1 inhibition. N. Engl. J. Med. 377, 2500–2501 (2017).
Snyder, A. et al. Genetic basis for clinical response to CTLA-4 blockade in melanoma. N. Engl. J. Med. 371, 2189–2199 (2014).
Innocenti, F. et al. Mutational analysis of patients with colorectal cancer in CALGB/SWOG 80405 identifies new roles of microsatellite instability and tumor mutational burden for patient outcome. J. Clin. Oncol. 37, 1217–1227 (2019).
Popat, S., Hubner, R. & Houlston, R. S. Systematic review of microsatellite instability and colorectal cancer prognosis. J. Clin. Oncol. 23, 609–618 (2005).
Fancello, L., Gandini, S., Pelicci, P. G. & Mazzarella, L. Tumor mutational burden quantification from targeted gene panels: major advancements and challenges. J. Immunother. Cancer 7, 183 (2019).
Aaltonen, L. A. et al. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
Cancer Genome Atlas Research Network. et al. The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat. Genet. 45, 1113–1120 (2013).
Sorokin, M. et al. RNA sequencing in comparison to immunohistochemistry for measuring cancer biomarkers in breast cancer and lung cancer specimens. Biomedicines 8, 114 (2020).
Kushnarev, V. et al. 143 Correlating RNA-seq detection and IHC staining of potential antibody-drug conjugate (ADC) targets: HER3, HER2, TROP2, Nectin4, and aFLR. in Regular and Young Investigator Award Abstracts A163–A163 (Journal for ImmunoTherapy of Cancer, :https://doi.org/10.1136/jitc-2023-SITC2023.0143 (2023).
Radenbaugh, A. J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE 9, e111516 (2014).
Neums, L. et al. VaDiR: an integrated approach to Variant Detection in RNA. GigaScience 7, gix122 (2018).
Shi, P., Chen, C. & Yao, Y. Correlation between HER-2 gene amplification or protein expression and clinical pathological features of breast cancer. Cancer Biother. Radiopharm. 34, 42–46 (2019).
Blancato, J., Singh, B., Liu, A., Liao, D. J. & Dickson, R. B. Correlation of amplification and overexpression of the c-myc oncogene in high-grade breast cancer: FISH, in situ hybridisation and immunohistochemical analyses. Br. J. Cancer 90, 1612–1619 (2004).
Latysheva, N. S. & Babu, M. M. Discovering and understanding oncogenic gene fusions through data intensive computational approaches. Nucleic Acids Res. 44, 4487–4503 (2016).
Kastenhuber, E. R. et al. DNAJB1–PRKACA fusion kinase interacts with β-catenin and the liver regenerative response to drive fibrolamellar hepatocellular carcinoma. Proc. Natl. Acad. Sci. USA 114, 13076–13084 (2017).
Tomlins, S. A. et al. Role of the TMPRSS2-ERG gene fusion in prostate cancer. Neoplasia 10, 177-IN9 (2008).
Al-Jundi, M., Thakur, S., Gubbi, S. & Klubo-Gwiezdzinska, J. Novel targeted therapies for metastatic thyroid cancer—a comprehensive review. Cancers 12, 2104 (2020).
Melosky, B. et al. The rapidly evolving landscape of novel targeted therapies in advanced non-small cell lung cancer. Lung Cancer 160, 136–151 (2021).
Chernyshov, K. et al. Aggregated analysis of 1000 patients with cancer to assess the benefits of integrated whole exome and whole transcriptome sequencing. J. Clin. Oncol. 41, 3076–3076 (2023).
Peymani, F., Farzeen, A. & Prokisch, H. RNA sequencing role and application in clinical diagnostic. Pediatr. Investig. 6, 29–35 (2022).
Rakicevic, L. DNA and RNA molecules as a foundation of therapy strategies for treatment of cardiovascular diseases. Pharmaceutics 15, 2141 (2023).
Lau, D., Bobe, A. M. & Khan, A. A. RNA sequencing of the tumor microenvironment in precision cancer immunotherapy. Trends Cancer 5, 149–156 (2019).
Flynn, P., Suryaprakash, S., Grossman, D., Panier, V. & Wu, J. The antibody-drug conjugate landscape. Nat. Rev. Drug Discov. https://doi.org/10.1038/d41573-024-00064-w (2024).
Byron, S. A., Van Keuren-Jensen, K. R., Engelthaler, D. M., Carpten, J. D. & Craig, D. W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 17, 257–271 (2016).
Zito Marino, F. et al. NTRK fusions, from the diagnostic algorithm to innovative treatment in the era of precision medicine. Int. J. Mol. Sci. 21, 3718 (2020).
McCabe, M. J. et al. Development and validation of a targeted gene sequencing panel for application to disparate cancers. Sci. Rep. 9, 17052 (2019).
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinforma. Oxf. Engl. 30, 2843–2851 (2014).
Cai, L., Yuan, W., Zhang, Z., He, L. & Chou, K.-C. In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data. Sci. Rep. 6, 36540 (2016).
LaDuca, H. et al. Exome sequencing covers >98% of mutations identified on targeted next generation sequencing panels. PLoS ONE 12, e0170843 (2017).
Krøigård, A. B., Thomassen, M., Lænkholm, A.-V., Kruse, T. A. & Larsen, M. J. Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLoS ONE 11, e0151664 (2016).
Chen, Z. et al. Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Sci. Rep. 10, 3501 (2020).
Fang, H. et al. Tumour mutational burden is overestimated by target cancer gene panels. J. Natl. Cancer Cent. 3, 56–64 (2023).
Bartels, S. et al. Concordance in detection of microsatellite instability by PCR and NGS in routinely processed tumor specimens of several cancer types. Cancer Med. 12, 16707–16715 (2023).
Anagnostou, V. et al. Multimodal genomic features predict outcome of immune checkpoint blockade in non-small-cell lung cancer. Nat. Cancer 1, 99–111 (2020).
Schou Nørøxe, D. et al. Tumor mutational burden and purity adjustment before and after treatment with temozolomide in 27 paired samples of glioblastoma: a prospective study. Mol. Oncol. 16, 206–218 (2022).
Masood, D. et al. Evaluation of somatic copy number variation detection by NGS technologies and bioinformatics tools on a hyper-diploid cancer genome. Genome Biol. 25, 163 (2024).
Gordeeva, V. et al. Benchmarking germline CNV calling tools from exome sequencing data. Sci. Rep. 11, 14416 (2021).
Seed, G. et al. Gene copy number estimation from targeted next-generation sequencing of prostate cancer biopsies: analytic validation and clinical qualification. Clin. Cancer Res. 23, 6070–6077 (2017).
Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. USA 107, 16910–16915 (2010).
Singh, A. K. et al. Detecting copy number variation in next generation sequencing data from diagnostic gene panels. BMC Med. Genomics 14, 214 (2021).
Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med. 23, 703–713 (2017).
Bagaev, A. & BostonGene Corporation. Clinical and analytical validation of a combined RNA and DNA exome assay. NCBI (2024).
Bagaev, A. & BostonGene Corporation. Clinical and analytical validation of a combined RNA and DNA exome assay across a large tumor cohort: dataset. https://doi.org/10.5281/zenodo.15261029 (2025).
Acknowledgements
This study was funded by BostonGene Corporation. We thank Egor Anoshkin and Elizabeth (Suos) Scott for their help with sample procurement, Felix Frenkel for establishing the initial bioinformatics infrastructure used in this study, Anton Sivkov and Danil Stupichev for their support of preceding experiments, and Anna Kamysheva and Alexander Morozov for their valuable insights.
Author information
Authors and Affiliations
Contributions
A.Y. contributed to data generation, major data analysis, data structuring, data harmonization, analysis discussions, the writing and reviewing of the manuscript, figure design and preparation, and study design. C.T. and A. Baisangurov contributed to the design and supervision of the research and data analysis. E.N. contributed to the design and supervision of the bioinformatics pipelines and validation experiments, and preparing infrastructure for calculations. K.L., M.C., L.K., N.E., D.S., D.F., S.A., A.F., and B.M. performed laboratory research. V.S. contributed to the development of the CNV calling pipeline, CNV experiment planning, and analysis of CNV calling results (for both cell lines and clinical samples). S.P. performed research and data analysis. S.S. contributed to the development of the gene fusion calling pipeline, experiment planning, and analysis of gene fusion calling results for cell lines, reference materials, and clinical samples. K.S. performed analysis of RNA-seq expression data of FF and FFPE samples and qPCR orthogonal validation results. O.K. contributed to the design of orthogonal validation experiments, quality control of sequencing data for clinical samples, and SNV/INDEL cell line data analysis for exome-wide reference preparation. N.K. performed development of the SNV and INDEL mutation calling pipeline and comparison of the somatic SNV calling results to published data. G.K. contributed to the optimization of the somatic SNV and INDEL filtration step for downstream analysis, TMB score calculation, and comparison of TMB score to orthogonal data. D. Sh. contributed to development of the optimization of the gene fusion calling pipeline and analysis of gene fusion calling results for references. Y.L. contributed to the development of the RNA-seq expression assessment pipeline and the analysis of the ERCC Spike-in experiments. A.S. monitored the somatic SNV, CNV, and expression data for internal control samples. M.A. helped obtain new reagents and samples. A.E. helped develop the quality control pipeline and analyze the coverage characteristics of validation samples. A.L. and J.H.B. contributed to the writing, reviewing, and preparation of figures. J.C. and J.L. contributed to the supervision of the work. A. Bagaev conceived and supervised the study, provided experiment design, and contributed to the writing and reviewing of the manuscript and figures.
Corresponding author
Ethics declarations
Competing interests
The authors declare the following competing interests: Jochen Lennerz is the Chief Scientific Officer, and Alexander Bagaev is the Chief Product Officer at BostonGene Corporation. All authors were employees of BostonGene Corporation when the study was performed. All other authors declare no competing interests.
Peer review
Peer review information
Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yudina, A., Tazearslan, C., Baisangurov, A. et al. Clinical and analytical validation of a combined RNA and DNA exome assay across a large tumor cohort. Commun Med 5, 236 (2025). https://doi.org/10.1038/s43856-025-00934-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43856-025-00934-3