Introduction

Cancer genome studies have accelerated the discovery of novel diagnostic, prognostic and therapeutic biomarkers. Prospective sequencing of cancer biopsies is now routine in clinical oncology and represents a pivotal step in treatment planning. To this end, most clinical testing assays leverage targeted panel sequencing, a cost-effective, scalable solution that is amenable to the short fragment sizes and low DNA quality associated with FFPE.

However, panel tests are designed to support the detection of small mutations in the genes most frequently mutated in common cancers. These targeted approaches do not yield clinically relevant findings in up to 60% of cancer patients, especially those with less common cancer diagnoses1,2. WGS is the only DNA profiling assay that enables comprehensive characterization of a cancer genome through the identification of the full spectrum of clinically relevant mutations across variant classes, such as SNVs, indels, SVs, copy number variants (CNVs) and complex mutation signatures, such as tumor mutational burden (TMB), microsatellite instability (MSI) and HRD3.

In recent years, seminal studies have demonstrated the diagnostic utility of WGS, enabling detection of clinically relevant biomarkers in 68–90% of patients with rare cancers, pediatric tumors, hematological neoplasms, and advanced metastatic disease4,5,6,7,8,9. With decreasing sequencing costs, WGS is becoming increasingly accessible. However, a major limitation of WGS applications in oncology is the reliance on FF tissue as a source of tumor DNA due to FFPE preparations resulting in hyper-fragmented DNA (225–300 bps) that is smaller than the optimal WGS range (360–480 bps), and generating low coverage data with an enrichment in artifactual variant calls10,11,12,13,14. Development of end-to-end laboratory and analytical workflows to generate high-fidelity WGS data from FFPE material would open up molecularly guided clinical decision support for patients who do not benefit from existing diagnostic approaches. Importantly, routine WGS profiling of FFPE would advance biomarker discovery through the analysis of ongoing and retrospective clinical trial cohorts15,16.

Recent efforts to address the challenges of WGS in FFPE material have focused on optimization of DNA extraction and library preparation protocols17,18,19,20,21,22,23, or computational tools to post-process mutation calls24,25,26. Of these, FFPESig and FFPolish filter artifactual SNVs but not indels and FFPESig specifically is trained on targeted data rather than WGS. More recently, FFPEimpact was developed as a scoring method to quantify the level of FFPE damage within a sample; however, this tool does not output a set of high-quality filtered calls27. Importantly, none of these methodologies has been quality-controlled against validated clinical sequencing assays. As such, there is still no computational tool that effectively assesses the impact of FFPE damage across the spectrum of variant classes in a cancer genome, including SNVs, indels as well as SVs, and clinically relevant mutation signatures (TMB, HRD, MSI, etc.)28,29. As laboratory workflows are established for the derivation of high-quality sequencing data, there is a pressing need to develop generalizable tools for analysis and variant reporting in FFPE-derived WGS data.

Here, we analyze 56 trios of matched FF, FFPE and normal control WGS data to identify the genome-wide consequences of FFPE damage across all variant classes using a consensus calling approach30. We developed FFPErase, a random forest classifier that improves FFPE artifact classification and concordance between matched FF/FFPE datasets and delivers clinical-grade reporting across all variant classes and clinically relevant mutation signatures, thereby unlocking the potential of WGS in clinical oncology.

Results

Cohort and data characteristics

We acquired matched FF/FFPE WGS data from 168 samples across three centers, comprising 56 patient trios (FF, FFPE, germline control) and 16 cancer types (Fig. 1a, Supplementary Data 1). The dataset includes 40 published trios from Oxford University Hospitals, 10 in-house trios from Memorial Sloan Kettering Cancer Center (MSKCC), and six trios from BC Cancer (BCC).

Fig. 1: Cohort Characteristics.
figure 1

a Graphical representation of data accruement, demonstrating acquisition of trios and sequencing at multiple centers. b Pointplot showing distribution across 56 FF/FFPE pairs in percentage of calls unique to FFPE sample in all passed calls versus consensus calls for each variant class. Center represents mean percentage and error bars represent 95% confidence intervals. c Scatterplot showing call totals for all passed calls versus consensus calls for each variant class. d Stacked barplots demonstrating proportion of detected mutation calls within FFPE samples that are shared with matched FF in each mutation class. Top plots showing artifact enrichment in all passed calls and bottom blots showing artifact enrichment specifically in high confidence consensus calls. FF, Fresh Frozen; FFPE, Formalin-fixed, paraffin-embedded; SNV, single nucleotide variant; indel, small insertion or deletion; SV, structural variant. Source data are provided as a Source Data file. Panel a created in BioRender. Papaemmanuil, E. (2025) https://BioRender.com/qlm8gy2.

Formalin fixation, DNA extraction, and WGS were performed independently in each center, resulting in median genome-wide coverages of 93x for FF and 51x for FFPE samples (Supplementary Fig. 1a). The target FFPE coverage for MSKCC, Oxford and BCC was 80x, 70x and 40x, respectively, and the average drop in coverage from FF to FFPE was 36x. Purity and ploidy estimates remained consistent across FF and FFPE. Although protocols differed between institutions, FFPE libraries across all centers had shorter average insert sizes (166–358 bp) than FF (356–503 bp) and increased GC bias (Supplementary Fig. 1b, c). These results demonstrate lower effective coverage and library quality in FFPE-derived WGS.

Comparison of genome-wide mutation calls between FFPE and FF

To evaluate the effect of FFPE processing in variant calling from WGS, we first compared the output of single algorithm variant calling approaches to consensus calling using three individual variant callers per mutation class. We evaluated FF/FFPE mutations considering 1. the union of all passed calls and 2. consensus calls (i.e., variants supported by at least two callers).

Our data demonstrated that consensus variant calling significantly decreases the proportion of FFPE specific SVs (92–12%). However, this improvement is not seen in SNV and indel calls, where the median proportion of FFPE-specific mutations remained high (62 and 73%, respectively). This highlights the need to implement a consensus variant calling methodology but also to incorporate additional approaches for artifact filtration across variant classes (Fig. 1b, c). Retaining a consensus calling approach, we called 2,346,008 SNVs and 387,851 indels in FF samples and 3,946,938 SNVs and 466,232 indels in FFPE, signifying a median 2.0x and 2.4x increase in SNVs and indels, respectively, with variability across centers (Supplementary Data 2)10,18. Across the dataset, there were up to 152x more SNVs and 42x more indels called in FFPE-derived WGS data, lowering the precision of small mutation calling to 50% for SNVs and 62% for Indels. SV calling precision was less affected amongst the samples (median fold-change 0.76 (0.19–1.42)) (Fig. 1d, Supplementary Fig. 1d).

Overall, 85% of SNVs (1,992,218) and 75% of indels (289,231) in FF were also called in FFPE (Supplementary Data 3). While SV consensus calling maintained high precision (80%, 5094/6364), sensitivity was lower (57%, 5094/8880) due to reduced coverage on SV detection amongst other factors such as shorter read fragments and lower mapping quality (Supplementary Fig. 2a, b). Coverage impact was evident as FF-specific SV loci (n = 3776) were on average 15x lower in FFPE compared to shared FF/FFPE calls (n = 5102). With 52% of FF-specific loci having at least 2 supporting reads in FFPE, increased coverage could recover these SVs. CNV profiles in FFPE showed higher noise levels resulting in hyper-segmentation and less reliable determination of copy number state (Supplementary Fig. 2c–f)17,18.

Impact of FFPE damage on the ability to identify clinically relevant molecular biomarkers

A major motivation for expanding WGS application to FFPE biospecimens is the reporting of clinically relevant biomarkers. We therefore evaluated the sensitivity and specificity of FFPE WGS data as compared to matched FF WGS in calling established molecular biomarkers. Annotations with OncoKb31 and COSMIC32 identified 117 small mutations (27 indels, 90 SNVs), 46 SVs, and 32 amplifications in FF derived WGS data. Sensitivity for the same driver detection in FFPE was 89% (174/195). An additional set of 13 FFPE-specific findings were reported likely due to intratumor heterogeneity9. (Supplementary Fig. 3, Supplementary Data 4-5, Supplementary Information)

Next, the implications of FFPE damage were assessed on TMB and cancer-associated mutation signatures. Derived from our consensus call set, genome-wide TMB was elevated in FFPE (median: 10.28, range: 1.42–536.38) compared to FF (median: 3.45, range: 0.04–561.56), yet coding TMB was unaffected (Fig. 2a–c), suggesting elevated artifact representation in non-coding regions. Of note, without deploying consensus calling coding TMB was elevated an average 7-fold (Supplementary Fig. 4a).

Fig. 2: Genomewide Signal Impact.
figure 2

a Example of a WGS FF tumor resection (H201536) through a Circos plot showing the different types of somatic mutations along the genome. The outermost ring shows the intermutation distance for all SNVs color-coded by the pyrimidine partner of the mutated base. The middle ring shows small insertions (green) and deletions (red). The innermost ring shows copy number changes, and the arcs show SVs. b Circos plot as described for matched FFPE tumor resection. c Violin plot showing TMB distribution across cohort (n = 56) for FF and FFPE samples excluding hypermutator cases H203508 and H203526 for interpretability. Both coding and and genomewide TMB calculations are shown (two-sided Mann–Whitney U test, Coding: p = 3.489e-01, Genomewide: p = 2.381e-06). In the middle of the split violin is a box and whisker plot for the all data points (coding and non-coding) with the median as a white dot, a thick bar for the interquartile range (25–75th percentile) and a thin bar that goes to the minimum and maximum of the data. d Boxplots showing cosine similarity in COSMIC SBS and ID signature contribution matrices (median SBS: 0.58, ID: 0.84) as well as cosine similarity in SV context matrices (median: 0.97) between 56 matched FF/FFPE samples. The line in the middle of each box represents the median, while the upper and lower ends of each box represent the interquartile range (IQR) (25th-75th percentile) and the whiskers represent 1.5x the IQR. e Barplot showing significant signature differences between FF and FFPE for 56 matched samples (two-sided Mann-Whitney U test, SBS1 p = 3.641e-02; SBS37 p = 1.211e-12; ID2 p = 8.420e-03). Bars at mean contribution of signature of all 56 samples and error bands showing 95% confidence intervals. f Barplot showing HRD detection status for callers HRDetect and CHORD in patients with HRD (n = 7). g Concordance in HRDetect and CHORD scores for the seven HRD positive patients identified with scores above the recommended threshold for each tool. Red dotted lines show threshold of detection per tool (0.5 for CHORD and 0.7 for HRDetect) for FFPE sample score. FF each mutation class. FF, Fresh Frozen; FFPE, Formalin-fixed, paraffin-embedded; SNV, single nucleotide variant; indel, small insertion or deletion; SV, structural variant; HRD, homologous recombination deficiency. Raw data for (a,b) can be accessed at the EGA study. Source data for (c–g) are provided in Supplementary Data 2, 6, 8 and the data repository. Source data are also provided as a Source Data file.

Samples with a high artifact rate showed enrichment in specific mutation signatures. For example, an increased contribution of SBS37 was observed in 45/56 FFPE samples (median proportion: 23.4%) compared to the corresponding FF (12/56, median proportion: 3.6%) (Fig. 2d–e, Supplementary Fig. 4b, c, Supplementary Data 6). Additionally, FFPE-derived data exhibited a 2.8x increase in both insertions and repeat-mediated deletions (Supplementary Fig. 4d). In contrast, SV profiles remained largely unaffected (median cosine similarity: 0.97, range: 0.02-1.00) (Supplementary Data 7)3,33.

We hypothesized that the rise in genome-wide mutation burden and corresponding artifact signatures could impair the detection of composite mutation signatures like HRD. Utilizing HRDetect34 and CHORD29, 7 samples were flagged as likely HRD by both callers in the FF data. However, HRD scores in the corresponding FFPE data were below the detection cutoff for 7/7 cases by HRDetect, and 4/7 by CHORD, resulting in incorrect HRD classification (Fig. 2f, g, Supplementary Data 8). These discrepancies demonstrate that increased SNV/indel artifacts in FFPE affects two key predictor features of HRD: SBS3 and microhomology-mediated indels, leading to false HRD classification (Supplementary Fig. 4e).

Evaluation of MSI status by TMB, MSIsensor, and MIMcall scores in FF identified 2 MSI-positive samples (TMBs: 99 and 562) (Supplementary Data 9). Both MSI positive samples exhibited increased contribution of MMR-associated signatures (SBS15, SBS21)28,35,36,37,38. Contrary to HRD, FFPE artifacts did not affect presentation or detection of MMR-related signatures. There was one false positive MSIsensor call in FFPE (H158880; FFPE: 11.09, FF: 1.68) and none in MIMcall. However, this sample had low TMB (FFPE: 1.71, FF: 1.43) and no presentation of MMR-associated signatures in FF, demonstrating the necessity of integrating multiple genome-wide features in MSI assessment.

Taken together, this analysis demonstrates that whilst common cancer-associated mutations can be captured in FFPE-derived WGS data, without correction, clinically relevant signatures can be missed (e.g., HRD) or misclassified (e.g., MSI). Importantly, comprehensive assessment of FFPE-derived data across biomarker classes, inclusive of small mutations, structural variants and complex mutation signatures, is warranted to ensure accurate interpretation of reporting of these signatures.

Rationale for the development of an FFPE-specific artifact classifier

The distinct mutation patterns observed in FFPE, such as the enrichment of SBS37 signature and repeat-mediated deletions, present an opportunity to develop informatics solutions that learn and correct for FFPE-specific artifacts. To this end, we evaluated 33 SNV and 29 indel-associated features (Supplementary Data 10) and noted significant importance in features related to VAF, read mapping quality, and insert size (Supplementary Fig. 5a-b)39.

We designed a machine learning classifier tailored to discern FFPE-specific artifacts. We hypothesized that certain patterns of FFPE-associated artifacts are likely shared across preparation protocols, while some may be protocol or laboratory-specific. To test this hypothesis, we evaluated three classification models: 1. Consider mutations derived from samples from a single center for training/testing; 2. Consider mutations from samples derived from two centers and testing on data from a third; 3. A combined approach, training a model per sample with mutations across all centers but excluding mutations from the test sample. All models utilized a balanced random forest architecture, with the 1- and 3-center approaches employing a leave-one-out cross-validation and the 2-center approach representing validation on a fully independent dataset. (Fig. 3a–c, Supplementary Data 11, Supplementary Information).

Fig. 3: Artifact Classifier.
figure 3

a Graphical representation of training and validation process from mutation labeling to validation stratification on the basis of centers. b Parallel axis plot showing feature distributions for random sampling of 6704 SNVs (50% shared, 50% artifact). Final axis categorizes FFPE unique mutations as 1 and mutations shared with FF as 0 for the ARTIFACT class. c PCA reducing feature space to two principal components and shading mutations as shared with FF (blue) or FFPE unique (green) for SNVs from all samples excluding hypermutator cases (1823304 artifacts and 560,618 real). Axis legends adjusted for extreme outliers (greater than 0.00001 percentile and less than 99.999 percentile). d AUC for ROC and Precision-Recall for 500 random training samplings of 750 (50% shared, 50% artifacts) mutations from 3 samples per center (nine total) and tested on all other sample mutations. e Timepoints for an example tumor (H135022) derived from FF and FFPE pre and post filtration are shown in the form of a Circos plot as described in Fig. 2. FFPE, formalin-fixed, paraffin-embedded; PCA, principal component analysis; FF, fresh frozen; SNV, single nucleotide variant; AUC, area under curve; ROC, receiver operating characteristic. Raw data for panels b-e can be accessed at the dbGAP, EGA, and 100,000 genomes studies. Source data can be generated using the tool repository. Panel a created in BioRender. Papaemmanuil, E. (2025) https://BioRender.com/nzk6z38.

The classifier’s performance across all approaches showcased a 93–94% median accuracy for SNVs and 89-91% for indels. Importantly, models where the test center data were incorporated during training (1- and 3-center) marginally outperformed the 2-center approach (Supplementary Fig. 5c) suggesting that there are both shared, as well as center-specific effects of FFPE damage that can only be captured if these are represented in the training dataset. Using the 3-center approach, precision for SNV/indel calling increased from 52 to 92% for FFPE mutations relative to the FF calls, while maintaining an 80% sensitivity. Notably, the cross center approach was the only to not filter high-confidence driver mutations shared by FF and FFPE, demonstrating that it extracts the most robust FFPE artifact profile (Fig. 3d, e, Supplementary Data 12).

Post-filtering results in high concordance in drivers and global signatures

Next, we used the 3-center approach to filter artifacts in each hold-out sample and assessed the impact of FFPE filtration on reporting of clinically relevant drivers and global mutational patterns. Concordance for TMB, COSMIC signatures, and HRD detection drastically improved.

Upon filtration, median genome-wide TMB decreased from 10.3 (1.4–536.4) to 2.9 (0.3–497.4), much closer to FF TMB estimates (median: 3.5) (Fig. 4a, Supplementary Data 13). The cosine similarity between FF and FFPE for SNV signatures improved to 0.94 (0.40–1.00) (Fig. 4b, Supplementary Fig. 6a-b, Supplementary Data 14), and the contribution of FFPE-specific signatures (e.g., SBS37) was notably decreased. Similar results were obtained for indels (Fig. 4c). To verify that signature concordance after filtering was independent of the signature fitting method, we calculated signature exposures using both MutationalPatterns40 and MuSiCal41 and observed consistent top signatures and concordance between matched FF and filtered FFPE samples (Supplementary Fig. 6c).

Fig. 4: Filtration Impact on Global Signatures.
figure 4

a Boxplots showing distribution of genomewide TMB in both FF and FFPE samples prior to filtration with FFPErase compared to FFPE samples post filtration for all 56 matched FF/FFPE pairs. The line in the middle of each box represents the median, while the upper and lower ends of each box represent the interquartile range (IQR) (25–75th percentile) and the whiskers represent 1.5x the IQR. b Stacked bar plot showing median contribution percentage of highest contribution signatures in SNVs for FF, FFPE, and filtered FFPE. c Same as (b) for indel signatures. d Distribution comparisons for FF vs. unfiltered FFPE vs. filtered FFPE of HRDetect (Frozen vs. FFPE p = 9.876e-03, Frozen vs. Filtered p = 1) and CHORD (Frozen vs. FFPE p = 4.281e-03, Frozen vs. Filtered p = 2.479e-01) scores for 7 HRD positive cases by FF scoring with shaded regions for each tool’s HRD threshold. (two-sided Mann–Whitney U test) TMB, tumor mutational burden; FF, fresh frozen; FFPE, formalin-fixed, paraffin, embedded; COSMIC, Catalog Of Somatic Mutations In Cancer; SBS, single base substitution; ID, insertion/deletion; HRD, homologous recombination deficiency. Source data for all panels are provided in Supplementary Data 6, 8, 13 and the data repository. Source data are also provided as a Source Data file.

Post-filtration, 7/7 HRD+ tumors were correctly classified in FFPE (Fig. 4d). For the 2 cases identified as MSI + , artifact filtration reduced mutation burden by 7–8% without removing contribution of MMR-associated signatures. These results demonstrate that FFPErase removes artifacts without losing sensitivity, thus enabling accurate detection of biological signals related to HRD and MMR. Importantly, FFPErase accurately classified all drivers shared between FF/FFPE (n = 103). Of the 11 FFPE-unique driver calls, 4 (1 SNV, 3 indels) were classified as artifacts. 2 of these were indels located in the polymorphic HLA-A locus, a region that is highly variable, suggesting that classifier performance in loci with decreased mapping quality may require more training data.

Performance of FFPErase as compared to existing approaches

We compared the FFPErase results from the 3-center approach to two publicly available artifact removal strategies, FFPolish and FFPESig. Additionally, we adapted the scoring tool FFPEimpact to output filtered mutational contexts (Supplementary Information)24,25. FFPolish leverages a feature extraction approach like FFPErase for SNVs but not indels. When applied to our data, FFPolish had a lower accuracy compared to FFPErase (63 vs. 95%) and critically removed 42% of SNV drivers shared by matched FF/FFPE tissue (Fig. 5a, Supplementary Data 15). FFPESig does not provide an artifact detection strategy but rather corrects trinucleotide contexts using FFPE signatures learned from targeted sequencing data. The adapted version of FFPEimpact outputs a similarly filtered trinucleotide context but does not provide per-mutation classification. To assess the performance of all tools, we quantified cosine similarity between matched FF and post-filtered FFPE contexts and show that the similarity index was higher using FFPErase compared to all other methods (FFPErase: 0.99, FFPolish: 0.91, FFPEimpact: 0.90, FFPESig: 0.83) (Fig. 5b–d, Supplementary Data 16). These findings highlight that FFPErase is the only tool to deliver a per-mutation classification, filter putative artifacts and result in near FF variant calls. Furthermore, we applied our FFPErase filtration to a representative dataset from the FFPolish study24 consisting of 31 burkitt lymphoma and 38 cervical cancer cases and achieved a median accuracy of 97% for SNVs and 90% for indels across samples (Supplementary Fig. 7).

Fig. 5: Comparison to Existing Methods.
figure 5

a ROC and PR AUC plots comparing FFPErase, FFPolish, and a VAF thresholding method (setting a filter on every mutation with VAF < 0.2*sample purity) results on all samples within the cohort. b Boxplot showing the distribution of mutational context cosine similarities as compared to FF data for unfiltered and filtered FFPE by use of FFPErase, FFPESig, FFPolish and FFPEimpact with statistical annotations between unfiltered and filtered data by each tool for all 56 matched pairs. (two-sided Mann–Whitney U test, FFPErase p = 6.024e-07, FFPESig p = 9.951e-02, FFPolish p = 1; FFPEimpact p = 1; ns = 5e-02 <p < =1,*=1e-02<p < =5e-02,**=1e-03<p < =1e-02,***=1e-04<p < =1e-03,****=p < 1e-04) The line in the middle of each box represents the median, while the upper and lower ends of each box represent the interquartile range (IQR) (25th-75th percentile) and the whiskers represent 1.5x the IQR. c Same as (b). for indels for all 56 matched pairs (two-sided Mann-Whitney U test, FFPErase p = 1.838e-10, FFPEimpact p = 1). d Example 96-mutational context plots for substitutions from H203498 FF/FFPE alongside the filtered FFPE profiles using FFPErase, FFPESig, FFPolish and FFPEimpact. SBS, single base substitution; FF, fresh frozen; FFPE, formalin-fixed, paraffin-embedded; ROC, receiver operating characteristic; PR, precision recall; AUC, area under curve. Raw data for panels a and d can be accessed at the dbGAP, EGA, and 100,000 genomes studies. Source data for (b, c) are provided in Supplementary Data 16 and the data repository. Source data are also provided as a Source Data file.

Next, we evaluated FFPErase performance against clinical findings in a cohort of 34 cases for which we had FFPE-derived WGS data and matched clinical sequencing by FDA-approved MSK-IMPACT1 from the same biopsy. A total of 92 clinically relevant events were reported by MSK-IMPACT. 99% were retained in FFPErase-processed WGS data. Importantly, analysis of WGS from the FFPE specimen revealed additional, clinically relevant events in 41% (14/34) of patients increasing the number of findings by 24% (Fig. 6a, Table 1, Supplementary Data 17). These findings were not captured by panel tests and were mostly represented by SVs.

Fig. 6: Detailed Use-Case for PARP-inhibitor treated Ovarian Case H203572.
figure 6

a Graphical representation of concordance test against FDA-approved MSK-IMPACT panel calls with breakout scatterplot of called mutations by VAF in WGS vs. VAF in MSK-IMPACT. b Timeline for ovarian cancer patient H203572 in respect to WGS samples with corresponding circos plots described in Fig. 3 for: pre-therapy right ovary resection from FF tissue, pre-therapy infracolic omentum resection from FF tissue, post- therapy distal pancreas resection from FFPE tissue both pre and post filtration with FFPErase. Additional panels below consisting barplots showing the absolute number of mutations attributed to the five SBS signatures with the highest exposure in the tumor, below that the 96 trinucleotide contexts of SNVs, and lastly the global HRD probability score from HRDetect for each sample. c Phylogenetic tree for pre-filtration clonality analysis with each bar representing the number of mutations per abstracted clone and colored by proportions of study relevant SBS signatures with gray denoting other signature contributions. Drivers are annotated on the clone they are associated with and are matched by using mean CCF per clone as compared to the mutation CCF. d 96 trinucleotide contexts for each SNV clone pre-filtration along with bars on the right that are length adjusted for number of mutations within the clone and show percentage of variants filtered by FFPErase. VAF, variant allele frequency; FF, fresh frozen; SBS, single base substitution; FFPE, formalin-fixed, paraffin-embedded. Source data for panels a is provided in Supplementary Data 17 and the data repository. Source data are also provided as a Source Data file. Raw data for (b–d) can be accessed at the dbGAP study. Panel 6b created in BioRender. Papaemmanuil, E. (2025) https://BioRender.com/cdp997a.

Table 1 Driver Annotations for FFPE WGS to MSK-IMPACT Validation

Clinical use case: artifactual clone from FFPE removed from ovarian cancer with PARP inhibitor resistance

Beyond driver mutations and global signatures, FFPE artifact filtration can directly impact the analysis of tumor clonal structure, with direct implications for surveillance studies during disease progression and treatment response.

We present a case study that highlights the clinical utility of FFPErase in an ovarian cancer patient with a germline BRCA2 mutation and a heterogeneous response to platinum and PARP inhibitors (PARPi). WGS data were generated for the two responsive pre-treatment FF specimens and one resistant post-therapy FFPE sample with a BRCA2 reversion mutation (Fig. 6b). Both pre-treatment FF tumors had TMBs of 4.09 and 4.40 with high HRDetect scores (97 and 96%). In contrast, prior to filtering the FFPE sample had a high TMB (20.19) and low HRDetect score (37%). FFPErase designated 80% of SNVs and 43% of indels as artifacts refining TMB estimates to 4.44 and HRDetect score to 99% in the processed data.

Prior to filtering, clonal reconstruction across three samples identified 8 clones. HRD-related signature SBS3 was detected in all clones except for two that were unique to FFPE (5 and 7). Filtration with FFPErase demonstrates that clone 7 was primarily defined by artifactual mutations (CCF: 20.2%, 92% artifacts) and corrects the mutational profile of clone 5 (CCF: 77.8%, 17% artifacts). Notably mutation signature analysis of clone 5 post FFPErase filtering showed no evidence of SBS3 suggesting HR proficiency42. This is consistent with the BRCA2 reversion. Clone 5 also had high contributions of cisplatin signatures (SBS31/35) (Fig. 6c, d, Supplementary Fig. 8)43. Taken together, cleanup with FFPErase enables accurate genomic characterization of a clone associated with treatment resistance that would have been missed without correction.

Discussion

Challenges in performing WGS in FFPE biospecimens pose a critical barrier to adoption of WGS in clinical oncology. In this study, we deliver a comprehensive meta-analysis of three independent FFPE WGS datasets with matched FF and germline control WGS data (n = 56 trios)18. We study the genome-wide presentation of FFPE-related artifacts across mutation classes including SNVs/indels, SVs and global mutational signatures inclusive of HRD, MSI and TMB.

Despite variation in FFPE preparation and sequencing protocols, we identified common artifact patterns across the three centers, suggesting that these are associated with core components of FFPE processing. Consistent with prior studies, SNV/Indel calling is confounded by an enrichment of FFPE-associated artifacts in 95% of tumors. Artifact mutations in FFPE are enriched for SBS37, insertions and repeat-mediated deletions. While FFPE-associated artifacts do not impair driver mutation detection and prioritization in our dataset, they do affect detection accuracy of clinically relevant mutation signatures such as HRD. For example utilizing FFPE WGS data as input, HRDetect fails to accurately classify 7/7 patients with HRD+ status in FF data, whilst CHORD only captures 3/7 HRD+ cases. Whilst genome-wide TMB is increased in FFPE data, coding TMB is not, which is consistent with clinical testing data44. SV calling was mostly consistent between FF and FFPE when utilizing consensus calling, requiring two out of three callers to call an event. SV sensitivity was predominantly affected by variation in coverage, which may result in a loss of detection and reporting of specific SVs; however, genome-wide SV profiles (i.e., HRD) are preserved.

The enrichment of artifactual calls highlights a clear need to differentiate artifacts from real mutations in FFPE-derived WGS data. For SVs, employing consensus calling greatly increases precision, but not for small mutations. To this end, we leverage the data in this analysis and develop FFPErase, a machine learning model that is informed by the analysis of SNVs/indels from matched FF/FFPE WGS data from three centers. FFPErase achieves high accuracy in artifact classification increasing the precision of mutation calling by 40% without filtering out driver events. Benchmarking showed improved model performance when in-house and external samples were both included in training. This observation indicates that FFPE-associated artifacts are generalizable across FFPE processing workflows, yet center-specific artifacts are also present. FFPErase delivers a center-agnostic FFPE filtering workflow. Most importantly, this tool enables the characterization of clinically relevant mutation signatures like HRD without reducing TMB associated with biologically relevant signals such as MSI.

Existing methods like FFPESig and FFPolish have limited capabilities in artifact removal, focusing solely on SNVs, training on targeted data or lacking a per-mutation artifact prediction. FFPEimpact is designed to score the extent of FFPE damage and does not remove artifacts from the data. Our comparative analysis to these tools shows that FFPErase achieves the highest accuracy in artifact detection (95%) and cosine similarity between matched FF and FFPE contexts (0.99), while critically preserving 95% of shared SNVs. To ensure generalizability, the tool was further tested against a fourth, external dataset of 69 tumor samples from 2 never seen cancer types achieving 97% accuracy. Notably, FFPErase is the only tool benchmarked against an FDA-approved targeted panel assay. FFPErase not only retains 99% of clinically relevant events reported by MSK-IMPACT, but also uncovers additional clinically significant events in 41% of patients, primarily SVs.

Sequential sampling is increasingly implemented in clinical practice, as a means to understand disease progression and treatment response. We present a case study that shows how the FFPE-related artifacts affect resolution and interpretation of tumor clonal phylogenies. Sequential WGS profiling of an ovarian cancer patient with heterogeneous response to PARP inhibition shows the emergence of a treatment-resistant clone in a post-therapy FFPE resection. Deconvolution of the subclonal structure and HRD status is challenged by the presence of artifacts in the FFPE resection. Upon filtering with FFPErase, we deliver a clean representation of the clonal architecture of the tumor and an accurate clone-specific assessment of HRD status that informs disease presentation and treatment resistance.

A limitation of our study is that FFPErase does not address hypersegmentation in CN from FFPE-derived WGS data. Perhaps adaptation of methods such as dryclean or ACE can support this, but would need significantly larger datasets45,46. Additionally, although we demonstrate that high depth coverage can be achieved in FFPE, optimization of laboratory protocols to improve library quality and sequencing depth is warranted but beyond the scope of this study. We note that applying FFPErase to targeted/exome sequencing data rather than WGS would require retraining the model using relevant data (e.g., exome or panel) to account for variations in sequencing depth and coverage patterns between these assays.

Our work provides a clear path forward for the implementation of FFPE-derived WGS datasets in oncology through the combination of consensus variant calling and artifact filtration methods such as FFPErase. Applications such as FFPErase enable the utilization of FFPE biospecimens and ultimately help democratize the development of clinical WGS workflows in a wide variety of settings (academic centers, community hospitals, etc.) and accelerate correlative studies for biomarker discovery.

Methods

Study participants

Patients who were seen within the Department of Pediatrics at Memorial Sloan Kettering Cancer Center with presumed or established solid tumor malignancies (including CNS tumors) were eligible to enroll on an institutional prospective tumor/germline-sequencing protocol (ClinicalTrials.gov number, NCT01775072) with informed consent from the patients or their guardians. This study was approved by the MSKCC Institutional Review Board/Privacy Board. Patients with newly diagnosed as well as relapsed/refractory disease were eligible. The rest of the data was sourced from public or request-for-access repositories as described18,24.

Study design and WGS analysis

Raw data for 56 tumor specimens were collected across three centers, representing independent workflows for FFPE sample preparation and sequencing (Supplementary Data 1). For each tumor, trios of samples for each tumor, including two resections, one prepared as FF and the other as FFPE, as well as a matched normal sample from peripheral blood, were sourced. An additional 35 FFPE-only tumor specimens with matched normal peripheral blood were collected and processed from MSKCC. Patient specimens spanned 16 cancer indications and age at resection ranged from 5 to 89 years. Both FF and FFPE tumors were assessed using the same analytical pipelines based on consensus calling across all variant classes9,30.

Sample Collection

Patient samples from three independent centers were collected, processed and sequenced separately. A trio of samples was collected for each patient, two tumor samples from the same resection that were prepared as FF and FFPE samples, and one matched normal sample for matched analysis. Ten trios were collected from MSKCC, six from BCC, and forty from Oxford University Hospitals.

Tissue fixation, processing, and embedding

For the 45 subjects from MSKCC, surgical samples were placed into standard tissue cassette(s) and completely immersed in fresh 10% Neutral Buffered Formalin (NBF) and loaded into the Peloris tissue processor. Surgical routine samples were processed for about 7 h in multiple steps per reagent consisting of formalin (90 min), alcohol (125 min), xylene (90 min) and paraffin (90 min). Cassettes were then removed and embedded in paraffin blocks using the Tissue-Tek AutoTEC a120.

DNA extraction

For the 45 subjects from MSKCC, tumor DNA was extracted from fresh frozen (FF) OCT embedded tissue biopsies and matched normal from peripheral blood using the DNeasy Blood & Tissue Kit (Qiagen catalog # 69504) according to the manufacturer’s protocol for all but one sample where the AllPrep DNA/RNA Mini Kit (Qiagen catalog # 80204) was used. FFPE tissue was deparaffinized using heat treatment (90 °C for 10’ in 480 μL PBS and 20 μL 10% Tween 20), centrifugation (10,000 x g for 15’) and ice chill. Paraffin and supernatant were removed, and the pellet was washed with 1 mL of 100% EtOH followed by an incubation overnight in 400 µl of 1 M NaSCN for rehydration and impurity removal. Tissues were subsequently digested with 40 µl of Proteinase K (600 mAU/ml) in 360 µl Buffer ATL at 55 °C. DNA isolation proceeded with the Mag-Bind Blood & Tissue DNA HDQ 96 Kit (Chemagic SKU M6399-01) according to the manufacturer’s protocol modified by replacing AW2 buffer with 80% ethanol. All DNA was eluted in 0.5X Buffer AE.

For the six subjects from BCC, tumor DNA was extracted from FF tissue biopsies and matched normal. FFPE tissue was deparaffinized using xylene and then extracted using the QIAmp DNA FFPE Tissue Kit (Qiagen catalog # 56404) according to the manufacturer’s protocol.

DNA extraction for the 40 samples from Oxford University Hospitals is detailed in the associated publication18.

Whole genome sequencing

For the 45 subjects from MSKCC, after PicoGreen quantification and quality control by Agilent BioAnalyzer, 500 ng of genomic DNA were sheared from all FF samples and where possible from all but 5 FFPE samples (minimum gDNA: 140.9 ng) using a LE220-plus Focused-ultrasonicator (Covaris catalog # 500569). Sequencing libraries were prepared using the KAPA Hyper Prep Kit (Kapa Biosystems KK8504) and were subjected to a 0.5X size select using aMPure XP beads (Beckman Colter catalog # A63882) after post-ligation cleanup. For the FF samples, PCR-free libraries were pooled equivolume for sequencing and for the FFPE samples five cycles of PCR were performed. Samples were then run on a NovaSeq 6000 in a 150 bp/150 bp paired-end run, using the NovaSeq 6000 SP, S1, S2, or S4 Reagent Kit (300 Cycles) (Illumina).

For the six subjects from BCC, FF samples were prepared as follows: One microgram of genomic DNA was arrayed in a 96-well microtitre plate and subjected to shearing by sonication (Covaris). Sheared DNA was end-repaired and size selected using AMPure XP beads targeting a 300–400 bp fraction. After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using AMPure XP beads. Library fragment sizes were assessed using an aliquot of PCR amplified library DNA on the Agilent 2100 Bioanalyzer DNA1000 chip, or Caliper GX DNA1000 chip. The PCR-free library concentration was quantified using a qPCR Library Quantification kit (KAPA, KK4824). Four cycles of PCR were applied to rescue low-yield libraries prior to sequencing with paired-end 100 base reads on the Illumina HiSeq platform according to manufacturer recommendations. FFPE was prepared as follows: gDNA was normalized to 250 ng in a volume of 62 μL elution buffer (Qiagen) and transferred into a microTUBE plate for shearing on an LE220 (Covaris) acoustic sonicator using the conditions: Duty Factor - 20%, Peak Incident Power – 450 W, Cycle per burst – 200, Duration – 2 × 60 s with an intervening spin. FFPE protocol has a dominant DNA peak in the size range between 300 and 400 bp. To improve the library quality of FFPE-derived DNA, solid-phase reversible immobilization (SPRI) bead-based size selection was performed before library construction to remove smaller DNA fragments from highly degraded FFPE DNAs. If not removed early in the library construction process, these smaller fragments would otherwise dominate the final amplified library. FFPE DNA damage and end-repair and phosphorylation were combined in a single reaction using an enzymatic premix (NEB), then bead purified using a 0.8:1 (bead: sample) ratio to remove small FFPE fragments. Repaired DNA fragments were next A-tailed for ligation to paired-end, partial Illumina sequencing adapters then purified twice with PCR Clean DX beads (0.9:1 ratio). Full-length adaptered products were achieved by performing 8 cycles PCR with primers introducing fault-tolerant hexamer “barcodes” allowing multiplexing of libraries. Indexed PCR products were double purified with 0.9:1 beads. Concentration of final libraries was determined using size profiles obtained from a high sensitivity Caliper LabChip GX together with Quant-iT (Invitrogen) quantification. Libraries were sequenced with paired-end 150 base reads on a HiSeqX instrument running v2.5 chemistry (Illumina).

WGS for the 40 samples from Oxford University Hospitals is detailed in the associated publication18.

Bioinformatic analysis

Sequencing data was gathered and analyzed analogously using the Isabl platform and included: 1. Data QC; 2. Ensemble variant calling for germline and somatically acquired mutations from at least two out of three algorithms run for each variant class; 3. Signature extraction (i.e., mutation signatures, MSI score, homologous recombination deficiency); 4. Variant classification; and, 5. The generation of a clinical prototype summary report9. Briefly, upon completion of each sequencing run, Isabl imports paired tumor-normal FASTQ files, executes alignment, quality control algorithms and generates tumor purity and ploidy estimates. Whole-genome paired-end reads were aligned to human reference genome (GRCh37d5) using BWA-mem (v0.7.17) as a part of the pcap-core v2.18.2 wrapper (https://github.com/cancerit/PCAP-core)47. The wrapper includes marking of duplicates using Picard. For tumor samples ensemble variant calling for each variant class (substitutions, insertions and deletions and structural variations) was performed. High confidence somatic mutations are classified with regards to their putative role in cancer pathogenesis and statistical post-processing enables the derivation of MSI scores, mutation signatures and HRD scores. Clinical relevance of mutations in common cancer genes was annotated using OncoKb, COSMIC, Ensembl Variant Effect Predictor, VAGrENT, gnomAD and ClinVar databases. Details of the variant calling and annotation can be found in the Supplementary Information.

Identification of somatic mutations in whole-genome sequences

Somatic alterations were detected comparing the tumor against the matched normal for each variant type. All bioinformatic tools were launched using an in-house wrapper. Allele-specific subclonal CN changes were detected using Battenberg (cgpBattenberg v1.4.0) (https://github.com/cancerit/cgpBattenberg)48. Single-nucleotide variants (SNVs) were identified using Strelka2 (v2.9.1 with manta v1.3.1), (https://github.com/Illumina/strelka), MuTect2 (gatk:v4.0.1.2), (https://github.com/broadinstitute/gatk), and CaVEMan (cgpCavemanWrapper v1.7.5) (https://github.com/cancerit/cgpCaVEManWrapper)48,49,50. Variant post-processing was done using default flags for Strelka2 and MuTect2, while for CaVEMan, cgpCavemanPostprocessing (v1.5.2) was used filtering for sequencing artifacts with > =3 mutant alleles in at least 1% of samples within a panel of 100 unmatched blood normal (https://github.com/cancerit/cgpCaVEManPostProcessing). Small insertions and deletions (indels) were detected using Strelka2, MuTect2, and Pindel (cgpPindel v1.5.4) (https://github.com/cancerit/cgpPindel) and filtered against a panel of 100 unmatched normals51. Structural genomic variants (SVs) were identified using SvABA (~v1.0.0 commit 47c7a88) (https://github.com/walaj/svaba), GRIDSS (v2.2.2) (https://github.com/PapenfussLab/gridss), and BRASS (v4.0.5 with GRASS v1.1.6) (https://github.com/cancerit/BRASS) using a panel of 100 in-house unmatched normals33,52,53. Finally, microsatellite-instability status was assessed using MSISensor (v0.5) (https://github.com/ding-lab/msisensor)35 and MIMcall (https://github.com/afujimoto/MIMcall)38.

Variant consolidation and annotation

VCF files for SNVs and indels were merged with an in-house wrapper using chromosome, position, reference allele, and alternative allele. The merged VCFs were annotated with VAGrENT (v3.3.0, https://github.com/cancerit/VAGrENT) and VEP (v92, https://github.com/Ensembl/ensembl-vep)54,55 VCF files for SVs were merged using MergeSVvcfs (v1.0.2, https://github.com/papaemmelab/mergeSVvcf). High-confidence mutations were designated as those that were passed by at least 2 callers and >600 bp in length for SV calls.

Calculation of TMB

TMB was calculated using high-confidence, somatic substitutions and indels that fall within coding regions. The totals for these variant classes were combined and then converted to coding TMB using a divisor of 30 to approximate the length of the human exome in Mb and 2900 to approximate length of the human genome in Mb. Values greater than ten mutations per Mb were considered hypermutators, thresholds set by the study in Grobner et al.8.

Identification of mutation signatures for substitutions and indels

Mutational signature analysis was performed with the MutationalPatterns package (v3.4.1, https://bioconductor.org/packages/release/bioc/html/MutationalPatterns.html) for both substitutions and indels using signatures for the SBS and ID variant classes from COSMIC Mutational Signatures (v3.1) with the addition of Temozolomide signature from Kucab et al. for substitution signatures.

Inference of clonal structure

Clonal structure was analyzed using high-confidence SNVs called in each biopsy or the union of SNVs whenever multiple biopsies were available for a patient. DPClust (v0.2.2, https://github.com/Wedge-Oxford/dpclust) was used for calculation of cancer cell fraction corrected for purity and local CN as well as clustering and assignment of mutations across samples with the exception of the Gibbs Sampling Dirichlet Process step which was optimized internally48. Clonal ordering was deduced using clonevol (v0.99.11, https://github.com/hdng/clonevol56. Mutational signatures were computed in each cluster independently. Figures were generated with matplotlib (v3.1.0, https://matplotlib.org/).

Development of artifact classification model

The artifact classifier was designed using scikit-learn (v0.24.1, https://scikit-learn.org/0.24/) and imbalanced-learn (v.0.8.0, https://imbalanced-learn.org/stable/). Custom mutational feature extraction was performed on the high confidence call set produced from Isabl for both somatic substitutions and indels to format for training and test. Optimization of classification was done using nextflow (https://github.com/nextflow-io/nextflow) in order to parallelize artifact detection and achieve computation time of under 2 hours. The final artifact classifier was developed to run given a VCF and a BAM as input and was trained using a balanced Random Forest model. It leverages 33 features per single nucleotide variant and 29 features per indel that are either at the sample or mutation level and has a depth of 100 trees. Test strategies involving training and test data from the same center employed a leave-one-out cross validation strategy where a model was produced for each test sample using equal amounts of randomly sourced mutations from all other available samples for training. This was done for the 1-center and 3-center approaches. Machine learning architectures were explored and compared including a convolutional neural network approach leveraging pileup images57. These are detailed in Supplementary Information. We implemented the code in such a way that the model can be retrained with new data utilizing the preprocessing and training modules58.

Feature extraction

Features are extracted from tumor BAMs using a VCF to designate mutation loci and leveraging a combination of custom scripts and Picard’s CollectSequencingArtifactMetrics (v2.25.6, https://github.com/broadinstitute/picard). Features were assessed and selected using permutation feature importance and random forest mean decrease in impurity. Feature correlation was also assessed using Spearman’s correlation to determine relationships between all features.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.