CNVoyant a machine learning framework for accurate and explainable copy number variant classification

Schuetz, Robert J.; Ceyhan, Defne; Antoniou, Austin A.; Chaudhari, Bimal P.; White, Peter

doi:10.1038/s41598-024-72470-4

Download PDF

Article
Open access
Published: 28 September 2024

CNVoyant a machine learning framework for accurate and explainable copy number variant classification

Robert J. Schuetz¹,
Defne Ceyhan¹,
Austin A. Antoniou²,
Bimal P. Chaudhari^2,3,4,5 &
…
Peter White^1,2,3

Scientific Reports volume 14, Article number: 22411 (2024) Cite this article

4343 Accesses
4 Citations
Metrics details

Subjects

Abstract

The precise classification of copy number variants (CNVs) presents a significant challenge in genomic medicine, primarily due to the complex nature of CNVs and their diverse impact on rare genetic diseases (RGDs). This complexity is compounded by the limitations of existing methods in accurately distinguishing between benign, uncertain, and pathogenic CNVs. Addressing this gap, we introduce CNVoyant, a machine learning-based multi-class framework designed to enhance the clinical significance classification of CNVs. Trained on a comprehensive dataset of 52,176 ClinVar entries across pathogenic, uncertain, and benign classifications, CNVoyant incorporates a broad spectrum of genomic features, including genome position, disease-gene annotations, dosage sensitivity, and conservation scores. Models to predict the clinical significance of copy number gains and losses were trained independently. Final models were selected after testing 29 machine learning architectures and 10,000 hyperparameter combinations each for deletions and duplications via fivefold cross-validation. We validate the performance of CNVoyant by leveraging a comprehensive set of 21,574 CNVs from the DECIPHER database, a highly regarded resource known for its extensive catalog of chromosomal imbalances linked to clinical outcomes. Compared to alternative approaches, CNVoyant shows marked improvements in precision-recall and ROC AUC metrics for binary pathogenic classifications while going one step further, offering multi-classification of clinical significance and corresponding SHAP explainability plots. Additionally, when provided germline CNV calls from real-world RGD cases with diagnostic CNV(s), CNVoyant correctly classified all diagnostic CNVs as having pathogenic significance with high confidence. This large-scale validation demonstrates CNVoyant’s superior accuracy and underscores its potential to aid genomic researchers and clinical geneticists in interpreting the clinical implications of real CNVs.

Combination of expert guidelines-based and machine learning-based approaches leads to superior accuracy of automated prediction of clinical effect of copy number variations

Article Open access 29 June 2023

Combining callers improves the detection of copy number variants from whole-genome sequencing

Article Open access 08 November 2021

Comprehensive reanalysis for CNVs in ES data from unsolved rare disease cases results in new diagnoses

Article Open access 26 October 2024

Introduction

The establishment of reference genomes, sequencing technologies, and post-processing algorithms has ushered in an era where genetic variation is reliably detectable. Databases are maintained to define functional regions of the genome^1,2, catalog observed genetic variants^3,4, record variant frequencies in different human populations^5,6,7, and provide annotations regarding clinical significance⁸. However, entries in these resources favor smaller genetic changes, specifically single nucleotide variations (SNVs) and short insertions and deletions (indels). To date, these short variants have been the focus for clinical germline diagnoses in rare genetic diseases (RGDs); however, this may be a symptom of the limited capacity to discern the clinical significance of larger structural variants (SVs).

SVs cover larger segments of DNA and include, but are not limited to, copy number variants (CNVs), translocations, and inversions, all of which span at least 50 base pairs (bp)^9,10,11. The recent clinical adoption of genome sequencing (GS) has led to more reliable identification of SVs, at a much finer resolution than was possible with microarray technology^12,13,14. In contrast to exome sequencing (ES), which focuses on coding sequences, GS extends the breadth of detection to intronic and intergenic regions. This attribute is crucial given that SVs frequently occur in non-coding regions and can encompass multiple genes. Despite the newly available data, understanding the clinical significance of detected SVs remains a challenge. Compared to recurrent SVs with well-defined breakpoints, rare SVs are particularly difficult to interpret. Even when rare SVs are observed in population frequency databases, they are often not annotated for clinical significance¹⁵.

Despite the advent of next-generation sequencing (NGS), diagnostic genetic variants are typically only identified in 25–45% of patients undergoing GS for suspicion of having an RGD^{16,17,18,19,20}. Reanalysis of undiagnosed cases has yielded additional diagnoses after considering CNVs, confirming reports of CNV involvement in RGDs^21,22,23. Recent efforts have been made to standardize the interpretation of CNVs, culminating in the American College of Medical Genetics and Genomics (ACMG) technical standards for interpreting CNVs²⁴. These guidelines consider population frequency, the impact of overlapping functional regions, and previous clinical interpretation²⁵. Within these guidelines, these features are evaluated to determine haploinsufficiency (HI) and triplosensitivity (TS), the tolerance to regional losses or gains in the genome, respectively. Both concepts fall under the broader category of dosage sensitivity.

The ACMG guidelines, while highly valued in the clinical setting, tend to classify CNVs as having uncertain pathogenic significance (VUS), as observed in the algorithmic implementation of the guidelines, ClassifyCNV²⁶. In the interpretation setting, this limited specificity results in lengthy candidate CNV sets requiring review, with many benign CNVs being classified as VUS. To address this problem, several machine learning (ML) approaches have been proposed to enhance the precision of classifying the clinical significance of CNVs. These algorithms statistically learn from data elements related to dosage sensitivity, overlapping genes, population frequencies, regulatory elements, topologically associated domains, and genomic position to predict pathogenic potential^{27,28,29,30,31}. None of these algorithms combine all informative features in a single model. Moreover, as is common for ML-based classifiers, they fail to provide prediction explanations for better interpretability as to why the algorithm chose the given classification for a CNV. Together, these two limitations motivated the development of an improved ML approach.

Here we introduce CNVoyant, a tree-based, multi-class clinical significance classifier that combines previously reported features with novel features to classify CNVs more accurately than previously published methods. CNVoyant provides prediction explanations and enhances the accuracy and granularity of clinical significance classifications, enabling rapid identification and interpretation of potentially pathogenic CNVs.

Methods

CNVoyant’s target audience comprises individuals determining the role of a CNV in a suspected genomic disorder. This diverse group of clinical genomicists includes variant scientists, geneticists, clinical molecular geneticists, and laboratory directors. CNVoyant is designed to require minimal bioinformatics expertise and enables easy interpretation of results. It accepts a list of CNVs, specified by chromosomal coordinates and variant type (deletion or duplication), and returns probabilities for pathogenic, VUS, and benign clinical significance. For each CNV entry, CNVoyant classifies the variant by determining the maximum probability among the clinical significance classes.

The capacity of CNVoyant to classify the clinical significance of CNVs was tested in 21,574 CNVs curated from DECIPHER after training on 52,176 CNVs published in ClinVar (Fig. 1). This approach is consistent with previously reported comparisons of pathogenicity, where a set of CNVs are examined in a general context rather than focusing on individual patients. This also aligns with the previously cited ACMG technical guidelines, that recommend uncoupling CNV pathogenicity classification from the implications for a specific patient²⁴. Features were generated to capture information related to genomic position, variant composition, overlapping functional annotation, population frequency, conservation, and dosage sensitivity. Finally, to demonstrate the efficacy of CNVoyant in a real-world clinical diagnostic setting, CNVoyant predictions were obtained from five genomic disease patients with diagnostic CNVs.

Training dataset curation

CNVoyant is trained on CNVs included in the January 2023 XML release of ClinVar⁸. This XML file was parsed, and extracted variants were limited to CNVs (variant type of “copy number gain” or “copy number loss”) that did not have duplicated variant positions. The Reference ClinVar Accession Number (RCV) entry was chosen to represent each CNV to avoid training on duplicates that can arise in cases of multiple submitters. 40,837 of the extracted CNVs were aligned to the GRCh37 reference genome, all of which required a genomic coordinate mapping via the UCSC liftOver command line tool³² to be combined with the 12,641 CNVs that were aligned to GRCh38. Following liftOver, 1,126 variant entries were identified as duplicates of entries originally aligned to GRCh38, and they were omitted. 850 CNVs were omitted due to ambiguous clinical significance labeling or conflicting clinical significance annotation in entries with matching genomic coordinates. 20 variants were removed for having a size of less than 50 bp. The remaining ClinVar CNVs with at least one pathogenic or likely pathogenic designation were labeled as pathogenic for training purposes. Non-pathogenic variants with at least one VUS designation were labeled as VUS, and remaining variants containing only benign or likely benign classifications were labeled as benign. Altogether, 52,176 CNVs were included in model training (Fig. 2a, Table 1; DEL: 6,886 pathogenic, 7,191 VUS, 13,134 benign; DUP: 3,028 pathogenic, 10,792 VUS, 11,145 benign).

Table 1 Distribution of variant type and clinical significance in training and test sets.

Full size table

Testing dataset curation

A test set of CNVs was curated from the web interface of the v11.18 release of the DECIPHER database³³. All 29,453 DECIPHER CNVs are aligned to the GRCh38 reference genome and have been assigned clinical significance following manual review. 712 variants had variant types other than CNV deletions or duplications and as such were removed from the test set. The remaining 28,132 variants were lifted to GRCh37 to accommodate the expected input for comparator algorithms; 800 failed to map to an autosomal or sex chromosome contig and were omitted. 1,003 entries were removed for having shared genomic coordinates and conflicting clinical significance annotation, while 5,138 entries were removed for having duplicated genomic coordinates and clinical significance annotation. To guard against data leakage between train and test sets, 118 variants whose coordinates matched a ClinVar submission were omitted from the test set. 38 variants were removed for having a size of less than 50 bp. 21,574 CNVs were included in the final test set, which was used to benchmark CNVoyant against comparator algorithms (Fig. 2b; DEL: 6,183 pathogenic, 3,785 VUS, 1,097 benign; DUP: 3,360 pathogenic, 5,595 VUS, 1,554 benign).

CNVoyant feature selection

CNVoyant utilizes 17 features to classify the clinical significance of candidate CNVs. These features were selected based on publicly available, peer-reviewed resources and were chosen for their relevance and proven utility in the context of clinical significance classification for RGDs. Several feature distributions are right-skewed in the training data (counts of overlapping genes, exons, promoter regions, diseases, pathogenic ClinVar SNVs/indels, and bp length). These features are log-transformed to reflect more normal distributions. All features are normalized via the sklearn preprocessing.MinMaxScaler Python class prior to training or prediction³⁴.

Genomic position (2 features)

Centromere distance: The number of bp separating the centromere from the candidate CNV. Distance from the centromere to the CNV is determined by selecting the CNV boundary closest to the centromere: the end coordinate on the P arm or the start coordinate on the Q arm.
Telomere distance: The number of bp separating the telomere from the candidate CNV. Distance from the telomere to the CNV is calculated by selecting the CNV boundary furthest from the centromere: the start coordinate on the P arm or the end coordinate on the Q arm.

CNV composition (2 features)

GC Content: Percentage of nucleotides in the genomic region encompassed by the candidate CNV that are guanine or cytosine.
BP Length: Total bp spanning the candidate CNV.

Functional annotation (5 features)

Count of genes: The gene count is defined as the total number of genes that overlap the candidate CNV. Genes overlap the candidate CNV if at least one bp is shared between the candidate CNV’s location and the gene’s genomic coordinates drawn from the RefSeq database¹. All overlap calculations were made via the Bedtools intersect function³⁵.
Count of diseases: The disease count is defined as the total number of diseases associated with the gene (s) that overlap the candidate CNV. This allows genes with more disease associations to be distinguished from genes with only a single or no disease association. Disease-gene associations are referenced from the curated annotations provided by Online Mendelian Inheritance in Man (OMIM)³⁶.
Count of exons: Overlapping exon count is calculated by summing exons, across all genes, that overlap the candidate CNV. The exonic boundaries were padded by 10 base pairs to account for canonical splice regions.
Count of promoter regions: CNVs may overlap the promoter region of a gene rather than the gene itself. To address this, we include a count of promoter regions, defined as the interval between the transcription start site (TSS) and 1,000 base pairs upstream of the TSS.
Count of ClinVar pathogenic SNVs and indels: A sum of overlapping pathogenic SNV/indels is included to capture potentially relevant pathogenicity interpretation. To obtain a set of overlapping pathogenic SNV/indels, ClinVar³⁷ is intersected with the candidate CNV and limited to variants interpreted as having “Pathogenic” or “Likely Pathogenic” significance.

Population frequency (1 feature)

GnomAD SV popmax: To estimate population frequency, we identify the highest frequency across all gnomAD SV (V4)³⁸ entries that match the CNV’s variant type (deletion or duplication) and exhibit at least 50% reciprocal overlap in genomic coordinates. This means that the candidate CNV and the gnomAD SV entry it overlaps with must share at least half of their span, ensuring a significant genomic coverage overlap between the observed CNV and those described in gnomAD SV.

Conservation (2 features)

PhyloP: To estimate the conservation of a candidate CNV, PhyloP scores are referenced. PhyloP scores are available at single nucleotide resolution. Single nucleotide scores are highly variable and are thus correlated with the size of the candidate CNV. To mitigate this correlation, we employ a centered moving average that considers all scores within a specified reading frame (Supplemental Fig. 1). This effectively smooths the otherwise volatile conservation score curve. The maximum value of this smoothed curve within the genomic coordinates of the candidate CNV is returned as the PhyloP feature. A maximum was chosen considering that higher PhyloP scores indicate higher estimated conservation.
phastCons: The same procedure used to calculate the PhyloP feature was used to estimate conservation according to the phastCons score. Similar to PhyloP, the maximum was chosen as the aggregate function considering that higher phastCons scores indicate higher estimated conservation.

Dosage sensitivity (5 features)

HI Score: Dosage sensitivity was estimated by overlapping the candidate CNV with a curated set of dosage sensitive regions described by ClinGen³⁹. Manually curated HI scores are available for all curated regions. HI scores were one-hot encoded to handle each unique value and split into binary features. In the case of multiple overlapping regions, HI score features were summed.
TS Score: As with HI Score, one-hot encoded binary features were generated from the curated TS scores of annotated dosage sensitive regions. The sum is also taken across all binary features when multiple dosage-sensitive regions overlap with the candidate CNV.
HI Index: The minimum HI Index score⁴⁰ observed in overlapping dosage-sensitive regions is obtained to represent the HI Index value for the candidate CNV. HI Index estimates the probability of loss intolerance, where lower values predict haploinsufficiency.
pLI: The maximum pLI score³ observed in overlapping dosage-sensitive regions is obtained from gnomAD to represent the pLI value for the candidate CNV. pLI estimates the probability of loss intolerance, where higher values predict haploinsufficiency.
LOEUF: The minimum LOEUF score⁴¹ observed in overlapping dosage-sensitive regions is obtained to represent the LOEUF value for the candidate CNV. LOEUF estimates the probability of loss intolerance, where lower values predict haploinsufficiency.

Training procedure and model selection

For model training, features were generated for CNVs included in the ClinVar training set before partitioning into deletion and duplication sets. Separate multi-class models were trained for duplication and deletions, each predicting whether a candidate CNV has benign, uncertain, or pathogenic clinical significance. 29 common ML architectures were tested via fivefold cross-validation for each CNV type before selecting the top performers according to multi-class F1 score (Supplemental Table 1). Following architecture selection, fivefold cross-validation was again employed to hyperparameter tune and find the most accurate model. 10,000 sets of hyperparameter values were tested for each CNV type via the sklearn.model_selection.RandomizedSearchCV class. The top-performing models were calibrated to class distributions in the training set via the isotonic method implemented in the sklearn.calibration.CalibratedClassifierCV class.

Comparator algorithm selection

CNVoyant predictions were compared to five published ML-based CNV pathogenicity classifiers, X-CNV²⁷, TADA²⁸, dbCNV²⁹, StrVCTVRE³⁰, and ISV³¹, as well as the algorithmic implementation of the ACMG technical standards for CNV interpretation, ClassifyCNV²⁶. The test set was passed to CNVoyant and all comparator algorithms to test for generalizability and generate accuracy metrics for benchmarking. Given that X-CNV, TADA, dbCNV, and ClassifyCNV take GRCh37-aligned variants as input, UCSC liftOver was again called to lift DECIPHER variants from GRCh38 to GRCh37. X-CNV pathogenic probabilities yielded only three unique values across the entirety of the test set, which was unexpected given the generation of a relatively heterogeneous feature set. Rather than attempting to amend the source code to produce more continuous output, X-CNV was omitted from benchmarking.

Measuring performance

To generate granular performance data in the test cohort, accuracy metrics are reported for each of the three CNV prediction classes: benign, VUS, and pathogenic. The classification probability for each class was utilized to sort the list of CNVs and generate precision-recall (PR) and receiver operating characteristic (ROC) curves. The area under these curves (PR AUC, ROC AUC) is referenced to measure model accuracy, in addition to the average F1 score and overall accuracy for multi-class predictions. F1 and overall accuracy are only reported for CNVoyant, dbCNV, and ClassifyCNV, as these algorithms are the only three that provide multi-class output. dbCNV provides likely pathogenic and likely benign classification designations in addition to pathogenic, VUS, and benign designations. Likely pathogenic predictions were mapped to pathogenic classification and likely benign predictions were mapped to benign classification to ensure a fair comparison to CNVoyant and ClassifyCNV.

TADA, ISV, and StrVCTVRE all output a single score to estimate CNV pathogenic probability. The complement of the pathogenic probability score (1-Pr (pathogenic)) was calculated to estimate benign significance scores for these comparator algorithms. The ClassifyCNV output is a score rather than a probability, but the complement was still chosen to represent benign significance, as higher scores represent a higher pathogenicity. CNVoyant is the only algorithm that provides benign and VUS probabilities; these values were used in plotting corresponding benign and VUS classification curves. dbCNV does not provide probabilities or a continuous confidence score for classification, so there is no value to reference in plotting ROC and PR curves. As such, dbCNV was not included in the ROC and PR curve comparisons.

Subgroup analyses was conducted to evaluate the relative performance of CNVoyant when candidate CNVs are partitioned by either bp length or their overlap with known segmental duplications (SD). This approach grants a more granular assessment of CNVoyant's classification accuracy across different genomic contexts, assessing whether CNVoyant maintains its predictive power across short and long CNVs. Pathogenic CNVs can be relatively small if the overlapping sequence holds critical biological significance, as is observed in cases of exon deletion. As such, it is important to ensure that CNVoyant does not make less accurate predictions in one group versus the other. In addition to partitioning subgroup analysis by bp length, CNVoyant accuracy was compared between CNVs overlapping with SDs and those not. SDs are regions within the genome that share at least 90% of their genomic sequence⁴². Clinically, these regions introduce complexity during significance interpretation⁴³. As such, CNVoyant prediction accuracy was compared, and feature value distributions were generated between these groups to understand characteristic differences between CNVs overlapping with SDs and those not.

Deciphering feature influence on CNV classification with SHAP analysis

SHAP (SHapley Additive exPlanations) values offer a qualitative analysis tool for understanding how each feature influences the clinical significance prediction for distinct CNV classes. For CNVoyant, we generated SHAP beeswarm plots across all classes to visualize the effect of training features on model prediction⁴⁴. These plots rank features by their importance and use color coding to depict the direction of their influence on the model's output. Each point on a plot represents a feature's SHAP value for an individual observation, quantifying its contribution to moving the model's prediction from the base value—the dataset's average prediction—toward the actual prediction.

Application of CNVoyant in diagnostic CNV cases

To demonstrate the capacity to prioritize likely pathogenic CNVs, CNVoyant predictions were obtained for CNVs detected from five patients. Patients with diagnostic CNVs were selected from the rapid genome sequencing (rGS) study in the Institute for Genomic Medicine at Nationwide Children’s Hospital⁴⁵. Inclusion criteria for the rGS study required that a medical geneticist determine that rapid genome sequencing could inform medical care. Out of 218 patients, four patients had one diagnostic CNV and one patient had two diagnostic CNVs. CNV calls were generated by the GATK GermlineCNVCaller⁴⁶ and manually validated via a locally developed, clinically validated CNV user interface. To measure the performance of CNVoyant, we report the prioritized rank of the known diagnostic variant(s) for each case, along with the predicted probabilities of benign, uncertain, and pathogenic clinical significance.

Results

Our evaluation of multiple machine learning architectures and subsequent hyperparameter tuning culminated in selecting a random forest model for both duplication and deletion CNV predictions, setting the stage for a comprehensive analysis of CNVoyant's performance against other leading algorithms. The performance of each algorithm was evaluated using PR AUC and ROC AUC metrics, highlighting the effectiveness of CNVoyant in distinguishing between benign, VUS, and pathogenic CNVs. For pathogenic CNV entries in the test set, CNVoyant displayed the highest PR AUC (0.858) and ROC AUC (0.870) (Fig. 3a, b, Table 2). StrVCTVRE displayed the second-highest PR AUC (0.816), followed by ClassifyCNV (0.812), ISV (0.804), and TADA (0.701). In terms of ROC AUC, ISV was the second-most accurate algorithm (0.847), followed by StrVCTVRE (0.827), ClassifyCNV (0.773), and TADA (0.748). For benign CNV entries in the test set, CNVoyant displayed the highest PR AUC (0.463), followed by StrVCTVRE (0.461), ClassifyCNV (0.373), ISV (0.344), and TADA (0.282) (Supplemental Fig. 2(a-b)). In terms of ROC AUC, CNVoyant was again the most accurate algorithm (0.848), followed by ISV (0.819), StrVCTVRE (0.817), TADA (0.751), and ClassifyCNV (0.689). CNVoyant’s accuracy in classifying VUS was respectable but less accurate than pathogenic classification (PR AUC: 0.642; ROC AUC: 0.757). Similarly, CNVoyant benign classification was also less accurate than pathogenic classification.

Table 2 Benchmarking algorithmic classification of CNV clinical significance. The classification score performance of CNVoyant was compared to four algorithms (ISV, StrVCTVRE, TADA, ClassifyCNV) in determining the clinical significance of deletions, duplications and combined CNVs. The effectiveness of each algorithm was assessed by calculating the area under the curve (AUC) for both the precision-recall (PR) AUC and the receiver operating characteristic (ROC) AUC. These metrics were selected to provide a comprehensive evaluation of each classifier's ability to discriminate between clinically significant and non-significant CNVs under various threshold settings. CNVoyant demonstrated superior performance to all compared algorithms across most CNV subsets as evaluated by these metrics, except for StrVCTVRE, which exhibited a higher PR AUC in classifying benign deletion variants. dbCNV was excluded from this comparison due to the absence of a continuous variable necessary for plotting PR and ROC curves.

Full size table

The most informative features in the SHAP beeswarm plots for pathogenic classification differed between deletion and duplication events (Fig. 4). Pathogenic SNV/indel overlap was the most important feature for deletions, and exon count was the most important feature for duplications. For deletions, the HI index and a curated HI score of “sufficient evidence” were the second and third most important features, respectively. For duplications, bp length and promoter region count were the second and third most important features, respectively. The top five most informative features between deletions and duplications included exon count and disease count. PhyloP was more informative than phastCons in both variant types but was more important in predicting pathogenic deletions (6th most important feature) than pathogenic duplications (9th most important feature). SHAP beeswarm plots for benign and VUS classification indicated more similar feature importance between duplication and deletion variants (Supplemental Fig. 3, benign (a-b) and VUS (c-d)). For benign classification, the top four features were shared between duplication and deletion variants in the same order of importance. Bp length was the most important feature, followed by pathogenic SNV/indel overlap, PhyloP, and exon count. For VUS classification, bp length was the most important feature, followed by exon count for both duplications and deletion variants. PhyloP was the third most important feature for deletion events, followed by gene count. Pathogenic SNV/indel overlap was the third most important feature for VUS classification in duplications, followed by PhyloP. Population frequency and GC content were relatively uninformative across benign, VUS, and pathogenic predictions.

In our analysis, CNVoyant demonstrated superior accuracy in multi-class clinical significance classification (overall accuracy: 0.669, average F1: 0.629) compared to dbCNV (overall accuracy: 0.610, average F1: 0.565) and ClassifyCNV (overall accuracy: 0.626, average F1: 0.465) (Fig. 5). When only considering benign variants, CNVoyant (F1: 0.466) outperformed dbCNV (F1: 0.427) and ClassifyCNV (F1: 0.084). ClassifyCNV was the most accurate model in predicting VUS CNVs (F1: 0.689), followed by CNVoyant (F1: 0.648) and dbCNV (F1: 0.539). For pathogenic CNVs, CNVoyant was the most accurate (F1: 0.773), followed by dbCNV (F1: 0.729) and ClassifyCNV (F1: 0.622).

To further detail performance on different sized CNVs, subgroup analyses were performed to measure CNVoyant performance based on CNV bp length (log₁₀ transformed) and overlap with SDs. The CNV bp length analysis considered three categories, CNVs whose lengths are more than one standard deviation below the mean, within one standard deviation of the mean, and more than one standard deviation above the mean (Supplemental Fig. 4). For smaller CNVs (bp < 85,037, one standard deviation less than the mean), CNVoyant favored benign predictions, but managed to correctly assign the majority of pathogenic variants as pathogenic. VUS CNVs were more often mislabeled as benign than correctly labeled as VUS in this subgroup. For moderately sized CNVs (ranging between 85,037 \(\le\)bp and \({\le}\)3,217,671 bp), CNVoyant predicted the correct clinical significance for the majority of variants within each label class. For large CNVs (bp > 3,217,671, one standard deviation greater than the mean), CNVoyant favored pathogenic predictions across all label classes, with a larger proportion of true VUS being labeled as pathogenic compared to true benign variants. CNVoyant accuracy varied between CNVs overlapping with SD regions compared to those not (Supplemental Fig. 5). In CNVs that overlap with SDs, CNVoyant was more successful in classifying the overall clinical significance of CNVs (accuracy = 0.724 vs. 0.575). In CNVs that do not overlap with SDs, CNVoyant’s performance in classifying pathogenic clinical significance was reduced (F1 = 0.484 vs. 0.828 in SD overlapping CNVs). For CNVs classified as benign or VUS, similar performance was observed in SD overlapping CNVs (Benign F1 = 0.421; VUS F1 = 0.647) and non-SD overlapping CNVs (Benign F1 = 0.491; VUS F1 = 0.648).

The application of CNVoyant in diagnostic CNV cases successfully demonstrated the capacity to prioritize likely pathogenic CNVs. For all five patients selected from the rGS protocol, CNVoyant placed the diagnostic CNV in the first position and placed the additional diagnostic variant in the second position in the case where two diagnostic CNVs were present (Table 3). Additionally, all diagnostic CNVs were predicted to be pathogenic with a probability of at least 0.8.

Table 3 CNVoyant predictions for six diagnostic CNVs observed in the clinical setting. Displayed are CNVoyant predictions for six diagnostic CNVs identified in five patients. Each row corresponds to a CNV and provides the following information; Patient: the de-identified identifier for each patient; Diagnostic CNV: the chromosomal location (GRCh38) and type of CNV (deletion or duplication); CNV Pathogenic Rank: the prioritized rank assigned to the CNV among all of the patient’s CNVs, sorted by pathogenic probability; Prediction (Probability): CNVoyant probability predictions for the CNV being of benign, uncertain (VUS), or pathogenic clinical significance. The highest class prediction is indicated in bold on a per-variant basis.

Full size table

Discussion

CNVoyant sets a new standard in the classification of clinical significance for CNVs. Our novel algorithm outperformed the five leading algorithms for classifying CNVs (ClassifyCNV, ISV, StrVCTVRE, TADA, dbCNV) across all accuracy metrics and clinical significance classes in the DECIPHER test set. This unparalleled accuracy underscores CNVoyant's advanced analytical capabilities, especially when considering the complexity and variability resulting from individual provider submissions within the DECIPHER data set. Furthermore, for the first time, our comprehensive evaluation of feature performance within the predictive model has uncovered novel insights into the determinants of CNV clinical significance, offering a deeper understanding of the underlying drivers of classification.

CNVoyant prediction probabilities closely align with the observed class distributions in the DECIPHER test set (Supplemental Fig. 6), supporting the generalizability of these predictions. Investigating subgroups of the expansive genomic context demonstrates the expected performance in CNVs that are small or large in length or overlap with SDs. In the first subgroup analysis, CNVoyant was most accurate in CNVs between 85,037 and 3,217,671 bp in length. In the second, CNVoyant was most accurate when candidate CNVs overlapped with an SD (Supplemental Fig. 5). Upon further inspection of feature value distributions between classes, it became apparent that CNVs that do not intersect an SD are generally smaller and are associated with fewer known ClinVar pathogenic sequence variants, genes, and diseases. They are also less frequently annotated with population frequency and dosage sensitivity annotations. As these annotations become more robust following the expansion of relevant literature, the accuracy of CNVoyant clinical significance predictions is expected to increase.

Regarding explainability, the SHAP values generated from the test set also reflect intuitive reasoning driving predictions. Larger CNVs overlap with more functional and dosage sensitive regions, which are logically more likely to be pathogenic, and this was clearly reflected in the pathogenic SHAP beeswarm plots (Fig. 4 and Supplemental Fig. 3). Conversely, an inverse relationship exists where smaller CNVs that overlap with fewer regions drive benign predictions. The length of a candidate CNV was a simple but highly important feature omitted from ISV, StrVCTVRE, and TADA. Specifically, bp length was the most informative feature for both deletions and duplications in both VUS and benign variants. We hypothesize that a portion of the overall performance gained over the comparator algorithms was due to the addition of this feature.

In deletions, the count of pathogenic SNVs and indels contained within the CNV boundaries was the most important feature in predicting pathogenic significance and the second most important feature in predicting benign significance. This is also to be expected, as regions more intolerant to variation have more disease-causing variants. Given that loss of function is the most common variant type of pathogenic or likely pathogenic ClinVar SNV and indel annotations (72.5% of such variants), the emphasis placed on deletion events aligns with expectations. This trend was further observed in the context of conservation, with deletion variants spanning highly conserved regions having more pathogenic potential. Interestingly, HI and conservation metrics showed predictive value in classifying duplication variants. After further investigation, it was recently reported that HI and TS features largely overlap, confirming our observed trend⁴⁷.

As previously stated, ClassifyCNV is an algorithm that encodes the logic driven by the most recent ACMG technical standards for CNV interpretation. There is considerable value in minimizing false negative predictions, especially in clinical settings. ClassifyCNV was highly capable of limiting false negative in the validation set, specifically calling 36 pathogenic CNVs benign compared to CNVoyant (474) or dbCNV (599). The consequence of avoiding false negatives is less accuracy in identifying true negatives. We observed this lack of true negative recall in the DECIPHER cohort, where ClassifyCNV predicted that only 4.7% of the 2,651 benign CNVs had benign clinical significance. This effectively leaves 98.9% of called CNVs to be interpreted by clinical genomicists, a value that does not significantly reduce the burden of variant interpretation. As such, ML methods should be utilized as prioritization methods rather than definitive clinical classification methods. Effectively, algorithms like CNVoyant can answer the question of which CNVs should be reviewed first. By using ML methods as prioritizers, variants with less pathogenic potential can be discounted compared to other CNVs rather than eliminated from consideration entirely, somewhat mitigating the false negative problem. To ensure that truly pathogenic variants are not incorrectly classified as benign in clinical settings, CNVoyant and ClassifyCNV predictions should both be considered.

ML methods obtain better negative predictive value by considering features that are not included in the current clinical guidelines and learning from previous CNV interpretation submissions. Comparator models each have certain blind spots that CNVoyant aims to account for. TADA fails to contextualize the genomic position of the CNV itself and instead focuses on overlapping topologically associating domains. ISV and StrVCTVRE address these shortcomings but fail to consider reported pathogenic SNVs and indels in ClinVar. Echoing the principle of Occam's Razor, CNVoyant underscores the power of simplicity by leveraging a concise set of features to outperform more complex models. This approach streamlines the analytical process and enhances the model’s explainability, affirming the notion that simplicity often leads to superior outcomes.

The potential for variability in the rigor with which clinical significance is assigned within the DECIPHER dataset reflects one potential shortcoming of our study. CNVs in this test set were submitted by individual providers, and submitters likely used varying methods to assess the clinical significance of a given CNV. Given the vast size of this test dataset (21,574 CNVs) and the challenges of reassessing all these CNVs with a standardized set of guidelines, we had to accept the assigned significance label in our test data. To address this potential shortcoming, we obtained CNVoyant predictions for five patients who were affected by RGDs with causal CNVs. CNVoyant correctly classified all diagnostic variants as having pathogenic clinical significance and assigned high pathogenic probabilities, thereby demonstrating its potential utility in the clinical setting. Notably, CNVoyant consistently prioritized the diagnostic variant in the first position for every case. This ability to prioritize potentially pathogenic CNVs from a list of candidates can significantly reduce the time needed to identify a pathogenic variant, a crucial advantage as germline CNV callers continue to advance.

In the realm of clinical genomics, CNVs with existing pathogenic interpretation submissions should be referenced in favor of CNVoyant predictions. CNVoyant predictions, and computational pathogenicity predictions in general, should be considered secondary evidence when previous interpretation records detailing pathogenic significance are available. Clinical experts frequently encounter rare, uninterpreted CNVs that straddle the line between different clinical significance classes. In these cases, understanding the rationale behind ML classification can be invaluable. CNVoyant addresses this need by providing a novel feature amongst existing CNV classification algorithms, SHAP force plots (Supplemental Fig. 7). This approach enables CNVoyant to effectively highlight the CNV features with the greatest influence on the model's classification decision, providing critical insights to guide clinical interpretation. With this in mind, we engineered CNVoyant to export the plots into portable static image files, which can easily be attached to clinical notes or reports.

It should be noted that CNVoyant alone cannot predict the diagnostic significance of a patient’s CNVs. Other factors must be considered, including variant zygosity, phenotypic overlap with associated diseases, mode of inheritance of associated diseases, presence of additional variants in trans, and differences in CNV frequencies in clinical settings enriched with patients affected by genetic conditions. CNVoyant should instead be included in diagnostic classification architectures to limit candidate diagnostic CNVs to only those with requisite probabilities of pathogenicity. CNVoyant outputs probabilities of benign, uncertain, and pathogenic significance, which are predicated on label distributions in the training set. In clinical cases, there are far more benign variants than variants with uncertain or pathogenic significance. Often, a patient will only have CNVs of benign clinical significance. To combat this class imbalance, CNVoyant should be retrained on interpreted CNVs from real patients before implementation in a clinical decision-support setting. While the current catalog of annotated CNVs is limited, it will undoubtedly grow exponentially as more CNVs are detected and interpreted in clinical settings. In anticipation of new data, we have open-sourced CNVoyant’s source code to allow users to train models with new data using the same feature set and architecture.

Of note, CNVoyant is designed to classify the clinical significance of CNVs in context of RGDs rather than clinical oncology. The count of OMIM diseases affected by the candidate CNV was a critical feature in classifying pathogenicity, and this may be misleading when assessing the clinical significance of CNVs in cancer. Additionally, CNVoyant was validated using a dataset designed to catalog variant interpretation in RGDs. Extended validation experiments would be required to determine if CNVoyant holds utility in clinical oncology pathogenicity classification. The addition of oncology-specific features could aid in the expansion of classification utility, specifically indicators of tumor-suppressing or oncogenic potential. Counts of genes belonging to an oncology-related gene list could also prove useful in identifying pathogenic CNVs in cancer.

Finally, it should be reiterated that CNVs are only one category within the larger domain of structural variation. Additional pathogenicity prediction algorithms are required to predict the clinical significance of other types of structural variation, including inversion and translocation events. Comprehensive variant prioritization algorithms must account for all structural variant types and simultaneously consider shorter variants, including SNVs and indels.

Conclusions

The advent of GS technologies and advanced algorithms has revolutionized our ability to detect genetic variants, including SDs and deletions, reliably. Clinical genomics experts must painstakingly interpret these CNVs to determine their relevance to a patient’s suspected genetic condition. To aid this process, we introduce CNVoyant, a highly accurate algorithm for classifying the clinical significance of CNVs. CNVoyant’s unparalleled accuracy in classifying CNVs' clinical significance is driven by a unique ML architecture and a carefully selected set of features that capture the multitude of factors that should be considered when evaluating the impact of a CNV. Importantly, CNVoyant demystifies ML decisions through SHAP force plots, providing the rationale behind the algorithm’s classification for any given CNV and enhancing transparency for clinicians. With the source code publicly available, CNVoyant invites continuous evolution, allowing for retraining with new data or specific populations. This adaptability will ensure that CNVoyant remains at the forefront of genomic medicine, simplifying variant prioritization and scaling to meet the demands of expanding GS applications. CNVoyant not only sets a new standard for accuracy and explainability, but also advances the capability to discern pathogenic significance, marking a significant leap in genomics.

Data availability

ClinVar XML files used to generate the training set are available via the provided FTP site (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/), and DECIPHER variants are available via the web-based graphical user interface (https://www.deciphergenomics.org/). CNVoyant is published under an Open Source Initiative approved 3-Clause BSD License to ensure that any interested academic institution can perform optimization with their cohorts and implement their version of the algorithm in their respective diagnostic workflows. The code for the CNVoyant algorithm is available in our GitHub repository (https://github.com/nch-igm/CNVoyant). It is also available via the pip package manager (https://pypi.org/project/CNVoyant/) and conda package manager in Linux and OS distributions (https://anaconda.org/schuetz.12/cnvoyant).

References

O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733-45 (2016).
Article PubMed Google Scholar
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884-91 (2021).
Article CAS PubMed Google Scholar
Exome Aggregation Consortium, Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–91.
Sherry, S. T. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–11 (2001).
Article CAS PubMed PubMed Central Google Scholar
Koch, L. Exploring human genomic diversity with gnomAD. Nat. Rev. Genet. 21, 448–448 (2020).
Article CAS PubMed Google Scholar
The UK10K Consortium, Writing group et al. The UK10K project identifies rare variants in health and disease. Nature. 526, 82–90 (2015).
Article ADS Google Scholar
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature. 526, 68–74 (2015).
Article Google Scholar
Landrum, M. J. et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062-7 (2018).
Article CAS PubMed Google Scholar
1000 Genomes Project Consortium et al. Mapping copy number variation by population-scale genome sequencing. Nature. 470, 59–65 (2011).
Article Google Scholar
MacDonald, J. R., Ziman, R., Yuen, R. K. C., Feuk, L. & Scherer, S. W. The database of genomic variants: A curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986-92 (2014).
Article CAS PubMed Google Scholar
Coutelier, M. et al. Combining callers improves the detection of copy number variants from whole-genome sequencing. Eur. J. Hum. Genet. 30, 178–86 (2022).
Article CAS PubMed Google Scholar
Liu, Z. et al. Towards accurate and reliable resolution of structural variants for clinical diagnosis. Genome Biol. 23, 68 (2022).
Article CAS PubMed PubMed Central Google Scholar
Sanchis-Juan, A. et al. Complex structural variants in Mendelian disorders: Identification and breakpoint resolution using short- and long-read genome sequencing. Genome Med. 10, 95 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gross, A. M. et al. Copy-number variants in clinical genome sequencing: Deployment and interpretation for rare and undiagnosed disease. Genet Med. 21, 1121–30 (2019).
Article CAS PubMed Google Scholar
NHGRI Centers for Common Disease Genomics et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature. 583, 83–9 (2020).
Article PubMed Central Google Scholar
Yang, Y. et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA. 312, 1870 (2014).
Article CAS PubMed PubMed Central Google Scholar
Clark, M. M. et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. Npj Genomic Med. 3, 16 (2018).
Article Google Scholar
Tan, T. Y. et al. A head-to-head evaluation of the diagnostic efficacy and costs of trio versus singleton exome sequencing analysis. Eur. J. Hum. Genet. 27, 1791–9 (2019).
Article PubMed PubMed Central Google Scholar
Kumar, R. D. et al. Clinical genome sequencing: three years’ experience at a tertiary children’s hospital. Genet. Med. 25, 100916 (2023).
Article CAS PubMed Google Scholar
McLean, A. et al. Informing a value care model: Lessons from an integrated adult neurogenomics clinic. Intern Med. J. 53, 2198–207 (2023).
Article PubMed Google Scholar
Bergant, G. et al. Comprehensive use of extended exome analysis improves diagnostic yield in rare disease: a retrospective survey in 1,059 cases. Genet Med. 20, 303–12 (2018).
Article PubMed Google Scholar
Hegele, R. A. Copy-number variations and human disease. Am. J. Hum. Genet. 81, 414–5 (2007).
Article CAS PubMed PubMed Central Google Scholar
Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: Insights from and for human disease. Nat. Rev. Genet. 14, 125–38 (2013).
Article CAS PubMed Google Scholar
Riggs, E. R. et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: A joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen). Genet. Med. 22, 245–57 (2020).
Article PubMed Google Scholar
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature. 581, 444–51 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Gurbich, T. A. & Ilinsky, V. V. ClassifyCNV: a tool for clinical annotation of copy-number variants. Sci. Rep. 10, 20375 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, L. et al. X-CNV: Genome-wide prediction of the pathogenicity of copy number variations. Genome Med. 13, 132 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hertzberg, J., Mundlos, S., Vingron, M. & Gallone, G. TADA—a machine learning tool for functional annotation-based prioritisation of pathogenic CNVs. Genome Biol. 23, 67 (2022).
Article PubMed PubMed Central Google Scholar
Lv, K. et al. dbCNV: deleteriousness-based model to predict pathogenicity of copy number variations. BMC Genomics. 24, 131 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sharo, A. G., Hu, Z., Sunyaev, S. R. & Brenner, S. E. StrVCTVRE: A supervised learning method to predict the pathogenicity of human genome structural variants. Am. J. Hum. Genet. 109, 195–209 (2022).
Article CAS PubMed PubMed Central Google Scholar
Gažiová, M. et al. Automated prediction of the clinical impact of structural copy number variations. Sci. Rep. 12, 555 (2022).
Article ADS PubMed PubMed Central Google Scholar
Hinrichs, A. S. The UCSC genome browser database: Update 2006. Nucleic Acids Res. 34, D590-8 (2006).
Article CAS PubMed Google Scholar
Firth, H. V. et al. DECIPHER: Database of chromosomal imbalance and phenotype in humans using ensembl resources. Am. J. Hum. Genet. 84, 524–33 (2009).
Article CAS PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–30 (2011).
MathSciNet Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 841–2 (2010).
Article CAS PubMed PubMed Central Google Scholar
Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIMorg: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789-98 (2015).
Article PubMed Google Scholar
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980-5 (2014).
Article CAS PubMed Google Scholar
Gudmundsson, S. et al. Variant interpretation using population databases: Lessons from gnomAD. Hum. Mutat. 43, 1012–30 (2022).
Article PubMed Google Scholar
Rehm, H. L. et al. ClinGen — the clinical genome resource. N. Engl. J. Med. 372, 2235–42 (2015).
Article CAS PubMed PubMed Central Google Scholar
Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).
Article PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 581, 434–43 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 11, 1005–17 (2001).
Article CAS PubMed PubMed Central Google Scholar
Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science. 330, 641–6 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Lundberg, S.M., Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017;
Chaudhari, B. et al. Outcomes of in-house rapid genome sequencing at a Children’s Hospital. Mol. Genet. Metab. 132, S165-6 (2021).
Article Google Scholar
Babadi, M. et al. GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data. Nat. Genet. 55, 1589–97 (2023).
Article CAS PubMed PubMed Central Google Scholar
Collins, R. L. et al. A cross-disorder dosage sensitivity map of the human genome. Cell. 185, 3041-3055.e25 (2022).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the Nationwide Children’s Hospital Foundation and The Abigail Wexner Research Institute at Nationwide Children’s Hospital for generously supporting this work. These funding bodies had no role in the study's design, collection, analysis, and interpretation of data, nor in writing the manuscript. We thank the patients and their families for providing the data needed to train and test CNVoyant. We also thank the physicians and bioinformaticians who submitted their CNV interpretations to ClinVar and DECIPHER.

Funding

BPC’s work on this publication was supported, in part, by the National Center for Advancing Translational Sciences of the National Institutes of Health under Grant Number UM1TR004548. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

The Office of Data Sciences, The Abigail Wexner Research Institute at Nationwide Children’s Hospital, 575 Children’s Crossroad, Columbus, OH, 43215, USA
Robert J. Schuetz, Defne Ceyhan & Peter White
The Steve and Cindy Rasmussen Institute for Genomic Medicine, The Abigail Wexner Research Institute, Nationwide Children’s Hospital, 575 Children’s Crossroad, Columbus, OH, 43215, USA
Austin A. Antoniou, Bimal P. Chaudhari & Peter White
Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
Bimal P. Chaudhari & Peter White
Divisions of Neonatology, Genetics and Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, USA
Bimal P. Chaudhari
Center for Clinical and Translational Science, The Ohio State University and Nationwide Children’s Hospital, Columbus, OH, USA
Bimal P. Chaudhari

Authors

Robert J. Schuetz
View author publications
Search author on:PubMed Google Scholar
Defne Ceyhan
View author publications
Search author on:PubMed Google Scholar
Austin A. Antoniou
View author publications
Search author on:PubMed Google Scholar
Bimal P. Chaudhari
View author publications
Search author on:PubMed Google Scholar
Peter White
View author publications
Search author on:PubMed Google Scholar

Contributions

R.S. curated training data, designed and implemented the machine learning workflow, wrote and published the CNVoyant software package, wrote the first draft of the manuscript, and prepared figures and tables. D.C. curated test data and aided in executing hyperparameter tuning procedures. A.A. contributed to experimental design and wrote segments of python package. B.P.C. and P.W. conceived, designed, and supervised the project, oversaw and contributed to algorithm development, provided support for the utilization of A.W.S. and computational resources, and contributed significantly to manuscript writing and revision. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Bimal P. Chaudhari or Peter White.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

This study exclusively employed publicly available datasets, and thus did not involve direct human subjects research. In conducting this analysis, we strictly adhered to the terms of use, access, and distribution of the datasets as outlined by their respective sources. We ensured that our research methods and objectives were aligned with the ethical guidelines for research and data use, including respecting privacy, intellectual property rights, and the integrity of the data.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Schuetz, R.J., Ceyhan, D., Antoniou, A.A. et al. CNVoyant a machine learning framework for accurate and explainable copy number variant classification. Sci Rep 14, 22411 (2024). https://doi.org/10.1038/s41598-024-72470-4

Download citation

Received: 22 April 2024
Accepted: 09 September 2024
Published: 28 September 2024
Version of record: 28 September 2024
DOI: https://doi.org/10.1038/s41598-024-72470-4

This article is cited by

Application and clinical utility assessment of natural language processing-based software for copy-number variants interpretation
- Songchang Chen
- Chang Liu
- Chenming Xu
Journal of Translational Medicine (2025)