Abstract
The precise classification of copy number variants (CNVs) presents a significant challenge in genomic medicine, primarily due to the complex nature of CNVs and their diverse impact on rare genetic diseases (RGDs). This complexity is compounded by the limitations of existing methods in accurately distinguishing between benign, uncertain, and pathogenic CNVs. Addressing this gap, we introduce CNVoyant, a machine learning-based multi-class framework designed to enhance the clinical significance classification of CNVs. Trained on a comprehensive dataset of 52,176 ClinVar entries across pathogenic, uncertain, and benign classifications, CNVoyant incorporates a broad spectrum of genomic features, including genome position, disease-gene annotations, dosage sensitivity, and conservation scores. Models to predict the clinical significance of copy number gains and losses were trained independently. Final models were selected after testing 29 machine learning architectures and 10,000 hyperparameter combinations each for deletions and duplications via fivefold cross-validation. We validate the performance of CNVoyant by leveraging a comprehensive set of 21,574 CNVs from the DECIPHER database, a highly regarded resource known for its extensive catalog of chromosomal imbalances linked to clinical outcomes. Compared to alternative approaches, CNVoyant shows marked improvements in precision-recall and ROC AUC metrics for binary pathogenic classifications while going one step further, offering multi-classification of clinical significance and corresponding SHAP explainability plots. Additionally, when provided germline CNV calls from real-world RGD cases with diagnostic CNV(s), CNVoyant correctly classified all diagnostic CNVs as having pathogenic significance with high confidence. This large-scale validation demonstrates CNVoyant’s superior accuracy and underscores its potential to aid genomic researchers and clinical geneticists in interpreting the clinical implications of real CNVs.
Similar content being viewed by others
Introduction
The establishment of reference genomes, sequencing technologies, and post-processing algorithms has ushered in an era where genetic variation is reliably detectable. Databases are maintained to define functional regions of the genome1,2, catalog observed genetic variants3,4, record variant frequencies in different human populations5,6,7, and provide annotations regarding clinical significance8. However, entries in these resources favor smaller genetic changes, specifically single nucleotide variations (SNVs) and short insertions and deletions (indels). To date, these short variants have been the focus for clinical germline diagnoses in rare genetic diseases (RGDs); however, this may be a symptom of the limited capacity to discern the clinical significance of larger structural variants (SVs).
SVs cover larger segments of DNA and include, but are not limited to, copy number variants (CNVs), translocations, and inversions, all of which span at least 50 base pairs (bp)9,10,11. The recent clinical adoption of genome sequencing (GS) has led to more reliable identification of SVs, at a much finer resolution than was possible with microarray technology12,13,14. In contrast to exome sequencing (ES), which focuses on coding sequences, GS extends the breadth of detection to intronic and intergenic regions. This attribute is crucial given that SVs frequently occur in non-coding regions and can encompass multiple genes. Despite the newly available data, understanding the clinical significance of detected SVs remains a challenge. Compared to recurrent SVs with well-defined breakpoints, rare SVs are particularly difficult to interpret. Even when rare SVs are observed in population frequency databases, they are often not annotated for clinical significance15.
Despite the advent of next-generation sequencing (NGS), diagnostic genetic variants are typically only identified in 25–45% of patients undergoing GS for suspicion of having an RGD16,17,18,19,20. Reanalysis of undiagnosed cases has yielded additional diagnoses after considering CNVs, confirming reports of CNV involvement in RGDs21,22,23. Recent efforts have been made to standardize the interpretation of CNVs, culminating in the American College of Medical Genetics and Genomics (ACMG) technical standards for interpreting CNVs24. These guidelines consider population frequency, the impact of overlapping functional regions, and previous clinical interpretation25. Within these guidelines, these features are evaluated to determine haploinsufficiency (HI) and triplosensitivity (TS), the tolerance to regional losses or gains in the genome, respectively. Both concepts fall under the broader category of dosage sensitivity.
The ACMG guidelines, while highly valued in the clinical setting, tend to classify CNVs as having uncertain pathogenic significance (VUS), as observed in the algorithmic implementation of the guidelines, ClassifyCNV26. In the interpretation setting, this limited specificity results in lengthy candidate CNV sets requiring review, with many benign CNVs being classified as VUS. To address this problem, several machine learning (ML) approaches have been proposed to enhance the precision of classifying the clinical significance of CNVs. These algorithms statistically learn from data elements related to dosage sensitivity, overlapping genes, population frequencies, regulatory elements, topologically associated domains, and genomic position to predict pathogenic potential27,28,29,30,31. None of these algorithms combine all informative features in a single model. Moreover, as is common for ML-based classifiers, they fail to provide prediction explanations for better interpretability as to why the algorithm chose the given classification for a CNV. Together, these two limitations motivated the development of an improved ML approach.
Here we introduce CNVoyant, a tree-based, multi-class clinical significance classifier that combines previously reported features with novel features to classify CNVs more accurately than previously published methods. CNVoyant provides prediction explanations and enhances the accuracy and granularity of clinical significance classifications, enabling rapid identification and interpretation of potentially pathogenic CNVs.
Methods
CNVoyant’s target audience comprises individuals determining the role of a CNV in a suspected genomic disorder. This diverse group of clinical genomicists includes variant scientists, geneticists, clinical molecular geneticists, and laboratory directors. CNVoyant is designed to require minimal bioinformatics expertise and enables easy interpretation of results. It accepts a list of CNVs, specified by chromosomal coordinates and variant type (deletion or duplication), and returns probabilities for pathogenic, VUS, and benign clinical significance. For each CNV entry, CNVoyant classifies the variant by determining the maximum probability among the clinical significance classes.
The capacity of CNVoyant to classify the clinical significance of CNVs was tested in 21,574 CNVs curated from DECIPHER after training on 52,176 CNVs published in ClinVar (Fig. 1). This approach is consistent with previously reported comparisons of pathogenicity, where a set of CNVs are examined in a general context rather than focusing on individual patients. This also aligns with the previously cited ACMG technical guidelines, that recommend uncoupling CNV pathogenicity classification from the implications for a specific patient24. Features were generated to capture information related to genomic position, variant composition, overlapping functional annotation, population frequency, conservation, and dosage sensitivity. Finally, to demonstrate the efficacy of CNVoyant in a real-world clinical diagnostic setting, CNVoyant predictions were obtained from five genomic disease patients with diagnostic CNVs.
CNVoyant development framework. The final CNVoyant models are a result of the illustrated machine learning pipeline and are designed to predict the pathogenicity of copy number variations (CNVs). The training set is comprised of 52,176 CNVs (24,965 duplications, 27,211 deletions) parsed from the January 2023 version of ClinVar, and the test set is comprised of 21,574 CNVs (10,509 duplications, 11,065 deletions) from DECIPHER v11.18. Features are generated from annotations related to genomic position, variant composition, clinical significance, and dosage sensitivity. Two models were trained to classify deletion and duplication events independently. Training data for each CNV type was partitioned into 5 cross folds. Accuracy metrics observed in each fold were utilized to (1) select the optimal architecture from 29 candidates, (2) select an optimal set of hyperparameters from 10,000 permutations, and (3) calibrate outputted probabilities to class distributions in the training data. The resulting models were used to generate probabilities of benign significance (Pr (Benign)), VUS (Pr (VUS)), and pathogenic significance (Pr (Pathogenic)) for CNVs in the test set. A clinical significance prediction is also provided by taking a maximum over the set of benign, VUS, and pathogenic probabilities. The CNVoyant output generated from the test set was later used for benchmarking.
Training dataset curation
CNVoyant is trained on CNVs included in the January 2023 XML release of ClinVar8. This XML file was parsed, and extracted variants were limited to CNVs (variant type of “copy number gain” or “copy number loss”) that did not have duplicated variant positions. The Reference ClinVar Accession Number (RCV) entry was chosen to represent each CNV to avoid training on duplicates that can arise in cases of multiple submitters. 40,837 of the extracted CNVs were aligned to the GRCh37 reference genome, all of which required a genomic coordinate mapping via the UCSC liftOver command line tool32 to be combined with the 12,641 CNVs that were aligned to GRCh38. Following liftOver, 1,126 variant entries were identified as duplicates of entries originally aligned to GRCh38, and they were omitted. 850 CNVs were omitted due to ambiguous clinical significance labeling or conflicting clinical significance annotation in entries with matching genomic coordinates. 20 variants were removed for having a size of less than 50 bp. The remaining ClinVar CNVs with at least one pathogenic or likely pathogenic designation were labeled as pathogenic for training purposes. Non-pathogenic variants with at least one VUS designation were labeled as VUS, and remaining variants containing only benign or likely benign classifications were labeled as benign. Altogether, 52,176 CNVs were included in model training (Fig. 2a, Table 1; DEL: 6,886 pathogenic, 7,191 VUS, 13,134 benign; DUP: 3,028 pathogenic, 10,792 VUS, 11,145 benign).
Training and test set curation. CNVoyant was trained with copy number variants (CNVs) curated from ClinVar and tested on variants curated from DECIPHER. The flowcharts indicate the reasoning for omitting 2,002 variants from the training set (a) and 7,809 variants from the test set (b). For ClinVar, 6 CNVs were mapped to contigs other than autosomes or sex chromosomes, 1,126 had matching genomic coordinates and clinical significance, 572 had ambiguous clinical significance labels, 278 variants had matching genomic coordinates and conflicting clinical significance labels, and 20 spanned less than 50 base pairs. For DECIPHER, 712 CNVs had variant types other than “duplication” or “deletion”, 5,138 had matching genomic coordinates and clinical significance, 1,003 had matching genomic coordinates and conflicting clinical significance labels, 118 overlapped with values in the training set, and 38 spanned less than 50 base pairs.
Testing dataset curation
A test set of CNVs was curated from the web interface of the v11.18 release of the DECIPHER database33. All 29,453 DECIPHER CNVs are aligned to the GRCh38 reference genome and have been assigned clinical significance following manual review. 712 variants had variant types other than CNV deletions or duplications and as such were removed from the test set. The remaining 28,132 variants were lifted to GRCh37 to accommodate the expected input for comparator algorithms; 800 failed to map to an autosomal or sex chromosome contig and were omitted. 1,003 entries were removed for having shared genomic coordinates and conflicting clinical significance annotation, while 5,138 entries were removed for having duplicated genomic coordinates and clinical significance annotation. To guard against data leakage between train and test sets, 118 variants whose coordinates matched a ClinVar submission were omitted from the test set. 38 variants were removed for having a size of less than 50 bp. 21,574 CNVs were included in the final test set, which was used to benchmark CNVoyant against comparator algorithms (Fig. 2b; DEL: 6,183 pathogenic, 3,785 VUS, 1,097 benign; DUP: 3,360 pathogenic, 5,595 VUS, 1,554 benign).
CNVoyant feature selection
CNVoyant utilizes 17 features to classify the clinical significance of candidate CNVs. These features were selected based on publicly available, peer-reviewed resources and were chosen for their relevance and proven utility in the context of clinical significance classification for RGDs. Several feature distributions are right-skewed in the training data (counts of overlapping genes, exons, promoter regions, diseases, pathogenic ClinVar SNVs/indels, and bp length). These features are log-transformed to reflect more normal distributions. All features are normalized via the sklearn preprocessing.MinMaxScaler Python class prior to training or prediction34.
Genomic position (2 features)
-
Centromere distance: The number of bp separating the centromere from the candidate CNV. Distance from the centromere to the CNV is determined by selecting the CNV boundary closest to the centromere: the end coordinate on the P arm or the start coordinate on the Q arm.
-
Telomere distance: The number of bp separating the telomere from the candidate CNV. Distance from the telomere to the CNV is calculated by selecting the CNV boundary furthest from the centromere: the start coordinate on the P arm or the end coordinate on the Q arm.
CNV composition (2 features)
-
GC Content: Percentage of nucleotides in the genomic region encompassed by the candidate CNV that are guanine or cytosine.
-
BP Length: Total bp spanning the candidate CNV.
Functional annotation (5 features)
-
Count of genes: The gene count is defined as the total number of genes that overlap the candidate CNV. Genes overlap the candidate CNV if at least one bp is shared between the candidate CNV’s location and the gene’s genomic coordinates drawn from the RefSeq database1. All overlap calculations were made via the Bedtools intersect function35.
-
Count of diseases: The disease count is defined as the total number of diseases associated with the gene (s) that overlap the candidate CNV. This allows genes with more disease associations to be distinguished from genes with only a single or no disease association. Disease-gene associations are referenced from the curated annotations provided by Online Mendelian Inheritance in Man (OMIM)36.
-
Count of exons: Overlapping exon count is calculated by summing exons, across all genes, that overlap the candidate CNV. The exonic boundaries were padded by 10 base pairs to account for canonical splice regions.
-
Count of promoter regions: CNVs may overlap the promoter region of a gene rather than the gene itself. To address this, we include a count of promoter regions, defined as the interval between the transcription start site (TSS) and 1,000 base pairs upstream of the TSS.
-
Count of ClinVar pathogenic SNVs and indels: A sum of overlapping pathogenic SNV/indels is included to capture potentially relevant pathogenicity interpretation. To obtain a set of overlapping pathogenic SNV/indels, ClinVar37 is intersected with the candidate CNV and limited to variants interpreted as having “Pathogenic” or “Likely Pathogenic” significance.
Population frequency (1 feature)
-
GnomAD SV popmax: To estimate population frequency, we identify the highest frequency across all gnomAD SV (V4)38 entries that match the CNV’s variant type (deletion or duplication) and exhibit at least 50% reciprocal overlap in genomic coordinates. This means that the candidate CNV and the gnomAD SV entry it overlaps with must share at least half of their span, ensuring a significant genomic coverage overlap between the observed CNV and those described in gnomAD SV.
Conservation (2 features)
-
PhyloP: To estimate the conservation of a candidate CNV, PhyloP scores are referenced. PhyloP scores are available at single nucleotide resolution. Single nucleotide scores are highly variable and are thus correlated with the size of the candidate CNV. To mitigate this correlation, we employ a centered moving average that considers all scores within a specified reading frame (Supplemental Fig. 1). This effectively smooths the otherwise volatile conservation score curve. The maximum value of this smoothed curve within the genomic coordinates of the candidate CNV is returned as the PhyloP feature. A maximum was chosen considering that higher PhyloP scores indicate higher estimated conservation.
-
phastCons: The same procedure used to calculate the PhyloP feature was used to estimate conservation according to the phastCons score. Similar to PhyloP, the maximum was chosen as the aggregate function considering that higher phastCons scores indicate higher estimated conservation.
Dosage sensitivity (5 features)
-
HI Score: Dosage sensitivity was estimated by overlapping the candidate CNV with a curated set of dosage sensitive regions described by ClinGen39. Manually curated HI scores are available for all curated regions. HI scores were one-hot encoded to handle each unique value and split into binary features. In the case of multiple overlapping regions, HI score features were summed.
-
TS Score: As with HI Score, one-hot encoded binary features were generated from the curated TS scores of annotated dosage sensitive regions. The sum is also taken across all binary features when multiple dosage-sensitive regions overlap with the candidate CNV.
-
HI Index: The minimum HI Index score40 observed in overlapping dosage-sensitive regions is obtained to represent the HI Index value for the candidate CNV. HI Index estimates the probability of loss intolerance, where lower values predict haploinsufficiency.
-
pLI: The maximum pLI score3 observed in overlapping dosage-sensitive regions is obtained from gnomAD to represent the pLI value for the candidate CNV. pLI estimates the probability of loss intolerance, where higher values predict haploinsufficiency.
-
LOEUF: The minimum LOEUF score41 observed in overlapping dosage-sensitive regions is obtained to represent the LOEUF value for the candidate CNV. LOEUF estimates the probability of loss intolerance, where lower values predict haploinsufficiency.
Training procedure and model selection
For model training, features were generated for CNVs included in the ClinVar training set before partitioning into deletion and duplication sets. Separate multi-class models were trained for duplication and deletions, each predicting whether a candidate CNV has benign, uncertain, or pathogenic clinical significance. 29 common ML architectures were tested via fivefold cross-validation for each CNV type before selecting the top performers according to multi-class F1 score (Supplemental Table 1). Following architecture selection, fivefold cross-validation was again employed to hyperparameter tune and find the most accurate model. 10,000 sets of hyperparameter values were tested for each CNV type via the sklearn.model_selection.RandomizedSearchCV class. The top-performing models were calibrated to class distributions in the training set via the isotonic method implemented in the sklearn.calibration.CalibratedClassifierCV class.
Comparator algorithm selection
CNVoyant predictions were compared to five published ML-based CNV pathogenicity classifiers, X-CNV27, TADA28, dbCNV29, StrVCTVRE30, and ISV31, as well as the algorithmic implementation of the ACMG technical standards for CNV interpretation, ClassifyCNV26. The test set was passed to CNVoyant and all comparator algorithms to test for generalizability and generate accuracy metrics for benchmarking. Given that X-CNV, TADA, dbCNV, and ClassifyCNV take GRCh37-aligned variants as input, UCSC liftOver was again called to lift DECIPHER variants from GRCh38 to GRCh37. X-CNV pathogenic probabilities yielded only three unique values across the entirety of the test set, which was unexpected given the generation of a relatively heterogeneous feature set. Rather than attempting to amend the source code to produce more continuous output, X-CNV was omitted from benchmarking.
Measuring performance
To generate granular performance data in the test cohort, accuracy metrics are reported for each of the three CNV prediction classes: benign, VUS, and pathogenic. The classification probability for each class was utilized to sort the list of CNVs and generate precision-recall (PR) and receiver operating characteristic (ROC) curves. The area under these curves (PR AUC, ROC AUC) is referenced to measure model accuracy, in addition to the average F1 score and overall accuracy for multi-class predictions. F1 and overall accuracy are only reported for CNVoyant, dbCNV, and ClassifyCNV, as these algorithms are the only three that provide multi-class output. dbCNV provides likely pathogenic and likely benign classification designations in addition to pathogenic, VUS, and benign designations. Likely pathogenic predictions were mapped to pathogenic classification and likely benign predictions were mapped to benign classification to ensure a fair comparison to CNVoyant and ClassifyCNV.
TADA, ISV, and StrVCTVRE all output a single score to estimate CNV pathogenic probability. The complement of the pathogenic probability score (1-Pr (pathogenic)) was calculated to estimate benign significance scores for these comparator algorithms. The ClassifyCNV output is a score rather than a probability, but the complement was still chosen to represent benign significance, as higher scores represent a higher pathogenicity. CNVoyant is the only algorithm that provides benign and VUS probabilities; these values were used in plotting corresponding benign and VUS classification curves. dbCNV does not provide probabilities or a continuous confidence score for classification, so there is no value to reference in plotting ROC and PR curves. As such, dbCNV was not included in the ROC and PR curve comparisons.
Subgroup analyses was conducted to evaluate the relative performance of CNVoyant when candidate CNVs are partitioned by either bp length or their overlap with known segmental duplications (SD). This approach grants a more granular assessment of CNVoyant's classification accuracy across different genomic contexts, assessing whether CNVoyant maintains its predictive power across short and long CNVs. Pathogenic CNVs can be relatively small if the overlapping sequence holds critical biological significance, as is observed in cases of exon deletion. As such, it is important to ensure that CNVoyant does not make less accurate predictions in one group versus the other. In addition to partitioning subgroup analysis by bp length, CNVoyant accuracy was compared between CNVs overlapping with SDs and those not. SDs are regions within the genome that share at least 90% of their genomic sequence42. Clinically, these regions introduce complexity during significance interpretation43. As such, CNVoyant prediction accuracy was compared, and feature value distributions were generated between these groups to understand characteristic differences between CNVs overlapping with SDs and those not.
Deciphering feature influence on CNV classification with SHAP analysis
SHAP (SHapley Additive exPlanations) values offer a qualitative analysis tool for understanding how each feature influences the clinical significance prediction for distinct CNV classes. For CNVoyant, we generated SHAP beeswarm plots across all classes to visualize the effect of training features on model prediction44. These plots rank features by their importance and use color coding to depict the direction of their influence on the model's output. Each point on a plot represents a feature's SHAP value for an individual observation, quantifying its contribution to moving the model's prediction from the base value—the dataset's average prediction—toward the actual prediction.
Application of CNVoyant in diagnostic CNV cases
To demonstrate the capacity to prioritize likely pathogenic CNVs, CNVoyant predictions were obtained for CNVs detected from five patients. Patients with diagnostic CNVs were selected from the rapid genome sequencing (rGS) study in the Institute for Genomic Medicine at Nationwide Children’s Hospital45. Inclusion criteria for the rGS study required that a medical geneticist determine that rapid genome sequencing could inform medical care. Out of 218 patients, four patients had one diagnostic CNV and one patient had two diagnostic CNVs. CNV calls were generated by the GATK GermlineCNVCaller46 and manually validated via a locally developed, clinically validated CNV user interface. To measure the performance of CNVoyant, we report the prioritized rank of the known diagnostic variant(s) for each case, along with the predicted probabilities of benign, uncertain, and pathogenic clinical significance.
Results
Our evaluation of multiple machine learning architectures and subsequent hyperparameter tuning culminated in selecting a random forest model for both duplication and deletion CNV predictions, setting the stage for a comprehensive analysis of CNVoyant's performance against other leading algorithms. The performance of each algorithm was evaluated using PR AUC and ROC AUC metrics, highlighting the effectiveness of CNVoyant in distinguishing between benign, VUS, and pathogenic CNVs. For pathogenic CNV entries in the test set, CNVoyant displayed the highest PR AUC (0.858) and ROC AUC (0.870) (Fig. 3a, b, Table 2). StrVCTVRE displayed the second-highest PR AUC (0.816), followed by ClassifyCNV (0.812), ISV (0.804), and TADA (0.701). In terms of ROC AUC, ISV was the second-most accurate algorithm (0.847), followed by StrVCTVRE (0.827), ClassifyCNV (0.773), and TADA (0.748). For benign CNV entries in the test set, CNVoyant displayed the highest PR AUC (0.463), followed by StrVCTVRE (0.461), ClassifyCNV (0.373), ISV (0.344), and TADA (0.282) (Supplemental Fig. 2(a-b)). In terms of ROC AUC, CNVoyant was again the most accurate algorithm (0.848), followed by ISV (0.819), StrVCTVRE (0.817), TADA (0.751), and ClassifyCNV (0.689). CNVoyant’s accuracy in classifying VUS was respectable but less accurate than pathogenic classification (PR AUC: 0.642; ROC AUC: 0.757). Similarly, CNVoyant benign classification was also less accurate than pathogenic classification.
Binary classification of pathogenic copy number variants. The performance of CNVoyant was compared to four algorithms (ISV, StrVCTVRE, TADA, ClassifyCNV) in the binary classification of pathogenic CNVs. The discriminative power of each algorithm is quantified using the area under the curve (AUC) from both (a) precision-recall (PR AUC) and (b) receiver operating characteristic (ROC AUC) curves. CNVoyant demonstrates superior performance in distinguishing pathogenic from non-pathogenic CNVs, achieving the highest PR AUC of 0.858, indicating its effectiveness in correctly identifying pathogenic CNVs with a high degree of precision and recall. The rankings for PR AUC performance are as follows: CNVoyant (0.858), StrVCTVRE (0.816), ClassifyCNV (0.812), ISV (0.804), and TADA (0.701). Similarly, CNVoyant leads in ROC AUC with a score of 0.870, showcasing its overall capability to accurately classify CNVs across different thresholds. The ROC AUC rankings are: CNVoyant (0.870), ISV (0.847), StrVCTVRE (0.827), ClassifyCNV (0.773), and TADA (0.748).
The most informative features in the SHAP beeswarm plots for pathogenic classification differed between deletion and duplication events (Fig. 4). Pathogenic SNV/indel overlap was the most important feature for deletions, and exon count was the most important feature for duplications. For deletions, the HI index and a curated HI score of “sufficient evidence” were the second and third most important features, respectively. For duplications, bp length and promoter region count were the second and third most important features, respectively. The top five most informative features between deletions and duplications included exon count and disease count. PhyloP was more informative than phastCons in both variant types but was more important in predicting pathogenic deletions (6th most important feature) than pathogenic duplications (9th most important feature). SHAP beeswarm plots for benign and VUS classification indicated more similar feature importance between duplication and deletion variants (Supplemental Fig. 3, benign (a-b) and VUS (c-d)). For benign classification, the top four features were shared between duplication and deletion variants in the same order of importance. Bp length was the most important feature, followed by pathogenic SNV/indel overlap, PhyloP, and exon count. For VUS classification, bp length was the most important feature, followed by exon count for both duplications and deletion variants. PhyloP was the third most important feature for deletion events, followed by gene count. Pathogenic SNV/indel overlap was the third most important feature for VUS classification in duplications, followed by PhyloP. Population frequency and GC content were relatively uninformative across benign, VUS, and pathogenic predictions.
SHAP Beeswarm Plots for CNVoyant pathogenic classification. SHapley Additive exPlanations (SHAP) values are provided to illustrate the impact of genomic features on the machine learning classification of CNVs SHAP values offer a measure of each feature's contribution to the model's prediction, with higher absolute values indicating greater influence. Separate models were trained for (a) CNV deletions and (b) duplications; beeswarm plots are provided for each. Each point in the graph indicates a feature value for a specific training CNV. Positive SHAP values indicate that features support a pathogenic classification, and negative values detract from a pathogenic classification. The color intensity reflects the magnitude of feature values. Features are displayed in descending order by influence on the model's decision. Detailed feature descriptions are provided in the CNVoyant Feature Selection section of the Methods.
In our analysis, CNVoyant demonstrated superior accuracy in multi-class clinical significance classification (overall accuracy: 0.669, average F1: 0.629) compared to dbCNV (overall accuracy: 0.610, average F1: 0.565) and ClassifyCNV (overall accuracy: 0.626, average F1: 0.465) (Fig. 5). When only considering benign variants, CNVoyant (F1: 0.466) outperformed dbCNV (F1: 0.427) and ClassifyCNV (F1: 0.084). ClassifyCNV was the most accurate model in predicting VUS CNVs (F1: 0.689), followed by CNVoyant (F1: 0.648) and dbCNV (F1: 0.539). For pathogenic CNVs, CNVoyant was the most accurate (F1: 0.773), followed by dbCNV (F1: 0.729) and ClassifyCNV (F1: 0.622).
Multi-Class confusion matrices for CNV classification. This visualization presents confusion matrices for CNVoyant, dbCNV, and ClassifyCNV, showcasing the algorithms' ability to classify CNVs into multiple categories. The matrices illustrate the correlation between actual categories (row-wise) and predicted categories (column-wise), with color intensity indicating the proportion of observations normalized by the totals for actual labels. Darker shades denote higher proportions, highlighting the model’s classification capability per category. Ideally, a perfect classifier would have all observations along the diagonal line from the top left to the bottom right, indicating accurate category prediction for every observation. Among the algorithms capable of multi-class predictions, CNVoyant outperforms the others, demonstrating more precise classification across different CNV categories. Specifically, CNVoyant exhibits the most effective classification of benign and pathogenic CNVs, with F1 scores of 0.466 and 0.773, respectively. This compares favorably to dbCNV, with benign and pathogenic F1 scores of 0.427 and 0.729, and ClassifyCNV, with significantly lower scores of 0.084 for benign and 0.622 for pathogenic CNVs. Notably, while ClassifyCNV shows a preference for variants of uncertain significance (VUS) predictions with an F1 score of 0.689, it underperforms in benign CNV classification. CNVoyant not only leads in category-specific F1 scores but also achieves the highest overall accuracy rate of 0.669, indicating a greater proportion of correct predictions across all categories, compared to ClassifyCNV (0.626) and dbCNV (0.610). Additionally, CNVoyant maintains the highest average F1 score across categories (0.629), evidencing its superior balanced performance across benign, pathogenic, and VUS classifications, in contrast to dbCNV (0.565) and ClassifyCNV (0.465), which exhibit lower average F1 scores.
To further detail performance on different sized CNVs, subgroup analyses were performed to measure CNVoyant performance based on CNV bp length (log10 transformed) and overlap with SDs. The CNV bp length analysis considered three categories, CNVs whose lengths are more than one standard deviation below the mean, within one standard deviation of the mean, and more than one standard deviation above the mean (Supplemental Fig. 4). For smaller CNVs (bp < 85,037, one standard deviation less than the mean), CNVoyant favored benign predictions, but managed to correctly assign the majority of pathogenic variants as pathogenic. VUS CNVs were more often mislabeled as benign than correctly labeled as VUS in this subgroup. For moderately sized CNVs (ranging between 85,037 \(\le\)bp and \({\le}\)3,217,671 bp), CNVoyant predicted the correct clinical significance for the majority of variants within each label class. For large CNVs (bp > 3,217,671, one standard deviation greater than the mean), CNVoyant favored pathogenic predictions across all label classes, with a larger proportion of true VUS being labeled as pathogenic compared to true benign variants. CNVoyant accuracy varied between CNVs overlapping with SD regions compared to those not (Supplemental Fig. 5). In CNVs that overlap with SDs, CNVoyant was more successful in classifying the overall clinical significance of CNVs (accuracy = 0.724 vs. 0.575). In CNVs that do not overlap with SDs, CNVoyant’s performance in classifying pathogenic clinical significance was reduced (F1 = 0.484 vs. 0.828 in SD overlapping CNVs). For CNVs classified as benign or VUS, similar performance was observed in SD overlapping CNVs (Benign F1 = 0.421; VUS F1 = 0.647) and non-SD overlapping CNVs (Benign F1 = 0.491; VUS F1 = 0.648).
The application of CNVoyant in diagnostic CNV cases successfully demonstrated the capacity to prioritize likely pathogenic CNVs. For all five patients selected from the rGS protocol, CNVoyant placed the diagnostic CNV in the first position and placed the additional diagnostic variant in the second position in the case where two diagnostic CNVs were present (Table 3). Additionally, all diagnostic CNVs were predicted to be pathogenic with a probability of at least 0.8.
Discussion
CNVoyant sets a new standard in the classification of clinical significance for CNVs. Our novel algorithm outperformed the five leading algorithms for classifying CNVs (ClassifyCNV, ISV, StrVCTVRE, TADA, dbCNV) across all accuracy metrics and clinical significance classes in the DECIPHER test set. This unparalleled accuracy underscores CNVoyant's advanced analytical capabilities, especially when considering the complexity and variability resulting from individual provider submissions within the DECIPHER data set. Furthermore, for the first time, our comprehensive evaluation of feature performance within the predictive model has uncovered novel insights into the determinants of CNV clinical significance, offering a deeper understanding of the underlying drivers of classification.
CNVoyant prediction probabilities closely align with the observed class distributions in the DECIPHER test set (Supplemental Fig. 6), supporting the generalizability of these predictions. Investigating subgroups of the expansive genomic context demonstrates the expected performance in CNVs that are small or large in length or overlap with SDs. In the first subgroup analysis, CNVoyant was most accurate in CNVs between 85,037 and 3,217,671 bp in length. In the second, CNVoyant was most accurate when candidate CNVs overlapped with an SD (Supplemental Fig. 5). Upon further inspection of feature value distributions between classes, it became apparent that CNVs that do not intersect an SD are generally smaller and are associated with fewer known ClinVar pathogenic sequence variants, genes, and diseases. They are also less frequently annotated with population frequency and dosage sensitivity annotations. As these annotations become more robust following the expansion of relevant literature, the accuracy of CNVoyant clinical significance predictions is expected to increase.
Regarding explainability, the SHAP values generated from the test set also reflect intuitive reasoning driving predictions. Larger CNVs overlap with more functional and dosage sensitive regions, which are logically more likely to be pathogenic, and this was clearly reflected in the pathogenic SHAP beeswarm plots (Fig. 4 and Supplemental Fig. 3). Conversely, an inverse relationship exists where smaller CNVs that overlap with fewer regions drive benign predictions. The length of a candidate CNV was a simple but highly important feature omitted from ISV, StrVCTVRE, and TADA. Specifically, bp length was the most informative feature for both deletions and duplications in both VUS and benign variants. We hypothesize that a portion of the overall performance gained over the comparator algorithms was due to the addition of this feature.
In deletions, the count of pathogenic SNVs and indels contained within the CNV boundaries was the most important feature in predicting pathogenic significance and the second most important feature in predicting benign significance. This is also to be expected, as regions more intolerant to variation have more disease-causing variants. Given that loss of function is the most common variant type of pathogenic or likely pathogenic ClinVar SNV and indel annotations (72.5% of such variants), the emphasis placed on deletion events aligns with expectations. This trend was further observed in the context of conservation, with deletion variants spanning highly conserved regions having more pathogenic potential. Interestingly, HI and conservation metrics showed predictive value in classifying duplication variants. After further investigation, it was recently reported that HI and TS features largely overlap, confirming our observed trend47.
As previously stated, ClassifyCNV is an algorithm that encodes the logic driven by the most recent ACMG technical standards for CNV interpretation. There is considerable value in minimizing false negative predictions, especially in clinical settings. ClassifyCNV was highly capable of limiting false negative in the validation set, specifically calling 36 pathogenic CNVs benign compared to CNVoyant (474) or dbCNV (599). The consequence of avoiding false negatives is less accuracy in identifying true negatives. We observed this lack of true negative recall in the DECIPHER cohort, where ClassifyCNV predicted that only 4.7% of the 2,651 benign CNVs had benign clinical significance. This effectively leaves 98.9% of called CNVs to be interpreted by clinical genomicists, a value that does not significantly reduce the burden of variant interpretation. As such, ML methods should be utilized as prioritization methods rather than definitive clinical classification methods. Effectively, algorithms like CNVoyant can answer the question of which CNVs should be reviewed first. By using ML methods as prioritizers, variants with less pathogenic potential can be discounted compared to other CNVs rather than eliminated from consideration entirely, somewhat mitigating the false negative problem. To ensure that truly pathogenic variants are not incorrectly classified as benign in clinical settings, CNVoyant and ClassifyCNV predictions should both be considered.
ML methods obtain better negative predictive value by considering features that are not included in the current clinical guidelines and learning from previous CNV interpretation submissions. Comparator models each have certain blind spots that CNVoyant aims to account for. TADA fails to contextualize the genomic position of the CNV itself and instead focuses on overlapping topologically associating domains. ISV and StrVCTVRE address these shortcomings but fail to consider reported pathogenic SNVs and indels in ClinVar. Echoing the principle of Occam's Razor, CNVoyant underscores the power of simplicity by leveraging a concise set of features to outperform more complex models. This approach streamlines the analytical process and enhances the model’s explainability, affirming the notion that simplicity often leads to superior outcomes.
The potential for variability in the rigor with which clinical significance is assigned within the DECIPHER dataset reflects one potential shortcoming of our study. CNVs in this test set were submitted by individual providers, and submitters likely used varying methods to assess the clinical significance of a given CNV. Given the vast size of this test dataset (21,574 CNVs) and the challenges of reassessing all these CNVs with a standardized set of guidelines, we had to accept the assigned significance label in our test data. To address this potential shortcoming, we obtained CNVoyant predictions for five patients who were affected by RGDs with causal CNVs. CNVoyant correctly classified all diagnostic variants as having pathogenic clinical significance and assigned high pathogenic probabilities, thereby demonstrating its potential utility in the clinical setting. Notably, CNVoyant consistently prioritized the diagnostic variant in the first position for every case. This ability to prioritize potentially pathogenic CNVs from a list of candidates can significantly reduce the time needed to identify a pathogenic variant, a crucial advantage as germline CNV callers continue to advance.
In the realm of clinical genomics, CNVs with existing pathogenic interpretation submissions should be referenced in favor of CNVoyant predictions. CNVoyant predictions, and computational pathogenicity predictions in general, should be considered secondary evidence when previous interpretation records detailing pathogenic significance are available. Clinical experts frequently encounter rare, uninterpreted CNVs that straddle the line between different clinical significance classes. In these cases, understanding the rationale behind ML classification can be invaluable. CNVoyant addresses this need by providing a novel feature amongst existing CNV classification algorithms, SHAP force plots (Supplemental Fig. 7). This approach enables CNVoyant to effectively highlight the CNV features with the greatest influence on the model's classification decision, providing critical insights to guide clinical interpretation. With this in mind, we engineered CNVoyant to export the plots into portable static image files, which can easily be attached to clinical notes or reports.
It should be noted that CNVoyant alone cannot predict the diagnostic significance of a patient’s CNVs. Other factors must be considered, including variant zygosity, phenotypic overlap with associated diseases, mode of inheritance of associated diseases, presence of additional variants in trans, and differences in CNV frequencies in clinical settings enriched with patients affected by genetic conditions. CNVoyant should instead be included in diagnostic classification architectures to limit candidate diagnostic CNVs to only those with requisite probabilities of pathogenicity. CNVoyant outputs probabilities of benign, uncertain, and pathogenic significance, which are predicated on label distributions in the training set. In clinical cases, there are far more benign variants than variants with uncertain or pathogenic significance. Often, a patient will only have CNVs of benign clinical significance. To combat this class imbalance, CNVoyant should be retrained on interpreted CNVs from real patients before implementation in a clinical decision-support setting. While the current catalog of annotated CNVs is limited, it will undoubtedly grow exponentially as more CNVs are detected and interpreted in clinical settings. In anticipation of new data, we have open-sourced CNVoyant’s source code to allow users to train models with new data using the same feature set and architecture.
Of note, CNVoyant is designed to classify the clinical significance of CNVs in context of RGDs rather than clinical oncology. The count of OMIM diseases affected by the candidate CNV was a critical feature in classifying pathogenicity, and this may be misleading when assessing the clinical significance of CNVs in cancer. Additionally, CNVoyant was validated using a dataset designed to catalog variant interpretation in RGDs. Extended validation experiments would be required to determine if CNVoyant holds utility in clinical oncology pathogenicity classification. The addition of oncology-specific features could aid in the expansion of classification utility, specifically indicators of tumor-suppressing or oncogenic potential. Counts of genes belonging to an oncology-related gene list could also prove useful in identifying pathogenic CNVs in cancer.
Finally, it should be reiterated that CNVs are only one category within the larger domain of structural variation. Additional pathogenicity prediction algorithms are required to predict the clinical significance of other types of structural variation, including inversion and translocation events. Comprehensive variant prioritization algorithms must account for all structural variant types and simultaneously consider shorter variants, including SNVs and indels.
Conclusions
The advent of GS technologies and advanced algorithms has revolutionized our ability to detect genetic variants, including SDs and deletions, reliably. Clinical genomics experts must painstakingly interpret these CNVs to determine their relevance to a patient’s suspected genetic condition. To aid this process, we introduce CNVoyant, a highly accurate algorithm for classifying the clinical significance of CNVs. CNVoyant’s unparalleled accuracy in classifying CNVs' clinical significance is driven by a unique ML architecture and a carefully selected set of features that capture the multitude of factors that should be considered when evaluating the impact of a CNV. Importantly, CNVoyant demystifies ML decisions through SHAP force plots, providing the rationale behind the algorithm’s classification for any given CNV and enhancing transparency for clinicians. With the source code publicly available, CNVoyant invites continuous evolution, allowing for retraining with new data or specific populations. This adaptability will ensure that CNVoyant remains at the forefront of genomic medicine, simplifying variant prioritization and scaling to meet the demands of expanding GS applications. CNVoyant not only sets a new standard for accuracy and explainability, but also advances the capability to discern pathogenic significance, marking a significant leap in genomics.
Data availability
ClinVar XML files used to generate the training set are available via the provided FTP site (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/), and DECIPHER variants are available via the web-based graphical user interface (https://www.deciphergenomics.org/). CNVoyant is published under an Open Source Initiative approved 3-Clause BSD License to ensure that any interested academic institution can perform optimization with their cohorts and implement their version of the algorithm in their respective diagnostic workflows. The code for the CNVoyant algorithm is available in our GitHub repository (https://github.com/nch-igm/CNVoyant). It is also available via the pip package manager (https://pypi.org/project/CNVoyant/) and conda package manager in Linux and OS distributions (https://anaconda.org/schuetz.12/cnvoyant).
References
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733-45 (2016).
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884-91 (2021).
Exome Aggregation Consortium, Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–91.
Sherry, S. T. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–11 (2001).
Koch, L. Exploring human genomic diversity with gnomAD. Nat. Rev. Genet. 21, 448–448 (2020).
The UK10K Consortium, Writing group et al. The UK10K project identifies rare variants in health and disease. Nature. 526, 82–90 (2015).
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature. 526, 68–74 (2015).
Landrum, M. J. et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062-7 (2018).
1000 Genomes Project Consortium et al. Mapping copy number variation by population-scale genome sequencing. Nature. 470, 59–65 (2011).
MacDonald, J. R., Ziman, R., Yuen, R. K. C., Feuk, L. & Scherer, S. W. The database of genomic variants: A curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986-92 (2014).
Coutelier, M. et al. Combining callers improves the detection of copy number variants from whole-genome sequencing. Eur. J. Hum. Genet. 30, 178–86 (2022).
Liu, Z. et al. Towards accurate and reliable resolution of structural variants for clinical diagnosis. Genome Biol. 23, 68 (2022).
Sanchis-Juan, A. et al. Complex structural variants in Mendelian disorders: Identification and breakpoint resolution using short- and long-read genome sequencing. Genome Med. 10, 95 (2018).
Gross, A. M. et al. Copy-number variants in clinical genome sequencing: Deployment and interpretation for rare and undiagnosed disease. Genet Med. 21, 1121–30 (2019).
NHGRI Centers for Common Disease Genomics et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature. 583, 83–9 (2020).
Yang, Y. et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA. 312, 1870 (2014).
Clark, M. M. et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. Npj Genomic Med. 3, 16 (2018).
Tan, T. Y. et al. A head-to-head evaluation of the diagnostic efficacy and costs of trio versus singleton exome sequencing analysis. Eur. J. Hum. Genet. 27, 1791–9 (2019).
Kumar, R. D. et al. Clinical genome sequencing: three years’ experience at a tertiary children’s hospital. Genet. Med. 25, 100916 (2023).
McLean, A. et al. Informing a value care model: Lessons from an integrated adult neurogenomics clinic. Intern Med. J. 53, 2198–207 (2023).
Bergant, G. et al. Comprehensive use of extended exome analysis improves diagnostic yield in rare disease: a retrospective survey in 1,059 cases. Genet Med. 20, 303–12 (2018).
Hegele, R. A. Copy-number variations and human disease. Am. J. Hum. Genet. 81, 414–5 (2007).
Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: Insights from and for human disease. Nat. Rev. Genet. 14, 125–38 (2013).
Riggs, E. R. et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: A joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen). Genet. Med. 22, 245–57 (2020).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature. 581, 444–51 (2020).
Gurbich, T. A. & Ilinsky, V. V. ClassifyCNV: a tool for clinical annotation of copy-number variants. Sci. Rep. 10, 20375 (2020).
Zhang, L. et al. X-CNV: Genome-wide prediction of the pathogenicity of copy number variations. Genome Med. 13, 132 (2021).
Hertzberg, J., Mundlos, S., Vingron, M. & Gallone, G. TADA—a machine learning tool for functional annotation-based prioritisation of pathogenic CNVs. Genome Biol. 23, 67 (2022).
Lv, K. et al. dbCNV: deleteriousness-based model to predict pathogenicity of copy number variations. BMC Genomics. 24, 131 (2023).
Sharo, A. G., Hu, Z., Sunyaev, S. R. & Brenner, S. E. StrVCTVRE: A supervised learning method to predict the pathogenicity of human genome structural variants. Am. J. Hum. Genet. 109, 195–209 (2022).
Gažiová, M. et al. Automated prediction of the clinical impact of structural copy number variations. Sci. Rep. 12, 555 (2022).
Hinrichs, A. S. The UCSC genome browser database: Update 2006. Nucleic Acids Res. 34, D590-8 (2006).
Firth, H. V. et al. DECIPHER: Database of chromosomal imbalance and phenotype in humans using ensembl resources. Am. J. Hum. Genet. 84, 524–33 (2009).
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–30 (2011).
Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 841–2 (2010).
Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIMorg: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789-98 (2015).
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980-5 (2014).
Gudmundsson, S. et al. Variant interpretation using population databases: Lessons from gnomAD. Hum. Mutat. 43, 1012–30 (2022).
Rehm, H. L. et al. ClinGen — the clinical genome resource. N. Engl. J. Med. 372, 2235–42 (2015).
Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 581, 434–43 (2020).
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 11, 1005–17 (2001).
Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science. 330, 641–6 (2010).
Lundberg, S.M., Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017;
Chaudhari, B. et al. Outcomes of in-house rapid genome sequencing at a Children’s Hospital. Mol. Genet. Metab. 132, S165-6 (2021).
Babadi, M. et al. GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data. Nat. Genet. 55, 1589–97 (2023).
Collins, R. L. et al. A cross-disorder dosage sensitivity map of the human genome. Cell. 185, 3041-3055.e25 (2022).
Acknowledgements
We thank the Nationwide Children’s Hospital Foundation and The Abigail Wexner Research Institute at Nationwide Children’s Hospital for generously supporting this work. These funding bodies had no role in the study's design, collection, analysis, and interpretation of data, nor in writing the manuscript. We thank the patients and their families for providing the data needed to train and test CNVoyant. We also thank the physicians and bioinformaticians who submitted their CNV interpretations to ClinVar and DECIPHER.
Funding
BPC’s work on this publication was supported, in part, by the National Center for Advancing Translational Sciences of the National Institutes of Health under Grant Number UM1TR004548. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
R.S. curated training data, designed and implemented the machine learning workflow, wrote and published the CNVoyant software package, wrote the first draft of the manuscript, and prepared figures and tables. D.C. curated test data and aided in executing hyperparameter tuning procedures. A.A. contributed to experimental design and wrote segments of python package. B.P.C. and P.W. conceived, designed, and supervised the project, oversaw and contributed to algorithm development, provided support for the utilization of A.W.S. and computational resources, and contributed significantly to manuscript writing and revision. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics approval and consent to participate
This study exclusively employed publicly available datasets, and thus did not involve direct human subjects research. In conducting this analysis, we strictly adhered to the terms of use, access, and distribution of the datasets as outlined by their respective sources. We ensured that our research methods and objectives were aligned with the ethical guidelines for research and data use, including respecting privacy, intellectual property rights, and the integrity of the data.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Schuetz, R.J., Ceyhan, D., Antoniou, A.A. et al. CNVoyant a machine learning framework for accurate and explainable copy number variant classification. Sci Rep 14, 22411 (2024). https://doi.org/10.1038/s41598-024-72470-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-72470-4
This article is cited by
-
Application and clinical utility assessment of natural language processing-based software for copy-number variants interpretation
Journal of Translational Medicine (2025)