Fig. 1 | Scientific Reports

Fig. 1

From: CNVoyant a machine learning framework for accurate and explainable copy number variant classification

Fig. 1

CNVoyant development framework. The final CNVoyant models are a result of the illustrated machine learning pipeline and are designed to predict the pathogenicity of copy number variations (CNVs). The training set is comprised of 52,176 CNVs (24,965 duplications, 27,211 deletions) parsed from the January 2023 version of ClinVar, and the test set is comprised of 21,574 CNVs (10,509 duplications, 11,065 deletions) from DECIPHER v11.18. Features are generated from annotations related to genomic position, variant composition, clinical significance, and dosage sensitivity. Two models were trained to classify deletion and duplication events independently. Training data for each CNV type was partitioned into 5 cross folds. Accuracy metrics observed in each fold were utilized to (1) select the optimal architecture from 29 candidates, (2) select an optimal set of hyperparameters from 10,000 permutations, and (3) calibrate outputted probabilities to class distributions in the training data. The resulting models were used to generate probabilities of benign significance (Pr (Benign)), VUS (Pr (VUS)), and pathogenic significance (Pr (Pathogenic)) for CNVs in the test set. A clinical significance prediction is also provided by taking a maximum over the set of benign, VUS, and pathogenic probabilities. The CNVoyant output generated from the test set was later used for benchmarking.

Back to article page