Fig. 3

Feature selection steps in the application of machine learning algorithms. Subfigures (A), (B), and (C) display Spearman correlation heatmaps for the LB, VOUS, and LP variants, respectively. Subfigure (D) illustrates the results of principal component analysis (PCA) applied across 42 in silico tools, revealing that the accessible in silico tools can be explained with two or three components. PC1 explains 60% of the variance, PC2 explains 24%, and PC3 explains 4% of the variance, together they account for 88% of the total variance. The elbow plot analysis suggests that using two components is more appropriate than three, as the additional component provides minimal improvement in clustering performance. (E) This classification system was further utilized to assess the distribution of amino acid positions. (F) The method employed for feature selection is recursive feature elimination with cross-validation (RFECV), which identifies six in silico tools as the optimal feature set for precise variant classification. The results underscore the effectiveness of these tools in distinguishing between different variant categories. (G) Feature importance metrics based on RF algorithm. Revel displayed outperformed performance than other scores. Determination of optimum number of clusters according to the Elbow Method (H) and Silhouette Scores (I). According to elbow method and Silhouette scores 3 number of clusters showed best overall performance (J) 3-means clustering results: Revel(First cluster), Mutation Assesor, MetaLR (Second cluster), CADD, SIFT, Polyphen-2 (Third Cluster). (Hyperparameters: n_clusters=[2, 3, 4, 5], init=’k-means++’, n_init = 10, max_iter = 300, tol = 1e−4, distance_metric=’euclidean’, random_state = 42, verbose = 0, algorithm=’lloyd’). Subfigures (K), (L), and (M), present the Spearman correlation analysis between LP, LB, overall (LP + LB) conducted on the top six selected in silico tools. The performance of the classification models is evaluated via F1 metrics, with the corresponding F1 scores displayed in subfigure. According to the RFECV scores and PCA, three in silico tools represent the optimum number of features. (N) The classifier’s performance was evaluated using RFECV and obtained the highest results with three features.