Introduction

The PAX6 gene (Paired box 6, OMIM #607108, HGNC 8620) encodes a DNA-binding protein that performs essential regulatory functions during eye development in many animal species including humans [1, 2]. Genetic variants in PAX6 underlie a number of ophthalmic disorders. By far the most common PAX6-related oculopathy is aniridia (OMIM #106210), a condition associated with PAX6 haploinsufficiency due to heterozygous loss-of-function variants [3]. Missense variants have been generally linked with milder phenotypes [4, 5]. However, in 2020, a study by Williamson et al. highlighted that certain heterozygous PAX6 missense variants can cause clinical manifestations that are more severe than aniridia (including microphthalmia and anophthalmia) [6]. Predicting the effect of the growing number of missense variants that are being identified remains challenging. Notably, when established criteria (such as those described by the American College of Medical Genetics and Association of Molecular Pathology (ACMG/AMP)) are used to classify these sequence alterations, a significant proportion are classified as variants of uncertain significance (VUS) [7, 8].

Computational (in silico) tools are commonly used to provide evidence to support or refute variant pathogenicity [8]. Each tool employs a different algorithm; features commonly taken into account include evolutionary conservation and protein/domain structure (Supplementary Table 1). It is noted that some algorithms combine the output from other tools to achieve a single consensus prediction (meta-predictors) [9].

A number of previous studies have evaluated the performance of commonly used computational tools in different genes, noting significant variability in predictive performance [10,11,12,13]. Aiming to increase the reliability of existing algorithms and to optimize their predictions, some studies have proposed the introduction of gene-specific thresholds [14, 15]. To date, computational tool evaluation and optimization have not been undertaken in the context of PAX6 and this study aims to address this gap.

Materials and methods

Dataset collection

In our primary analysis, PAX6 missense variants from publicly available resources were collected from: the Genome Aggregation Database (gnomAD) version 2.1.1 (v2) and version 3.1.1 (v3) (controls/biobanks subsets); the Leiden Open Variation Database (LOVD) version 2.0 and version 3.0; the Human Genetic Mutation Database (HGMD) Public version; and ClinVar (the websites of these resources can be found in the Web Resources section) (all accessed in February 2023). A biomedical literature search (MEDLINE/PubMed) using the term “PAX6” and focusing on articles between 2021 and 2023 was also undertaken [16,17,18,19,20]. We excluded duplicates and VUS (including “likely disease-causing mutation with questionable pathogenicity” (DM?) in HGMD), and then categorized the remaining variants into: “Primary Dataset Neutral” and “Primary Dataset Disease” (Fig. 1).

Fig. 1: Overview of the datasets used in the primary analysis.
figure 1

The utilized resources, selection criteria and filtering steps are outlined. Orange boxes: number of compiled variants; gray boxes: exclusion criteria and number of excluded variants; blue boxes: number of filtered variants based on their pathogenicity; yellow box: number of variants in the two sub-datasets. gnomAD Genome Aggregation Database, LOVD Leiden Open Variation Database, HGMD Human Gene Mutation Database, DM “disease-causing mutation” (as assigned in HGMD); DM?, “likely disease-causing mutation with questionable pathogenicity” (as assigned in HGMD), VUS variants of uncertain significance. All five resources were accessed in February 2023.

Primary Dataset Neutral included: (i) variants previously classified as benign or likely benign and (ii) variants present in gnomAD, a population-scale database that does not include individuals with severe pediatric disease [16]. While it cannot be excluded that certain PAX6 missense variants reported in the gnomAD controls/biobanks cohorts are pathogenic (e.g. if linked with subclinical phenotypes or incomplete penetrance), we adopted a pragmatic approach and considered these changes as “presumed benign”. Although filtering gnomAD variants based on their allele frequency would increase the likelihood of including only truly benign variants, this would reduce the dataset size. Hence, we did not apply such a filter. Primary Dataset Disease included missense variants labeled as pathogenic in ClinVar, LOVD or PubMed and variants labeled as DM in HGMD.

For validation purposes, a secondary analysis was conducted involving PAX6 missense variants from our local database at the Manchester Center for Genomic Medicine (MCGM), part of the North West Genomic Laboratory Hub (accessed in May 2023). These variants correspond to changes that were evaluated in an accredited diagnostics laboratory with >15 years’ experience in assessing genetic alterations from individuals with ophthalmic disorders. All variants were classified according to the ACMG/AMP 2015 guidelines [8] and changes assigned to the “likely pathogenic” and “pathogenic” categories formed the “Secondary Dataset Disease” (Fig. 2). For this replication study, variants present in the BRAVO database (version TOPMed Freeze 8) were collected (accessed in May 2023) and formed “Secondary Dataset Neutral”. Duplicates were excluded, while the detected VUS were used for downstream analysis [21].

Fig. 2: Overview of the datasets used in the secondary analysis.
figure 2

The utilized resources, selection criteria and filtering steps are outlined. Orange box, number of compiled variants; gray box, exclusion criteria and number of excluded variants; blue box, number of filtered and compiled variants based on their pathogenicity; yellow box, number of variants grouped into two sub-datasets and the number of VUS (Variant of Uncertain Significance) collected. MCGM denotes Manchester Center for Genomic Medicine while BRAVO denotes TOPMed (Trans-Omics for Precision Medicine) Data Freeze 8. Both resources were accessed in May 2023.

All variants were numbered based on Genome Reference Consortium Human Build 38 (GRCh38). Variants from gnomAD v2 were lifted over to this reference, using the transcript ENST00000241001 (Ensembl ID), which encodes the canonical PAX6 protein, comprising 422 amino acids (UniProt ID: P26367-1) [22].

Descriptive analysis

The distribution of variants in Primary Dataset Disease, Primary Dataset Neutral, Secondary Dataset Disease and Secondary Dataset Neutral along the linear protein sequence (as retrieved from UniProt) was visualized using a lolliplot diagram. The cBioPortal (version 5.4.5) tool was used to generate the relevant figure (accessed in May 2023) (Fig. 3) [23].

Fig. 3: Distribution of the PAX6 missense variants included in this study.
figure 3

Variant distribution according to their pathogenicity is shown. The X-axis represents the PAX6 canonical protein sequence (422 amino acids), while the Y-axis denotes the number of variants impacting the same residue. A difference can be noted in the clustering patterns of the presumed neutral variants between the primary and secondary datasets (b compared to e). In the latter, the number of variants that are found to be altering amino acids in the DNA-binding domains is greater and this could potentially reflect the more diverse ancestral backgrounds and phenotypic profiles of the individuals included in the BRAVO dataset compared to those included in the gnomAD controls/biobanks cohorts. Green bar, paired domain; red bar, homeodomain. DM?, “likely disease-causing mutation with questionable pathogenicity” (as assigned in the Human Gene Mutation Database [HGMD]); VUS variants of uncertain significance. BRAVO denotes the TOPMed (Trans-Omics for Precision Medicine) Data Freeze 8 while gnomAD denotes Genome Aggregation Database v2 and v3 (controls/biobanks subsets).

Computational tools

Ten commonly used computational prediction tools were assessed: AlphaMissense, BayesDel, CADD, ClinPred, Eigen, MutPred2, Polyphen-2, REVEL, SIFT4G and VEST4 [24,25,26,27,28,29,30,31,32,33]. These tools employ various algorithms to evaluate variant pathogenicity (more information on the utilized approaches can be found in Supplementary Table 1). The dbNSFP (version 4.1) resource was used to obtain pathogenicity scores for each tested variant. As the utilized version of dbNSFP did not include AlphaMissense prediction scores, these were extracted from the AlphaMissense_hg38.tsv.gz file provided in the relevant publication [24].

Depending on how the obtained scores compared to each algorithm’s pre-set threshold (determined by the respective tool’s developers), the studied variants were classified as “predicted pathogenic” or “predicted benign” [34]. Default thresholds were set for CADD and Eigen based on previous studies (although the use of a single, arbitrary threshold is not recommended by the tools’ developers). For AlphaMissense, variants with scores ranging from 0.564 to 1.00 were assigned to the “predicted pathogenic” category (in line with observations in the publication that introduced this tool) [24]; all other variants were assigned to a “predicted benign” group. Higher scores indicated a higher likelihood of a pathogenic prediction for all tools except SIFT4G. In a few cases, a single tool generated multiple scores and we opted for the following: CADD-phred; BayesDel AddAF (incorporates allele frequency data); Eigen raw for coding variants; and the PolyPhen-2 HumVar-trained model (which is suitable for studying Mendelian diseases) [26]. The prediction outputs “deleterious”, “damaging”, “probably damaging”, or “possibly damaging” were considered “predicted pathogenic”, while the terms “tolerated” or “benign” were deemed “predicted benign”.

Performance assessment

Initially, performance parameters were calculated using the PAX6 missense variants included in the primary datasets. We estimated sensitivity, specificity, accuracy, precision (Positive Predictive Value; PPV), and the Matthews Correlation Coefficient (MCC) [35]. To determine the best-performing tool, we used MCC, which ranges from -1 (constant false predictions) to 1 (perfect predictions) with 0 indicating random predictions.

We hypothesized that using an optimized, gene-specific threshold can improve the performance of each tool. Receiver Operating Characteristic (ROC) curves were utilized to identify the threshold that yielded the highest MCC score for each tool. This was achieved by iteratively adjusting the threshold and calculating the corresponding MCC score until the optimal value was identified. The quality of the prediction obtained using the optimized threshold was then compared to that obtained using the default threshold. The IBM SPSS (Version 25.0) [36] software was used for these analyses.

Subsequently, we explored if the analytical performance could be further improved by combining the three tools with the highest MCC scores into a custom meta-predictor. We adopted the “majority rule” method (agreement of over 50% of the employed tools), which involved classifying a variant as “predicted pathogenic” if it received a “predicted pathogenic” score in at least two out of the three selected tools.

Validation and evaluation

The findings for the tool with the highest MCC score were validated using a fivefold cross-validation approach (similar to that previously described by Tang et al. [11]). Briefly, this involved randomly dividing variants into five subsets of equal size, four of which (80%) formed the training set, while the remaining subset (20%) served as the test set. Within the training set, the optimized threshold that maximized the MCC was determined. The obtained threshold was then applied to assess performance on the testing set. This process was repeated five times until all subsets were utilized as the testing set. The resulting analytical pipeline was then evaluated on a secondary dataset and was used to assess a set of variants that were previously classified as VUS.

Results

PAX6 variant datasets

Our primary analysis included a total of 241 variants from publicly available databases. Using pre-determined criteria (see Methods) these were split into two groups: Primary Dataset Disease (n = 167) and Primary Dataset Neutral (n = 74) (Fig. 1). For the secondary analysis, we collected 17 unique variants from our local database, consisting of seven that were classed as VUS and 10 classed as pathogenic (Secondary Dataset Disease). We supplemented these with 65 presumed benign variants from the BRAVO resource (Secondary Dataset Neutral) (Fig. 2). All missense variants included in the primary and secondary analyses are shown in Supplementary Table 2.

Descriptive analysis

When the distribution of the studied variants was mapped, presumed pathogenic changes tended to cluster around the two DNA-binding protein domains of PAX6: the Paired Domain (PD) and the HomeoDomain (HD). Conversely, presumed benign variants were more likely to affect residues outside these domains. VUS did not show a clear clustering pattern (Fig. 3).

Performance of computational tools

The predictive performance of ten tools was evaluated. When the performance metrics were calculated using the default threshold set by the tools’ developers, considerable variability was noted (Table 1a). Most tools exhibited high sensitivity (exceeding 88%) but had low specificity scores (with the latter being in keeping with the findings of previous studies, e.g. [10, 11, 37, 38]). SIFT4G and AlphaMissense achieved specificity scores of 88% and 81%, respectively. In contrast, other tools showed specificities below 70%, with CADD, BayesDel and VEST4 scoring the lowest at 12%, 14% and 19%, respectively. The other metrics, such as accuracy and PPV, ranged from 72% to 88% and 71% to 94%, respectively. The MCC scores ranged from 0.22 to 0.74, with the top-three tools attaining the highest scores being SIFT4G at 0.74, followed by AlphaMissense at 0.72 and MutPred2 at 0.62.

Table 1 Performance of the computational tools assessed in this study (in tasks involving PAX6 missense variant evaluation).

Improving performance through threshold optimization

Aiming to obtain gene-specific thresholds tailored to PAX6, we performed ROC curve analysis and determined the value that achieved the maximum MCC score for each tool (see Supplementary Fig. 1). The default thresholds were generally lower compared to the optimized thresholds (Table 2b), except for SIFT4G (which, unlike the other tools, assigns lower scores to variants with a higher likelihood of being predicted as pathogenic). Following threshold optimization, all the performance parameters of the tools showed improvement, with a notable increase in specificity scores. At the optimized threshold, AlphaMissense achieved the highest MCC score of 0.81, succeeded by SIFT4G and REVEL at 0.77 (Table 2b).

Table 2 Fivefold cross validation results showing the performance of the AlphaMissense tool (in tasks involving PAX6 missense variant evaluation).

Performance of combination of tools

We assessed if the predictive performance could be further improved by combining multiple tools. A combination of the top-three tools (AlphaMissense, SIFT4G and REVEL) with optimized thresholds, demonstrated an MCC score of 0.78, with a sensitivity of 87% and accuracy of 90%. These results outperformed those obtained by combining the predictions of SIFT4G and AlphaMissense or REVEL and AlphaMissense but the MCC score was lower than the combination of SIFT4G and REVEL (Supplementary Table 3). Interestingly, the MCC score of AlphaMissense alone (following threshold optimization) was higher (0.81) than the MCC score of all combined approaches.

Validation and further evaluation

To assess the reliability of the results of our primary analysis (concerning AlphaMissense), we conducted further studies using a fivefold cross-validation approach. The findings confirmed the robustness of AlphaMissense (with the threshold optimization) in predicting the effect of PAX6 variants (Table 2).

Further evaluation using a different set of variants (secondary dataset) confirmed (i) that AlphaMissense and SIFT4G are among the higher-ranking tools; and (ii) that gene-specific thresholds lead to enhanced predictive performance (Table 3). It is worth noting that, except for sensitivity, the values in the secondary analysis were lower than those obtained in the primary analysis. This difference is likely to be influenced by the varying proportion of presumed benign and presumed pathogenic variants between the corresponding primary and secondary datasets.

Table 3 Performance of the computational tools assessed in this study (in tasks involving PAX6 missense variant evaluation): secondary analysis.

Lastly, a set of seven VUS from our local database were analyzed. Among these variants, six were consistently classified as pathogenic by all the ten tools investigated. However, one variant, PAX6 c.926 T > G, p.(Phe309Cys), showed discordant predictions (see Supplementary Table 2b) with AlphaMissense and SIFT4G labelling this variant as predicted benign (with scores of 0.1654 and 0.16, respectively). Notably, PAX6 c.926 T > G, p.(Phe309Cys), affects a residue in the C-terminal region, whereas the other six variants alter residues in one of the PAX6 DNA-binding domains (PD or HD).

Discussion

We assessed the performance of ten commonly used variant prediction tools in the context of missense variants in a highly-conserved gene, PAX6. Using default settings, most tools were able to make reliable predictions in relation to pathogenic variants. However, their ability to correctly predict benign variants was limited (i.e., there was high sensitivity but low specificity). These results are consistent with those from previous studies conducted on a genome-wide or an individual gene level [10, 11, 13, 37,38,39]. By generating optimized, gene-specific thresholds for each tool, it was possible to achieve improved performance compared to conventional approaches.

When default thresholds were used, SIFT4G, AlphaMissense and MutPred2 were found to be the top-ranking algorithms (i.e., had the highest MCC scores). Following threshold optimization, AlphaMissense emerged as the best performing tool with the highest MCC score, followed by SIFT4G and REVEL, while MutPred2 shifted to the fourth position. AlphaMissense uses a deep learning model that builds on the protein structure prediction tool AlphaFold2 [24]. SIFT4G evaluates the impact of amino acid substitutions based on evolutionary conservation and sequence homology, aligning well with the highly-conserved nature of the PAX6 gene [25, 40]. MutPred2 also incorporates a conservation-based approach along with other features. It is noted that MutPred2 was previously found to have good performance in prediction tasks involving variants in PITX2, a paired-like homeodomain transcription factor that is also expressed in the developing eye [41]. REVEL emerged as the best meta-predictor in the context of PAX6; this was unsurprising as its superior performance over other ensemble tools has previously been demonstrated [37, 42,43,44].

Our findings support the use of gene-specific thresholds, as opposed to relying on default settings [45]. Even REVEL, one of the highest performing tools, had a specificity of 47% (misclassifying 39 out of 74 presumed benign missense variants) with the default threshold. This issue arises due to the training process of the tools, where variants from multiple genes are used. This default approach allows for the possibility of underfitting, where crucial details necessary to capture the characteristics of an individual gene are overlooked. It is noted that, upon applying optimized thresholds, all tools demonstrated substantial improvement, particularly in specificity (Table 2). This observation is consistent with the findings of other studies looking at different genes [11, 13].

We attempted to combine the predictions of the top-three performing tools (following threshold optimization) using the majority rule method. The results demonstrated good performance, with most of the parameters surpassing 84% and the MCC ranging from 0.76 to 0.79 (Supplementary Table 3). However, the use of AlphaMissense alone outperformed this approach (Table 1b). The high performance of this tool was confirmed through a fivefold cross-validation experiment and in the secondary dataset (Table 3). To a degree, our findings contradict the observations of similar studies. For instance, Leong et al. found that the best performance for predicting KCNQ1 variant pathogenicity was achieved by considering three out of the five tools that were examined [12]. Likewise, Tang et al. reported achieving optimal performance in the context of SCN1A variants when combining the three best-performing tools [11]. Conversely, our findings align with those of a study by Gunning et al. which supported the adoption of a single tool instead of using a consensus-based approach [42].

Using AlphaMissense to evaluate seven PAX6 missense variants that have been previously classified as VUS resolved some of the discordance for one change, c.926 T > G, p.(Phe309Cys), by suggesting that it does not have an effect on molecular function. This variant, unlike most PAX6 pathogenic missense changes, affects a residue outside the DNA-binding domains [46]. This result could potentially be attributed to AlphaMissense’s ability to pinpoint functionally crucial sites (instead of simply evaluating the overall evolutionary conservation of a protein) [24]. It is noted that a few recent studies have shown that AlphaMissense can reliably classify subsets of variants that are known to affect molecular function [47,48,49].

The present study has several limitations, including the availability of a relatively small number of presumed pathogenic variants due to the rarity of PAX6-related disease. Additionally, we were unable to exclude the possibility that some of the studied genetic variants may have been utilized for training some of the evaluated tools. Notably, it is possible that some of the PAX6 missense variants that were presumed to be neutral/benign in this study (e.g. due to their presence in the gnomAD controls/biobanks datasets) may have been miscategorized and could in fact be associated with overlooked phenotypes or incomplete penetrance. To evaluate the robustness of the findings, we modeled this potential issue by repeating the analyses using an intentionally contaminated variant dataset. The main results of the study could be replicated in this context (Supplementary Table 4). Finally, it is noted that we did not (i) consider all mechanistic consequences of missense events, (ii) seek to exclude exonic splice variants from the core datasets, (iii) combine conventional missense impact prediction methods with methods that evaluate other mechanisms of genetic variant impact (e.g. splicing or gene expression). Future studies could explore the performance of a wider range of computational approaches, including tools considering splicing and/or the 3D-structure of the protein, and algorithms using advanced artificial neural network approaches.

It is highlighted that variant pathogenicity predictors constitute one of the many pieces of evidence that can be used to evaluate the effect of genetic alterations. It is crucial to consider other factors (including segregation analysis, population frequency and the outcomes of functional assays) [50]. Refinement of the ACMG/AMP sequence variant guidelines (and utilization of Bayesian approaches) is expected to provide an enhanced framework that would help generate robust estimates by improving how different lines of evidence are combined.

Conclusion

In summary, this study offers insights into how computational prediction tools can be optimally used for the task of PAX6 missense variant evaluation. The best-performing approach, which involves using a PAX6-specific threshold for AlphaMissense, can be utilized in different contexts and has the potential to enhance variant interpretation, ultimately leading to more precise and timely diagnoses for individuals with PAX6-related disorders.

Main web resources