Introduction

While gene-specific and gene panel genetic testing for rare and common diseases is readily accessible to most economically stable countries, this is not the case for Africa. As such, genetic screening for public health intervention is lacking across the continent. A literature review of cancer genetic screening within Sub-Saharan Africa revealed that only 15 (31.25%) of 48 countries were represented in published genetic studies, with the vast majority limited to known breast cancer genes BRCA1 and BRCA21. Even within ethnically diverse countries, genetic testing for the most common diseases is tailored for people of European ancestry2, including for prostate cancer (PCa), the disease example used in this study. Specifically, men of African ancestry are at the greatest risk for disease presentation at a younger age with more aggressive pathology, including associated mortality compared with men of non-African descent3,4. Yet, there is currently no consensus on PCa germline testing for Africans5. The reason, as with many other diseases, is a lack of sufficient data6. Addressing the health equity gap for PCa, thereby contributing to improve the region’s health metrics7, we urge for caution in viewing Africa and Africans in singular terms8, as evidenced by over 2,000 representative language groups and the continent at the epicentre of human genetic diversity9,10.

In 2015, the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) proposed a panel of 28 assessment criteria to categorise genetic variants as pathogenic, likely pathogenic, uncertain significance, likely benign or benign11. These criteria are based on population, computational/predictive, functional, and segregation data, and include data from databases (i.e. sequence database, population database, variant/disease-specific database), as well as patient information such as family history. However, the ACMG-AMP guidelines come with their own set of reproducibility challenges ranging from the gathering of information, criteria of interpretation, algorithm and cut-offs used for database preferences. Addressing these challenges, the Clinical Interpretation of Genetic Variants or InterVar tool was created, which, based on 18 of the ACMG-AMP criteria, is aimed at minimising human error and, as such, enables clinical interpretation12. In contrast, the National Centre for Biotechnology Information variant database ClinVar13, relies on tested and validated functional effects of variants from external submitters for the determination of pathogenicity, providing classification as [likely] pathogenic, [likely] benign or uncertain significance14. However, both the ACMG-AMP guideline and database-classified pathogenicity prediction approaches are constrained by the limited availability of African-derived genomic resources and associated variant classification, questioning the suitability of established prediction criteria and hampering database inclusion15. This is evidenced most recently when interrogating for 20 genes contributing to current PCa germline testing panels16. Compared with non-Africans, we found Southern African men at 3-fold greater risk for aggressive disease presentation17, to be 2-fold less likely to present with a known pathogenic variant (PV), with an elevated number of coding variants of unknown significance (VUS)16. Given that experimental interpretation for variant disease pathogenicity is tedious and time-consuming, it is critical to evaluate the applicability of in silico variant pathogenicity prediction tools (VPPTs) for evaluating the functionality of African-relevant VUS.

Here, having access to published African-vs-European ancestral whole genome sequenced variant data generated and called using a single technical and analytical pipeline for men presenting with National Comprehensive Cancer Network (NCCN) criteria for PCa germline testing18, we test 54 VPPT algorithms19,20. Focusing on pathologically matched and genetically non-admixed patients, balanced for African (n = 50), specifically Southern African, and European ancestries (n = 50), largely Australian, we provide a thorough evaluation of VPPT performance for all whole genome germline rare variants (minor allele frequency [MAF] < 1%), make ancestry-specific VPPT-derived workflow recommendations, with testing showing an increase in VUS pathogenic prediction for African patients.

Methods

Inclusion and ethics statement

The data presented in this study have been made available by the Southern African Prostate Cancer Study (SAPCS) Data Access Committee and with study approval granted by the Garvan/St Vincent’s Prostate Cancer Biobank. Patients were recruited adhering to the principles of the Helsinki Declaration, providing informed consent as stipulated by approvals granted from the University of Pretoria Faculty of Health Sciences Research Ethics Committee (HREC) in South Africa (including US Federal wide assurance FWA00002567 and IRB00002235 IORG0001762; #43/2010) and St Vincent’s HREC in Australia (#SVH/12/231), with approval for genomic interrogation provided by the St. Vincent’s HREC (#SVH/15/227) and Human Research Protection Office of the US Army Medical Research and Development Command E02371 (TARGET Africa) and E03280 (HEROIC PCaPH Africa1K).

Germline variant data source and patient inclusion criteria

Whole genome variant data was derived from deep sequenced (average 46X coverage, range: 30-97x) whole germline genomes from an African-vs-European ancestral aggressive presenting PCa resource made accessible via the European Genome‐Phenome Archive (EGA; https://ega‐archive.org) and including men of Southern African (EGAD00001009067) and European ancestry (EGAD00001009066)18. Noting NCCN guidelines, which advises for germline genetic testing for men with metastatic, recurrent or high-risk localised PCa, defined as Gleason score derived International Society of Urological Pathology (ISUP) group grading 4 and 5, and/or prostate-specific antigen (PSA) > 20 ng/mL, regardless of family history21, patient selection was biased towards ISUP defined advanced disease presentation and balanced for ethnic representation (TableĀ 1). Representing the Southern African population identifier, our study included 50 Black South Africans (100% ISUP 4–5), while presenting the European identifier, our study included 49 Australians and 1 White South African (98% ISUP 4–5, with a single European man presenting with ISUP 3). Notably, PSA is excluded as an inclusion criterion as a consequence of the elevated overall PSA levels at presentation in our African over European men (mean 448.3, 22-fold greater than the NCCN European-based guidelines for aggressive disease, versus 12.8 ng/mL, p = 0.001). Elevated PSA levels concur with previous reports for the SAPCS for men with and without PCa3,17. Additionally, Black South African men present at diagnosis over 6 years later than our European Australian patient presentation at surgery, which has largely been attributed to regional differences in PSA screening practices. All sequenced data were generated and processed using a single technical and analytical pipeline, allowing for direct comparative analyses, while patient ancestries were verified using 7,472,833 markers across the genome and fastSTRUCTURE v1.0 population substructure analysis22, allowing for selection of patients representing exclusively African or European genetic ancestries.

Table 1 Demographic and clinicopathological characteristics and benchmark (positive and negative) datasets by genetic ancestry

Benchmark positive and negative datasets for VPPT testing

Single-nucleotide variants (SNVs) were extracted for our 50 African (n = 19,045,878) and 50 European ancestral advanced PCa patients (n = 11,811,487), representing 4,465,388 and 3,752,976 rare variants (MAF < 1%), respectively. Rare variants were further annotated using ANNOVAR23 and dbNSFP v4.724, to establish ClinVar classifications (version: 20240611)13, while InterVar (version: 2021-08, https://github.com/WGLab/InterVar)12 was run to establish the NCCN guideline pathogenic predictions. For each ethnicity, positive and negative datasets were generated using known variant ClinVar classifications, defined as Pathogenic or Likely pathogenic (PVs) and Benign or Likely benign (benign variants, BVs), respectively, and as defined using InterVar ACMG-AMP guideline-driven prediction (TableĀ 1). After removing overlapping variants, 158 African and 202 European-derived PVs were represented in the positive and 41,045 and 103,886 BVs in the negative merged dataset, respectively.

Predicting VPPT performance criteria

The 54 VPPTs, broadly categorised as (i) multiple sequence alignment (MSA), the earliest prediction method developed at the completion of the Human Genome Project and used to evaluate (interspecific) evolutionary conservation; (ii) protein structural/functional parameters, which evaluate how variants affect proteins’ physical structure and in turn their functions; (iii) supervised machine learning (ML), introduced in the early 2010s, uses positively and negatively labelled model training datasets for variant prediction; (iv) unsupervised ML, which does not rely on labelled database; (v) deep learning (DL), a subset of ML that is powered by artificial intelligence to automate feature extraction and as such requires a larger training dataset and more complex computation; and lastly (vi) meta-predictors, that integrate features and prediction scores from multiple VPPTs, often combined with their in-house ML methods (Fig.Ā 1), were included for prediction performance assessment and further defined by their year of development and method of prediction (TableĀ 2). Prediction scores and results were similarly derived from dbNSFP v4.7 via ANNOVAR23. If available, the directly predicted classification of variant pathogenicity was used (e.g. T for Tolerant, D for Deleterious), while default/recommended numeric cut-off scores were applied to other tools.

Fig. 1: 35-year timeline depicting major historical events in the development of in silico variant pathogenicity prediction tools (VPPTs).
figure 1

Major historical events by timelines shaping VPPT development and defined as 1990-2000 (Genomic Era Begins), 2000–2010 (First Generation of VPPTs), 2010–2015 (Early Machine Learning) and 2015–current (Contemporary Machine Learning).

Table 2 Summary of VPPTs tested in the study

The predictive power of 54 VPPTs for African and European variant data was assessed by the standard performance metrics including sensitivity (true positive), specificity (true negative), false positive rate (FPR) and false negative rate (FNR). True positives were defined as PVs predicted as [likely] deleterious and true negatives as BVs predicted as benign. False positives were defined as BVs predicted as [likely] deleterious, while false negatives were PVs predicted as benign. Additionally, the Matthews Correlation Coefficient (MCC) was calculated for all VPPTs using Eq.Ā 1.

$$\begin{array}{c}{Sensitivity}=\frac{{TP}}{{TP}+{FN}}\\ {Specificity}=\frac{{TN}}{{TN}+{FP}}\\ \begin{array}{c}{FPR}=\frac{{FP}}{{TN}+{FP}}\\ {FNR}=\frac{{FN}}{{TP}+{FN}}\\ {MCC}=\frac{{TP}\times {TN}-{FP}\times {FN}}{\sqrt{({TP}+{FP})\times ({TP}+{FN})\times ({TN}+{FP})\times ({TN}+{FN})}}\end{array}\end{array}$$
(1)

R Base Functions25 and RStudio26 (version: 2023.6.1.524) were used for data analysis and formula calculations.

After exclusion for known PVs and BVs, VUS were further defined using a data-driven workflow for potential deleterious variant (PDV) identification (Fig.Ā 2). To further define PDV oncogenic potential, we applied the power of Cancer Genome Interpreter (CGI) (version: March 2024)27 with additional inclusion of predicted Loss of Function (pLoF), retrieved from Ensembl’s Variant Effect Predictor (release 112)28.

Fig. 2: Pathogenic prediction workflow for whole genome single-nucleotide variant (SNV) interrogation for 50 African (orange) and 50 European (blue) clinically matched advanced prostate cancer cases.
figure 2

Rare SNVs were further defined as pathogenic variants (PVs) or benign variants (BVs) to establish the positive and negative benchmark variant datasets, respectively. Variants of unknown significance (VUS) were further interrogated for classification of potentially deleterious variants (PDVs) using ancestral-specific 10-VPPT criteria, with ancestral-unique VPPTs in bold, with further cancer-specific classification as potentially oncogenic variants (POVs), based on further interrogation for potential Loss of Function (pLoF) and Cancer Genome Interpreter (CGI) oncogenic status.

Reporting summary

Further information on research design is available in theĀ Nature Portfolio Reporting Summary linked to this article.

Results

African-vs-European ancestral assessment of VPPTs using the benchmark datasets

While the 50 African PCa cases present with 1.6- and 1.2-fold greater number of SNVs and rare variants, respectively, we found the 50 European PCa cases to present with 2.5- and 1.4-fold more known ClinVar classified positive PVs or negative BVs, respectively (TableĀ 1). Although ACMG-AMP guidelines (using InterVar) provided a better balance between the ancestries, we observed a 1.1-fold increase in positive PVs called for European men, with this variance widening to 2.8-fold for negative BVs. Notably, all 54 VPPTs were predicted with slightly higher mean sensitivity for European variants (0.68 vs 0.70, p = 0.026) derived from the merged ClinVar and InterVar benchmark datasets (Supplementary TableĀ 1). Irrespective of ancestry (African vs European), VPPTs with the greatest sensitivity include M-CAP (0.98 vs 0.96), GERP-NR (0.92 vs 0.94) and CADD (0.91 vs 0.97), while GERP-NR had the highest FPR (0.81 vs 0.80). The greatest sensitivity differences between the ancestries included ClinPred preferencing African variants (0.67 vs 0.48), while VARITY-ER-LOO outperformed with European variants (0.63 vs 0.49). However, taken together, the 54 VPPTs predicted slightly better with African than European variants (mean MCC 0.19 vs 0.17, p = 8.57E-06). VPPTs with top overall performance included BayesDel-addAF, MetaRNN, ClinPred, LINSIGHT and BayesDel-noAF, with only LINSIGHT outperforming in the European over African cohort (0.52 vs 0.46). The raw number of positive and negative variants predicted is available in Supplementary TableĀ 2.

Given the lack of pathogenic representation for African-derived variants within ClinVar and in turn bias of our benchmark dataset towards European-derived variants of known pathogenicity (both positive and negative), we sought to independently test VPPT performance by ancestry for ClinVar classified and InterVar ACMG-AMP guideline predicted variants. For known ClinVar variants, we found the predicting performance to be similar between our African and European cohorts (MCC 0.13 vs 0.14, p = 3.19E-02) (Supplementary TableĀ 3), while sensitivity was marginally higher for African variants (0.77 vs 0.74, p = 0.063). Independent of ancestry, top-performing tools overall included MetaRNN (0.61 African vs 0.51 European), ClinPred (0.60 vs 0.49), BayesDel-addAF (0.54 vs 0.50), BayesDel-noAF (0.28 vs 0.28) and REVEL (0.25 vs 0.26). The raw number of positive and negative variants predicted is available in Supplementary TableĀ 4. In contrast and as predicted, the sensitivity for African pathogenicity prediction declined when restricting our analyses to the InterVar dataset (TableĀ 3, Supplementary Fig.Ā 1), significantly increasing the ancestry gap (0.66 vs 0.71, p = 9.86E-06). Only 13 VPPTs (24.1%, 13/54) outperformed with higher sensitivity in our African versus European data, compared to 16 VPPTs (29.6%, 16/54) in the merged benchmark dataset. For Africans, the top five tools with the highest sensitivity in descending order included M-CAP, MetaSVM, MetaLR, GERP-NR and phastCons470way-mammalian, while for Europeans these included M-CAP, CADD, GERP-NR, Eigen-raw and MutationTaster. Notably, GERP-NR accounted for the highest FPR in both the African and European cohorts (0.78 vs 0.77). VPPTs that favoured European variant prediction for sensitivity and showed the greatest ancestral disparity included VARITY-ER-LOO (0.45 vs 0.65), MutationAssessor (0.69 vs 0.87) and PROVEAN (0.62 vs 0.80). As to the overall performance, a slightly higher MCC score was still observed for the African dataset (0.23 vs 0.20, p = 7.91E-05). Irrespective of ancestry, the top-performing tools included BayesDel-addAF (0.74 vs 0.73), LINSIGHT (0.50 vs 0.67), MetaRNN (0.67 vs 0.64), ClinPred (0.61 vs 0.62), BayesDel-noAF (0.52 vs 0.46) and MetaSVM (0.51 vs 0.38), while they were among VPPTs with the lowest FPR, except LINSIGHT (0.30 vs 0.20). The raw number of positive and negative variants predicted is available in Supplementary TableĀ 5.

Table 3 Performance of 54 VPPTs for ancestry-specific African (AFR) vs European (EUR) advanced prostate cancer benchmark datasets classified by InterVar (ACMG-AMP guidelines) and ordered in descending order of sensitivity (Sen) for our African ancestral patients

African-vs-European ancestral assessment for top-performing VPPTs

Based on the MCC scores of 54 VPPTs tested with African and European variant datasets, we propose and tested ancestrally independent pathogenic prediction workflows, with additional relevance for calling cancer-specific pathogenicity (Fig.Ā 2). Applied to our data, after exclusion for PVs and BVs, including exclusion for rare variants with a reported MAFs > 1% across the globally representative gnomAD v4.1 database, a total of roughly 5.3 million VUS (3,479,179 African and 1,818,267 European) remained. Based on our InterVar ACMG-AMP guideline benchmarking dataset derived VPPT performance, with FPR > 0.3 exclusion, we select the top 10 performed VPPTs by ancestry, proposing ancestry-specific 10-VPPT PDV detection criteria. While MetaSVM, CADD, Eigen-raw, BayesDel-noAF, phyloP100way-vertebrate and MVP are shared between the ancestral workflows, MutationTaster, DANN, LRT and GERP-RS are African-specific and MutationAssessor, PROVEAN, LIST-S2 and REVEL are European-specific. The European-specific VPPTs outperformed for sensitivity between 1.13- and 1.29-fold over African data. Irrespective of patient ancestry, we further refine for PDVs through the inclusion of stop-gain and splicing variants as defined by ACMG-AMP guidelines11, classifying 13,269 African and 9,427 European-derived VUS as potentially deleterious. While overall this represents a greater percentage of European (0.52%) over African (0.38%) VUS, as a proportion of the total number of rare genome-wide SNVs by ancestry, PDVs represent a 1.18-fold increase for our African data.

To provide further verification for our African versus European 10-VPPT PDV prediction, we assessed our datasets using the alternative ancestral workflow. While the number of identified PDVs decreased for our African cohort by 2.5% (338 variants), in contrast, the PDV count increased by 1.3% (120 variants) for the European data. Assuming these differences are largely driven by the four ancestrally unique VPPTs, further targeted assessment for these VPPTs using the alternative ancestral workflow showed a major decline for our African data by 19.6% (2367 PDVs) and a 19.5% (1435 PDVs) increase for our Europeans. Additionally, we observed a 1.2-fold increase in the FPR between the African and the European 10-VPPTs for our European data (0.2 vs 0.17) and a further 1.5-fold increase between the African and the European unique 4-VPPTs (0.25 vs 0.17). Assuming that VPPTs largely favour European-derived variant prediction, we focussed on the most commonly used VPPTs20, namely PolyPhen2, SIFT, CADD and MutationTaster, with PolyPhen2, SIFT and MutationTaster further highlighted in the ACMG-AMP guidelines11. Using the African ancestral InterVar ACMG-AMP guideline benchmark dataset, further PolyPhen2 and SIFT tool-specific selection was based on the highest sensitivity with FPR < 0.3, identifying SIFT-4G and PolyPhen2-HVAR. Compared to the number of PDVs predicted using these four most common VPPTs, our ancestral-specific workflows increased the number of African predicted PDVs by 1.2% (159 PDVs) and only minimally decreased the European predicted PDVs by 0.15% (14 PDVs).

Assessment of ancestry-specific 10-VPPT workflows to predict oncogenicity

To further evaluate our 10-VPPT ancestry-specific criteria to improve cancer-specific studies, we propose additional filtering steps to refine potentially PVs as potentially oncogenic variants (POVs). Here we utilise the power of predicting pLoF variants, as well as identifying oncogenic genetic alterations using CGI27. Classically, LoF variants are regarded as gene-disrupting alterations and result in a premature stop codon for protein transcription29, while oncogenic variants often occur upstream and cause tumour suppressor inactivation30. Identifying 23 African and 9 European pLoFs, none overlapped with CGI predictions for a merged total of 234 and 160 POVs, respectively. Notably, our POVs increased the per capita ClinVar/InterVar PVs (Number of PVs/Number of patients) from 3.16 for Africans and 4.04 for Europeans to 7.84 and 7.24, respectively. As a total number of genome-wide SNVs, the gap between POV prediction equated to a 1.1-fold increase for African over European data.

Discussion

Although VPPTs have been widely utilised and evaluated for specific diseases31 and variant types32,33, most VPPT comparison studies have excluded African ancestral data34,35,36. Focusing on an African-vs-European ancestral whole genome comparative PCa cohort including men of Southern African versus European ancestries, our study highlights the need to ensure that VPPTs are appropriate for African inclusion, as we demonstrated the inadequacies of current pathogenic databases and prediction guidelines. Importantly, while our patients were clinicopathologically matched, cases were further excluded for genetic admixture, while using a single data generation and variant calling pipeline, we excluded for inevitable between-study variability18. At the epicentre of human genome diversity9,37,38, it is not surprising that our southern African patients present with 1.6- and 1.2-fold more genome-wide SNVs and rare SNVs. Conversely, European patients have a greater percentage of rare variants (31.8% vs 23.5%), which may be explained through recently diverged populations experiencing a major genetic bottleneck during the out-of-Africa migration event39,40. Irrespective, the proportion of PVs and BVs is double to four times that for European over African SNVs, respectively, emphasising the associated discrepancies with regard to ClinVar content and InterVar ACMG-AMP guidelines for African inferences. Using our established positive and negative benchmark datasets, we evaluated VPPT performance, suggesting a 10-VPPT African-specific workflow to maximise PDV identification. Using these criteria, we increased the proportion of PDVs identified in our African data and in turn narrowed the margin observed between the ancestries. As our study focuses on PCa, through further loss of function and oncogenic predictions, we provide further clarification for potential pathogenicity, again narrowing the ancestral prediction gap.

A thorough review of the literature showed a single study to include African ancestral data for VPPT assessment32. Focusing on African American data and as such representing predominantly West African genomic ancestry, we and others have shown genetic substructure fraction divergence between African Americans and Southern Africans41,42. However, similarities between the studies are noted. These include a limited gap of specificity between African and European ancestral data, while the pattern of specificity is higher for LRT, CADD and MutationTaster and lower for FATHMM for both African American and Southern African data. In contrast, Southern African men did not show higher specificity for VEST4 and PolyPhen2, nor lower specificity for PROVEAN, MutationAssessor and SIFT. Moreover, we showed current VPPTs performed worse when predicting for African PVs, while the disparity of sensitivity was greater for the InterVar ACMG-AMP guideline benchmark datasets, compared to the merged benchmark datasets, suggesting current VPPTs were trained biased towards the ClinVar database. Given the difference in the inter-ethnic gap of sensitivity and specificity of current VPPTs, we infer whether the prediction of rare deleterious variants is more computationally challenging than rare non-functional/neutral variants.

Based on our findings, we provide further discussion with regards to ancestry-specific VPPT selection. Among the six ancestrally shared top-performing VPPTs, three are meta-predictors, including MetaSVM, CADD and BayesDel-noAF with the unsupervised ML VPPT Eigen-raw, the supervised MVP with DL and the MSA VPPT phyloP100way-vertebrate. Meta-predictors such as BayesDel, CADD, MetaSVM, ClinPred and REVEL generally have better performance43, while MetaSVM has some of the lowest FPR44. Eigen and phyloP have similarly shown decent performance for sensitivity and FPR34. Three MSA VPPTs each were selected as African- and European-specific, with the supervised DANN complimenting the African workflow and the Meta-predictor REVEL the European workflow. Irrespective of ancestry, no VPPT relying solely on supervised ML made our selection criteria due to either low sensitivity or high FPRs. The concern was raised for supervised ML that the same set of variants is recursively used for both training and performance assessment, resulting in poor prediction on unknown data19,45. Although several VPPTs performed with very high sensitivity, such as M-CAP, GERP-NR and GenoCanyon, they were excluded for consideration due to high FPRs. Notably, two independent European-based comparison studies found M-CAP, followed by GenoCanyon, to have one of the highest FPRs35,44. Further consideration must also be given to VPPTs designed for specific variant types. For instance, as LINSIGHT has been built to predict non-coding variants only46, it is not surprising that it generated the least number of predicted positive and negative variants.

Since ACMG recommended for the return of genomic incidental findings from a minimum set of 56 actionable genes in clinical sequencing47, large-scale studies48,49 have sought to define and standardise the criteria for variant pathogenicity classification, including variant allele frequency, segregation, the number of patients affected, de novo events in a trio and disruptive variants, while VPPTs were tested for its utility although they were not involved in their criteria. Their results showed a lower (by nearly 50%) number of African PVs in actionable genes than European ancestry, where the authors concluded that this could be attributed to the underrepresentation of the African participants in the publicly available databases and literature. Furthermore, a perspective article from Southern African researchers raised the utility of VPPTs in African genomes and called for an approach to infer African variant pathogenicity15. While our focus on Southern Africans provides advantages with respect to minimising non-African and non-Southern African genetic admixture, in turn, we acknowledge the limitation of our observations to be translated across the African diaspora, calling for further regionally relevant and ethnolinguistically inclusive studies. Furthermore, our and other studies are limited by a lack of disease-negative population-matched control resources. This has led to the generation of public data such as the Australian Medical Genome Reference Bank, which provides the community with whole genome variant and associated phenotype data for over 3000 healthy aged (>75 years) participants with no known cancer, cardiovascular disease or dementia50. However, to the best of our knowledge, no such resource exists for Sub-Saharan Africa. In contrast, while resources like gnomAD capture WGS data for 37,545 African genomes (gnomAD v4)51, phenotypic data are not prioritised and as previously demonstrated, African representation is regionally confined and lacking for Southern African inclusion42. Here we advocate for the establishment of younger-aged disease-free population-matched resources. Our rationale, as human aging increases the frequency of somatic mutations in hematopoietic stem cells, a phenomenon more prevalent in DNA repair genes, while the risk is modest, their rarity has the potential for further misclassification for germline screening52. Either way, accurate filtering for non-pathogenic African-relevant PDVs requires a substantial global commitment.

Conclusion

Here we curate, to the best of our knowledge, the first ancestry-specific set of VPPTs informed by our African-vs-European ancestral genome-wide variant evaluation. Establishing African-specific criteria, we begin to close the ancestry gap for PV prediction using tailored in silico methods. While just the beginning, our study is a call for action—our results highlighting the need for further representation across the broad African identifier, further disease type-specific evaluations, and establishing African-relevant and regionally specific databases from clinically curated to disease-free resources.