Fig. 4: Abundance of protein features across nine missense variant datasets.

These variant datasets include gnomAD variants binned by AF: very rare; AF \(<\) 0.1%, rare; 0.1% \(\le\) AF \(<\) 0.5%, low frequency; 0.5% \(\le\) AF \(<\) 5%, and common; AF \(\ge\) 5%, ClinVar variants grouped by clinical significance: PLP, BLB, VUS and HGMD disease mutations grouped by confidence levels: high and low. For details about each protein feature, see ‘Protein features in the G2P portal’ in Methods. a, The abundance of each sequence annotation from UniProt and PTM site within a given dataset. The calculated abundance of a feature (for example, active site) is denoted as the numerical value at each data point (see Supplementary Fig. 6 for the details of feature abundance calculation). Each point is color coded based on its normalized abundance, wherein the abundance is divided by the maximum value among the nine datasets (denoted as bold and circled) to facilitate comparison of relative abundances across different features. For example, the abundance of the active site is the highest for the ClinVar PLP dataset, represented as 0.23, resulting in the darkest color where normalized abundance equals 1, while the gnomAD common dataset has 0/23 = 0 having the brightest color. b, The proportion of three-class (left) and nine-class (right) secondary structures within variant datasets. Nine secondary structure classes are grouped into three larger classes: helix (H; 310-helix/G, α-helix/H, π-helix/I and polyproline helix/P), strand (B; β-sheet/E and β-bridge/B) and loop (C; bend/S, turn/T and coil/C). Structured regions (helix and strand) have a higher prevalence of harboring pathogenic variants (~56% of ClinVar PLP variants and HGMD high-confidence disease mutations). c, Violin plots showing the distributions of 3D structural features (accessible surface area and backbone phi/psi angles) across different variant datasets. The plots are divided into high (pLDDT \(\ge\) 70, n = 4,134,666) and low (pLDDT \(<\) 70, n = 2,544,814) confidence as predicted by AlphaFold. The violins illustrate the probability density of the data at different values, with the white dot representing the median, the thick black bar in the center representing the interquartile range (IQR), and the thin black line representing the 95% confidence interval. Features of variants summarized in b and c are computed using AlphaFold structures.