Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures

Kwon, Seulki; Safer, Jordan; Nguyen, Duyen T.; Hoksza, David; May, Patrick; Arbesfeld, Jeremy A.; Rubin, Alan F.; Campbell, Arthur J.; Burgin, Alex; Iqbal, Sumaiya

doi:10.1038/s41592-024-02409-0

Download PDF

Resource
Open access
Published: 18 September 2024

Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures

Nature Methods volume 21, pages 1947–1957 (2024)Cite this article

15k Accesses
13 Citations
52 Altmetric
Metrics details

Subjects

Abstract

Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types—to ‘map’ variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal (https://g2p.broadinstitute.org/): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure–function relationship between natural or synthetic variations and their molecular phenotypes.

From systems to structure — using genetic data to model protein structures

Article 10 January 2022

Sequence-structure-function relationships in the microbial protein universe

Article Open access 26 April 2023

Long-read sequencing to detect full-length protein–protein interactions

Article Open access 17 July 2025

Main

We live in the era of big biological data where there has been an insurgence of efficient deep learning methods, making millions of high-quality predicted protein structures available to the biomedical community^1,2,3,4,5. At the same time, cryo-electron microscopy and other experimental methods are leading to a burst of high-resolution protein structures and assemblies^6,7,8. These advances come when an unprecedented number of genetic variants in the general population and those associated with diseases have been identified and accumulated in multiple databases^{9,10,11,12,13}. Concomitantly, advances in functional genomics approaches (for example, base editing^14,15, prime editing¹⁶ and Perturb-seq¹⁷) have enabled the generation of synthetic mutations and the quantification of their functional impact in different cellular models. Mapping natural or synthetic variants in the context of protein structure provides valuable molecular-level insights and helps hypothesize the structure–function mechanism of the variant.

Challenges remain, however, in connecting genomic data (that is, variants) to protein structural data due to the complexity introduced by diverse RNA transcripts and protein isoforms originating from a single DNA sequence¹⁸, which requires accurate transcript–protein isoform mapping. Another technical hurdle lies in reconciling disparate formats between genomic identifiers of variants such as rsIDs¹⁹ or Human Genome Variation Society (HGVS) notations²⁰ and those of protein data featured on amino acid sequences and their spatial coordinates in structures²¹, often only available for fragments of the full-length protein. Aligning and harmonizing this discrepancy in naming and formatting requires interdisciplinary knowledge or collaborations across researchers from genetics, structural biology and computational biology. Therefore, proteome-wide linking of human genes to protein structures through identifiers and matching variants to their corresponding structural positions thereafter, necessitates efficient computational methods. Additionally, a user-friendly, online method that supports such linking alongside interactive options for uploading data, applying customized filters on variants and exporting interoperable results is needed to enable a broad community toward mapping variants onto proteins, analyzing targets informed by genetics, and thereby designing better therapeutics, which is precisely why we developed the Genomics 2 Proteins (G2P) portal.

The G2P portal is a bioinformatic tool to dynamically query, retrieve and connect genetic variants and transcripts to protein sequence annotations and structures wrapped within an interactive web interface with visualization functions. When writing the paper, ~20 million genetic variants within all human protein-coding genes from public databases are aggregated and annotated within protein sequences and structures with comprehensive protein feature reports. By exploiting experimentally solved and predicted protein structures, the G2P portal covers 99% of all human proteins with corresponding structures.

Results

G2P bioinformatic method and portal

The G2P web portal and the underlying bioinformatic method are built upon a dynamic application programming interface (API) for seamless mapping of identifiers from public databases for genes (HUGO Gene Nomenclature Committee (HGNC))²², transcripts (the Ensembl genome browser²³ and RefSeq²⁴; NCBI Reference Sequence Database), protein sequences (UniProt KnowledgeBase (UniProtKB)²¹) and structures (Protein Data Bank (PDB)⁶ and AlphaFold Database (AlphaFoldDB)²⁵). We call the API Genomics 2 Proteins 3D or G2P3D API (‘Construction of G2P3D API’ in Methods). As of October 2023, the API links 20,292 human genes that encode 20,242 UniProtKB proteins corresponding to 42,413 isoforms, via 53,607 Ensembl transcripts and 57,543 RefSeq transcripts, to 77,923 protein structures (58,027 PDB structures and 19,896 AlphaFold structures). A schematic overview of the portal and data flow via G2P3D API is illustrated in Fig. 1.

**Fig. 1: The bioinformatic framework of the G2P portal.**

Multi-omics data aggregated in the portal showed that about 47% of all human genes have one unique protein isoform to date, as available in UniProtKB (release 2023_05). The remaining 53% of genes have on average three isoforms by alternative splicing. Overall, 85% of all canonical protein isoforms were translated from the MANE (Matched Annotation from NCBI and EBI) Select transcripts²⁶. Additionally, 99% of human proteins have either PDB or AlphaFold structures. Further statistics of data aggregated via G2P3D API are available in Supplementary Note 1.

The web-based user interface of the G2P portal is built on Google Cloud infrastructure (‘G2P Google Cloud infrastructure’ in Methods and Extended Data Fig. 1). The portal features two distinct modules (Fig. 1a): (1) Gene/Protein Lookup: a human proteome-wide resource for users to link genetic variants from transcripts to protein sequences and structures; (2) Interactive Mapping: a tool for users to analyze their data, thereby, not limited to publicly available variants or protein structures. The Gene/Protein Lookup module currently aggregates variants from the Genome Aggregation Database (gnomAD)⁹, ClinVar¹⁰ and the Human Gene Mutation Database¹¹ (HGMD) and then maps them onto amino acid positions upon dynamic retrieval of the protein sequence and structure (Fig. 1b). Moreover, a comprehensive set of protein residue-wise annotations (referred to as ‘protein features’) are computed and aggregated in the portal, such as UniProt-based features (for example, domain and active site), structural features, posttranslational modifications (PTMs)²⁷ and variants effects from MaveDB²⁸. Finally, variants are mapped onto protein sequences and structures simultaneously with protein features, aiding informed variant-mechanism analysis by assessing the mutational impact on important protein features. Further details of submodules within the two main modules of the portal are available in ‘G2P portal sitemap’ in Methods and Extended Data Fig. 2.

A suite of visualization tools has been implemented in the portal for intuitive data exploration (‘Data visualization tools in the G2P portal’ in Methods and Extended Data Fig. 3). The portal undergoes a biannual data update to incorporate the latest information from external databases and maintain the dynamically linked APIs (Supplementary Note 2 and Supplementary Table 1).

Resources in the G2P portal

The Gene/Protein Lookup module of the G2P portal contains variants, protein structures and protein features for all human protein-coding genes (Supplementary Table 2). Genes and proteins are classified by the HGNC²² gene family and protein class from PANTHER knowledgebase²⁹ (Supplementary Note 3.1, Supplementary Fig. 1 and Supplementary Tables 3 and 4).

Variant data

We applied the bioinformatic method to variants in three human genetic variation databases: gnomAD⁹, ClinVar¹⁰ and HGMD¹¹, referred to as gnomAD, ClinVar and HGMD variants, respectively (Fig. 2). As of October 2023, G2P portal aggregates 18,014,632 protein-coding variants annotated in 18,723 human genes from gnomAD v2.1.1, 1,749,628 protein-coding variants in 18,180 genes from ClinVar (10/2023 release) and 312,738 disease-causing mutations in 12,918 genes from HGMD professional 2023.01 (‘Variant aggregation’ in Methods). In the portal, variants were grouped based on their protein consequence: missense, nonsense, synonymous, frameshift, in-frame indel and others (Fig. 2a–c) and visualized as separate tracks in the protein sequence viewer. Additionally, the portal allows for a customized selection of variants based on database-specific filters: allele frequency (AF) and allele count (AC) for gnomAD, clinical significance for ClinVar, and confidence ascertainment for HGMD variants. This unique functionality facilitates the mapping and interpretation of variants based on their specific protein consequences and clinically relevant criteria.

**Fig. 2: Statistics of variants from gnomAD, ClinVar and HGMD databases aggregated in the G2P portal.**

Most gnomAD variants (~97%) are very rare (Fig. 2d), and the fraction of common variants is the largest for synonymous variants (0.78%). Clinical significance of ClinVar variants (Fig. 2e) showed a correlation with protein consequences: the proportion of benign/likely benign (BLB) and pathogenic/likely pathogenic (PLP) variants are inversely correlated across synonymous (90.1% BLB and 0.1% PLP) and nonsense (0.8% BLB and 83.0% PLP) mutations. Missense mutations hold the highest fraction (87.1%) of variants of uncertain significance (VUS) and conflicting interpretation (CI). Within HGMD data, frameshift and nonsense mutations account for the two largest fractions of high-confidence disease mutations (87% and 81%, respectively), while most synonymous mutations are of low confidence (Fig. 2f). Additional variant statistics are available in Supplementary Note 3.2. The portal maps variants to both canonical and noncanonical protein isoforms (a comparative overview of statistics is shown in Supplementary Fig. 2).

Protein structural data

G2P portal dynamically queries structure databases to map variants annotated on transcripts onto structures of canonical protein isoforms (according to UniProtKB²¹). As of October 2023, variants were mapped onto 58,026 PDB⁶ structures of 7,973 proteins (Supplementary Fig. 3 and Supplementary Table 5) and 19,896 AlphaFold²⁵ structures of 19,972 proteins. AlphaFoldDB covers structures of 12,105 proteins with no experimental structure available yet, with reasonably high predicted local distance difference test (pLDDT)²⁵ (median \(\approx\) 75; Supplementary Note 4 and Supplementary Figs. 4 and 5), highlighting the utility of AlphaFold structures for variant mapping and protein three-dimensional (3D) feature analysis of variant positions. As a result, over 90% of gnomAD, ClinVar and HGMD variants were successfully mapped onto protein structures (Fig. 3a–c).

**Fig. 3: Statistics of variants mapped on 3D structures in the G2P portal.**

The distribution of protein consequences and database-specific groups of variants (based on AF for gnomAD, clinical significance for ClinVar and confidence for HGMD) mapped on structures illustrated that gnomAD variants do not show differences in AF distributions between variants mapped on PDB and AlphaFold structures (Fig. 3d). A higher fraction of ClinVar PLP variants were mapped on PDB structures (14.5%) compared to AlphaFold structures (10.2%; Fig. 3e). Similarly, a higher fraction of high-confidence disease mutations from HGMD were mapped on PDB structures (78%) than AlphaFold structures (72%; Fig. 3f).

Protein feature data

A comprehensive set of per-residue protein features are integrated into the portal (‘Protein features in the G2P portal’ in Methods), to help users establish the relationship between genetic variations and protein structure–function (‘Variant and feature mapping onto proteins’ in Methods). The features are grouped into (1) physicochemical properties of amino acids, (2) structural features^30,31,32 based on PDB and AlphaFold structures, (3) sequence annotations from UniProtKB, (4) PTMs from PhosphoSitePlus²⁷, and (5) readouts from multiplexed assays of variant effect (MAVE) when available in MaveDB²⁸. The portal’s proteome-wide feature annotations provide insight into the differential structural and functional characteristics of gnomAD versus ClinVar and HGMD variants. In Fig. 4, we present the abundance of protein features across nine missense variant groups divided by database-specific filters: gnomAD (very rare, rare, low frequency and common), ClinVar (PLP, BLB and VUS) and HGMD (high and low confidence). Notably, gnomAD common and ClinVar BLB missense variants as well as ClinVar PLP and HGMD high-confidence missense variants exhibit similar distributions across all protein features. Similar results for variants leading to different protein consequences are reported in Supplementary Note 5 and Supplementary Figs. 6–10.

**Fig. 4: Abundance of protein features across nine missense variant datasets.**

Case study—Gene/Protein Lookup module

In the Gene/Protein Lookup module, users can query the G2P portal by a gene or protein name and map variants from gnomAD⁹, ClinVar¹⁰ and HGMD¹¹ from a selected transcript, and protein features onto the protein sequence. Variants and features can thereafter be transferred to a selected PDB⁶ and AlphaFold²⁵ structure. Users can explore the customized annotations of protein sequences and structures with variants and protein features online and download mappings in tabular and PyMOL³³ formats.

Figure 5 presents a case study using MORC2, encoding ATPase MORC2 protein, associated with Charcot–Marie–Tooth disease type 2 (CMT2Z)³⁴. Upon selecting a transcript under the ‘variant to protein sequence’ tab (Fig. 5a), users can view variants and protein features mapped in the protein sequence viewer. The portal provides easily selectable filters for variants and features. For example, by filtering for Missense and Pathogenic/Likely pathogenic ClinVar variants, we observe a cluster of PLP missense variants in the N-terminal region (residue 20–470) of MORC2. Protein feature tracks offer further insights into this region: (1) PDBe/SIFT track shows that the N terminus of MORC2 has experimentally solved structures; (2) accessible surface area and AlphaFold confidence (pLDDT) tracks indicate that the N-terminal region exhibits greater structural integrity with lower surface area and higher pLDDT, compared to the C-terminal region; and (3) binding site track highlights predominant presence of catalytic sites (ATP- and ZN²⁺-binding) in the N terminus. Users can visualize these observations on 3D structures using the structure viewer under the ‘variant to protein structure’ tab (Fig. 5b). Here, mapping ClinVar PLP missense variants (yellow, Fig. 5b) alongside the binding site track on the crystal structure of the MORC2 homodimer (PDB 5OF9) reveals the mutations’ proximity to the dimer interface and binding sites (black, Fig. 5b). This case study underscores the utility of G2P portal in unveiling the spatial distribution of MORC2 pathogenic variants and their protein structure–function relationship.

Fig. 5: A use case of the Gene/Protein Lookup module for reported variants and protein features of *MORC2.*

Case study—Interactive Mapping module

The Interactive Mapping module of the G2P portal allows users to upload genetic variants or sequence annotations, for example, domains, drug-binding pockets and conservation scores, and map them to the target protein’s structure (PDB, AlphaFold or user-provided), extending the capability of linking genomics to proteins beyond existing databases and even human proteome. The portal ensures the security of user-uploaded data (Extended Data Fig. 1); user-provided data are not shared or saved in the portal’s backend. The versatility of the integrated pipeline allows concurrent mapping of multiple data types (genetic variants, discrete feature annotations and continuous scores) from gene to protein sequence and structure and exporting results for downstream analyses. Figure 6 presents a case study using the published base-editing (BE) scanning results of the DNA methyltransferase 3A¹⁵. From ‘start with a gene/protein identifier’, we entered the gene DNMT3A, chose a structure (PDB 4U7T) and uploaded annotations such as (1) 34 missense variants (base-edited position) with absolute sgDNA scores ≥ ± 2 standard deviation, (2) sgRNA scores from the BE screen and the pathogenicity prediction scores from AlphaMissense³⁵, and (3) domain annotations (domain) from ref. ¹⁵ (Supplementary Table 6). User-uploaded annotations are visible and selectable in the viewer (Fig. 6a), and users can supplement these with additional annotations from integrated resources (described in ‘Resources in the G2P portal’ in Results). By selecting ‘Base-edited position’ and ‘domain’ annotations (Fig. 6a, left), the user can pinpoint the 3D positions of variants within each domain (Fig. 6a, right)—24 and 4 variants are found in MTase and ADD domain, respectively. Figure 6b illustrates the concurrent mapping of user-uploaded and integrated data in the Gene/Protein Lookup module, for example, ClinVar PLP variants and three-class secondary structures; this capability allows users to analyze their variants in the context of known pathogenic variants (Fig. 6b, top) and structural features (Fig. 6b, bottom). Finally, Fig. 6c showcases the mapping of the BE results onto the AlphaFold structure, colored by the AlphaFold confidence scores (pLDDT) and AlphaMissense scores averaged over 20 amino acid substitutions at a reference amino acid position—a capability in the portal that enables users to characterize variants using any state-of-the-art variant effect prediction score. We observed that all 34 base-edited variants mapped to residues with pLDDT > 70, and that 31 variants are pathogenic according to AlphaMissense (score > 0.57). An extension of the DNMT3A case study to map user-uploaded annotations on user-provided ESMFold⁴ structure is available in Supplementary Note 6 (Supplementary Fig. 11).

**Fig. 6: A use case of the Interactive Mapping module using *DNMT3A* base-editing screens.**

Discussion

Genetic screening is increasingly applied in clinical practice^36,37,38; a persistent bottleneck in translational and clinical genetics, however, is decoding the molecular-level implications of the screening output, that is, genetic variations. Most clinically identified variants remain of uncertain significance³⁹, and determining therapeutically actionable variants is challenging⁴⁰. Bridging genetic variants to structural biology provides a means to connect the potential cause of many diseases to the molecular effect^41,42,43,44. However, integration of data across genomics, transcriptomics, protein sequence and structure, necessary for connecting variants to protein structure, is not straightforward due to different data types and inherent complexities. Here, we present the G2P portal that overcomes these challenges of multi-omics data integration and establishes the bridge between gene variants and protein structures for the human proteome to help generate hypotheses for translating genetic discovery into molecular consequences and therapeutic mechanisms.

Recognizing the importance, multiple efforts have been made to connect variants to proteins: UniProt feature/variant viewer⁴⁵ displays feature annotations and disease-related variants on protein sequences. RCSB 3D-sequence viewer⁴⁶ shows features from a specific PDB entry and a limited set of variants from UniProt. The Gene2Phenotype⁴⁷, also called G2P, facilitates diagnostic variant filtering given a gene or phenotype, and connects users to DECIPHER⁴⁸ for protein information and mapping of protein-changing variants onto proteins. Additionally, VarMap⁴⁹, in conjunction with ProtVar⁵⁰, offers resources to contextualize missense variations with respect to protein sequence and structures. The G2P portal stands out from existing tools on multiple fronts, including the scale of integrated resources, the versatility of the user interface and flexible data visualization and export functions tailored toward investigating functional implications of variants within the context of proteins.

The G2P portal distinguishes itself from UniProt feature/variant viewer⁴⁵, RCSB 3D-sequence viewer⁴⁶ and DECIPHER⁴⁸ sequence/structure viewers by simultaneously mapping variants from the largest population variant^9,12 and multiple disease variant databases^10,11 as separate tracks alongside protein feature tracks. Additionally, users can customize their analysis by filtering variants by protein consequences (for example, missense and synonymous), clinical significance and AF—a feature essential to differentiate local enrichment (or depletion) of pathogenic (or common, putatively benign) variants on the linear protein sequence and 3D structural region associated with a specific protein function. Another differentiating attribute of the G2P portal is its comprehensive protein feature aggregation, encompassing curated annotations from UniProtKB²¹, PDBe/SIFTS³¹, AlphaFoldDB²⁵, MaveDB²⁸, PTM data²⁷ and computed features (for example, solvent accessible area). Unlike existing viewers in UniProt⁴⁵, RCSB PDB⁴⁶ and DECIPHER⁴⁸, the G2P portal can map multiple features, including discrete and quantitative scores concurrently with variants, onto structures. This functionality of the portal enables the identification of mutational hotspots in genes/proteins and their characterization by structural/functional features, which reportedly informs variant prioritization^51,52. The G2P portal also offers enhanced data export capabilities; users can seamlessly download mapping results in multiple interoperable formats, including tabular CSV/TSV files for protein sequences and PyMOL-compatible files for structures, making the user experience of downstream analysis easy and efficient.

Furthermore, the growing landscape of natural and synthetic variants, as well as predicted protein structures, underscores the urgent, unmet need for a flexible, dynamic and interactive tool for variant mapping on protein structures that go beyond existing databases. The Interactive Mapping module of the portal allows for this—a capability distinct from existing tools. To compare, ProtVar⁵⁰ allows users to upload variants but links them to protein structures in a variant-by-variant manner. In contrast, the focus of the G2P portal is to offer uploading of variants by a target gene along with protein residue level score or discrete feature annotations and mapping them concurrently over to the target protein’s sequence and structure. Notably, two features of Interactive Mapping make the G2P portal uniquely versatile: the ability to analyze user-uploaded variants and features (1) in the context of public data and (2) on user-uploaded structures. For example, a clinician with an unreported de novo mutation or a molecular biologist with a set of mutations out of a base-editor screen can upload their data by searching the target gene of interest and interactively investigate their data on the target protein’s sequence and structure, alongside already reported variants from population and clinical databases. Similarly, a structural biologist can upload a previously unsolved structure of a drug target or a structure model predicted by AI methods^1,2,3,4 and can map the known set of disease mutations onto the structure for structure-based rationalization of the impact of disease mutations.

In summary, the G2P portal is an open-source discovery tool for human proteome-wide linking of genetic variants to protein sequences and structures. The versatile interface of the portal is built upon an integrated and dynamic bioinformatic method enabling rapid and efficient investigation of genetic variants with user-defined filters from different databases in the context of proteins, which otherwise requires manual labor and is prone to error. We are committed to growing and maintaining the G2P portal, with regular data updates and documentation of release logs. The future expansion plan includes incorporating cross-isoform and cross-species variant mappings, automating sequence similarity-based structure search for variant mapping, predictions of binding pockets⁵³ and free energy change upon mutation^54,55. The data and capabilities integrated into the portal will connect researchers across different fields of biology for a holistic understanding of how genetic variants impact protein structure and function and thereby facilitate the spectrum of basic biology research, from the translation of genetic discovery into better target selection to drug discovery.

Methods

Construction of G2P3D API

We integrated public databases focusing on genes, transcripts and proteins to build an API for seamless mapping of identifiers for genes, transcripts, protein sequences and structures, referred to as the G2P3D API (Fig. 1). The HGNC²² maintains a curated online repository (https://www.genenames.org/) of approved genes and their unique symbols and names for human loci. The Ensembl genome browser (http://useast.ensembl.org/)²³ offers access to a wide range of genomic annotations. UniProtKB (https://www.uniprot.org/)²¹ provides the most current data on protein sequences and functions. These databases each specialize in different aspects of biology and are regularly updated; thus, there would be situations where gene symbols annotated in UniProtKB have been changed or withdrawn in the HGNC, and UniProtKB IDs annotated in the Ensembl browser have been obsolete in the latest release of UniProtKB. To address this issue, G2P3D API has integrated UniProtKB, Ensembl and HGNC to ensure it captures the most comprehensive and up-to-date information on genes, transcripts and proteins.

First, we obtained a list of all human proteins from UniProtKB/Swiss-Prot (indexed by UniProt Accession or UniProtAC) and their corresponding HGNC IDs. Then, we retrieved gene symbols for each protein from HGNC with the provided HGNC ID. Subsequently, all Ensembl and RefSeq transcript identifiers and corresponding UniProtKB protein isoform identifiers were obtained via the Mart View API from Ensembl BioMart⁵⁶ for the human reference genome GRCh38. These data were processed to map each gene symbol (HGNC) to its encoded UniProtAC and then each protein isoform to its corresponding Ensembl and RefSeq transcript when available from Ensembl. Additionally, canonical protein isoform annotations, as defined by UniProtKB, and the canonical Ensemble and MANE Select²⁶ annotation of transcripts were assembled. Next, the PDB⁶ identifiers for the experimentally solved protein structures per protein were obtained using Graph-API (https://www.ebi.ac.uk/pdbe/graph-api/uniprot/unipdb/:UniProtAC/) and the identifier for the predicted structure by AlphaFold²⁵ was retrieved using API (https://alphafold.ebi.ac.uk/api/prediction/:UniProtAC/). As of October 2023, the G2P3D API, incorporated into the G2P portal (see an example of the API output in Fig. 1b), links 20,292 HGNC genes (Supplementary Table 2) that encode 20,242 UniProtKB/Swiss-Prot human proteins corresponding to 42,413 protein isoforms, via 53,607 Ensembl transcripts and 57,543 RefSeq transcripts, to 77,923 3D protein structures (58,027 experimentally solved and 19,896 computationally predicted).

The G2P3D API is available at https://g2p.broadinstitute.org/api/gene/:geneName/protein/:UniProtAC/gene-transcript-protein-isoform-structure-map/. The Swagger user interface for the API and its documentation are available at https://g2p.broadinstitute.org/api-docs/.

G2P Google Cloud infrastructure

The schematic of the G2P portal infrastructure is presented in Extended Data Fig. 1. The portal frontend is implemented in React.js, which is served by a Node.js backend running on Google Cloud Platform. The RCSB Saguaro 1D Feature Viewer⁵⁷ and Mol*⁵⁸ are adopted and customized as protein sequence and structure viewers, respectively, to visualize the frontend data on protein sequences and structures. The backend runs on Google App Engine, a serverless and on-demand compute offering that launches a variable number of backend instances proportional to usage.

Google Cloud Storage (GCS) is utilized as the primary data store for variant and protein feature annotations per gene/protein alongside an in-memory datastore used on the backend to track the gene–transcript–protein isoform–protein structure mapping. The static data stored in GCS are collected, processed, formatted and uploaded by the portal admin (Extended Data Fig. 1). To load static data from GCS, the portal requests files directly from the frontend, which reduces latency by avoiding an additional ‘hop’ where data must first travel to the backend before reaching the frontend. From our testing, the minimum observed time for a backend request is a 60-ms round trip, and by requesting files directly from the frontend, the G2P portal saves a minimum of 60 ms per request. To load data from the in-memory datastore, the portal frontend makes requests to backend APIs, and the backend retrieves and returns the relevant records. The datastore is managed directly by the backend server, not by a separate process. In addition to managed data sources, the portal dynamically requests data from external APIs to provide the most current information possible. The full list of external and internal APIs as well as static and dynamic data maintained in the G2P portal are available in Extended Data Fig. 1.

To this end, the G2P portal web app requests the latest protein sequence and structure records directly from UniProtKB²¹, PDBe³², AlphaFoldDB²⁵ and EMBL-EBI APIs⁵⁹. In the 'Interactive Mapping' module of the portal, users can provide their data (protein residue-wise annotation of variants, features, scores and protein structures) for joint analysis of user data with G2P-provided resources (‘Resources in the G2P portal’ in Results). The Interactive Mapping module can be securely accessed via Google sign-in, and to further ensure data confidentiality, all user-uploaded data remain within the user’s local browser only; therefore, no user-provided data leaves the user’s device. This ensures that the user has full, secure control over their data while simultaneously providing access to G2P Portal’s variants and protein features for joint analysis. When a user searches a gene or protein via the Gene/Protein Lookup or as part of the Interactive Mapping workflow, static mapping information is fetched directly with the G2P3D API to connect gene to protein to transcript to sequence to structure. Subsequently, detailed gene-specific and protein-specific data are fetched as static data from GCS and dynamic data from external APIs.

G2P portal sitemap

The homepage is the central hub for navigating to two primary modules of the G2P portal: (1) Gene/Protein Lookup and (2) Interactive Mapping, complemented by a top navigation bar featuring tabs for About, Documentation, Statistics, API, Release Logs and Feedback (Extended Data Fig. 2). The disclaimer for using data in the G2P portal is available in the About page. The Statistics page shows the overview of the latest data in the portal. Across the two main modules of the portal, a suite of visualization tools has been implemented for intuitive exploration of the data—protein sequence viewer, variant information and protein feature cards, variant and protein feature tables, protein structure viewer, and mutagenesis output viewer. Details of these viewers are available in ‘Data visualization tools in the G2P portal’ in Methods and Extended Data Fig. 3.

Users can access the Gene/Protein Lookup module by searching for a human gene or protein name. Upon valid input, users are directed to the gene/protein overview page containing the gene family and protein class information for the input gene and a navigation bar with tabs for five submodules, as follows. (1) The ‘protein sequence annotations’ tab hosts a protein sequence viewer that displays a complete list of protein features aggregated within the G2P portal (‘Protein features in the G2P portal’ in Methods). Users can choose a protein isoform identifier from the list of isoforms available for the selected protein, according to UniProtKB²¹. By default, protein features are displayed for the canonical protein isoform. (2) The ‘variant to protein sequence’ tab permits users to select an RNA transcript ID, to map variants from gnomAD⁹, ClinVar¹⁰ and HGMD¹¹ for the selected transcript onto the protein sequence, and displays the mapped variants on the protein sequence viewer on top of protein features (Fig. 5a and Extended Data Fig. 3a). Users can apply filters on variants (different source databases and database-specific filters, for example, AF for gnomAD and pathogenicity for ClinVar) and protein features from an easily (un-)selectable checklist to the left of the sequence viewer. Variant and protein feature data displayed on the protein sequence viewer can also be explored as a table view and are exportable in CSV and PDF formats. Clicking on a specific variant within the sequence viewer, users can expand the variant and protein feature cards with detailed information on the variant and protein features at the variant position (‘Data visualization tools in the G2P portal’ in Methods and Extended Data Fig. 3c). (3) Under the ‘variant to protein structure’ tab, users can find the list of available PDB and AlphaFold protein structures for the selected gene (Fig. 5b). After selecting a structure, users are directed to the ‘structure_map’ page, where users can map variants and protein features onto structures and view them in the protein structure viewer, coupled with the sequence viewer (Extended Data Fig. 3b). Both protein sequence and structure viewer support dynamic feature and variant selection as described above. Outputs from the structure viewer are exportable in PyMOL-compatible formats. (4) The ‘gene to transcript to protein isoform mapping’ tab provides a table view of the mapping of identifiers across gene, transcript and protein sequences, downloadable in TSV format. The canonical protein isoform according to the UniProtKB, the canonical transcript in terms of Ensembl and the MANE Select transcript for the input gene are indicated in the table. (5) The ‘additional resources’ tab offers links to external gene information, such as UCSC⁶⁰, ChEMBL⁶¹, DrugBank⁶², Orphanet⁶³ and OMIM⁶⁴. Moreover, the portal integrates MAVE data from MaveDB for 40 genes²⁸ (Supplementary Table 7). When available, the ‘additional resources’ tab displays the MAVE data (that is, mutagenesis scores) as heat maps. Additionally, the portal shows the title, description and a short method text describing the MAVE assay. The raw JSON files of scores are available to download alongside a hyperlink to the original source of data.

In the ‘Interactive Mapping’ modules, users can start their exploration from either a gene/protein identifier or their own protein structures (respective case studies are presented in Fig. 6 and Supplementary Fig. 11). When starting with a gene/protein identifier, users can provide their target gene of interest as input and then choose a structure (PDB or AlphaFold structure). The portal retrieves the protein sequence and the list of available structures dynamically from the UniProt sequence API and PDB/AlphaFold APIs, respectively. Alternatively, users can start with their own protein structures can upload them in PDB format. In both scenarios, the final step prompts a window for annotations, providing a sample format and allowing users to enter their annotations (variants, scores or features). The resulting data are displayed in the ‘view results’ section (Fig. 6a), featuring both sequence and structure viewers. When starting with a gene/protein identifier, users can also append additional feature annotations, such as protein features and variants, corresponding to the selected transcript or protein isoforms, and map them simultaneously with the user-uploaded data on protein sequences and structures.

Data visualization tools in the G2P portal

Protein sequence viewer

We adopted the RCSB Saguaro 1D Feature Viewer⁴⁴ and customized it for online visualization of variants and protein features mapped onto the protein sequence with dynamic applications of filters on variants and protein features, referred to henceforth as the ‘protein sequence viewer’ (Extended Data Fig. 3a). The protein sequence viewer in the G2P portal is highly flexible. Variants and features are grouped under collapsible and expandable headers according to variant databases and feature groups and can be easily filtered in and out from the sequence viewer according to AF or pathogenicity criteria (see ‘Resources in the G2P portal’ in Results and ‘Protein features in the G2P portal’ in Methods for further details on variants and features integrated in the G2P portal). Users can download the customized mapping data as residue-wise annotations in CSV or PDB format. For example, Extended Data Fig. 3a shows the mapping of CBS gnomAD missense variants with the filter ‘singleton’ and ClinVar missense variants with the filter Pathogenic/Likely pathogenic, in the context of UniProt sequence features alone and other protein features collapsed for clarity.

Protein structure viewer

We integrated the Mol* protein structure viewer⁵⁸ to visualize variants, protein features and scores on protein structures, simultaneously with protein sequence (Extended Data Fig. 3b). Users can map three types of data from sequence to structure: variants (mutation positions, as spheres), scores (continuous variable, as a heat map) and multiclass features (discrete/categorical variable discretely colored by category). Users can map, review and recolor features as desired, and apply data filters concurrently. For example, a user can filter CBS ClinVar missense PLP variants (orange spheres) and gnomAD synonymous singletons (green spheres) and map them concurrently with the domain annotation (light blue) from UniProtKB on the protein structure (Extended Data Fig. 3b). In the Interactive Mapping module, users can map user-uploaded annotations on the structure and can further add variant and feature annotations from available databases, to inspect user-uploaded data in the context of existing data.

The structure viewer is interconnected with the sequence viewer; when a user hovers over residues in sequence, they are highlighted in the structure, and vice versa. The G2P portal is dynamically linked with and loads structures from the PDB⁶ and AlphaFold²⁵. Many AlphaFold structures show high-confidence structured domains surrounded by low-confidence regions, which challenge users to analyze the structure by obscuring structured regions and globular domains. As such, the structure viewer provides additional functionality, allowing users to hide residues on AlphaFold structures based on the AlphaFold confidence of the structure (pLDDT). To export data for subsequent analysis, the structure viewer allows users to download structures and all accompanying features in a prepared PyMOL file, which includes user-uploaded and the G2P portal-provided features as annotations in the PyMOL session.

Variant and protein feature table

Users can view per-residue annotation of variants and protein features per gene (or protein) by clicking ‘view as table’ on top of the protein sequence viewer (Extended Data Fig. 3a). For gnomAD variants, the table includes the HGVS annotation of variants (HGVSp, HGVSc), AC and frequency information, homozygote count, and so on (for example, see https://g2p.broadinstitute.org/table/LDLR/P01130-1/ENST00000558518/missense/). For ClinVar variants, the details include genomic and protein consequences, ClinVar variation type, and other clinically relevant information as available in ClinVar (for example, clinical significance, phenotypes and review status; for example, see https://g2p.broadinstitute.org/clinvartable/LDLR/P01130-1/ENST00000558518/clinvar_single/). Similarly, for HGMD variants, the table lists the variant consequences (genomic and protein), codon change, HGMD confidence and disease annotations (for example, see https://g2p.broadinstitute.org/hgmdtable/LDLR/P01130-1/ENST00000558518/missense/). The protein feature table (for example, see https://g2p.broadinstitute.org/features/LDLR/P01130/P01130-1/) includes all features described in ‘Protein features in the G2P portal’ in Methods. Data in these tabular views can be downloaded as machine-readable text files for further usage by users, except for the licensed HGMD professional data. Note that all variant-level information reflects data available in source databases (gnomAD, ClinVar and HGMD) and users are referred to respective databases for the definitions and details of those information.

Variant information and protein feature cards

From the protein sequence viewer, users can click on a variant position to view detailed variant information and protein features for the variant position as summary reports in ‘variant information’ and ‘protein feature’ cards, respectively (Extended Data Fig. 3c). These cards include details of a selected variant, which is also available in the ‘table view’ for the entire gene or protein (as described above in ‘Variant and protein feature table’). For example, in the case presented in Extended Data Fig. 3c, users can click on the CBS ClinVar missense variant at Gly116 on the protein sequence viewer, and a card will display below revealing details for variant p.Gly116Arg, such as p.Gly116Arg has been classified as a PLP variant and is associated with homocystinuria. At the same time, the protein feature card shows a summary of five categories of protein features for the residue position Gly116. The summary highlights that the variant p.Gly116Arg substitutes a small, flexible amino acid Gly to a charged amino acid Arg (physicochemical properties), the variant is located at a buried region of the protein structure with an accessible surface area of 7 Å² (structural features), and this missense variant substitutes a known PTM site (PhosphoSitePlus PTMs). Whenever available, each variant and feature information in the cards are linked to their original sources for users to check for any update in the original data source (Extended Data Fig. 3c).

Mutagenesis output viewer

We implemented a mutagenesis output viewer to display the MAVE from MaveDB²⁸, when available (Supplementary Table 7). Users can view MaveDB data under the ‘additional resources’ tab of the Gene/Protein Lookup module (Extended Data Fig. 2). For single missense mutations, a 21 × N heat map is displayed, where N is the range of mutations covered by MAVE perturbations with 21 rows for the 20 different amino acids and the stop codon possible at each position. Each value in the heat map corresponds to the score recorded in the MAVE, or the average of multiple scores if multiple scores were recorded for the same mutation. An example is shown in Extended Data Fig. 3d for CBS MAVE readouts collected via DMS-TileSeq at low levels of vitamin B₆. Scores show a clear distinction between residues 90 and 390 (low scores in blue) and residues at the N terminus and C terminus (high scores in red). For double mutant MAVEs, where two different residues were perturbed concurrently, an N × N heat map is displayed where the row and column each represent one of the two residue positions perturbed in the experiment. As with the single missense mutations, the value in the heat map corresponds to the reported score from the mutation or the average of all scores reported for the residue pair. Different MAVEs utilizing different techniques have different score scales and scores that require interpretation in the context of the methodology used by the corresponding MAVE. To this end, the G2P portal includes a brief description of the experimental technique and scoring methodology of the paper, as provided by MaveDB, and additional links to the score set page in MaveDB and the associated publication such that users can best understand the experimental conditions under which any specific score of interest was collected. To facilitate deeper analysis, the portal includes a downloadable JSON file with all coding and noncoding variants from MAVE data.

Variant aggregation

We downloaded raw VCF files (https://gnomad.broadinstitute.org/downloads/) for genome and exome datasets from gnomAD⁹ v2.1.1 and selectively extracted variants that passed all variant filters for quality control (filter = ’PASS’ flag) and possessed valid HGVSp annotation. When the same variant was identified in both genome and exome datasets, we summed the AC and the sample count, subsequently calculating the merged AF value. Variant data from ClinVar¹⁰ (October 2023 release) was downloaded directly from the FTP site (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variants_summary_txt.gz). Variants were filtered based on the reference genome GRCh38 and valid HGVSp annotation. From HGMD professional release (version 2023.1)¹¹, variants on GRCh38 and with disease-causing state (variantType = ‘DM’ flag, indicating disease mutation) were extracted. Among those variants, we collected variants that have z valid HGVSp annotation retrieved from Ensembl Variant Effect Predictor⁶⁵ REST API (https://rest.ensembl.org/vep/human/hgvs/:hgvs_notation/). Variants were excluded under the following conditions: (1) reference or altered amino acids are not 20 natural amino acids, or (2) the gene was not included from a list of genes from the G2P3D API, which contains only a reliable set of genes both present in HGNC²² and UniProtKB²¹ databases. The resulting variant aggregation spans 18,014,632 gnomAD variants, 1,749,628 ClinVar variants and 312,783 HGMD variants mapped on protein sequences.

Variant and feature mapping onto proteins

Genetic variants are annotated on the transcript; for example, variants sourced from gnomAD⁹ are annotated on Ensembl²³ transcripts (ENST-), and those from ClinVar¹⁰ and HGMD¹¹ are annotated on RefSeq²⁴ transcripts (NM-). Each variant aggregated from the databases was linked to its corresponding protein isoform IDs using the in-house G2P3D API (Fig. 1b) and then mapped onto its amino acid position upon fetching the protein sequence using UniProt REST API (https://rest.uniprot.org/uniprotkb/:UniProtAC.json/). Variants were mapped to both canonical and noncanonical protein sequences but only to structures of canonical protein sequences. Finally, proteins’ functional and structural features were annotated onto variant positions at the protein level (‘Protein Features in G2P portal’). The predicted structures cover the full-length protein sequences; however, the experimental structures often cover only parts of the protein and have gaps. We used the polymer_coverage API (https://www.ebi.ac.uk/pdbe/api/pdb/entry/polymer_coverage/:pdbid) to map experimental structure coverage to the sequence space for each chain. We then mapped protein residue positions from sequence to structure and consequently transferred the variants (that is, protein consequence positions) to protein structures, leveraging Mol*⁵⁸ functionality to properly align variants to positions before and after gaps. We found some limitations with polymer_coverage API and Mol* coverage detection, for example, a gap in a PTEN structure, PDB 5BUG (that is, missing region in crystallographic structure), is incorrectly reported in the API response and incorrectly aligned in the Mol* software.

Protein features in the G2P portal

The G2P portal provides a comprehensive set of protein features on both protein sequences and structures, which include physicochemical properties of amino acids, sequence annotations collected from external databases such as UniProtKB²¹ and PhosphoSitePlus²⁷, 3D structural features collected from PDB⁶ and AlphaFold²⁵ and readouts from the MAVEs when available in MaveDB²⁸.

(1)
The physicochemical properties of reference amino acids: The 20 natural amino acids are grouped into six categories based on physicochemical properties of their side chain R-groups; (i) Aliphatic—alanine (Ala/A), isoleucine (Ile/I), leucine (Leu/L), methionine (Met/M) and valine (Val/V); (ii) Aromatic—phenylalanine (Phe/F), tryptophan (Trp/W) and tyrosine (Tyr/Y); (iii) Polar/neutral—asparagine (Asn/N); glutamine (Gln/Q), serine (Ser/S) and threonine (Thr/T); (iv) Positively charged—arginine (Arg/R), histidine (His/H) and lysine (Lys/K); (v) Negatively charged—aspartic acid (Asp/D) and glutamic acid (Glu/E); (vi) Special—proline (Pro/P; a cyclic side chain and cannot make backbone hydrogen bonds), glycine (Gly/G; does not have a side chain that allows flexibility) and cysteine (Cys/C; a reactive sulfhydryl group -SH in the side chain). In addition to these groupings, the molar mass (g mol⁻¹) and hydropathy index (a numerical measure reflecting the hydrophobicity of a side chain—the larger the number is, the more hydrophobic the amino acid) of each amino acid are shown for the protein sequence.
(2)
3D structural features: The G2P portal provides precomputed annotations on structural features. These features are computed based on AlphaFold-predicted structures, aiming for extensive coverage. Secondary structures of amino acids refer to the local 3D conformations of the polypeptide backbone. DSSP³⁰ (Define Secondary Structure of Protein) is the standard tool for determining secondary structure by classifying each residue into a three-class structure (H, helix; B, β-sheet/strand; C, loop/coil) or a nine-class structure (G, 3₁₀-helix; H, α-helix; I, π-helix; P, polyproline helix; B, isolated β-bridge; E, parallel β-sheet; S, bend; T, turn; C, loop/coil). We utilized DSSP to annotate both three-class and nine-class secondary structures on AlphaFold structures. When experimental structures are available (for example, from PDBe/SIFTS³¹), we provide PDBe/SIFT secondary structures, which are derived from experimental structures, in a separate track. Additionally, DSSP calculates the accessible surface area (in Å²) and the backbone torsional phi/psi angles (in degrees) for each amino acid position within the context of the protein’s 3D structures. Furthermore, we include a per-residue confidence score produced in AlphaFold, known as pLDDT. The score ranges from 0 to 100 and categorizes the confidence as ‘very high’ (pLDDT > 90), ‘high’ (pLDDT > 70), ‘low’ (pLDDT > 50) or ‘very low’ (pLDDT < 50). Residues are color coded accordingly. It is important to note that residues with very low pLDDT scores may indicate that their structures are disordered in isolation.
(3)
Sequence annotation from UniProtKB: We gathered the sequence annotations that describe various regions, domains, or sites of interest for a protein, elucidating its function, binding, sequence motif, domain/site/region, molecular preprocessing and more. The G2P portal offers 31 selected sequence annotations: active site, binding site, chain, coiled coil, compositional bias, cross-link, disulfide bond, DNA binding, domain, glycosylation, initiator methionine, intramembrane, lipidation, modified residue, motif, mutagenesis, non-adjacent residues, non-standard residue, non-terminal residue, peptide, propeptide, region, repeat, sequence conflict, sequence uncertainty, signal, site, topological domain, transit peptide, transmembrane and zinc finger.
(4)
PTM: PTM refers to the covalent and enzyme-mediated modification of proteins to form mature proteins. We collected amino acid positions of seven different PTM types from the PhosphoSitePlus database: (i) acetylation—addition of an acetyl group; (ii) methylation—addition of a methyl group; (iii) O-GlcNAc—addition of N-acetylglucosamine, also known as O-linked N-acetylglucosamine; (iv) O-GalNAc—addition of N-acetylgalactosamine, also known as O-linked N-acetylgalactosamine; (v) phosphorylation—addition of a phosphoryl group; (vi) SUMOylation—addition of SUMO protein (small ubiquitin-like modifiers); (vii) ubiquitination—attachment of ubiquitin.
(5)
Readouts from MAVE: MaveDB²⁸ is a public repository dedicated to housing datasets from MAVEs. These datasets primarily result from deep mutational scanning or massively parallel reporter assay experiments. When a gene/protein is available in MaveDB (Supplementary Table 7), amino acid positions displaying variants whose effect falls within the top and bottom 99th percentile are highlighted in the protein sequence and structure viewer. The rationale behind displaying only the top and bottom 99th percentile was clarity of visualizing the data, but the full data are displayed as heat maps under the additional resource tab of the Gene/Protein Lookup module and are downloadable in JSON format (Extended Data Figs. 2 and 3d).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All resources in this paper are available on the G2P portal website (https://g2p.broadinstitute.org/). The G2P3D API is available at https://g2p.broadinstitute.org/api/gene/:geneName/protein/:uniprotId/gene-transcript-protein-isoform-structure-map/. The G2P Protein Feature API is available at https://g2p.broadinstitute.org/api/gene/:geneName/protein/:uniprotId/protein-features/. All APIs are easily accessible through the portal Swagger UI at https://g2p.broadinstitute.org/api-docs/. The following databases are accessed by the portal: HUGO Genome Nomenclature Committee (https://www.genenames.org/), Ensembl browser (https://www.ensembl.org/), NCBI Reference Sequence Database (https://www.ncbi.nlm.nih.gov/refseq/), UniProtKB (https://www.uniprot.org/), RCSB PDB (https://www.rcsb.org/), AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/), gnomAD (https://gnomad.broadinstitute.org/), ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/), the HGMD (https://www.hgmd.cf.ac.uk/ac/index.php/), PhosphoSitePlus (https://www.phosphosite.org/homeAction.action), MaveDB (https://mavedb.org/) and the PANTHER knowledgebase (https://www.pantherdb.org/). Source data are provided with this paper.

Code availability

The G2P portal website is available at https://g2p.broadinstitute.org/. An open-source Python library and command line tool for data access via the portal APIs is available at https://github.com/broadinstitute/g2papi/. Documentation for API access for the G2P3D API is available at https://g2p.broadinstitute.org/api-docs/. The portal relies on third-party software including NodeJS v20.11.0 to run the server. ReactJS 17.0.2, Mol* 3.9.1 and rcsb-saguaro 2.5.12 software packages were used for visualization. All data analysis scripts and the Python client library were written using Python, version 3.10.

References

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article CAS PubMed PubMed Central Google Scholar
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).
Article CAS PubMed Google Scholar
Lin, Z. M. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article CAS PubMed Google Scholar
Hekkelman, M. L., Vries, I. D., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20, 205–213 (2023).
Article CAS PubMed Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Article CAS PubMed PubMed Central Google Scholar
Burley, S. K. et al. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, gky949 (2018).
Google Scholar
Patwardhan, A. et al. Data management challenges in three-dimensional EM. Nat. Struct. Mol. Biol. 19, 1203–1207 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gudmundsson, S. et al. Variant interpretation using population databases: lessons from gnomAD. Hum. Mutat. 43, 1012–1030 (2022).
Article PubMed Google Scholar
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, gkx1153 (2017).
Google Scholar
Stenson, P. D. et al. The Human Gene Mutation Database (HGMD): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).
Article PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Turner, T. N. et al. denovo-db: a compendium of human de novo variants. Nucleic Acids Res. 45, D804–D811 (2017).
Article CAS PubMed Google Scholar
Porto, E. M., Komor, A. C., Slaymaker, I. M. & Yeo, G. W. Base editing: advances and therapeutic opportunities. Nat. Rev. Drug Discov. 19, 839–859 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lue, N. Z. et al. Base editor scanning charts the DNMT3A activity landscape. Nat. Chem. Biol. 19, 176–186 (2023).
Article CAS PubMed Google Scholar
Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).
Article CAS PubMed PubMed Central Google Scholar
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Article CAS PubMed PubMed Central Google Scholar
Andreadis, A., Gallego, M. E. & Nadal-Ginard, B. Generation of protein isoform diversity by alternative splicing: mechanistic and biological implications. Annu. Rev. Cell Biol. 3, 207–242 (1987).
Article CAS PubMed Google Scholar
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS PubMed PubMed Central Google Scholar
den Dunnen, J. T. Describing sequence variants using HGVS nomenclature. in Genotyping: Methods and Protocols (eds White S. J. & Cantsilieris S.) 243–251 (Springer New York, 2017).
Apweiler, R. et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).
Article CAS PubMed PubMed Central Google Scholar
Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2022).
Article PubMed Central Google Scholar
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Article CAS PubMed PubMed Central Google Scholar
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).
Article CAS PubMed Google Scholar
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).
Article PubMed Central Google Scholar
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512–D520 (2015).
Article CAS PubMed Google Scholar
Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).
Article PubMed PubMed Central Google Scholar
Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8, 1551–1566 (2013).
Article PubMed PubMed Central Google Scholar
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
Article CAS PubMed Google Scholar
Dana, J. M. et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 47, D482–D489 (2019).
Article CAS PubMed Google Scholar
Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48, D335–D343 (2020).
CAS PubMed Google Scholar
Schrödinger, L. The PyMOL Molecular Graphics System, version 1.8 (2015).
Sancho, P. et al. Characterization of molecular mechanisms underlying the axonal Charcot–Marie–Tooth neuropathy caused by mutations. Hum. Mol. Genet 28, 1629–1644 (2019).
Article CAS PubMed Google Scholar
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
Article CAS PubMed Google Scholar
Ramos, E. M. et al. Characterizing genetic variants for clinical action. Am. J. Med. Genet. C Semin. Med. Genet. 166, 93–104 (2014).
Article Google Scholar
Lau, T. K. & Leung, T. N. Genetic screening and diagnosis. Curr. Opin. Obstet. Gynecol. 17, 163–169 (2005).
Article PubMed Google Scholar
Stark, Z. & Scott, R. H. Genomic newborn screening for rare diseases. Nat. Rev. Genet. 24, 755–766 (2023).
Article CAS PubMed Google Scholar
Hoffman-Andrews, L. The known unknown: the challenges of genetic variants of uncertain significance in clinical practice. J. Law Biosci. 4, 648–657 (2017).
Article PubMed Google Scholar
Carter, T. C. & He, M. M. Challenges of identifying clinically actionable genetic variants for precision medicine. J. Healthc. Eng. https://doi.org/10.1155/2016/3617572 (2016).
Woodard, J., Iqbal, S. & Mashaghi, A. Circuit topology predicts pathogenicity of missense mutations. Proteins 90, 1634–1644 (2022).
Article CAS PubMed PubMed Central Google Scholar
Iqbal, S. et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc. Natl Acad. Sci. USA 117, 28201–28211 (2020).
Article CAS PubMed PubMed Central Google Scholar
Iqbal, S. et al. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe. Nucleic Acids Res. 48, gkaa361 (2020).
Article Google Scholar
Costain, G. & Andrade, D. M. Third-generation computational approaches for genetic variant interpretation. Brain 146, 411–412 (2023).
Article PubMed Google Scholar
Watkins, X., Garcia, L. J., Pundir, S., Martin, M. J. & Consortium, U. ProtVista: visualization of protein sequence annotations. Bioinformatics 33, 2040–2041 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bittrich, S. et al. RCSB Protein Data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics 38, 1452–1454 (2022).
Article CAS PubMed Google Scholar
Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun. https://doi.org/10.1038/s41467-019-10016-3 (2019).
Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).
Article CAS PubMed Google Scholar
Stephenson, J. D., Laskowski, R. A., Nightingale, A., Hurles, M. E. & Thornton, J. VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations. Bioinformatics 35, 4854–4856 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stephenson, J. D. et al. ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res. https://doi.org/10.1093/nar/gkae413 (2024).
Hicks, M., Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Functional characterization of 3D protein structures informed by human genetic diversity. Proc. Natl Acad. Sci. USA 116, 8960–8965 (2019).
Article CAS PubMed PubMed Central Google Scholar
Iqbal, S. et al. Delineation of functionally essential protein regions for 242 neurodevelopmental genes. Brain 146, 519–533 (2022).
Article PubMed Central Google Scholar
Meller, A. et al. Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network. Nat. Commun. 14, 1177 (2023).
Article CAS PubMed PubMed Central Google Scholar
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Article CAS PubMed PubMed Central Google Scholar
Tiberti, M. et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles. Brief. Bioinform. 23, bbac074 (2022).
Article PubMed Google Scholar
Smedley, D. et al. BioMart—biological queries made easy. BMC Genomics 10, 22 (2009).
Article PubMed PubMed Central Google Scholar
Segura, J., Rose, Y., Westbrook, J., Burley, S. K. & Duarte, J. M. RCSB Protein Data Bank 1D tools and services. Bioinformatics 36, btaa1012 (2020).
Google Scholar
Sehnal, D. et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49, W431–W437 (2021).
Article CAS PubMed PubMed Central Google Scholar
Madeira, F. et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 50, W276–W279 (2022).
Article CAS PubMed PubMed Central Google Scholar
Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).
Article CAS PubMed PubMed Central Google Scholar
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
Article CAS PubMed Google Scholar
Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
Article CAS PubMed Google Scholar
Weinreich, S. S., Mangon, R., Sikkens, J. J., Teeuw, M. E. E. & Cornel, M. C. Orphanet: a European database for rare diseases. Ned. Tijdschr. Geneeskd. 152, 518–519 (2008).
CAS PubMed Google Scholar
Hamosh, A., Scott, A. F., Amberger, J., Valle, D. & McKusick, V. A. Online Mendelian Inheritance In Man (OMIM). Hum. Mutat. 15, 57–61 (2000).
Article CAS PubMed Google Scholar
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. https://doi.org/10.1186/s13059-016-0974-4 (2016).

Download references

Acknowledgements

We thank G. Poncet-Montange and A. Wagner for the scientific discussion. We gratefully acknowledge B. Hajian for the scientific illustration of the G2P portal and the PATTERN team at Broad Institute for the website design feedback. This work was supported by a grant from the Merkin Institute for Transformative Technologies in Healthcare (to S.I.). We also acknowledge the support from NIH/NHGRI grants UM1HG011969 and RM1HG010461 and the Australian Government (to A.F.R.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Seulki Kwon, Jordan Safer.

Authors and Affiliations

Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Seulki Kwon, Jordan Safer, Arthur J. Campbell, Alex Burgin & Sumaiya Iqbal
PATTERN, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Duyen T. Nguyen
Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
David Hoksza
Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
Patrick May
The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, USA
Jeremy A. Arbesfeld
Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
Alan F. Rubin
Department of Medical Biology, University of Melbourne, Parkville, Victoria, Australia
Alan F. Rubin
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Sumaiya Iqbal
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
Sumaiya Iqbal
Cancer Data Sciences, Dana-Farber/Harvard Cancer Center, Boston, MA, USA
Sumaiya Iqbal

Authors

Seulki Kwon
View author publications
Search author on:PubMed Google Scholar
Jordan Safer
View author publications
Search author on:PubMed Google Scholar
Duyen T. Nguyen
View author publications
Search author on:PubMed Google Scholar
David Hoksza
View author publications
Search author on:PubMed Google Scholar
Patrick May
View author publications
Search author on:PubMed Google Scholar
Jeremy A. Arbesfeld
View author publications
Search author on:PubMed Google Scholar
Alan F. Rubin
View author publications
Search author on:PubMed Google Scholar
Arthur J. Campbell
View author publications
Search author on:PubMed Google Scholar
Alex Burgin
View author publications
Search author on:PubMed Google Scholar
Sumaiya Iqbal
View author publications
Search author on:PubMed Google Scholar

Contributions

S.I. conceptualized the project, designed the study and led the development of the G2P portal. S.K. performed the data analyses. J.S., D.T.N. and D.H. developed the G2P website. J.S., S.K., P.M., J.A.A. and A.F.R. contributed to the data curation. S.K., J.S. and S.I. wrote the manuscript. D.T.N., D.H., P.M., J.A.A., A.F.R., A.J.C. and A.B. reviewed the final manuscript. S.I., A.J.C. and A.B. contributed to the funding acquisition. S.I. supervised the project.

Corresponding author

Correspondence to Sumaiya Iqbal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Russ Altman and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The Google Cloud infrastructure of the G2P portal.

This figure illustrates the web implementation of the portal. The frontend is implemented in React.js and includes a customized version of RCSB Saguaro 1D Feature Viewer and Mol* as protein sequence and structure viewer, respectively. The backend is implemented in Node.js and uses the Google app engine. Users can query, upload, and retrieve data from the portal, and the flow of user-uploaded, static, and dynamic data is shown with arrows in different colors (user-uploaded data in orange, static data in cyan, and dynamic data in pink). All static data are stored in Google cloud storage. All user-uploaded data remain on the users’ browser, securing the confidentiality of users’ data.

Extended Data Fig. 2 The sitemap of the Genomics 2 Proteins (G2P) portal.

From the home page, users can access the About, Documentation, Release Logs, APIs in the portal, and Feedback pages, available on the navigation bar at the top of the portal. There are two main modules in the portal: (1) Gene/Protein Lookup, accessible via searching by a human gene or protein name; (2) Interactive Mapping, accessible via secure Google sign-in upon clicking on the button displayed on the home page. The Gene/Protein lookup module has five submodules for protein sequence annotation, variant mapping to protein sequence, variant mapping to protein structure, gene to transcript to protein isoform mapping, and links to additional resources. The Interactive Mapping module has two submodules, for allowing users to start with any human gene or a protein structure to map user-uploaded data onto the target protein’s sequence and structure. The user input, data sources, visualization methods, and downloadable data formats within each submodule are listed in the figure.

Extended Data Fig. 3 Data visualization tools in the G2P portal.

(a) Protein sequence viewer. This viewer displays protein residue-wise variants and protein features for the selected gene and transcript. Variants can be filtered based on protein consequences and database-specific filters. Data displayed within the viewer can be exported in tabular format (View as table button) and downloaded as CSV or PDF formats (Download button). The figure shows gnomAD missense (singletons; blue) and ClinVar missense (pathogenic/likely-pathogenic; orange) for gene CBS and transcript NM_000071 along with residue-wise physicochemical properties and UniProt sequence annotations in the protein sequence viewer. (b) Protein structure viewer. In the G2P portal, the structure viewer is coupled with the sequence viewer to interactively map variants and protein features on the sequence viewer onto the structure. Users can click a track to select variants or features from the sequence viewer to visualize on the structure viewer. Users can download the customized mapping results in a PyMOL-compatible file. The figure displays the concurrent mapping of gnomAD synonymous singleton variants (green spheres), ClinVar missense pathogenic/likely pathogenic variants (orange spheres), and the Domain annotation from UniProtKB (light blue) mapped on the structure (PDB: 7QGT) (c) Variant information and protein feature cards. These cards provide a per-variant summary of variant details and protein features for the variant position (see Methods: Data visualization tools in the G2P portal, for details). The example in this figure shows the details of CBS variant Gly116Arg from ClinVar and the physicochemical, structural, and functional features for the variant position, Gly116. The variant and features are linked to their sources, whenever available. (d) Mutagenesis output viewer. This viewer shows the mutagenesis readouts, when available in MaveDB²⁸, for a gene as a heatmap. By hovering over the heatmap, users can view the readouts from the assay and can download the entire score set by clicking the download icon. The figure highlights the residues 90-390 with a differentiating mutagenesis readouts compared to the rest of the protein.

Supplementary information

Supplementary Information

Supplementary Notes 1–6, Supplementary Figs. 1–11 and Supplementary Table 1.

Reporting Summary

Peer Review File

Supplementary Tables 2–7

Supplementary Table 2 The list of genes in the G2P3D API. The table includes the HGNC symbols, HGNC IDs, locus groups and corresponding UniProtKB accessions for 20,292 human genes. Supplementary Table 3 HGNC gene families. The table includes the list of HGNC gene family annotations along with the total count and a list of HGNC gene symbols categorized into distinct HGNC gene families. Supplementary Table 4 PANTHER protein classes. The table includes the list of PANTHER protein classes along with the total count and a list of UniProtKB identifiers categorized into distinct PANTHER protein classes. Supplementary Table 5 The number of available PDB structures per protein. The table includes UniProtKB accession of the protein, the corresponding gene symbols and the count of PDB structures along with PDB identifiers per protein for 7,973 proteins. Chain ID corresponding to each protein is provided within parenthesis, for example, for B2M/P61769, 1A1M(B) denotes that the chain B corresponds to BAM/P61719 in PDB 1A1M structure. Supplementary Table 6 Annotations for the interactive mapping case study. The table includes an example for a user-uploaded annotation for the interactive mapping case study of DNMT3A present in Fig. 6 and Supplementary Fig. 10. Annotations were obtained from the base editor scanning study of DNMT3A²¹. Supplementary Table 7 A list of genes available from MaveDB.

Source data

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kwon, S., Safer, J., Nguyen, D.T. et al. Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures. Nat Methods 21, 1947–1957 (2024). https://doi.org/10.1038/s41592-024-02409-0

Download citation

Received: 03 January 2024
Accepted: 09 August 2024
Published: 18 September 2024
Version of record: 18 September 2024
Issue date: October 2024
DOI: https://doi.org/10.1038/s41592-024-02409-0

This article is cited by

Multi-omics protein signaling networks identify sex-specific therapeutic candidates in lung adenocarcinoma
- Chen Chen
- Enakshi Saha
- John Quackenbush
Biology of Sex Differences (2025)
Mapping MAVE data for use in human genomics applications
- Jeremy A. Arbesfeld
- Estelle Y. Da
- Alex H. Wagner
Genome Biology (2025)

Subjects

Abstract

Similar content being viewed by others

Main

Results

G2P bioinformatic method and portal

Resources in the G2P portal

Variant data

Protein structural data

Protein feature data

Case study—Gene/Protein Lookup module

Case study—Interactive Mapping module

Discussion

Methods

Construction of G2P3D API

G2P Google Cloud infrastructure

G2P portal sitemap

Data visualization tools in the G2P portal

Protein sequence viewer

Protein structure viewer

Variant and protein feature table

Variant information and protein feature cards

Mutagenesis output viewer

Variant aggregation

Variant and feature mapping onto proteins

Protein features in the G2P portal

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links