Fig. 2

Petagraph dataset schema. (a) General schema design. Schematic summary of the Petagraph schema. Main “Concept” nodes (blue circle) are the bi-directional connected backbone of the Petagraph schema and represent organizing nodes where multiple annotation systems for a single type of entity converge. For example the Concept node for Type 2 diabetes (T2D) may have many different definitions and codings across multiple systems such as ICD10, MONDO, HPO, or SNOMED that can all link to one Concept node for T2D. Each system would have its own Code node connecting to the particular Concept representing the coded identifier from a system. The CUI-CUI connector between Concept nodes represents over 2,000 edge types defining Concept-to-Concept relationships. Code nodes (yellow circle) are the entities that store systematic IDs from different systems that connect to the Concept, and Terms (brown circle) give human-readable definitions for Codes and Concepts. Semantic types (light blue circle) classify the Concept node type while Definition nodes (orange circle) provide a unified definition of the Concept node. Bidirectional relationship links only exist between Concept nodes, simplifying queries and improving query times. (b) Specific Schema example. The Concept node for TP53 (C0079419) with three of its Codes and Terms are shown: HGNC (HGNC:11998), NCI Gene (NCI:C17359) and ENTREZ (ENTREZ:7157). Other Code nodes also connect to C0079419 but are not shown. It can also be seen in this example that the only bidirectional connections are those between Concept nodes, shown here between TP53’s Concept node and one of the HPO Concept nodes. PT: Preferred term. STY: Semantic Type. DEF: Definition.