Fig. 3: Simulated patients mimic real-world patients.
From: Simulation of undiagnosed patients with novel genetic conditions

Diagnosed, real-world patients from the Undiagnosed Diseases Network (orange) and a disease-matched cohort of simulated patients (teal) have similar numbers of a candidate genes per patient (average μ of 13.13 vs. 13.94) and b positive phenotype terms per patient (average of 24.08 vs. 21.57). c Real patients (orange) and simulated patients (teal) are indistinguishable based on their annotated positive phenotype terms within each Orphanet disease category, as visualized using non-linear dimension reduction via a Uniform Manifold Approximation and Projection (UMAP) plot. The horizontal and vertical axes are uniform across all plots. The number of real patients within each disease category, n, is listed in the corner of each plot; there are 20 simulated patients for each real patient. d For each real-world patient, all simulated patients in the disease-matched cohort are ranked randomly (black) and by the Jaccard similarity of their phenotype terms to the query real-world patient (purple). The Empirical Cumulative Distribution Function (ECDF) plot shows that the basic Jaccard similarity metric is able to retrieve simulated patients with the same disease as the query real patient more accurately than if the simulated patients were retrieved randomly. e The distributions of shortest path distances between all non-causal candidate and true causal genes in a gene–gene interaction network are indistinguishable between real-world and simulated patients. n is the number of patients in each patient category. Source data are provided as a Source Data file.