Extended Data Fig. 3: Simulation experiment for the identification of pairs of read and genotype data derived from the same individuals. | Nature Microbiology

Extended Data Fig. 3: Simulation experiment for the identification of pairs of read and genotype data derived from the same individuals.

From: Reconstruction of the personal information from human genome reads in gut metagenome sequencing data

Extended Data Fig. 3: Simulation experiment for the identification of pairs of read and genotype data derived from the same individuals.

a, Histograms represent the distribution of the standardized likelihood score for which the read data and genotype data originate from different samples. The blue line indicates the density plot made from the simulation results and the black line indicates the normal distribution. b, Q-Q plots representing the relationship between the observed and expected P values in the cases where the read data and genotype data originate from different samples. The red line indicates y = x. c, Scatter plots represent the relationship between the P values calculated from the background distribution of the score (Pdistribution) and P values empirically calculated with the permutation procedure (Ppermutation) for the cases where the read and genotype data originate from different samples. The red line indicates y = x. d, Re-identification from a set of genotype data based on the 175 (5 WGS data × 5 random seeds × 7 coverages) simulated read data. The x axis of the scatter plots indicates the number of the bases used for the calculation of the likelihood scores. The y axis of the scatter plots indicates the likelihood scores (left and middle) or P values (right). The results of the 17,500 tests (100 genotype data × 25 simulated data × 7 coverages) are indicated as the colour of the points. The identification of the pairs of the read and genotype data is based on the top score (left) or P values (middle and right). The distribution of the standardized likelihood score is indicated for a part of the middle plot with the stratification by whether the derivation of the simulated read data and the genotype data are the same or not. e, Re-identification from a set of genotype data based on the 105 (3 WGS data × 5 random seeds × 7 coverages) simulated read data. The three samples with the WGS data have a familial relationship as indicated. The x axis of the scatter plots indicates the number of the bases used for the calculation of the likelihood scores. The y axis of the scatter plots indicates the likelihood scores (left and middle) or P values (right). The colours of the points represent the categories of the pairs of the read data and the genotype data (left) or the result of the P value-based predictions (100 genotype data × 15 simulated data × 7 coverages = 10,500 tests). FN, false negative; FP, false positive; TN, true negative; TP, true positive.

Back to article page