Abstract
The fine-scale genetic structure within populations, focusing on demographic histories and migration patterns, has been explored previously. However, limited attention has been paid to understanding how genetic structure influences lifestyle and dietary habits within an epidemiological framework. This study explores the fine-scale genetic structure within a homogeneous Japanese population using advanced unsupervised learning techniques—Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and Density-Based Spatial Clustering of Applications with Noise (DBSCAN)—coupled with direct-to-consumer genetic testing data. We investigate the associated genetic factors and examine the relationship between the genetic structure and geographic ancestry. Additionally, using cross-sectional data and multinomial logistic regression, we further elucidate the nuanced impacts of lifestyle and dietary factors across genetic clusters, emphasizing the importance of integrating genetic data with epidemiological research. This study introduces a new framework for genetic epidemiology that considers both genetic and environmental influences.
Similar content being viewed by others
Introduction
In the era of personalized and customized health, the establishment and expansion of direct-to-consumer genetic testing (DTC-GT) have opened new avenues for gaining deeper insights into genetic information and disease risk prediction derived from prior genome-wide association studies (GWAS)1. The extensive data from DTC-GT companies like 23andMe, which are widely used in scientific research, highlight the growing recognition of the value of such data2. Owing to its ease of use, genetic research is increasingly utilizing DTC-GT data, aiming for more accurate and personalized healthcare applications. In Japan, MYCODE, provided by DeNA Life Science (DLS), Inc., (Tokyo, Japan), has taken the lead in the DTC-GT market3. It empowers users to make informed health decisions, potentially improves their behavioral habits, and contributes to scientific research, in which the MYCODE community voluntarily participates3.
In the landscape of GWAS, challenges stem from the population structure, which poses risks of false positives owing to the inadvertent inclusion of individuals with undetected admixtures4, prompting a growing support for machine-learning methods5. With the increasing availability of extensive genetic data, data-driven approaches have emerged as promising tools6,7. Although principal component analysis (PCA), a linear transformation method to identify principal components (PCs), remains prevalent in genetic research8, a notable shift has occurred toward employing unsupervised machine-learning methods, including t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP)9,10. These methods, used for reducing dimensionality in genotype data, improve the visualization of genetic variations, help delineate population structures, and mitigate stratification biases in GWAS6,11, especially after PCA initialization12,13. UMAP leverages Riemannian geometry and algebraic topology, enabling it to capture and retain the complex high-dimensional topology of data points within a lower-dimensional space10. This method has demonstrated superior performance in capturing fine-scale population structures across various genetic data types, such as bulk transcriptomics, single-cell RNA sequencing, and single nucleotide polymorphism (SNP) genotype data12,13,14. Its efficacy extends to visualizing ancestral composition in cohorts and discerning intricate patterns in diverse biobanks, establishing UMAP as a valuable tool to unravel the complexities of genetic data15.
The Japanese population is considered genetically homogeneous and characterized by relatively low genetic diversity16. Studies have been dedicated to revealing the fine-scale genetic structure within the Japanese population, recognizing the existing genetic diversity based on the dual-structure model17,18. A recent Biobank Japan study found that the PCA-UMAP dimensionality reduction method outperformed other techniques, demonstrating its superior efficacy in capturing the intricate structure of the Japanese subpopulation12. PCA-UMAP merges PCA and UMAP by applying UMAP to the principal components of genotype data, resulting in a more accurate classification of population clusters. This approach also offers computational advantages and reduces statistical noise. The fine-scale genetic structure of the Japanese population, primarily focusing on uncovering ancestry and geographical differences in the admixture proportion due to historical migration, has been explored earlier12,19. These studies clearly revealed the underlying population structure in Japan, emphasizing the genetic differences among islands and providing evidence of historical migration. Regional differences in the phenotypes of complex traits in Japanese individuals have been attributed to regional genetic differences20. Previous investigations have concentrated on the geographic distribution of genetic backgrounds, primarily through a polygenic risk score, which is a numerical representation of the impact of genetic variants on a trait identified by GWAS12,21. Isshiki et al.22 found that the distribution of polygenic height scores could explain the height gradient observed in Japan. Similarly, Sakaue et al.12 discovered variations in the polygenic risk score for complex human traits among Japanese subpopulations. However, few studies have explored how population genetic structure influences comprehensive lifestyle and dietary habits within an epidemiological framework.
To address this gap, our objective was to investigate the intricate interactions between fine-scale genetic structure, lifestyle, and dietary habits in the Japanese population by incorporating combined methods. In this study, we utilized unsupervised learning methods, including PCA, UMAP, and the density-based spatial clustering of applications with noise algorithm (DBSCAN) clustering method23, on MYCODE genotype data to investigate the population fine-scale genetic structure represented as clusters. We subsequently employed GWAS and post-GWAS analyses to identify genetic factors associated with the clusters as traits (Fig. 1a and b). Following this, we conducted a comprehensive examination of the associations between geographical, lifestyle, dietary habits, and genetic clusters using statistical analyses (Fig. 1c). This approach aimed to provide an innovative perspective combining genetic insights with epidemiological statistics. Through this procedure, we revealed intricate interactions between the population’s fine-scale genetic structure and lifestyle and dietary habits (Fig. 1d).
a The study population comprised 55,551 Japanese individuals who purchased the MYCODE genetic kit. Genotypes were obtained using microarrays, and genetic data were processed through imputation, phasing, quality control, and linkage disequilibrium pruning. Figure 1a includes a person icon by Valerie Lamm and a Japan map icon by Rahe, both from The Noun Project (CC BY 3.0) b After applying eligibility criteria, genotype allele matrices from 49,440 participants were analyzed using PCA-UMAP and DBSCAN to uncover fine-scale genetic structure within the homogenous population. This analysis resulted in 43,726 participants being retained for further analysis, including FST analysis, GWAS, and post-GWAS analyses for genomic interpretation. c Statistical analyses were conducted on lifestyle and dietary data for the same 43,726 participants. Geographical analyses focused on follow-up questionnaire responses from 2268 participants. d This approach integrates genomic interpretation with epidemiological methods to explore interactions among population structure, lifestyle, and dietary habits. Icons in Figure 1c and d are created by Freepik from Flaticon.com.
Results
Dimensionality reduction and clustering
After conducting genotype data quality control and LD pruning, we performed PCA on 395,665 pruned-in variants from 49,440 participants (Supplementary Figs. 1 and 2). Subsequently, UMAP was applied to the first 20 PCs of the genotype data, which is depicted in a two-dimensional scatter plot (Fig. 2a). The PCA-UMAP provided a discrete separation of individuals into six clusters, as observed by visual inspection (Fig. 2b). This observation established that the combination of unsupervised learning methods, PCA and UMAP, effectively distinguished participants within the same ethnic population. Furthermore, we excluded noise and participants who did not belong to the six largest clusters. This resulted in 43,726 participants being distributed among the top six clusters while maintaining the same projection shape (Fig. 2b).
This figure presents two-dimensional visualizations of genotype data from Japanese participants, grouped using PCA-UMAP dimensionality reduction techniques. Then, DBSCAN was used to cluster the PCA-UMAP projection, with six distinct clusters represented by different colors. a The PCA-UMAP projection prior to clustering. b The PCA-UMAP projection post-DBSCAN clustering c A pie chart illustrating the distribution of participants across the clusters, with each pie segment colored to correspond with the cluster assignments from Figure 2b.
Cluster 1, represented in blue in Fig. 2b, held a central position in the PCA-UMAP plot, while Clusters 6 was positioned the furthest from Cluster 1. This spatial arrangement suggests a clear and structured pattern within the homogeneous Japanese genetic data, indicating the underlying fine-scale population structure and potential genetic diversity. In total, 43,726 eligible healthy Japanese adults were included in the GWAS, post-GWAS, and multinomial logistic regression analyses (Supplementary Fig. 1). Cluster 1 comprises the largest proportion of the total population (54.5%), while Cluster 6 had the lowest proportion at 1.8%. The mean age and BMI of the entire study population was 45 (SD = 9.7) years and 22 (SD = 3.5) kg/m2, respectively. Males constituted 50.7% of the total population. In addition, consistent similarities were observed across all clusters in the distributions of age, BMI, sex ratios, drinking and smoking status, and blood glucose level (Supplementary Table 1).
Genetic differentiation among genetic clusters
Using Weir and Cockerham’s fixation index (FST)24, we assessed the proportion of genetic diversity among the six clusters identified by DBSCAN clustering. Weir and Cockerham’s FST incorporates unbiased estimators of variance components, and offers a reliable metric for assessing genetic differentiation. A weighted mean FST of 0.000904 (SD = 0.015) indicated relatively low genetic differentiation among the clusters, confirming the homogeneity of the Japanese population in the study group. FST values for pairwise comparisons between clusters and the comparison of each cluster with the entire population are presented in Fig. 3a. Despite a low level of differentiation, distinctions were observed. Clusters 2 and 3 exhibited the highest genetic dissimilarities. Clusters 1, 5 and 6 exhibited the highest similarity, with the lowest pairwise FST in the heatmap. Notably, the lowest FST was observed between Cluster 1 and the entire population, emphasizing the close resemblance of Clusters 1 to the overall population.
a A heatmap illustrates genetic differentiation among five clusters, measured by FST. b A Manhattan plot visualizes the FST values calculated across the newly designated four clusters. A grey dashed line indicates the threshold of mean + 3SD, and a red dashed line marks the highlight threshold at 0.5. c, d Regional plots showing SNPs with FST > 0.6 in chromosome 6 and 11. These SNPs are annotated to unique genes; if multiple SNPs correspond to the same gene, only the SNP with the highest FST value is displayed. For additional details, refer to Supplementary Data 1 and 2.
We then identified SNPs with noteworthy FST values, suggesting a potential association with genes influencing adaptive or economically significant traits25. Identifying SNPs based on the mean and SD of FST is a commonly employed method to detect selection signatures26. The Manhattan plot (Fig. 3b) revealed peak SNPs on chromosomes 6 (mean FST = 0.009772) and 11 (mean FST = 0.002453). Among 395,665 SNPs, 1339 and 132 on chromosomes 6 and 11, respectively, were identified as outliers surpassing three SDs from the mean, indicating that these loci had values higher than 99.8% of the total SNPs26. We established a threshold of FST > 0.5 to highlight genomic regions containing the most differentiated SNPs, with detailed annotations available in Supplementary Data 1 and 2. The closest genes associated with these regions are illustrated in Fig. 3c and d. The SNPs exhibiting the highest FST are located in the regions 6p22 and 6p21. The 6p21 region, which harbors a high concentration of core genes of the Major Histocompatibility Complex (MHC), is situated within the MHC. Meanwhile, the 6p22 region, positioned adjacent to this central area, also contributes to the broader context of MHC-related genetic functions. These regions correspond with the extended MHC region and have been previously linked to schizophrenia susceptibility27. Specifically, SNPs rs189984590 and rs548568008, located on chromosome 6 within the EHMT2 and PRRT1 genes, respectively, exhibited FST values close to 0.6. EHMT2, a key epigenetic regulator, is involved in histone modification and has implications for neurological disorders28. PRRT1 affects synaptic processes and is crucial for neurodevelopment and cognitive functions29. On chromosome 11, the SNPs demonstrating the greatest genetic differentiation are found in the 11p11 region, which is enriched for genes associated with neurodevelopmental processes30.
The MHC region, particularly of the human leukocyte antigen (HLA) genes, is crucial in a diverse range of complex human diseases and quantitative traits31. MHCs are categorized into three subclasses: class I, including highly polymorphic genes such as HLA-A, HLA-B, and HLA-C genes; class II, encoding genes such as HLA-DPA1, HLA-DPB1, HLA-DQA1, and others involved in antigen presentation; and class III, containing genes associated with inflammatory responses, leukocyte maturation, and the complement cascade32. In 2015, Nakaoka et al.33 investigated the distinction in HLA frequencies across 10 regional populations in Japan and found that HLA-A*24:02-C*12:02-B*52:01-DRB1*15:02-DPB1*09:01 exhibited distinctive regional frequency patterns. In our study, we found results consistent with previous observations33,34, where we identified highly differentiated SNPs within the four clusters, HLA-B, -DRA, -DRB1, -DQB1, -DQA2, -DQB2, predominantly located in or near the HLA region on chromosome 6. Collectively, despite the absence of specific geographical and regional information, our unsupervised learning methods effectively classified clusters exhibiting genetic differentiation, which was previously validated, thereby defining the fine-scale genetic structure of the study population.
To strengthen the reliability of our findings, we conducted additional FST analyses on randomly generated clusters, ensuring these clusters retained the structure of the identified clusters in our study. We assessed the similarity of different sets of clusters using the Adjusted Rand Index (ARI). The results, The results, detailed in Supplementary Table 2 and Supplementary Data 3 and 4, confirm that the genetic clusters observed in our study are not due to random variation but indeed represent distinct and genuine genetic fine-structures within the population.
Association between genetic clusters and geographic ancestry
Expanding upon this genetic categorization, we consolidated Clusters 1, 5, and 6 into a new entity designated Cluster 1*, guided by their genetic similarities as revealed through FST analysis. This strategic consolidation was informed by the low genetic differentiation observed between these clusters. FST values and characteristics of the newly formed clusters are presented in Supplementary Fig. 3 and Supplementary Table 3.
To elucidate the relationship between these PCA-UMAP-DBSCAN-defined clusters and geographic ancestry, we engaged a cohort of 2268 participants from a follow-up survey that gathered geographical information. We first performed chi-square tests to reveal potential associations between an individual’s genetic cluster and familial geographic lineage. As shown in Fig. 4a, we found that the genetic clusters were significantly related to seven questions in the survey, which covered geographical details ranging from the individual’s birthplace to the birthplaces of their parents and grandparents. Subsequently, we developed random forest models, which are powerful machine-learning techniques known for their simplicity and robustness35, to predict genetic clusters based on geographic information. Two models were used: one incorporated only the geographic variables that exhibited significant associations in the chi-square tests (labeled sig), whereas the other utilized the entire suite of geographic data collected (labeled all). The ROC curves, illustrated in Fig. 4b, yielded area under the curve values of approximately 0.76. These values suggested a certain predictive capability of the geographic data, and also highlighted the potential influence of additional factors in genetic diversity, given their accuracy scores of 0.60 and 0.59, respectively.
a The figure depicts significant associations between genetic clusters and seven geographic variables as determined by chi-squared tests. Sky-blue bars indicate the strength of the associations in terms of \(-{\log }_{10}(P\,{\mbox{value}})\). A red dotted line marks the statistical significance threshold of P = 0.05. b Two models are compared using ROC curves. The sig model uses only geographic variables that have significant relationships in Figure 4a, while the all model includes all geographic data. c ADMIXTURE analysis results with the number of hypothetical ancestral populations (K) set to four are presented. Distinct ancestral genetic mixtures are labeled from a–d. d This chart displays how the previously defined genetic clusters are distributed across the ancestral groups found in the ADMIXTURE analysis.
Further insights into the genetic structure of the cohort were obtained through unsupervised ADMIXTURE analysis36, a software designed for the maximum likelihood estimation of individual ancestries. We systematically explored the values of K, which represents the number of ancestral populations assumed, to discern the fine-scale genetic structure of our cohort’s ancestry (Supplementary Figs. 4 and 5). In our analysis, the selection of K = 4 for ADMIXTURE analysis was strategically chosen to compare with our previous clustering findings using PCA-UMAP-DBSCAN, and FST analyses. This decision allowed us to reveal four distinct ancestral genetic mixtures, as depicted in Fig. 4c, with detailed information in Supplementary Data 5. The distribution of these mixtures illustrates a clear alignment with the genetic clusters previously defined through other methodologies (Fig. 4d). Specifically, Mixture B was predominantly comprised individuals from Cluster 3 (150 of 168, i.e., 89.2%), whereas Mixture C mainly consisted of individuals from Cluster 2 (230 of 260, i.e., 88.5%). By contrast, Mixtures A and D were primarily composed of individuals from the merged Cluster 1*, which comprised 1,485 individuals (65.5%) in the follow-up cohort.
The alignment of genetic clusters with ancestral mixtures defined by ADMIXTURE analysis suggested a patterned genetic composition. The combined findings from the chi-square tests, predictive modeling, and ADMIXTURE analysis revealed a structured picture of a genetic cluster related to geographic ancestry. Populations tend to exhibit clustering according to geographic region, as determined by genetic distance. Although geographical information contributes to predicting genetic clusters37, our analysis indicates that it is not the sole definitive factor, suggesting that other factors could contribute to the genetic diversity of the population. The alignment observed in two types of clusters, where there was a significant overlap between the structures defined by machine-learning techniques and those identified by ADMIXTURE analysis, suggests that historical demographic events for particular clusters could significantly influence genetic compositions. This observation implies that the underlying genetic profiles of these clusters may be more prominently shaped by geographical drift than those of the other clusters. Beyond geographical distance, factors such as epigenetic markers, notably DNA methylation, a crucial epigenetic mechanism that is sensitive to environmental influences, may impact population structure38,39. These influences suggest that genetic diversity is the product of complex gene-environment interactions, which are further complicated by epigenetic modifications that reflect the intricate genetic and epigenetic interplay within populations38. As physical and social barriers among different populations decrease, genetic mixing becomes more common, suggesting that factors beyond geography are increasingly influential in shaping the population structure40. Overall, our results suggest a complex relationship among the concepts of genetic similarity, genetic ancestry, and genealogical ancestry41.
Integrated insights from genetic interpretation and epidemiological results
We conducted GWAS and post-GWAS analyses based on the newly assigned clusters to investigate potential differences in gene expression, lifestyle, and dietary habits (Supplementary Data 6–10). The mapped genes of each cluster were used for GSEA utilizing the FUMA pipeline42 to evaluate whether the genes exhibit statistically significant associations with trait-associated genetic variants. FUMA used 15 databases for GSEA, identifying 1616 significant relationships (Supplementary Data 9). Our study particularly highlighted the GWAS Catalog database, which is pivotal for exploring the genetic foundations of diverse traits and diseases, making it relevant to our investigation of the potential links between genetic variations and lifestyle and dietary habits. Among the gene sets analyzed, the GWAS Catalog revealed 220 significant relationships. Notably, these relationships exhibited the second-lowest P when gene sets were evaluated as groups, as illustrated in Supplementary Fig. 6.
A total of 64 unique gene sets were identified as statistically significant, with intriguing findings such as the “Fruit Consumption” gene set ranking 20th when sorted by adjusted P in ascending order, and the last associated with “A body shape index” (Supplementary Data 10). Given the information density, we focused on the distinct traits of each cluster in the Japanese population, employing a Circos plot (Fig. 5a). The upper circle segments correspond to the specific gene sets that are unique to each cluster, as determined by their non-overlapping presence in the GWAS Catalog database. For example, Cluster 1* was distinctly associated with HDL cholesterol levels, while Cluster 3 was uniquely linked to metabolic traits such as “Alanine aminotransferase” and “Glycated hemoglobin levels”.
a The Circos plot illustrates the Gene Set Enrichment Analysis (GSEA) results across the four identified clusters, using negative log-transformed P values. The lower circle of the plot represents each of the four clusters. The upper circle highlights the specific gene sets that are unique to each cluster, as determined by their non-overlapping presence in the GWAS Catalog. b A forest plot presents the outcomes of the multinomial logistic regression analysis. The model was adjusted for age, sex, and BMI for each exposure variable. The plot uses vertical lines on the y-axis to represent different lifestyle or dietary variables, with horizontal lines showing ORs and their 95% CIs. Points of intersection with the vertical line at OR = 1.0 suggest no effect. The figure only includes lifestyle or dietary variables with statistically significant relationships (P < 0.05).
Despite these specific associations, a uniform distribution of gene sets and their interrelations across clusters was evident (Supplementary Fig. 7). The consistent significance across gene sets likely stemmed from shared significant SNPs (Supplementary Data 11). This pattern became more apparent in the context of the one-vs-all GWAS analysis, where the studied traits essentially represented the fine-scale population structure, and the persistent occurrence of particular SNPs resulted from the shared genetic background within the homogeneous Japanese population. The consistently observed shared genetic influences, as reflected in the enrichment of the same gene sets within the population’s fine-scale genetic structure, can be attributed to pleiotropy, where a single gene can affect multiple seemingly unrelated traits. This finding highlights the effect of specific genetic loci on diverse traits, thereby emphasizing the importance of accounting for pleiotropy in explaining the genetic basis of phenotypic diversity.
In addition, we generated a forest plot using statistically significant results from multinomial logistic regression analysis of cross-sectional data (Fig. 5b) to illustrate the ORs and 95% confidence intervals (CI). The regression model was adjusted for age, sex, and BMI43 using Cluster 1* as the reference group because it was observed to be the most similar to the entire population. All 21 significant results were found among 175 exposure variables, with an average observation of 23,521 individuals, excluding missing variables. Among the significant findings, the lowest P value around 0.005 was observed for “Daily Milk Intake” in Cluster 3. The figure presents consistent positive associations in Clusters 3 and 4, notably with weekly food consumptions. Additionally, results included two exposures of “Total Weekly Calcium Consumption” for Cluster 4 (OR = 0.998, CI: 0.997–0.9999, P = 0.042) and “Total Weekly Fish Consumption” for Cluster 3 (OR = 1.002, CI: 1.000–1.003, P = 0.025). The small magnitude of these ORs suggested nuanced effects (See Supplementary Data 12 for detailed information, including survey questions). The regression analysis indicated significant associations across two domains: quality of life, emphasizing sleep disturbance and fatigue; and dietary habits, with a focus on the intake of fish and vegetables. An examination of Fig. 5a and b together reveals interesting patterns. For instance, Clusters 3 and 4 have higher vegetable consumption compared to the reference Cluster 1*. Furthermore, the Circos plot indicates that these clusters uniquely exhibit statistically significant GESA results for body shape, feelings of guilt, and non-lobar intracerebral hemorrhage (MTAG). Such variations highlight the diverse impacts of these factors on health across different genetic groups. The observed differences among genetic clusters suggest a complex interplay between epidemiological exposures and fine-scale genetic structures.
Discussion
In this study, we identified the population structure of the Japanese population using unsupervised learning techniques including PCA, UMAP, and DBSCAN. We investigated the genetic factors associated with these clusters through FST, GWAS, and post-GWAS analyses, uncovering significant gene sets linked to specific traits and diseases. Notably, gene sets related to dietary habits such as “Fish- and plant-related diet” and health conditions like “Liver iron content” were identified, highlighting the unique genetic background of the Japanese population44 and the potential impact of genetics on health and lifestyle choices45. Despite the uniform genetic background across all clusters, the variance in the epidemiological relationships between lifestyle factors, dietary habits, and genetic clusters highlights the complex relationship between genetics and environmental factors. Multinomial regression analysis was used to compare different clusters in relation to lifestyle and dietary habits, illustrating the varied impacts of these factors across genetic clusters. These combined results underscore the complexity of genetic predispositions and their interactions with lifestyle and dietary habits, offering insights into gene-environment interactions by utilizing the concept of fine-scale genetic structure46.
Genetic research on the Japanese population has delved into its adaptation through two principal lenses: the admixture history of ancient lineages, highlighting the dual-structure model that illustrates the indigenous identities of the Jomon and Yayoi peoples; and the genetic distinctions within geographical regions (Hondo and Ryukyu)17. Prior studies that aimed to understand the genetic fine-scale structure in the Japanese population12,19 have primarily served to confirm the concordance between geographical distribution and the observed genetic structure. However, a pressing question remains: is employing geographical or indigenous labels as proxies for the identified fine-scale structures the optimal approach? The growing awareness of the need to address the nuanced issue of population descriptors in genetics research has been increasingly important47, especially in terms of presenting genetic classifications to the public without fueling debates on genetic essentialism48. In 2023, the US National Academies of Sciences, Engineering, and Medicine (NASEM) has published guidance recommending best practices for using population descriptors in research47. This guidance advises carefully considering the use of geographical ancestry to avoid misconceptions, particularly the risk of misinterpreting geography as indicative of environmental exposures. It also warns against using indigenous identity in ways that could imply “purity” and lead to discriminatory interpretations.
The fine-scale structure we uncovered reveals patterns where clusters serve as indicators of individuals with substantial genetic similarity. Our use of “genetic similarity” as a descriptor aligns with existing guidelines47. It is a well-accepted concept that phenotypes arise from the dynamic interplay between genetics and the environment49,50. Our analysis reveals distinct patterns among populations grouped by genetic similarity, suggesting potential interactions between genetic factors and environmental conditions. In light of these findings, we highlight the need for future studies that explicitly incorporate gene-environment analyses. Such research would more clearly elucidate the impact of societal changes on genetic variation through complex social determinants and interactions. This points to the necessity of including a broad spectrum of environmental and social factors in genetic research. It is essential to integrate sociological and environmental contexts into genetic studies to fully explore the intricate, intertwined relationships between society and human biology. This approach would help uncover how societal developments may influence genetic variation through detailed causal networks of social determinants and gene-environment interactions51.
Our study helps to understand the complex factors influencing health outcomes, with implications for both public health and precision medicine. The application of advanced unsupervised learning technologies, such as PCA-UMAP-DBSCAN, effectively identified clusters with significant genetic differentiation within a homogeneous population. These combined technologies have uncovered intricate patterns within genetic data, proving efficient classification based on intrinsic genetic structures. Simplifying the complex genotype matrix into discrete categorical variables that represent the fine-scale genetic structure enables the use of epidemiological regression. Our statistical analysis revealed gene-environment interactions across clusters that share a homogeneous genetic background. This comprehensive approach links genetic predispositions to lifestyle and dietary patterns, yielding valuable insights for public health interventions. This framework also highlights the possibility of strengthening public health research by including genetic structure as an essential variable in mainstream statistical studies, as integrating genetic architecture could improve our understanding of complex relationships in genetic epidemiology52. However, the convenient nature of the DTC-GT data enables the creation of near-time biobank data, allowing researchers to engage in more precise data processing53. Swift recruitment facilitated by DTC-GT enables faster follow-up studies and offers crucial insights into personalized medicine. This approach allows for the rapid assembly and modification of prevention programs, thereby improving their effectiveness54. Additionally, incorporating gene-environment interactions and genetic clusters into the development of personalized prevention programs may improve their efficacy. By considering genetic susceptibility and modifiable lifestyle factors, these programs could improve personalized medicine by matching individuals using genomic profiling55.
Our study has some limitations. First, validation of gene set enrichment patterns in future studies is essential to establish the robustness of the explanation for the entire Japanese population. Second, the simplicity of the overall regression model, which was adjusted only for age, sex, and BMI, could introduce a potential bias in the interpretation of these relationships. The ORs close to 1.0 may indicate a potential limitation in establishing substantial associations with the respective clusters. Third, the exclusion of missing data could introduce a potential source of bias. The use of self-reported questionnaires for lifestyle and dietary habits in the cross-sectional study could also introduce recall bias. Additionally, while UMAP effectively highlights local data structures, it may overemphasize the differentiation between clusters. Future studies might benefit from validating findings through additional methods that offer different perspectives on the global relationships among genotype data. Finally, inherent self-selection bias could arise from customers opting to participate in the DTC-GT, potentially leading to subject bias. These limitations underscore the need for cautious interpretation and highlight the areas for consideration in future research.
In conclusion, by applying machine-learning techniques, we revealed the fine-scale genetic structure of Japanese DTC-GT customers, marked by high genetic variation among the identified clusters, and revealed a significant relationship between these genetic clusters and geographical ancestry. This study introduces an innovative framework for genetic epidemiology by integrating genetic insights with cross-sectional statistical analyses, shedding light on the subtle effects of lifestyle and dietary factors across different genetic clusters. With the future availability of health outcome data, research could further explore the relationship between risk factors and health outcomes within the context of genetic architecture.
Methods
Study participants
A total of 61,728 healthy Japanese adults were initially enrolled as customers of MYCODE (DLS Inc., Tokyo, Japan), a personal genome service in Japan. Participants returned their saliva samples to the DLS laboratory along with a signed consent form indicating their willingness to participate in MYCODE Research, where their anonymous genetic data and health-related information would be utilized for scientific research objectives. In sample quality control by the DLS laboratory, 55,551 participants along with 684,436 autosomal SNPs were retained after exclusion for sex inconsistency, identity by descent (\(\hat{\pi } > 0.1875\)), missing call rate (>0.01), non-Japanese ancestry estimated by the genetic PCA using East Asian samples from the 1000 Genomes Project Phase 3 version 5, and autosomal heterozygosity (>3 standard deviation (SD) above the mean).
For this study, the eligibility criteria were further established as follows: (i) age 20–64 years, (ii) validated height and weight data, and (iii) height within three times the interquartile range (IQR)56. PCA was conducted on the genotype data of the participants following the quality control procedures described below using PLINK (v1.9)57.
All ethical regulations relevant to human research participants were followed. The study was approved by both the ethics committee of DeNA Life Science Inc. (protocol #20140717_1) and the Institute of Medical Science, University of Tokyo (protocol #2019-48-1219) (Tokyo, Japan). Participants consented to the publication of research findings using their data, under the condition that no personally identifiable information is disclosed. All personal identifiers have been removed to protect confidentiality and comply with ethical standards.
Genotyping and quality control
SNP genotyping was performed using Infinium OmniExpress-24+ BeadChip or Human OmniExpress-24+ BeadChip (Illumina, Inc., San Diego, CA, USA) in the DLS laboratory. Before Genotype imputation, the quality control was performed on the genotyped 684,436 autosomal SNPs with the following exclusion criteria: with a (i) minor allele frequency < 0.01, (ii) Hardy-Weinberg equilibrium P < 10−6, and (iii) missing call rate > 0.01. The initial phase of haplotype phasing was conducted on the entire cohort of 55,551 individuals utilizing Eagle (v2.4.1)58, with the reference panels of the 1000 Genomes phase 3 version 5 (1KGP-JPT; available at https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html). Genotype imputation was performed using Minimac3. Quality control procedures for the genotype data included the exclusion of SNPs meeting the following criteria: with a (i) minor allele frequency < 0.01, (ii) Hardy-Weinberg equilibrium P < 10−6, (iii) call rate <95%, and (iv) r2 < 0.7. All genetic variants initially identified were subjected to a comprehensive re-annotation procedure using “bcftools annotate”. Linkage disequilibrium (LD) pruning was conducted after quality control, with an option “-indep-pairwise 50 5 0.2” using PLINK (v1.9)57. Following the application of eligibility criteria to the study participants, the dataset consisted of 395,665 pruned-in variants from 43,726 individuals. Genome construction information was based on the GRCh37/hg19 reference.
Dimensionality reduction and clustering
We adopted the dimensionality reduction methods adopted from previous literature12. First, we performed PCA using PLINK (v1.9) with the default parameters. The first two PCs were subsets and were projected onto a two-dimensional space for outlier inspection. We then performed UMAP on the first 20 PCs of the genotype using the Python umap-learn package (v0.5.3; https://pypi.org/project/umap-learn/) with the default parameters. The choice of the top 20 principal components was based on commonly accepted practices and informed by prior research on the fine-structure in the Japanese population12. A scree plot demonstrating eigenvalues is included in Supplementary Fig. 2.
After the manual identification of six distinct clusters based on the separation observed in the PCA-UMAP plot (Fig. 1a), clustering and visualization were conducted. DBSCAN, a widely used clustering algorithm in the spatial domain23, was implemented using the Python scikit-learn package (v1.3.0; https://scikit-learn.org/stable/) with the parameters set to eps = 0.068 and min_samples = 3. Noisy samples identified using DBSCAN were subsequently excluded, resulting in a final cohort of 43,726 individuals for further analysis (Supplementary Fig. 1). Python (v3.9.12) and R (v4.3.2) were used for all analyses in this study.
Genetic variance analysis
We used the FST value, as defined by Weir and Cockerham24, to investigate the overall genetic differentiation among the six clusters identified using PCA-UMAP analysis. FST calculations were performed using the first option in PLINK (v1.9)57. ANNOVAR (v2018Jul08)59 was used for the functional annotation of SNPs with high FST values. In addition, we conducted pairwise comparisons of the clusters in a one-versus-one manner. Furthermore, one-versus-all FST analyses were performed to quantify genetic variance between each cluster and the entire dataset. Randomly generated cluster sets were created using the numpy package (v1.24.4) with two distinct seeds (42 and 100). The similarity between these random clusters and the original clusters was assessed using the ARI from the scikit-learn package (v1.3.0).
Geographical information and related analysis
The geographical information was obtained from a follow-up survey with 3132 volunteers, featuring nine questions on current residence, birthplace, longest place of residence before adulthood, and six familial birthplaces, covering all 47 Japanese prefectures and an additional “I don’t know/don’t want to answer” option. As shown in Supplementary Fig. 8, these prefectures were further organized into 10 regional blocks according to the Type I region classification by the Ministry of Internal Affairs and Communications in Japan (https://www.soumu.go.jp/main_content/000872772.pdf). After excluding responses with missing data, our analysis focused on 2268 participants with complete responses. A chi-square analysis was initially conducted to confirm that the follow-up cohort shared a similar distribution of genetic clusters with the main dataset using the scipy.stats module (v1.11.4) in Python. For association analysis, we conducted chi-square tests to examine the relationship between geographical information and genetic clusters. In Python, using the scikit-learn package (v1.3.0), we applied a random forest classifier using the two models as features and the genetic cluster as the target variable. The two models of features included one with only significantly associated geographic variables, while the other used all the geographic data collected. A stratified five-fold cross-validation was applied to the dataset to ensure the consistency and reliability of the model performance. To mitigate class imbalance, we incorporated microaveraging to generate an receiver operating characteristic (ROC) curve. ADMIXTURE software (v1.3.0)36 was used to conduct an unsupervised estimation of ancestral components among 2268 participants. We examined different numbers of assumed ancestral components by setting the K values in the range of 2 to 1212. This representation distinguishes ancestral components through color coding and sorts individuals based on their predominant ancestry. We selected K = 4 for our analyses as it provided a balance between minimizing error and aligning with the coherent genetic structures identified through our previous analyses. Supplementary Figs. 4 and 5 provide a comprehensive overview of the ancestral estimations across the entire range of K values with cross-validation errors.
GWAS and Post-GWAS analyses
To explore genetic variance among the six clusters, one-versus-all case-control studies were conducted using PLINK (v1.9)57. A logistic regression model assuming additive genetic effects was used for association analysis, adjusting for age and sex as covariates. The association analysis was repeated six times, designating each cluster as a case. The FUMA pipeline (v1.6.0; https://fuma.ctglab.nl/), from SNP2GENE to GENE2FUNC functions, was employed to explore the results obtained from PLINK. In the SNP2GENE step, lead and candidate SNPs were identified using default parameters, except for 1KGP-EAS, which was used as the reference panel. An SNP was considered independently significant if it achieved genome-wide significance level, as determined by the default threshold of 5 × 10−8, and independent from each other with LD r2 < 0.6. FUMA also defines independent lead SNPs with low LD r2 < 0.1. Genomic risk loci were subsequently mapped to genes using functional and eQTL mapping. For positional mapping, ANNOVAR annotations were employed in the FUMA pipeline and candidate SNPs were assigned to the nearest genes within a maximum distance of 0 kb. For eQTL mapping, the expression data for blood tissues in the GTEx v8 were used. A default false discovery rate of 0.05 was applied to define significant eQTL associations. In addition, the GENE2FUNC feature in FUMA conducts gene set enrichment analysis (GSEA) by performing hypergeometric tests using mapped genes to assess the overrepresentation of biological functions using public databases, including the GWAS Catalog, MsigDB, and WikiPathways. The circos plot was generated using the Circlize package (v0.4.15).
Lifestyle and dietary habits
MYCODE users answered an optional web-based questionnaire on lifestyle and dietary habits on the service website to obtain personalized health advice from August 2014 to June 2020. The timing of the data collection varied between participants. Since the study involved self-administered questionnaires, the response times differed for each participant. The anonymized answer data of the MYCODE Research participants were used for this study. To ensure data quality, responses that appeared suspicious were identified and removed during the pre-analysis phase. The questionnaire contained anthropometric traits; dietary habits such as the intake of calcium, dietary fiber, fruit, fish, and red and yellow/green/other vegetables; physical activity; smoking and drinking habits; sleep-related behaviors; stress responses; and blood test results such as triglyceride, LDL cholesterol, and glucose levels. We used the calcium self-check chart60 for calcium intake assessment and the Brief Job Stress Questionnaire for stress response evaluation, which contains a subset of 29 items, comprising 18 items related to psychological stress responses and 11 items focusing on physical stress responses and somatic symptoms61. Other questionnaires on dietary habits were originally developed by DLS based on the Standard Tables of Food Composition in Japan (https://fooddb.mext.go.jp/). To remove potential wrong answers for free-answer questions on alcohol intake, answers for over 10 glasses of wine (120 mL per glass) per day or 10 cans of chuhai (350 mL per can) per day were converted to missing values.
Multinomial logistic regression analysis
For each lifestyle and dietary habit, multinomial logistic regression analyses were performed using the Scikit-Learn Python library (v1.3.0). We excluded all missing variables specific to each lifestyle or dietary habit when conducting multinomial regression for the exposure variable of interest. Age, sex, and body mass index (BMI) were adjusted in the model43, and the covariate data were comprehensive and devoid of missing values. All statistical analyses were conducted using Python, with a significance threshold set at P < 0.05 for statistical significance. The formula for multinomial logistic regression with lifestyle/dietary habit variable X and a specific cluster k is as follows:
where:
-
\((\frac{P(Y=k)}{P(Y=\,{\mbox{reference}})})\) is the probability of the dependent variable Y being in cluster k given the value of the lifestyle/dietary habit variable X, where the reference being Cluster 1*.
-
β1k, β2k, β3k, and β4k are the coefficients for cluster k corresponding to the variables age, sex, BMI, and X, where X is the specific lifestyle/dietary habit variable.
-
β0k is the intercept term representing the baseline log-odds for cluster k.
-
age, sex, and BMI are the covariates adjusted in the model.
The odds ratios (ORs) for X in relation to cluster k is calculated by:
Statistics and reproducibility
In our study, we utilized PCA, UMAP, and DBSCAN to explore the fine-scale genetic structure of a homogeneous Japanese population based on autosomal SNP data. Additionally, we investigated the relationship between genetic clusters and lifestyle factors by applying multinomial logistic regression.
To ensure the reproducibility of our findings, we provide comprehensive descriptions of all procedures, from data collection to analysis. This detail includes our data cleaning process, the criteria for including and excluding participants, and the specific parameters set within our statistical models. Initially, our study included 61,728 participants, which was then narrowed down to 43,726 individuals after quality control. For replication purposes, researchers should employ a similarly large cohort of both genetic and phenotypic data, ideally encompassing over 40,000 participants post-quality control, to ensure robust and replicable results.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The source data underpinning the figures presented in this study are provided as follows: Due to privacy concerns, the raw data for Fig. 2, which include detailed clustering information on individual participants, cannot be disclosed. Data supporting Fig. 3 are contained in Supplementary Data 1 and 2. Figure 4c and d draw from Supplementary Data 5. Individual responses underlying Figs. 4a and b are withheld to protect participant privacy. Data supporting Fig. 5a are provided in Supplementary Data 10. For Fig. 5b, while individual responses are not available, summary statistics are provided in Supplementary Data 12. Access to human SNP genotype and individual health-related data is controlled by privacy and legal, and ethical considerations. These sensitive data are held by DeNA Life Sciences, Inc., based on the informed consent obtained from participants, which does not allow deposition in a public repository. As of October 2024, ownership will be transferred to Allm Inc. due to business succession. However, these data may be obtained from the corresponding author upon a justified request. Data access requires ethical committee approval, followed by the provision of an opt-out opportunity for study participants. Data sharing must also comply with MYCODE Research’s security policies, which mandate appropriate access control, logging and monitoring, a dedicated and isolated network, and antivirus measures for environments handling sensitive data. Specific details regarding these requirements will be discussed upon individual request. Please note that the processes of opt-outs and data preparation will take approximately six months.
Code availability
The code for dimensionality reduction and clustering of the genotype data is available on GitHub: https://github.com/YichiChen-z/dimension-reduction-clustering/.
Change history
11 August 2025
In this article the handling editor name was missing and should have read primary handling editors: Qiao Fan and Rosie Bunton-Stasyshyn. The original article has been corrected.
References
Roberts, J. S. & Ostergren, J. Direct-to-consumer genetic testing and personal genomics services: a review of recent empirical studies. Curr. Genet. Med. Rep. 1, 182–200 (2013).
Hayden, E. C. The rise and fall and rise again of 23 and me. Nature 550, 174–177 (2017).
Miyake, K. Psy17-2 - 5 years of steady progress: DTC genetic testing service mycode. Ann. Oncol. 30, vi23 (2019).
Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).
Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).
Ausmees, K. & Nettelblad, C. A deep learning framework for characterization of genotype data. G3 Genes∣Genomes∣Genetics 12, jkac020 (2022).
Schrider, D. R. & Kern, A. D. Supervised machine learning for population genetics: A new paradigm. Trends Genet. 34, 301–312 (2018).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
der Maaten, L. V. & Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 9 (2008).
Healy, J. & McInnes, L. Uniform manifold approximation and projection. Nat. Rev. Methods Primers 4, 82 (2024).
Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinforma. 20, 116 (2019).
Sakaue, S. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat. Commun. 11, 1569 (2020).
Cristian, P.-M. et al. Diffusion on PCA-UMAP manifold captures a well-balance of local, global, and continuum structure to denoise single-cell RNA sequencing data. bioRxiv 2022.06.09.495525 http://biorxiv.org/content/early/2022/06/12/2022.06.09.495525.abstract (2022).
Yang, Y. et al. Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data. Cell reports 36 (2021).
Diaz-Papkovich, A., Anderson-Trocmé, L. & Gravel, S. A review of UMAP in population genetics. J. Hum. Genet. 66, 85–91 (2021).
Haga, H., Yamada, R., Ohnishi, Y., Nakamura, Y. & Tanaka, T. Gene-based SNP discovery as part of the Japanese millennium genome project: identification of 190,562 genetic variations in the human genome. J. Hum. Genet. 47, 605–610 (2002).
HANIHARA, K. Dual structure model for the population history of the Japanese. Jpn Rev. 1–33 http://www.jstor.org/stable/25790895 (1991).
Jinam, T. A., Kanzawa-Kiriyama, H. & Saitou, N. Human genetic diversity in the Japanese archipelago: dual structure and beyond. Genes Genet. Syst. 90, 147–152 (2015).
Takeuchi, F. et al. The fine-scale genetic structure and evolution of the Japanese population. PLOS ONE 12, e0185487– (2017).
Hirata, J. et al. Genetic and phenotypic landscape of the major histocompatibilty complex region in the Japanese population. Nat. Genet. 51, 470–480 (2019).
Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 12, 1–11 (2020).
Isshiki, M., Watanabe, Y. & Ohashi, J. Geographic variation in the polygenic score of height in Japan. Hum. Genet. 140, 1097–1108 (2021).
Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans. Database Syst. (TODS) 42, 1–21 (2017).
Weir, B. S. & Cockerham, C. C. Estimating f-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
Zhao, F., McParland, S., Kearney, F., Du, L. & Berry, D. P. Detection of selection signatures in dairy and beef cattle using high-density genomic information. Genet. Sel. Evol. 47, 49 (2015).
Maiorano, A. M. et al. Assessing genetic architecture and signatures of selection of dual purpose gir cattle populations using genomic information. PLOS ONE 13, 1–24 (2018).
Ikeda, M. et al. Genome-wide association study of schizophrenia in a Japanese population. Biol. Psychiatry 69, 472–478 (2011).
O’Leary, N. A. et al. Reference sequence (refseq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Aleksander, S. et al. Updates to the alliance of genome resources central infrastructure. Genetics 227, iyae049 (2024).
Stauffer, E.-M. et al. The genetic relationships between brain structure and schizophrenia. Nat. Commun. 14, 7820 (2023).
Trowsdale, J. & Knight, J. C. Major histocompatibility complex genomics and human disease. Annu. Rev. Genomics Hum. Genet. 14, 301–323 (2013).
Dendrou, C. A., Petersen, J., Rossjohn, J. & Fugger, L. Hla variation and disease. Nat. Rev. Immunol. 18, 325–339 (2018).
Nakaoka, H. & Inoue, I. Distribution of HLA haplotypes across Japanese archipelago: similarity, difference and admixture. J. Hum. Genet. 60, 683–690 (2015).
Yamaguchi-Kabata, Y. et al. Genetic differences in the two main groups of the Japanese population based on autosomal SNPs and haplotypes. J. Hum. Genet. 57, 326–334 (2012).
Biau, G. & Scornet, E. A random forest guided tour. Test 25, 197–227 (2016).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Yamada, Y. et al. Disappearance of differences in nutrient intake across two local cultures in Japan: A comparison between Tokyo and Kyoto. Tohoku J. Exp. Med. 179, 235–245 (1996).
Liu, J. et al. Identification of genetic and epigenetic marks involved in population structure. PloS one 5, e13209 (2010).
Barfield, R. T. et al. Accounting for population stratification in DNA methylation studies. Genet. Epidemiol. 38, 231–241 (2014).
Tishkoff, S. A. & Kidd, K. K. Implications of biogeography of human populations for’race’and medicine. Nat. Genet. 36, S21–S27 (2004).
Coop, G. Genetic similarity versus genetic ancestry groups as sample descriptors in human genetics. arXiv preprint arXiv:2207.11595 (2022).
Watanabe, K., Taskesen, E., Van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with fuma. Nat. Commun. 8, 1826 (2017).
Rothman, K. J., Greenland, S. & Lash, T. L. Design strategies to improve study accuracy. Mod. Epidemiol. 3, 168–182 (2008).
Hayashi, H. et al. Genetic background of primary iron overload syndromes in Japan. Intern. Med. 45, 1107–1111 (2006).
Perry, G. H. et al. Diet and the evolution of human amylase gene copy number variation. Nat. Genet. 39, 1256–1260 (2007).
Sul, J. H., Martin, L. S. & Eskin, E. Population structure in genetic studies: Confounding factors and mixed models. PLoS Genet. 14, e1007309 (2018).
National Academies of Sciences, Engineering, and Medicine, Division of Behavioral and Social Sciences and Education, Health and Medicine Division, Committee on Population, Board on Health Sciences Policy & Committee on the Use of Race, Ethnicity, and Ancestry as Population Descriptors in Genomics Research. Using population descriptors in genetics and genomics research: a new framework for an evolving field. (National Academies Press (US), 2023). https://www.ncbi.nlm.nih.gov/books/NBK592836/.
Kozlov, M. ‘All of Us’ genetics chart stirs unease over controversial depiction of race. Nature https://doi.org/10.1038/d41586-024-00568-w (2024).
Li, X., Guo, T., Mu, Q., Li, X. & Yu, J. Genomic and environmental determinants and their interplay underlying phenotypic plasticity. Proc. Natl Acad. Sci. 115, 6679–6684 (2018).
Seabrook, J. A. & Avison, W. R. Genotype–environment interaction and sociology: Contributions and complexities. Soc. Sci. Med. 70, 1277–1284 (2010).
Freese, J. Genetics and the social science explanation of individual outcomes. Am. J. Sociol. 114, S1–S35 (2008).
Smith, G. D. et al. Genetic epidemiology and public health: hope, hype, and future prospects. Lancet 366, 1484–1498 (2005).
Howard, H. C., Sterckx, S., Cockbain, J., Cambon-Thomsen, A. & Borry, P. The convergence of direct-to-consumer genetic testing companies and biobanking activities: the example of 23 and me. In Knowing New Biotechnologies, 59–74 (Routledge, 2015).
Singleton, A., Erby, L. H., Foisie, K. V. & Kaphingst, K. A. Informed choice in direct-to-consumer genetic testing (dtcgt) websites: a content analysis of benefits, risks, and limitations. J. Genet. Counsel. 21, 433–439 (2012).
Carlsten, C. et al. Genes, the environment and personalized medicine: We need to harness both environmental and genetic data to maximize personal and population health. EMBO Rep. 15, 736–739 (2014).
Akiyama, M. et al. Genome-wide association study identifies 112 new loci for body mass index in the Japanese population. Nat. Genet. 49, 1458–1467 (2017).
Purcell, S. et al. Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Loh, P.-R. et al. Reference-based phasing using the haplotype reference consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Wang, K., Li, M. & Hakonarson, H. Annovar: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164–e164 (2010).
Ishii, K., Uenishi, K. & Ishida, H. Development of a simple “self-check sheet for calcium” and its reliability. Osteoporos. Jpn. 13, 497–502 (2005).
Shimomitsu, T. The final development of the brief job stress questionnaire mainly used for assessment of the individuals. Ministry of Labour sponsored grant for the prevention of work-related illness: The 1999 report 126–164 (2000).
Acknowledgements
We express our sincere gratitude to all the participants in this study. Special thanks go to DeNA Life Science Inc., Tokyo, Japan, and the University of Tokyo for their collaboration and the provision of support through a collaborative research fund. Additionally, we appreciate the support from the Human Genome Center at the Institute of Medical Science, the University of Tokyo (http://sc.hgc.jp/shirokane.html), for providing super-computing resources. This study was supported by the Human Genome Center at the Institute of Medical Science, the University of Tokyo. We acknowledge and thank the creators of the icons used in this publication. Icons for ‘questionnaire,’ ‘test,’ ‘brain,’ ‘food,’ ‘cyclocross,’ ‘dna,’ and ‘mutation’ were created by Freepik and obtained from Flaticon.com. Icons for ‘person’ were created by Valerie Lamm from The Noun Project, and the ‘Japan Map’ icon was created by Rahe, also from The Noun Project. These resources have been instrumental in the visual representation in Fig. 1.
Author information
Authors and Affiliations
Contributions
S.Imoto supervised the study. Y.C. designed the study and conducted the data analyses. Y.C. wrote the manuscript, with support from K.K. and S.Imoto. S. Ishida provided the data and contributed to writing part of the methods section. All authors reviewed the manuscript, and approved the final version for publication.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Guanglin He, Fiona Hagenbeek, and the other anonymous reviewer for their contribution to the peer review of this work. Primary Handling Editors: Qiao Fan and Rosie Bunton-Stasyshyn.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, Y., Katayama, K., Ishida, S. et al. Intricate interactions between fine-scale genetic structure, lifestyle, and dietary habits in the Japanese population. Commun Biol 8, 1046 (2025). https://doi.org/10.1038/s42003-025-08479-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-025-08479-w