Intricate interactions between fine-scale genetic structure, lifestyle, and dietary habits in the Japanese population

Chen, Yichi; Katayama, Kotoe; Ishida, Sachiko; Imoto, Seiya

doi:10.1038/s42003-025-08479-w

Download PDF

Article
Open access
Published: 12 July 2025

Intricate interactions between fine-scale genetic structure, lifestyle, and dietary habits in the Japanese population

Yichi Chen¹,
Kotoe Katayama²,
Sachiko Ishida^3,4 &
…
Seiya Imoto ORCID: orcid.org/0000-0002-2989-308X^1,2

Communications Biology volume 8, Article number: 1046 (2025) Cite this article

2464 Accesses
1 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

The fine-scale genetic structure within populations, focusing on demographic histories and migration patterns, has been explored previously. However, limited attention has been paid to understanding how genetic structure influences lifestyle and dietary habits within an epidemiological framework. This study explores the fine-scale genetic structure within a homogeneous Japanese population using advanced unsupervised learning techniques—Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and Density-Based Spatial Clustering of Applications with Noise (DBSCAN)—coupled with direct-to-consumer genetic testing data. We investigate the associated genetic factors and examine the relationship between the genetic structure and geographic ancestry. Additionally, using cross-sectional data and multinomial logistic regression, we further elucidate the nuanced impacts of lifestyle and dietary factors across genetic clusters, emphasizing the importance of integrating genetic data with epidemiological research. This study introduces a new framework for genetic epidemiology that considers both genetic and environmental influences.

Genetic effects on gestational diabetes mellitus and their interactions with environmental factors among Japanese women

Article 21 March 2025

A cross-population compendium of gene–environment interactions

Article Open access 28 January 2026

Interaction of genetic and environmental factors for body fat mass control: observational study for lifestyle modification and genotyping

Article Open access 23 June 2021

Introduction

In the era of personalized and customized health, the establishment and expansion of direct-to-consumer genetic testing (DTC-GT) have opened new avenues for gaining deeper insights into genetic information and disease risk prediction derived from prior genome-wide association studies (GWAS)¹. The extensive data from DTC-GT companies like 23andMe, which are widely used in scientific research, highlight the growing recognition of the value of such data². Owing to its ease of use, genetic research is increasingly utilizing DTC-GT data, aiming for more accurate and personalized healthcare applications. In Japan, MYCODE, provided by DeNA Life Science (DLS), Inc., (Tokyo, Japan), has taken the lead in the DTC-GT market³. It empowers users to make informed health decisions, potentially improves their behavioral habits, and contributes to scientific research, in which the MYCODE community voluntarily participates³.

In the landscape of GWAS, challenges stem from the population structure, which poses risks of false positives owing to the inadvertent inclusion of individuals with undetected admixtures⁴, prompting a growing support for machine-learning methods⁵. With the increasing availability of extensive genetic data, data-driven approaches have emerged as promising tools^6,7. Although principal component analysis (PCA), a linear transformation method to identify principal components (PCs), remains prevalent in genetic research⁸, a notable shift has occurred toward employing unsupervised machine-learning methods, including t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP)^9,10. These methods, used for reducing dimensionality in genotype data, improve the visualization of genetic variations, help delineate population structures, and mitigate stratification biases in GWAS^6,11, especially after PCA initialization^12,13. UMAP leverages Riemannian geometry and algebraic topology, enabling it to capture and retain the complex high-dimensional topology of data points within a lower-dimensional space¹⁰. This method has demonstrated superior performance in capturing fine-scale population structures across various genetic data types, such as bulk transcriptomics, single-cell RNA sequencing, and single nucleotide polymorphism (SNP) genotype data^12,13,14. Its efficacy extends to visualizing ancestral composition in cohorts and discerning intricate patterns in diverse biobanks, establishing UMAP as a valuable tool to unravel the complexities of genetic data¹⁵.

The Japanese population is considered genetically homogeneous and characterized by relatively low genetic diversity¹⁶. Studies have been dedicated to revealing the fine-scale genetic structure within the Japanese population, recognizing the existing genetic diversity based on the dual-structure model^17,18. A recent Biobank Japan study found that the PCA-UMAP dimensionality reduction method outperformed other techniques, demonstrating its superior efficacy in capturing the intricate structure of the Japanese subpopulation¹². PCA-UMAP merges PCA and UMAP by applying UMAP to the principal components of genotype data, resulting in a more accurate classification of population clusters. This approach also offers computational advantages and reduces statistical noise. The fine-scale genetic structure of the Japanese population, primarily focusing on uncovering ancestry and geographical differences in the admixture proportion due to historical migration, has been explored earlier^12,19. These studies clearly revealed the underlying population structure in Japan, emphasizing the genetic differences among islands and providing evidence of historical migration. Regional differences in the phenotypes of complex traits in Japanese individuals have been attributed to regional genetic differences²⁰. Previous investigations have concentrated on the geographic distribution of genetic backgrounds, primarily through a polygenic risk score, which is a numerical representation of the impact of genetic variants on a trait identified by GWAS^12,21. Isshiki et al.²² found that the distribution of polygenic height scores could explain the height gradient observed in Japan. Similarly, Sakaue et al.¹² discovered variations in the polygenic risk score for complex human traits among Japanese subpopulations. However, few studies have explored how population genetic structure influences comprehensive lifestyle and dietary habits within an epidemiological framework.

To address this gap, our objective was to investigate the intricate interactions between fine-scale genetic structure, lifestyle, and dietary habits in the Japanese population by incorporating combined methods. In this study, we utilized unsupervised learning methods, including PCA, UMAP, and the density-based spatial clustering of applications with noise algorithm (DBSCAN) clustering method²³, on MYCODE genotype data to investigate the population fine-scale genetic structure represented as clusters. We subsequently employed GWAS and post-GWAS analyses to identify genetic factors associated with the clusters as traits (Fig. 1a and b). Following this, we conducted a comprehensive examination of the associations between geographical, lifestyle, dietary habits, and genetic clusters using statistical analyses (Fig. 1c). This approach aimed to provide an innovative perspective combining genetic insights with epidemiological statistics. Through this procedure, we revealed intricate interactions between the population’s fine-scale genetic structure and lifestyle and dietary habits (Fig. 1d).

Results

Dimensionality reduction and clustering

After conducting genotype data quality control and LD pruning, we performed PCA on 395,665 pruned-in variants from 49,440 participants (Supplementary Figs. 1 and 2). Subsequently, UMAP was applied to the first 20 PCs of the genotype data, which is depicted in a two-dimensional scatter plot (Fig. 2a). The PCA-UMAP provided a discrete separation of individuals into six clusters, as observed by visual inspection (Fig. 2b). This observation established that the combination of unsupervised learning methods, PCA and UMAP, effectively distinguished participants within the same ethnic population. Furthermore, we excluded noise and participants who did not belong to the six largest clusters. This resulted in 43,726 participants being distributed among the top six clusters while maintaining the same projection shape (Fig. 2b).

**Fig. 2: Genetic clusters after the dimensionality reduction method.**

Cluster 1, represented in blue in Fig. 2b, held a central position in the PCA-UMAP plot, while Clusters 6 was positioned the furthest from Cluster 1. This spatial arrangement suggests a clear and structured pattern within the homogeneous Japanese genetic data, indicating the underlying fine-scale population structure and potential genetic diversity. In total, 43,726 eligible healthy Japanese adults were included in the GWAS, post-GWAS, and multinomial logistic regression analyses (Supplementary Fig. 1). Cluster 1 comprises the largest proportion of the total population (54.5%), while Cluster 6 had the lowest proportion at 1.8%. The mean age and BMI of the entire study population was 45 (SD = 9.7) years and 22 (SD = 3.5) kg/m², respectively. Males constituted 50.7% of the total population. In addition, consistent similarities were observed across all clusters in the distributions of age, BMI, sex ratios, drinking and smoking status, and blood glucose level (Supplementary Table 1).

Genetic differentiation among genetic clusters

Using Weir and Cockerham’s fixation index (F_ST)²⁴, we assessed the proportion of genetic diversity among the six clusters identified by DBSCAN clustering. Weir and Cockerham’s F_ST incorporates unbiased estimators of variance components, and offers a reliable metric for assessing genetic differentiation. A weighted mean F_ST of 0.000904 (SD = 0.015) indicated relatively low genetic differentiation among the clusters, confirming the homogeneity of the Japanese population in the study group. F_ST values for pairwise comparisons between clusters and the comparison of each cluster with the entire population are presented in Fig. 3a. Despite a low level of differentiation, distinctions were observed. Clusters 2 and 3 exhibited the highest genetic dissimilarities. Clusters 1, 5 and 6 exhibited the highest similarity, with the lowest pairwise F_ST in the heatmap. Notably, the lowest F_ST was observed between Cluster 1 and the entire population, emphasizing the close resemblance of Clusters 1 to the overall population.

We then identified SNPs with noteworthy F_ST values, suggesting a potential association with genes influencing adaptive or economically significant traits²⁵. Identifying SNPs based on the mean and SD of F_ST is a commonly employed method to detect selection signatures²⁶. The Manhattan plot (Fig. 3b) revealed peak SNPs on chromosomes 6 (mean F_ST = 0.009772) and 11 (mean F_ST = 0.002453). Among 395,665 SNPs, 1339 and 132 on chromosomes 6 and 11, respectively, were identified as outliers surpassing three SDs from the mean, indicating that these loci had values higher than 99.8% of the total SNPs²⁶. We established a threshold of F_ST > 0.5 to highlight genomic regions containing the most differentiated SNPs, with detailed annotations available in Supplementary Data 1 and 2. The closest genes associated with these regions are illustrated in Fig. 3c and d. The SNPs exhibiting the highest F_ST are located in the regions 6p22 and 6p21. The 6p21 region, which harbors a high concentration of core genes of the Major Histocompatibility Complex (MHC), is situated within the MHC. Meanwhile, the 6p22 region, positioned adjacent to this central area, also contributes to the broader context of MHC-related genetic functions. These regions correspond with the extended MHC region and have been previously linked to schizophrenia susceptibility²⁷. Specifically, SNPs rs189984590 and rs548568008, located on chromosome 6 within the EHMT2 and PRRT1 genes, respectively, exhibited F_ST values close to 0.6. EHMT2, a key epigenetic regulator, is involved in histone modification and has implications for neurological disorders²⁸. PRRT1 affects synaptic processes and is crucial for neurodevelopment and cognitive functions²⁹. On chromosome 11, the SNPs demonstrating the greatest genetic differentiation are found in the 11p11 region, which is enriched for genes associated with neurodevelopmental processes³⁰.

The MHC region, particularly of the human leukocyte antigen (HLA) genes, is crucial in a diverse range of complex human diseases and quantitative traits³¹. MHCs are categorized into three subclasses: class I, including highly polymorphic genes such as HLA-A, HLA-B, and HLA-C genes; class II, encoding genes such as HLA-DPA1, HLA-DPB1, HLA-DQA1, and others involved in antigen presentation; and class III, containing genes associated with inflammatory responses, leukocyte maturation, and the complement cascade³². In 2015, Nakaoka et al.³³ investigated the distinction in HLA frequencies across 10 regional populations in Japan and found that HLA-A*24:02-C*12:02-B*52:01-DRB1*15:02-DPB1*09:01 exhibited distinctive regional frequency patterns. In our study, we found results consistent with previous observations^33,34, where we identified highly differentiated SNPs within the four clusters, HLA-B, -DRA, -DRB1, -DQB1, -DQA2, -DQB2, predominantly located in or near the HLA region on chromosome 6. Collectively, despite the absence of specific geographical and regional information, our unsupervised learning methods effectively classified clusters exhibiting genetic differentiation, which was previously validated, thereby defining the fine-scale genetic structure of the study population.

To strengthen the reliability of our findings, we conducted additional F_ST analyses on randomly generated clusters, ensuring these clusters retained the structure of the identified clusters in our study. We assessed the similarity of different sets of clusters using the Adjusted Rand Index (ARI). The results, The results, detailed in Supplementary Table 2 and Supplementary Data 3 and 4, confirm that the genetic clusters observed in our study are not due to random variation but indeed represent distinct and genuine genetic fine-structures within the population.

Association between genetic clusters and geographic ancestry

Expanding upon this genetic categorization, we consolidated Clusters 1, 5, and 6 into a new entity designated Cluster 1*, guided by their genetic similarities as revealed through F_ST analysis. This strategic consolidation was informed by the low genetic differentiation observed between these clusters. F_ST values and characteristics of the newly formed clusters are presented in Supplementary Fig. 3 and Supplementary Table 3.

To elucidate the relationship between these PCA-UMAP-DBSCAN-defined clusters and geographic ancestry, we engaged a cohort of 2268 participants from a follow-up survey that gathered geographical information. We first performed chi-square tests to reveal potential associations between an individual’s genetic cluster and familial geographic lineage. As shown in Fig. 4a, we found that the genetic clusters were significantly related to seven questions in the survey, which covered geographical details ranging from the individual’s birthplace to the birthplaces of their parents and grandparents. Subsequently, we developed random forest models, which are powerful machine-learning techniques known for their simplicity and robustness³⁵, to predict genetic clusters based on geographic information. Two models were used: one incorporated only the geographic variables that exhibited significant associations in the chi-square tests (labeled sig), whereas the other utilized the entire suite of geographic data collected (labeled all). The ROC curves, illustrated in Fig. 4b, yielded area under the curve values of approximately 0.76. These values suggested a certain predictive capability of the geographic data, and also highlighted the potential influence of additional factors in genetic diversity, given their accuracy scores of 0.60 and 0.59, respectively.

**Fig. 4: Analysis of genetic clusters in relation to geographic ancestry.**

Further insights into the genetic structure of the cohort were obtained through unsupervised ADMIXTURE analysis³⁶, a software designed for the maximum likelihood estimation of individual ancestries. We systematically explored the values of K, which represents the number of ancestral populations assumed, to discern the fine-scale genetic structure of our cohort’s ancestry (Supplementary Figs. 4 and 5). In our analysis, the selection of K = 4 for ADMIXTURE analysis was strategically chosen to compare with our previous clustering findings using PCA-UMAP-DBSCAN, and F_ST analyses. This decision allowed us to reveal four distinct ancestral genetic mixtures, as depicted in Fig. 4c, with detailed information in Supplementary Data 5. The distribution of these mixtures illustrates a clear alignment with the genetic clusters previously defined through other methodologies (Fig. 4d). Specifically, Mixture B was predominantly comprised individuals from Cluster 3 (150 of 168, i.e., 89.2%), whereas Mixture C mainly consisted of individuals from Cluster 2 (230 of 260, i.e., 88.5%). By contrast, Mixtures A and D were primarily composed of individuals from the merged Cluster 1*, which comprised 1,485 individuals (65.5%) in the follow-up cohort.

The alignment of genetic clusters with ancestral mixtures defined by ADMIXTURE analysis suggested a patterned genetic composition. The combined findings from the chi-square tests, predictive modeling, and ADMIXTURE analysis revealed a structured picture of a genetic cluster related to geographic ancestry. Populations tend to exhibit clustering according to geographic region, as determined by genetic distance. Although geographical information contributes to predicting genetic clusters³⁷, our analysis indicates that it is not the sole definitive factor, suggesting that other factors could contribute to the genetic diversity of the population. The alignment observed in two types of clusters, where there was a significant overlap between the structures defined by machine-learning techniques and those identified by ADMIXTURE analysis, suggests that historical demographic events for particular clusters could significantly influence genetic compositions. This observation implies that the underlying genetic profiles of these clusters may be more prominently shaped by geographical drift than those of the other clusters. Beyond geographical distance, factors such as epigenetic markers, notably DNA methylation, a crucial epigenetic mechanism that is sensitive to environmental influences, may impact population structure^38,39. These influences suggest that genetic diversity is the product of complex gene-environment interactions, which are further complicated by epigenetic modifications that reflect the intricate genetic and epigenetic interplay within populations³⁸. As physical and social barriers among different populations decrease, genetic mixing becomes more common, suggesting that factors beyond geography are increasingly influential in shaping the population structure⁴⁰. Overall, our results suggest a complex relationship among the concepts of genetic similarity, genetic ancestry, and genealogical ancestry⁴¹.

Integrated insights from genetic interpretation and epidemiological results

We conducted GWAS and post-GWAS analyses based on the newly assigned clusters to investigate potential differences in gene expression, lifestyle, and dietary habits (Supplementary Data 6–10). The mapped genes of each cluster were used for GSEA utilizing the FUMA pipeline⁴² to evaluate whether the genes exhibit statistically significant associations with trait-associated genetic variants. FUMA used 15 databases for GSEA, identifying 1616 significant relationships (Supplementary Data 9). Our study particularly highlighted the GWAS Catalog database, which is pivotal for exploring the genetic foundations of diverse traits and diseases, making it relevant to our investigation of the potential links between genetic variations and lifestyle and dietary habits. Among the gene sets analyzed, the GWAS Catalog revealed 220 significant relationships. Notably, these relationships exhibited the second-lowest P when gene sets were evaluated as groups, as illustrated in Supplementary Fig. 6.

A total of 64 unique gene sets were identified as statistically significant, with intriguing findings such as the “Fruit Consumption” gene set ranking 20th when sorted by adjusted P in ascending order, and the last associated with “A body shape index” (Supplementary Data 10). Given the information density, we focused on the distinct traits of each cluster in the Japanese population, employing a Circos plot (Fig. 5a). The upper circle segments correspond to the specific gene sets that are unique to each cluster, as determined by their non-overlapping presence in the GWAS Catalog database. For example, Cluster 1* was distinctly associated with HDL cholesterol levels, while Cluster 3 was uniquely linked to metabolic traits such as “Alanine aminotransferase” and “Glycated hemoglobin levels”.

**Fig. 5: Integrated insights of genetic interpretation and epidemiological results.**

Despite these specific associations, a uniform distribution of gene sets and their interrelations across clusters was evident (Supplementary Fig. 7). The consistent significance across gene sets likely stemmed from shared significant SNPs (Supplementary Data 11). This pattern became more apparent in the context of the one-vs-all GWAS analysis, where the studied traits essentially represented the fine-scale population structure, and the persistent occurrence of particular SNPs resulted from the shared genetic background within the homogeneous Japanese population. The consistently observed shared genetic influences, as reflected in the enrichment of the same gene sets within the population’s fine-scale genetic structure, can be attributed to pleiotropy, where a single gene can affect multiple seemingly unrelated traits. This finding highlights the effect of specific genetic loci on diverse traits, thereby emphasizing the importance of accounting for pleiotropy in explaining the genetic basis of phenotypic diversity.

In addition, we generated a forest plot using statistically significant results from multinomial logistic regression analysis of cross-sectional data (Fig. 5b) to illustrate the ORs and 95% confidence intervals (CI). The regression model was adjusted for age, sex, and BMI⁴³ using Cluster 1* as the reference group because it was observed to be the most similar to the entire population. All 21 significant results were found among 175 exposure variables, with an average observation of 23,521 individuals, excluding missing variables. Among the significant findings, the lowest P value around 0.005 was observed for “Daily Milk Intake” in Cluster 3. The figure presents consistent positive associations in Clusters 3 and 4, notably with weekly food consumptions. Additionally, results included two exposures of “Total Weekly Calcium Consumption” for Cluster 4 (OR = 0.998, CI: 0.997–0.9999, P = 0.042) and “Total Weekly Fish Consumption” for Cluster 3 (OR = 1.002, CI: 1.000–1.003, P = 0.025). The small magnitude of these ORs suggested nuanced effects (See Supplementary Data 12 for detailed information, including survey questions). The regression analysis indicated significant associations across two domains: quality of life, emphasizing sleep disturbance and fatigue; and dietary habits, with a focus on the intake of fish and vegetables. An examination of Fig. 5a and b together reveals interesting patterns. For instance, Clusters 3 and 4 have higher vegetable consumption compared to the reference Cluster 1*. Furthermore, the Circos plot indicates that these clusters uniquely exhibit statistically significant GESA results for body shape, feelings of guilt, and non-lobar intracerebral hemorrhage (MTAG). Such variations highlight the diverse impacts of these factors on health across different genetic groups. The observed differences among genetic clusters suggest a complex interplay between epidemiological exposures and fine-scale genetic structures.

Discussion

In this study, we identified the population structure of the Japanese population using unsupervised learning techniques including PCA, UMAP, and DBSCAN. We investigated the genetic factors associated with these clusters through F_ST, GWAS, and post-GWAS analyses, uncovering significant gene sets linked to specific traits and diseases. Notably, gene sets related to dietary habits such as “Fish- and plant-related diet” and health conditions like “Liver iron content” were identified, highlighting the unique genetic background of the Japanese population⁴⁴ and the potential impact of genetics on health and lifestyle choices⁴⁵. Despite the uniform genetic background across all clusters, the variance in the epidemiological relationships between lifestyle factors, dietary habits, and genetic clusters highlights the complex relationship between genetics and environmental factors. Multinomial regression analysis was used to compare different clusters in relation to lifestyle and dietary habits, illustrating the varied impacts of these factors across genetic clusters. These combined results underscore the complexity of genetic predispositions and their interactions with lifestyle and dietary habits, offering insights into gene-environment interactions by utilizing the concept of fine-scale genetic structure⁴⁶.

Genetic research on the Japanese population has delved into its adaptation through two principal lenses: the admixture history of ancient lineages, highlighting the dual-structure model that illustrates the indigenous identities of the Jomon and Yayoi peoples; and the genetic distinctions within geographical regions (Hondo and Ryukyu)¹⁷. Prior studies that aimed to understand the genetic fine-scale structure in the Japanese population^12,19 have primarily served to confirm the concordance between geographical distribution and the observed genetic structure. However, a pressing question remains: is employing geographical or indigenous labels as proxies for the identified fine-scale structures the optimal approach? The growing awareness of the need to address the nuanced issue of population descriptors in genetics research has been increasingly important⁴⁷, especially in terms of presenting genetic classifications to the public without fueling debates on genetic essentialism⁴⁸. In 2023, the US National Academies of Sciences, Engineering, and Medicine (NASEM) has published guidance recommending best practices for using population descriptors in research⁴⁷. This guidance advises carefully considering the use of geographical ancestry to avoid misconceptions, particularly the risk of misinterpreting geography as indicative of environmental exposures. It also warns against using indigenous identity in ways that could imply “purity” and lead to discriminatory interpretations.

The fine-scale structure we uncovered reveals patterns where clusters serve as indicators of individuals with substantial genetic similarity. Our use of “genetic similarity” as a descriptor aligns with existing guidelines⁴⁷. It is a well-accepted concept that phenotypes arise from the dynamic interplay between genetics and the environment^49,50. Our analysis reveals distinct patterns among populations grouped by genetic similarity, suggesting potential interactions between genetic factors and environmental conditions. In light of these findings, we highlight the need for future studies that explicitly incorporate gene-environment analyses. Such research would more clearly elucidate the impact of societal changes on genetic variation through complex social determinants and interactions. This points to the necessity of including a broad spectrum of environmental and social factors in genetic research. It is essential to integrate sociological and environmental contexts into genetic studies to fully explore the intricate, intertwined relationships between society and human biology. This approach would help uncover how societal developments may influence genetic variation through detailed causal networks of social determinants and gene-environment interactions⁵¹.

Our study helps to understand the complex factors influencing health outcomes, with implications for both public health and precision medicine. The application of advanced unsupervised learning technologies, such as PCA-UMAP-DBSCAN, effectively identified clusters with significant genetic differentiation within a homogeneous population. These combined technologies have uncovered intricate patterns within genetic data, proving efficient classification based on intrinsic genetic structures. Simplifying the complex genotype matrix into discrete categorical variables that represent the fine-scale genetic structure enables the use of epidemiological regression. Our statistical analysis revealed gene-environment interactions across clusters that share a homogeneous genetic background. This comprehensive approach links genetic predispositions to lifestyle and dietary patterns, yielding valuable insights for public health interventions. This framework also highlights the possibility of strengthening public health research by including genetic structure as an essential variable in mainstream statistical studies, as integrating genetic architecture could improve our understanding of complex relationships in genetic epidemiology⁵². However, the convenient nature of the DTC-GT data enables the creation of near-time biobank data, allowing researchers to engage in more precise data processing⁵³. Swift recruitment facilitated by DTC-GT enables faster follow-up studies and offers crucial insights into personalized medicine. This approach allows for the rapid assembly and modification of prevention programs, thereby improving their effectiveness⁵⁴. Additionally, incorporating gene-environment interactions and genetic clusters into the development of personalized prevention programs may improve their efficacy. By considering genetic susceptibility and modifiable lifestyle factors, these programs could improve personalized medicine by matching individuals using genomic profiling⁵⁵.

Our study has some limitations. First, validation of gene set enrichment patterns in future studies is essential to establish the robustness of the explanation for the entire Japanese population. Second, the simplicity of the overall regression model, which was adjusted only for age, sex, and BMI, could introduce a potential bias in the interpretation of these relationships. The ORs close to 1.0 may indicate a potential limitation in establishing substantial associations with the respective clusters. Third, the exclusion of missing data could introduce a potential source of bias. The use of self-reported questionnaires for lifestyle and dietary habits in the cross-sectional study could also introduce recall bias. Additionally, while UMAP effectively highlights local data structures, it may overemphasize the differentiation between clusters. Future studies might benefit from validating findings through additional methods that offer different perspectives on the global relationships among genotype data. Finally, inherent self-selection bias could arise from customers opting to participate in the DTC-GT, potentially leading to subject bias. These limitations underscore the need for cautious interpretation and highlight the areas for consideration in future research.

In conclusion, by applying machine-learning techniques, we revealed the fine-scale genetic structure of Japanese DTC-GT customers, marked by high genetic variation among the identified clusters, and revealed a significant relationship between these genetic clusters and geographical ancestry. This study introduces an innovative framework for genetic epidemiology by integrating genetic insights with cross-sectional statistical analyses, shedding light on the subtle effects of lifestyle and dietary factors across different genetic clusters. With the future availability of health outcome data, research could further explore the relationship between risk factors and health outcomes within the context of genetic architecture.

Methods

Study participants

A total of 61,728 healthy Japanese adults were initially enrolled as customers of MYCODE (DLS Inc., Tokyo, Japan), a personal genome service in Japan. Participants returned their saliva samples to the DLS laboratory along with a signed consent form indicating their willingness to participate in MYCODE Research, where their anonymous genetic data and health-related information would be utilized for scientific research objectives. In sample quality control by the DLS laboratory, 55,551 participants along with 684,436 autosomal SNPs were retained after exclusion for sex inconsistency, identity by descent ($\hat{\pi } > 0.1875$), missing call rate (>0.01), non-Japanese ancestry estimated by the genetic PCA using East Asian samples from the 1000 Genomes Project Phase 3 version 5, and autosomal heterozygosity (>3 standard deviation (SD) above the mean).

For this study, the eligibility criteria were further established as follows: (i) age 20–64 years, (ii) validated height and weight data, and (iii) height within three times the interquartile range (IQR)⁵⁶. PCA was conducted on the genotype data of the participants following the quality control procedures described below using PLINK (v1.9)⁵⁷.

All ethical regulations relevant to human research participants were followed. The study was approved by both the ethics committee of DeNA Life Science Inc. (protocol #20140717_1) and the Institute of Medical Science, University of Tokyo (protocol #2019-48-1219) (Tokyo, Japan). Participants consented to the publication of research findings using their data, under the condition that no personally identifiable information is disclosed. All personal identifiers have been removed to protect confidentiality and comply with ethical standards.

Genotyping and quality control

SNP genotyping was performed using Infinium OmniExpress-24+ BeadChip or Human OmniExpress-24+ BeadChip (Illumina, Inc., San Diego, CA, USA) in the DLS laboratory. Before Genotype imputation, the quality control was performed on the genotyped 684,436 autosomal SNPs with the following exclusion criteria: with a (i) minor allele frequency < 0.01, (ii) Hardy-Weinberg equilibrium P < 10⁻⁶, and (iii) missing call rate > 0.01. The initial phase of haplotype phasing was conducted on the entire cohort of 55,551 individuals utilizing Eagle (v2.4.1)⁵⁸, with the reference panels of the 1000 Genomes phase 3 version 5 (1KGP-JPT; available at https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html). Genotype imputation was performed using Minimac3. Quality control procedures for the genotype data included the exclusion of SNPs meeting the following criteria: with a (i) minor allele frequency < 0.01, (ii) Hardy-Weinberg equilibrium P < 10⁻⁶, (iii) call rate <95%, and (iv) r² < 0.7. All genetic variants initially identified were subjected to a comprehensive re-annotation procedure using “bcftools annotate”. Linkage disequilibrium (LD) pruning was conducted after quality control, with an option “-indep-pairwise 50 5 0.2” using PLINK (v1.9)⁵⁷. Following the application of eligibility criteria to the study participants, the dataset consisted of 395,665 pruned-in variants from 43,726 individuals. Genome construction information was based on the GRCh37/hg19 reference.

Dimensionality reduction and clustering

We adopted the dimensionality reduction methods adopted from previous literature¹². First, we performed PCA using PLINK (v1.9) with the default parameters. The first two PCs were subsets and were projected onto a two-dimensional space for outlier inspection. We then performed UMAP on the first 20 PCs of the genotype using the Python umap-learn package (v0.5.3; https://pypi.org/project/umap-learn/) with the default parameters. The choice of the top 20 principal components was based on commonly accepted practices and informed by prior research on the fine-structure in the Japanese population¹². A scree plot demonstrating eigenvalues is included in Supplementary Fig. 2.

After the manual identification of six distinct clusters based on the separation observed in the PCA-UMAP plot (Fig. 1a), clustering and visualization were conducted. DBSCAN, a widely used clustering algorithm in the spatial domain²³, was implemented using the Python scikit-learn package (v1.3.0; https://scikit-learn.org/stable/) with the parameters set to eps = 0.068 and min_samples = 3. Noisy samples identified using DBSCAN were subsequently excluded, resulting in a final cohort of 43,726 individuals for further analysis (Supplementary Fig. 1). Python (v3.9.12) and R (v4.3.2) were used for all analyses in this study.

Genetic variance analysis

We used the F_ST value, as defined by Weir and Cockerham²⁴, to investigate the overall genetic differentiation among the six clusters identified using PCA-UMAP analysis. F_ST calculations were performed using the first option in PLINK (v1.9)⁵⁷. ANNOVAR (v2018Jul08)⁵⁹ was used for the functional annotation of SNPs with high F_ST values. In addition, we conducted pairwise comparisons of the clusters in a one-versus-one manner. Furthermore, one-versus-all F_ST analyses were performed to quantify genetic variance between each cluster and the entire dataset. Randomly generated cluster sets were created using the numpy package (v1.24.4) with two distinct seeds (42 and 100). The similarity between these random clusters and the original clusters was assessed using the ARI from the scikit-learn package (v1.3.0).

Geographical information and related analysis

The geographical information was obtained from a follow-up survey with 3132 volunteers, featuring nine questions on current residence, birthplace, longest place of residence before adulthood, and six familial birthplaces, covering all 47 Japanese prefectures and an additional “I don’t know/don’t want to answer” option. As shown in Supplementary Fig. 8, these prefectures were further organized into 10 regional blocks according to the Type I region classification by the Ministry of Internal Affairs and Communications in Japan (https://www.soumu.go.jp/main_content/000872772.pdf). After excluding responses with missing data, our analysis focused on 2268 participants with complete responses. A chi-square analysis was initially conducted to confirm that the follow-up cohort shared a similar distribution of genetic clusters with the main dataset using the scipy.stats module (v1.11.4) in Python. For association analysis, we conducted chi-square tests to examine the relationship between geographical information and genetic clusters. In Python, using the scikit-learn package (v1.3.0), we applied a random forest classifier using the two models as features and the genetic cluster as the target variable. The two models of features included one with only significantly associated geographic variables, while the other used all the geographic data collected. A stratified five-fold cross-validation was applied to the dataset to ensure the consistency and reliability of the model performance. To mitigate class imbalance, we incorporated microaveraging to generate an receiver operating characteristic (ROC) curve. ADMIXTURE software (v1.3.0)³⁶ was used to conduct an unsupervised estimation of ancestral components among 2268 participants. We examined different numbers of assumed ancestral components by setting the K values in the range of 2 to 12¹². This representation distinguishes ancestral components through color coding and sorts individuals based on their predominant ancestry. We selected K = 4 for our analyses as it provided a balance between minimizing error and aligning with the coherent genetic structures identified through our previous analyses. Supplementary Figs. 4 and 5 provide a comprehensive overview of the ancestral estimations across the entire range of K values with cross-validation errors.

GWAS and Post-GWAS analyses

To explore genetic variance among the six clusters, one-versus-all case-control studies were conducted using PLINK (v1.9)⁵⁷. A logistic regression model assuming additive genetic effects was used for association analysis, adjusting for age and sex as covariates. The association analysis was repeated six times, designating each cluster as a case. The FUMA pipeline (v1.6.0; https://fuma.ctglab.nl/), from SNP2GENE to GENE2FUNC functions, was employed to explore the results obtained from PLINK. In the SNP2GENE step, lead and candidate SNPs were identified using default parameters, except for 1KGP-EAS, which was used as the reference panel. An SNP was considered independently significant if it achieved genome-wide significance level, as determined by the default threshold of 5 × 10⁻⁸, and independent from each other with LD r² < 0.6. FUMA also defines independent lead SNPs with low LD r² < 0.1. Genomic risk loci were subsequently mapped to genes using functional and eQTL mapping. For positional mapping, ANNOVAR annotations were employed in the FUMA pipeline and candidate SNPs were assigned to the nearest genes within a maximum distance of 0 kb. For eQTL mapping, the expression data for blood tissues in the GTEx v8 were used. A default false discovery rate of 0.05 was applied to define significant eQTL associations. In addition, the GENE2FUNC feature in FUMA conducts gene set enrichment analysis (GSEA) by performing hypergeometric tests using mapped genes to assess the overrepresentation of biological functions using public databases, including the GWAS Catalog, MsigDB, and WikiPathways. The circos plot was generated using the Circlize package (v0.4.15).

Lifestyle and dietary habits

MYCODE users answered an optional web-based questionnaire on lifestyle and dietary habits on the service website to obtain personalized health advice from August 2014 to June 2020. The timing of the data collection varied between participants. Since the study involved self-administered questionnaires, the response times differed for each participant. The anonymized answer data of the MYCODE Research participants were used for this study. To ensure data quality, responses that appeared suspicious were identified and removed during the pre-analysis phase. The questionnaire contained anthropometric traits; dietary habits such as the intake of calcium, dietary fiber, fruit, fish, and red and yellow/green/other vegetables; physical activity; smoking and drinking habits; sleep-related behaviors; stress responses; and blood test results such as triglyceride, LDL cholesterol, and glucose levels. We used the calcium self-check chart⁶⁰ for calcium intake assessment and the Brief Job Stress Questionnaire for stress response evaluation, which contains a subset of 29 items, comprising 18 items related to psychological stress responses and 11 items focusing on physical stress responses and somatic symptoms⁶¹. Other questionnaires on dietary habits were originally developed by DLS based on the Standard Tables of Food Composition in Japan (https://fooddb.mext.go.jp/). To remove potential wrong answers for free-answer questions on alcohol intake, answers for over 10 glasses of wine (120 mL per glass) per day or 10 cans of chuhai (350 mL per can) per day were converted to missing values.

Multinomial logistic regression analysis

For each lifestyle and dietary habit, multinomial logistic regression analyses were performed using the Scikit-Learn Python library (v1.3.0). We excluded all missing variables specific to each lifestyle or dietary habit when conducting multinomial regression for the exposure variable of interest. Age, sex, and body mass index (BMI) were adjusted in the model⁴³, and the covariate data were comprehensive and devoid of missing values. All statistical analyses were conducted using Python, with a significance threshold set at P < 0.05 for statistical significance. The formula for multinomial logistic regression with lifestyle/dietary habit variable X and a specific cluster k is as follows:

$$\log \left(\frac{P(Y=k)}{P(Y=\,{\mbox{reference}})}\right)={\beta }_{0k}+{\beta }_{1k}\cdot {\mbox{age}}+{\beta }_{2k}\cdot {\mbox{sex}}+{\beta }_{3k}\cdot {\mbox{BMI}}\,+{\beta }_{4k}\cdot X$$

where:

$(\frac{P(Y=k)}{P(Y=\,{\mbox{reference}})})$ is the probability of the dependent variable Y being in cluster k given the value of the lifestyle/dietary habit variable X, where the reference being Cluster 1*.
β_1k, β_2k, β_3k, and β_4k are the coefficients for cluster k corresponding to the variables age, sex, BMI, and X, where X is the specific lifestyle/dietary habit variable.
β_0k is the intercept term representing the baseline log-odds for cluster k.
age, sex, and BMI are the covariates adjusted in the model.

The odds ratios (ORs) for X in relation to cluster k is calculated by:

$${{\mbox{OR}}}_{X,k}=\exp ({\beta }_{4k})$$

Statistics and reproducibility

In our study, we utilized PCA, UMAP, and DBSCAN to explore the fine-scale genetic structure of a homogeneous Japanese population based on autosomal SNP data. Additionally, we investigated the relationship between genetic clusters and lifestyle factors by applying multinomial logistic regression.

To ensure the reproducibility of our findings, we provide comprehensive descriptions of all procedures, from data collection to analysis. This detail includes our data cleaning process, the criteria for including and excluding participants, and the specific parameters set within our statistical models. Initially, our study included 61,728 participants, which was then narrowed down to 43,726 individuals after quality control. For replication purposes, researchers should employ a similarly large cohort of both genetic and phenotypic data, ideally encompassing over 40,000 participants post-quality control, to ensure robust and replicable results.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The source data underpinning the figures presented in this study are provided as follows: Due to privacy concerns, the raw data for Fig. 2, which include detailed clustering information on individual participants, cannot be disclosed. Data supporting Fig. 3 are contained in Supplementary Data 1 and 2. Figure 4c and d draw from Supplementary Data 5. Individual responses underlying Figs. 4a and b are withheld to protect participant privacy. Data supporting Fig. 5a are provided in Supplementary Data 10. For Fig. 5b, while individual responses are not available, summary statistics are provided in Supplementary Data 12. Access to human SNP genotype and individual health-related data is controlled by privacy and legal, and ethical considerations. These sensitive data are held by DeNA Life Sciences, Inc., based on the informed consent obtained from participants, which does not allow deposition in a public repository. As of October 2024, ownership will be transferred to Allm Inc. due to business succession. However, these data may be obtained from the corresponding author upon a justified request. Data access requires ethical committee approval, followed by the provision of an opt-out opportunity for study participants. Data sharing must also comply with MYCODE Research’s security policies, which mandate appropriate access control, logging and monitoring, a dedicated and isolated network, and antivirus measures for environments handling sensitive data. Specific details regarding these requirements will be discussed upon individual request. Please note that the processes of opt-outs and data preparation will take approximately six months.

Code availability

The code for dimensionality reduction and clustering of the genotype data is available on GitHub: https://github.com/YichiChen-z/dimension-reduction-clustering/.

Change history

11 August 2025
In this article the handling editor name was missing and should have read primary handling editors: Qiao Fan and Rosie Bunton-Stasyshyn. The original article has been corrected.

References

Roberts, J. S. & Ostergren, J. Direct-to-consumer genetic testing and personal genomics services: a review of recent empirical studies. Curr. Genet. Med. Rep. 1, 182–200 (2013).
Article PubMed PubMed Central Google Scholar
Hayden, E. C. The rise and fall and rise again of 23 and me. Nature 550, 174–177 (2017).
Article Google Scholar
Miyake, K. Psy17-2 - 5 years of steady progress: DTC genetic testing service mycode. Ann. Oncol. 30, vi23 (2019).
Article Google Scholar
Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).
Article CAS PubMed Google Scholar
Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).
Article CAS PubMed PubMed Central Google Scholar
Ausmees, K. & Nettelblad, C. A deep learning framework for characterization of genotype data. G3 Genes∣Genomes∣Genetics 12, jkac020 (2022).
Article Google Scholar
Schrider, D. R. & Kern, A. D. Supervised machine learning for population genetics: A new paradigm. Trends Genet. 34, 301–312 (2018).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
der Maaten, L. V. & Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 9 (2008).
Healy, J. & McInnes, L. Uniform manifold approximation and projection. Nat. Rev. Methods Primers 4, 82 (2024).
Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinforma. 20, 116 (2019).
Article Google Scholar
Sakaue, S. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat. Commun. 11, 1569 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cristian, P.-M. et al. Diffusion on PCA-UMAP manifold captures a well-balance of local, global, and continuum structure to denoise single-cell RNA sequencing data. bioRxiv 2022.06.09.495525 http://biorxiv.org/content/early/2022/06/12/2022.06.09.495525.abstract (2022).
Yang, Y. et al. Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data. Cell reports 36 (2021).
Diaz-Papkovich, A., Anderson-Trocmé, L. & Gravel, S. A review of UMAP in population genetics. J. Hum. Genet. 66, 85–91 (2021).
Article PubMed Google Scholar
Haga, H., Yamada, R., Ohnishi, Y., Nakamura, Y. & Tanaka, T. Gene-based SNP discovery as part of the Japanese millennium genome project: identification of 190,562 genetic variations in the human genome. J. Hum. Genet. 47, 605–610 (2002).
Article CAS PubMed Google Scholar
HANIHARA, K. Dual structure model for the population history of the Japanese. Jpn Rev. 1–33 http://www.jstor.org/stable/25790895 (1991).
Jinam, T. A., Kanzawa-Kiriyama, H. & Saitou, N. Human genetic diversity in the Japanese archipelago: dual structure and beyond. Genes Genet. Syst. 90, 147–152 (2015).
Article CAS PubMed Google Scholar
Takeuchi, F. et al. The fine-scale genetic structure and evolution of the Japanese population. PLOS ONE 12, e0185487– (2017).
Article PubMed PubMed Central Google Scholar
Hirata, J. et al. Genetic and phenotypic landscape of the major histocompatibilty complex region in the Japanese population. Nat. Genet. 51, 470–480 (2019).
Article CAS PubMed Google Scholar
Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 12, 1–11 (2020).
Article Google Scholar
Isshiki, M., Watanabe, Y. & Ohashi, J. Geographic variation in the polygenic score of height in Japan. Hum. Genet. 140, 1097–1108 (2021).
Article CAS PubMed Google Scholar
Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans. Database Syst. (TODS) 42, 1–21 (2017).
Article Google Scholar
Weir, B. S. & Cockerham, C. C. Estimating f-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
CAS PubMed Google Scholar
Zhao, F., McParland, S., Kearney, F., Du, L. & Berry, D. P. Detection of selection signatures in dairy and beef cattle using high-density genomic information. Genet. Sel. Evol. 47, 49 (2015).
Article PubMed PubMed Central Google Scholar
Maiorano, A. M. et al. Assessing genetic architecture and signatures of selection of dual purpose gir cattle populations using genomic information. PLOS ONE 13, 1–24 (2018).
Article Google Scholar
Ikeda, M. et al. Genome-wide association study of schizophrenia in a Japanese population. Biol. Psychiatry 69, 472–478 (2011).
Article PubMed Google Scholar
O’Leary, N. A. et al. Reference sequence (refseq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Article PubMed Google Scholar
Aleksander, S. et al. Updates to the alliance of genome resources central infrastructure. Genetics 227, iyae049 (2024).
Stauffer, E.-M. et al. The genetic relationships between brain structure and schizophrenia. Nat. Commun. 14, 7820 (2023).
Article CAS PubMed PubMed Central Google Scholar
Trowsdale, J. & Knight, J. C. Major histocompatibility complex genomics and human disease. Annu. Rev. Genomics Hum. Genet. 14, 301–323 (2013).
Article CAS PubMed PubMed Central Google Scholar
Dendrou, C. A., Petersen, J., Rossjohn, J. & Fugger, L. Hla variation and disease. Nat. Rev. Immunol. 18, 325–339 (2018).
Article CAS PubMed Google Scholar
Nakaoka, H. & Inoue, I. Distribution of HLA haplotypes across Japanese archipelago: similarity, difference and admixture. J. Hum. Genet. 60, 683–690 (2015).
Article CAS PubMed Google Scholar
Yamaguchi-Kabata, Y. et al. Genetic differences in the two main groups of the Japanese population based on autosomal SNPs and haplotypes. J. Hum. Genet. 57, 326–334 (2012).
Article CAS PubMed Google Scholar
Biau, G. & Scornet, E. A random forest guided tour. Test 25, 197–227 (2016).
Article Google Scholar
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Article CAS PubMed PubMed Central Google Scholar
Yamada, Y. et al. Disappearance of differences in nutrient intake across two local cultures in Japan: A comparison between Tokyo and Kyoto. Tohoku J. Exp. Med. 179, 235–245 (1996).
Article CAS PubMed Google Scholar
Liu, J. et al. Identification of genetic and epigenetic marks involved in population structure. PloS one 5, e13209 (2010).
Article PubMed PubMed Central Google Scholar
Barfield, R. T. et al. Accounting for population stratification in DNA methylation studies. Genet. Epidemiol. 38, 231–241 (2014).
Article PubMed PubMed Central Google Scholar
Tishkoff, S. A. & Kidd, K. K. Implications of biogeography of human populations for’race’and medicine. Nat. Genet. 36, S21–S27 (2004).
Article CAS PubMed Google Scholar
Coop, G. Genetic similarity versus genetic ancestry groups as sample descriptors in human genetics. arXiv preprint arXiv:2207.11595 (2022).
Watanabe, K., Taskesen, E., Van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with fuma. Nat. Commun. 8, 1826 (2017).
Article PubMed PubMed Central Google Scholar
Rothman, K. J., Greenland, S. & Lash, T. L. Design strategies to improve study accuracy. Mod. Epidemiol. 3, 168–182 (2008).
Google Scholar
Hayashi, H. et al. Genetic background of primary iron overload syndromes in Japan. Intern. Med. 45, 1107–1111 (2006).
Article PubMed Google Scholar
Perry, G. H. et al. Diet and the evolution of human amylase gene copy number variation. Nat. Genet. 39, 1256–1260 (2007).
Article CAS PubMed PubMed Central Google Scholar
Sul, J. H., Martin, L. S. & Eskin, E. Population structure in genetic studies: Confounding factors and mixed models. PLoS Genet. 14, e1007309 (2018).
Article PubMed PubMed Central Google Scholar
National Academies of Sciences, Engineering, and Medicine, Division of Behavioral and Social Sciences and Education, Health and Medicine Division, Committee on Population, Board on Health Sciences Policy & Committee on the Use of Race, Ethnicity, and Ancestry as Population Descriptors in Genomics Research. Using population descriptors in genetics and genomics research: a new framework for an evolving field. (National Academies Press (US), 2023). https://www.ncbi.nlm.nih.gov/books/NBK592836/.
Kozlov, M. ‘All of Us’ genetics chart stirs unease over controversial depiction of race. Nature https://doi.org/10.1038/d41586-024-00568-w (2024).
Li, X., Guo, T., Mu, Q., Li, X. & Yu, J. Genomic and environmental determinants and their interplay underlying phenotypic plasticity. Proc. Natl Acad. Sci. 115, 6679–6684 (2018).
Article CAS PubMed PubMed Central Google Scholar
Seabrook, J. A. & Avison, W. R. Genotype–environment interaction and sociology: Contributions and complexities. Soc. Sci. Med. 70, 1277–1284 (2010).
Article PubMed Google Scholar
Freese, J. Genetics and the social science explanation of individual outcomes. Am. J. Sociol. 114, S1–S35 (2008).
Article Google Scholar
Smith, G. D. et al. Genetic epidemiology and public health: hope, hype, and future prospects. Lancet 366, 1484–1498 (2005).
Article Google Scholar
Howard, H. C., Sterckx, S., Cockbain, J., Cambon-Thomsen, A. & Borry, P. The convergence of direct-to-consumer genetic testing companies and biobanking activities: the example of 23 and me. In Knowing New Biotechnologies, 59–74 (Routledge, 2015).
Singleton, A., Erby, L. H., Foisie, K. V. & Kaphingst, K. A. Informed choice in direct-to-consumer genetic testing (dtcgt) websites: a content analysis of benefits, risks, and limitations. J. Genet. Counsel. 21, 433–439 (2012).
Article Google Scholar
Carlsten, C. et al. Genes, the environment and personalized medicine: We need to harness both environmental and genetic data to maximize personal and population health. EMBO Rep. 15, 736–739 (2014).
Article CAS PubMed PubMed Central Google Scholar
Akiyama, M. et al. Genome-wide association study identifies 112 new loci for body mass index in the Japanese population. Nat. Genet. 49, 1458–1467 (2017).
Article CAS PubMed Google Scholar
Purcell, S. et al. Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Loh, P.-R. et al. Reference-based phasing using the haplotype reference consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wang, K., Li, M. & Hakonarson, H. Annovar: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164–e164 (2010).
Article PubMed PubMed Central Google Scholar
Ishii, K., Uenishi, K. & Ishida, H. Development of a simple “self-check sheet for calcium” and its reliability. Osteoporos. Jpn. 13, 497–502 (2005).
Google Scholar
Shimomitsu, T. The final development of the brief job stress questionnaire mainly used for assessment of the individuals. Ministry of Labour sponsored grant for the prevention of work-related illness: The 1999 report 126–164 (2000).

Download references

Acknowledgements

We express our sincere gratitude to all the participants in this study. Special thanks go to DeNA Life Science Inc., Tokyo, Japan, and the University of Tokyo for their collaboration and the provision of support through a collaborative research fund. Additionally, we appreciate the support from the Human Genome Center at the Institute of Medical Science, the University of Tokyo (http://sc.hgc.jp/shirokane.html), for providing super-computing resources. This study was supported by the Human Genome Center at the Institute of Medical Science, the University of Tokyo. We acknowledge and thank the creators of the icons used in this publication. Icons for ‘questionnaire,’ ‘test,’ ‘brain,’ ‘food,’ ‘cyclocross,’ ‘dna,’ and ‘mutation’ were created by Freepik and obtained from Flaticon.com. Icons for ‘person’ were created by Valerie Lamm from The Noun Project, and the ‘Japan Map’ icon was created by Rahe, also from The Noun Project. These resources have been instrumental in the visual representation in Fig. 1.

Author information

Authors and Affiliations

Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo, Minato-ku, Tokyo, Japan
Yichi Chen & Seiya Imoto
Laboratory of Sequence Analysis, Human Genome Center, the Institute of Medical Science, the University of Tokyo, Minato-ku, Tokyo, Japan
Kotoe Katayama & Seiya Imoto
DeNA Life Science, Inc., Shibuya-ku, Tokyo, Japan
Sachiko Ishida
Allm Inc., Shibuya-ku, Tokyo, Japan
Sachiko Ishida

Authors

Yichi Chen
View author publications
Search author on:PubMed Google Scholar
Kotoe Katayama
View author publications
Search author on:PubMed Google Scholar
Sachiko Ishida
View author publications
Search author on:PubMed Google Scholar
Seiya Imoto
View author publications
Search author on:PubMed Google Scholar

Contributions

S.Imoto supervised the study. Y.C. designed the study and conducted the data analyses. Y.C. wrote the manuscript, with support from K.K. and S.Imoto. S. Ishida provided the data and contributed to writing part of the methods section. All authors reviewed the manuscript, and approved the final version for publication.

Corresponding author

Correspondence to Seiya Imoto.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Guanglin He, Fiona Hagenbeek, and the other anonymous reviewer for their contribution to the peer review of this work. Primary Handling Editors: Qiao Fan and Rosie Bunton-Stasyshyn.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1-12

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, Y., Katayama, K., Ishida, S. et al. Intricate interactions between fine-scale genetic structure, lifestyle, and dietary habits in the Japanese population. Commun Biol 8, 1046 (2025). https://doi.org/10.1038/s42003-025-08479-w

Download citation

Received: 07 October 2024
Accepted: 03 July 2025
Published: 12 July 2025
Version of record: 12 July 2025
DOI: https://doi.org/10.1038/s42003-025-08479-w

Subjects

Abstract

Similar content being viewed by others

Genetic effects on gestational diabetes mellitus and their interactions with environmental factors among Japanese women

A cross-population compendium of gene–environment interactions

Interaction of genetic and environmental factors for body fat mass control: observational study for lifestyle modification and genotyping

Introduction

Results

Dimensionality reduction and clustering

Genetic differentiation among genetic clusters

Association between genetic clusters and geographic ancestry

Integrated insights from genetic interpretation and epidemiological results

Discussion

Methods

Study participants

Genotyping and quality control

Dimensionality reduction and clustering

Genetic variance analysis

Geographical information and related analysis

GWAS and Post-GWAS analyses

Lifestyle and dietary habits

Multinomial logistic regression analysis

Statistics and reproducibility

Reporting summary

Data availability

Code availability

Change history

11 August 2025

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1-12

Reporting Summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links