Introduction

In the era of personalized and customized health, the establishment and expansion of direct-to-consumer genetic testing (DTC-GT) have opened new avenues for gaining deeper insights into genetic information and disease risk prediction derived from prior genome-wide association studies (GWAS)1. The extensive data from DTC-GT companies like 23andMe, which are widely used in scientific research, highlight the growing recognition of the value of such data2. Owing to its ease of use, genetic research is increasingly utilizing DTC-GT data, aiming for more accurate and personalized healthcare applications. In Japan, MYCODE, provided by DeNA Life Science (DLS), Inc., (Tokyo, Japan), has taken the lead in the DTC-GT market3. It empowers users to make informed health decisions, potentially improves their behavioral habits, and contributes to scientific research, in which the MYCODE community voluntarily participates3.

In the landscape of GWAS, challenges stem from the population structure, which poses risks of false positives owing to the inadvertent inclusion of individuals with undetected admixtures4, prompting a growing support for machine-learning methods5. With the increasing availability of extensive genetic data, data-driven approaches have emerged as promising tools6,7. Although principal component analysis (PCA), a linear transformation method to identify principal components (PCs), remains prevalent in genetic research8, a notable shift has occurred toward employing unsupervised machine-learning methods, including t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP)9,10. These methods, used for reducing dimensionality in genotype data, improve the visualization of genetic variations, help delineate population structures, and mitigate stratification biases in GWAS6,11, especially after PCA initialization12,13. UMAP leverages Riemannian geometry and algebraic topology, enabling it to capture and retain the complex high-dimensional topology of data points within a lower-dimensional space10. This method has demonstrated superior performance in capturing fine-scale population structures across various genetic data types, such as bulk transcriptomics, single-cell RNA sequencing, and single nucleotide polymorphism (SNP) genotype data12,13,14. Its efficacy extends to visualizing ancestral composition in cohorts and discerning intricate patterns in diverse biobanks, establishing UMAP as a valuable tool to unravel the complexities of genetic data15.

The Japanese population is considered genetically homogeneous and characterized by relatively low genetic diversity16. Studies have been dedicated to revealing the fine-scale genetic structure within the Japanese population, recognizing the existing genetic diversity based on the dual-structure model17,18. A recent Biobank Japan study found that the PCA-UMAP dimensionality reduction method outperformed other techniques, demonstrating its superior efficacy in capturing the intricate structure of the Japanese subpopulation12. PCA-UMAP merges PCA and UMAP by applying UMAP to the principal components of genotype data, resulting in a more accurate classification of population clusters. This approach also offers computational advantages and reduces statistical noise. The fine-scale genetic structure of the Japanese population, primarily focusing on uncovering ancestry and geographical differences in the admixture proportion due to historical migration, has been explored earlier12,19. These studies clearly revealed the underlying population structure in Japan, emphasizing the genetic differences among islands and providing evidence of historical migration. Regional differences in the phenotypes of complex traits in Japanese individuals have been attributed to regional genetic differences20. Previous investigations have concentrated on the geographic distribution of genetic backgrounds, primarily through a polygenic risk score, which is a numerical representation of the impact of genetic variants on a trait identified by GWAS12,21. Isshiki et al.22 found that the distribution of polygenic height scores could explain the height gradient observed in Japan. Similarly, Sakaue et al.12 discovered variations in the polygenic risk score for complex human traits among Japanese subpopulations. However, few studies have explored how population genetic structure influences comprehensive lifestyle and dietary habits within an epidemiological framework.

To address this gap, our objective was to investigate the intricate interactions between fine-scale genetic structure, lifestyle, and dietary habits in the Japanese population by incorporating combined methods. In this study, we utilized unsupervised learning methods, including PCA, UMAP, and the density-based spatial clustering of applications with noise algorithm (DBSCAN) clustering method23, on MYCODE genotype data to investigate the population fine-scale genetic structure represented as clusters. We subsequently employed GWAS and post-GWAS analyses to identify genetic factors associated with the clusters as traits (Fig. 1a and b). Following this, we conducted a comprehensive examination of the associations between geographical, lifestyle, dietary habits, and genetic clusters using statistical analyses (Fig. 1c). This approach aimed to provide an innovative perspective combining genetic insights with epidemiological statistics. Through this procedure, we revealed intricate interactions between the population’s fine-scale genetic structure and lifestyle and dietary habits (Fig. 1d).

Fig. 1: Study overview.
figure 1

a The study population comprised 55,551 Japanese individuals who purchased the MYCODE genetic kit. Genotypes were obtained using microarrays, and genetic data were processed through imputation, phasing, quality control, and linkage disequilibrium pruning. Figure 1a includes a person icon by Valerie Lamm and a Japan map icon by Rahe, both from The Noun Project (CC BY 3.0) b After applying eligibility criteria, genotype allele matrices from 49,440 participants were analyzed using PCA-UMAP and DBSCAN to uncover fine-scale genetic structure within the homogenous population. This analysis resulted in 43,726 participants being retained for further analysis, including FST analysis, GWAS, and post-GWAS analyses for genomic interpretation. c Statistical analyses were conducted on lifestyle and dietary data for the same 43,726 participants. Geographical analyses focused on follow-up questionnaire responses from 2268 participants. d This approach integrates genomic interpretation with epidemiological methods to explore interactions among population structure, lifestyle, and dietary habits. Icons in Figure 1c and d are created by Freepik from Flaticon.com.

Results

Dimensionality reduction and clustering

After conducting genotype data quality control and LD pruning, we performed PCA on 395,665 pruned-in variants from 49,440 participants (Supplementary Figs. 1 and 2). Subsequently, UMAP was applied to the first 20 PCs of the genotype data, which is depicted in a two-dimensional scatter plot (Fig. 2a). The PCA-UMAP provided a discrete separation of individuals into six clusters, as observed by visual inspection (Fig. 2b). This observation established that the combination of unsupervised learning methods, PCA and UMAP, effectively distinguished participants within the same ethnic population. Furthermore, we excluded noise and participants who did not belong to the six largest clusters. This resulted in 43,726 participants being distributed among the top six clusters while maintaining the same projection shape (Fig. 2b).

Fig. 2: Genetic clusters after the dimensionality reduction method.
figure 2

This figure presents two-dimensional visualizations of genotype data from Japanese participants, grouped using PCA-UMAP dimensionality reduction techniques. Then, DBSCAN was used to cluster the PCA-UMAP projection, with six distinct clusters represented by different colors. a The PCA-UMAP projection prior to clustering. b The PCA-UMAP projection post-DBSCAN clustering c A pie chart illustrating the distribution of participants across the clusters, with each pie segment colored to correspond with the cluster assignments from Figure 2b.

Cluster 1, represented in blue in Fig. 2b, held a central position in the PCA-UMAP plot, while Clusters 6 was positioned the furthest from Cluster 1. This spatial arrangement suggests a clear and structured pattern within the homogeneous Japanese genetic data, indicating the underlying fine-scale population structure and potential genetic diversity. In total, 43,726 eligible healthy Japanese adults were included in the GWAS, post-GWAS, and multinomial logistic regression analyses (Supplementary Fig. 1). Cluster 1 comprises the largest proportion of the total population (54.5%), while Cluster 6 had the lowest proportion at 1.8%. The mean age and BMI of the entire study population was 45 (SD = 9.7) years and 22 (SD = 3.5) kg/m2, respectively. Males constituted 50.7% of the total population. In addition, consistent similarities were observed across all clusters in the distributions of age, BMI, sex ratios, drinking and smoking status, and blood glucose level (Supplementary Table 1).

Genetic differentiation among genetic clusters

Using Weir and Cockerham’s fixation index (FST)24, we assessed the proportion of genetic diversity among the six clusters identified by DBSCAN clustering. Weir and Cockerham’s FST incorporates unbiased estimators of variance components, and offers a reliable metric for assessing genetic differentiation. A weighted mean FST of 0.000904 (SD = 0.015) indicated relatively low genetic differentiation among the clusters, confirming the homogeneity of the Japanese population in the study group. FST values for pairwise comparisons between clusters and the comparison of each cluster with the entire population are presented in Fig. 3a. Despite a low level of differentiation, distinctions were observed. Clusters 2 and 3 exhibited the highest genetic dissimilarities. Clusters 1, 5 and 6 exhibited the highest similarity, with the lowest pairwise FST in the heatmap. Notably, the lowest FST was observed between Cluster 1 and the entire population, emphasizing the close resemblance of Clusters 1 to the overall population.

Fig. 3: Genetic differentiation among genetic clusters.
figure 3

a A heatmap illustrates genetic differentiation among five clusters, measured by FST. b A Manhattan plot visualizes the FST values calculated across the newly designated four clusters. A grey dashed line indicates the threshold of mean + 3SD, and a red dashed line marks the highlight threshold at 0.5. c, d Regional plots showing SNPs with FST > 0.6 in chromosome 6 and 11. These SNPs are annotated to unique genes; if multiple SNPs correspond to the same gene, only the SNP with the highest FST value is displayed. For additional details, refer to Supplementary Data 1 and 2.

We then identified SNPs with noteworthy FST values, suggesting a potential association with genes influencing adaptive or economically significant traits25. Identifying SNPs based on the mean and SD of FST is a commonly employed method to detect selection signatures26. The Manhattan plot (Fig. 3b) revealed peak SNPs on chromosomes 6 (mean FST = 0.009772) and 11 (mean FST = 0.002453). Among 395,665 SNPs, 1339 and 132 on chromosomes 6 and 11, respectively, were identified as outliers surpassing three SDs from the mean, indicating that these loci had values higher than 99.8% of the total SNPs26. We established a threshold of FST > 0.5 to highlight genomic regions containing the most differentiated SNPs, with detailed annotations available in Supplementary Data 1 and 2. The closest genes associated with these regions are illustrated in Fig. 3c and d. The SNPs exhibiting the highest FST are located in the regions 6p22 and 6p21. The 6p21 region, which harbors a high concentration of core genes of the Major Histocompatibility Complex (MHC), is situated within the MHC. Meanwhile, the 6p22 region, positioned adjacent to this central area, also contributes to the broader context of MHC-related genetic functions. These regions correspond with the extended MHC region and have been previously linked to schizophrenia susceptibility27. Specifically, SNPs rs189984590 and rs548568008, located on chromosome 6 within the EHMT2 and PRRT1 genes, respectively, exhibited FST values close to 0.6. EHMT2, a key epigenetic regulator, is involved in histone modification and has implications for neurological disorders28. PRRT1 affects synaptic processes and is crucial for neurodevelopment and cognitive functions29. On chromosome 11, the SNPs demonstrating the greatest genetic differentiation are found in the 11p11 region, which is enriched for genes associated with neurodevelopmental processes30.

The MHC region, particularly of the human leukocyte antigen (HLA) genes, is crucial in a diverse range of complex human diseases and quantitative traits31. MHCs are categorized into three subclasses: class I, including highly polymorphic genes such as HLA-A, HLA-B, and HLA-C genes; class II, encoding genes such as HLA-DPA1, HLA-DPB1, HLA-DQA1, and others involved in antigen presentation; and class III, containing genes associated with inflammatory responses, leukocyte maturation, and the complement cascade32. In 2015, Nakaoka et al.33 investigated the distinction in HLA frequencies across 10 regional populations in Japan and found that HLA-A*24:02-C*12:02-B*52:01-DRB1*15:02-DPB1*09:01 exhibited distinctive regional frequency patterns. In our study, we found results consistent with previous observations33,34, where we identified highly differentiated SNPs within the four clusters, HLA-B, -DRA, -DRB1, -DQB1, -DQA2, -DQB2, predominantly located in or near the HLA region on chromosome 6. Collectively, despite the absence of specific geographical and regional information, our unsupervised learning methods effectively classified clusters exhibiting genetic differentiation, which was previously validated, thereby defining the fine-scale genetic structure of the study population.

To strengthen the reliability of our findings, we conducted additional FST analyses on randomly generated clusters, ensuring these clusters retained the structure of the identified clusters in our study. We assessed the similarity of different sets of clusters using the Adjusted Rand Index (ARI). The results, The results, detailed in Supplementary Table 2 and Supplementary Data 3 and 4, confirm that the genetic clusters observed in our study are not due to random variation but indeed represent distinct and genuine genetic fine-structures within the population.

Association between genetic clusters and geographic ancestry

Expanding upon this genetic categorization, we consolidated Clusters 1, 5, and 6 into a new entity designated Cluster 1*, guided by their genetic similarities as revealed through FST analysis. This strategic consolidation was informed by the low genetic differentiation observed between these clusters. FST values and characteristics of the newly formed clusters are presented in Supplementary Fig. 3 and Supplementary Table 3.

To elucidate the relationship between these PCA-UMAP-DBSCAN-defined clusters and geographic ancestry, we engaged a cohort of 2268 participants from a follow-up survey that gathered geographical information. We first performed chi-square tests to reveal potential associations between an individual’s genetic cluster and familial geographic lineage. As shown in Fig. 4a, we found that the genetic clusters were significantly related to seven questions in the survey, which covered geographical details ranging from the individual’s birthplace to the birthplaces of their parents and grandparents. Subsequently, we developed random forest models, which are powerful machine-learning techniques known for their simplicity and robustness35, to predict genetic clusters based on geographic information. Two models were used: one incorporated only the geographic variables that exhibited significant associations in the chi-square tests (labeled sig), whereas the other utilized the entire suite of geographic data collected (labeled all). The ROC curves, illustrated in Fig. 4b, yielded area under the curve values of approximately 0.76. These values suggested a certain predictive capability of the geographic data, and also highlighted the potential influence of additional factors in genetic diversity, given their accuracy scores of 0.60 and 0.59, respectively.

Fig. 4: Analysis of genetic clusters in relation to geographic ancestry.
figure 4

a The figure depicts significant associations between genetic clusters and seven geographic variables as determined by chi-squared tests. Sky-blue bars indicate the strength of the associations in terms of \(-{\log }_{10}(P\,{\mbox{value}})\). A red dotted line marks the statistical significance threshold of P = 0.05. b Two models are compared using ROC curves. The sig model uses only geographic variables that have significant relationships in Figure 4a, while the all model includes all geographic data. c ADMIXTURE analysis results with the number of hypothetical ancestral populations (K) set to four are presented. Distinct ancestral genetic mixtures are labeled from ad. d This chart displays how the previously defined genetic clusters are distributed across the ancestral groups found in the ADMIXTURE analysis.

Further insights into the genetic structure of the cohort were obtained through unsupervised ADMIXTURE analysis36, a software designed for the maximum likelihood estimation of individual ancestries. We systematically explored the values of K, which represents the number of ancestral populations assumed, to discern the fine-scale genetic structure of our cohort’s ancestry (Supplementary Figs. 4 and 5). In our analysis, the selection of K = 4 for ADMIXTURE analysis was strategically chosen to compare with our previous clustering findings using PCA-UMAP-DBSCAN, and FST analyses. This decision allowed us to reveal four distinct ancestral genetic mixtures, as depicted in Fig. 4c, with detailed information in Supplementary Data 5. The distribution of these mixtures illustrates a clear alignment with the genetic clusters previously defined through other methodologies (Fig. 4d). Specifically, Mixture B was predominantly comprised individuals from Cluster 3 (150 of 168, i.e., 89.2%), whereas Mixture C mainly consisted of individuals from Cluster 2 (230 of 260, i.e., 88.5%). By contrast, Mixtures A and D were primarily composed of individuals from the merged Cluster 1*, which comprised 1,485 individuals (65.5%) in the follow-up cohort.

The alignment of genetic clusters with ancestral mixtures defined by ADMIXTURE analysis suggested a patterned genetic composition. The combined findings from the chi-square tests, predictive modeling, and ADMIXTURE analysis revealed a structured picture of a genetic cluster related to geographic ancestry. Populations tend to exhibit clustering according to geographic region, as determined by genetic distance. Although geographical information contributes to predicting genetic clusters37, our analysis indicates that it is not the sole definitive factor, suggesting that other factors could contribute to the genetic diversity of the population. The alignment observed in two types of clusters, where there was a significant overlap between the structures defined by machine-learning techniques and those identified by ADMIXTURE analysis, suggests that historical demographic events for particular clusters could significantly influence genetic compositions. This observation implies that the underlying genetic profiles of these clusters may be more prominently shaped by geographical drift than those of the other clusters. Beyond geographical distance, factors such as epigenetic markers, notably DNA methylation, a crucial epigenetic mechanism that is sensitive to environmental influences, may impact population structure38,39. These influences suggest that genetic diversity is the product of complex gene-environment interactions, which are further complicated by epigenetic modifications that reflect the intricate genetic and epigenetic interplay within populations38. As physical and social barriers among different populations decrease, genetic mixing becomes more common, suggesting that factors beyond geography are increasingly influential in shaping the population structure40. Overall, our results suggest a complex relationship among the concepts of genetic similarity, genetic ancestry, and genealogical ancestry41.

Integrated insights from genetic interpretation and epidemiological results

We conducted GWAS and post-GWAS analyses based on the newly assigned clusters to investigate potential differences in gene expression, lifestyle, and dietary habits (Supplementary Data 610). The mapped genes of each cluster were used for GSEA utilizing the FUMA pipeline42 to evaluate whether the genes exhibit statistically significant associations with trait-associated genetic variants. FUMA used 15 databases for GSEA, identifying 1616 significant relationships (Supplementary Data 9). Our study particularly highlighted the GWAS Catalog database, which is pivotal for exploring the genetic foundations of diverse traits and diseases, making it relevant to our investigation of the potential links between genetic variations and lifestyle and dietary habits. Among the gene sets analyzed, the GWAS Catalog revealed 220 significant relationships. Notably, these relationships exhibited the second-lowest P when gene sets were evaluated as groups, as illustrated in Supplementary Fig. 6.

A total of 64 unique gene sets were identified as statistically significant, with intriguing findings such as the “Fruit Consumption” gene set ranking 20th when sorted by adjusted P in ascending order, and the last associated with “A body shape index” (Supplementary Data 10). Given the information density, we focused on the distinct traits of each cluster in the Japanese population, employing a Circos plot (Fig. 5a). The upper circle segments correspond to the specific gene sets that are unique to each cluster, as determined by their non-overlapping presence in the GWAS Catalog database. For example, Cluster 1* was distinctly associated with HDL cholesterol levels, while Cluster 3 was uniquely linked to metabolic traits such as “Alanine aminotransferase” and “Glycated hemoglobin levels”.

Fig. 5: Integrated insights of genetic interpretation and epidemiological results.
figure 5

a The Circos plot illustrates the Gene Set Enrichment Analysis (GSEA) results across the four identified clusters, using negative log-transformed P values. The lower circle of the plot represents each of the four clusters. The upper circle highlights the specific gene sets that are unique to each cluster, as determined by their non-overlapping presence in the GWAS Catalog. b A forest plot presents the outcomes of the multinomial logistic regression analysis. The model was adjusted for age, sex, and BMI for each exposure variable. The plot uses vertical lines on the y-axis to represent different lifestyle or dietary variables, with horizontal lines showing ORs and their 95% CIs. Points of intersection with the vertical line at OR = 1.0 suggest no effect. The figure only includes lifestyle or dietary variables with statistically significant relationships (P < 0.05).

Despite these specific associations, a uniform distribution of gene sets and their interrelations across clusters was evident (Supplementary Fig. 7). The consistent significance across gene sets likely stemmed from shared significant SNPs (Supplementary Data 11). This pattern became more apparent in the context of the one-vs-all GWAS analysis, where the studied traits essentially represented the fine-scale population structure, and the persistent occurrence of particular SNPs resulted from the shared genetic background within the homogeneous Japanese population. The consistently observed shared genetic influences, as reflected in the enrichment of the same gene sets within the population’s fine-scale genetic structure, can be attributed to pleiotropy, where a single gene can affect multiple seemingly unrelated traits. This finding highlights the effect of specific genetic loci on diverse traits, thereby emphasizing the importance of accounting for pleiotropy in explaining the genetic basis of phenotypic diversity.

In addition, we generated a forest plot using statistically significant results from multinomial logistic regression analysis of cross-sectional data (Fig. 5b) to illustrate the ORs and 95% confidence intervals (CI). The regression model was adjusted for age, sex, and BMI43 using Cluster 1* as the reference group because it was observed to be the most similar to the entire population. All 21 significant results were found among 175 exposure variables, with an average observation of 23,521 individuals, excluding missing variables. Among the significant findings, the lowest P value around 0.005 was observed for “Daily Milk Intake” in Cluster 3. The figure presents consistent positive associations in Clusters 3 and 4, notably with weekly food consumptions. Additionally, results included two exposures of “Total Weekly Calcium Consumption” for Cluster 4 (OR = 0.998, CI: 0.997–0.9999, P = 0.042) and “Total Weekly Fish Consumption” for Cluster 3 (OR = 1.002, CI: 1.000–1.003, P = 0.025). The small magnitude of these ORs suggested nuanced effects (See Supplementary Data 12 for detailed information, including survey questions). The regression analysis indicated significant associations across two domains: quality of life, emphasizing sleep disturbance and fatigue; and dietary habits, with a focus on the intake of fish and vegetables. An examination of Fig. 5a and b together reveals interesting patterns. For instance, Clusters 3 and 4 have higher vegetable consumption compared to the reference Cluster 1*. Furthermore, the Circos plot indicates that these clusters uniquely exhibit statistically significant GESA results for body shape, feelings of guilt, and non-lobar intracerebral hemorrhage (MTAG). Such variations highlight the diverse impacts of these factors on health across different genetic groups. The observed differences among genetic clusters suggest a complex interplay between epidemiological exposures and fine-scale genetic structures.

Discussion

In this study, we identified the population structure of the Japanese population using unsupervised learning techniques including PCA, UMAP, and DBSCAN. We investigated the genetic factors associated with these clusters through FST, GWAS, and post-GWAS analyses, uncovering significant gene sets linked to specific traits and diseases. Notably, gene sets related to dietary habits such as “Fish- and plant-related diet” and health conditions like “Liver iron content” were identified, highlighting the unique genetic background of the Japanese population44 and the potential impact of genetics on health and lifestyle choices45. Despite the uniform genetic background across all clusters, the variance in the epidemiological relationships between lifestyle factors, dietary habits, and genetic clusters highlights the complex relationship between genetics and environmental factors. Multinomial regression analysis was used to compare different clusters in relation to lifestyle and dietary habits, illustrating the varied impacts of these factors across genetic clusters. These combined results underscore the complexity of genetic predispositions and their interactions with lifestyle and dietary habits, offering insights into gene-environment interactions by utilizing the concept of fine-scale genetic structure46.

Genetic research on the Japanese population has delved into its adaptation through two principal lenses: the admixture history of ancient lineages, highlighting the dual-structure model that illustrates the indigenous identities of the Jomon and Yayoi peoples; and the genetic distinctions within geographical regions (Hondo and Ryukyu)17. Prior studies that aimed to understand the genetic fine-scale structure in the Japanese population12,19 have primarily served to confirm the concordance between geographical distribution and the observed genetic structure. However, a pressing question remains: is employing geographical or indigenous labels as proxies for the identified fine-scale structures the optimal approach? The growing awareness of the need to address the nuanced issue of population descriptors in genetics research has been increasingly important47, especially in terms of presenting genetic classifications to the public without fueling debates on genetic essentialism48. In 2023, the US National Academies of Sciences, Engineering, and Medicine (NASEM) has published guidance recommending best practices for using population descriptors in research47. This guidance advises carefully considering the use of geographical ancestry to avoid misconceptions, particularly the risk of misinterpreting geography as indicative of environmental exposures. It also warns against using indigenous identity in ways that could imply “purity” and lead to discriminatory interpretations.

The fine-scale structure we uncovered reveals patterns where clusters serve as indicators of individuals with substantial genetic similarity. Our use of “genetic similarity” as a descriptor aligns with existing guidelines47. It is a well-accepted concept that phenotypes arise from the dynamic interplay between genetics and the environment49,50. Our analysis reveals distinct patterns among populations grouped by genetic similarity, suggesting potential interactions between genetic factors and environmental conditions. In light of these findings, we highlight the need for future studies that explicitly incorporate gene-environment analyses. Such research would more clearly elucidate the impact of societal changes on genetic variation through complex social determinants and interactions. This points to the necessity of including a broad spectrum of environmental and social factors in genetic research. It is essential to integrate sociological and environmental contexts into genetic studies to fully explore the intricate, intertwined relationships between society and human biology. This approach would help uncover how societal developments may influence genetic variation through detailed causal networks of social determinants and gene-environment interactions51.

Our study helps to understand the complex factors influencing health outcomes, with implications for both public health and precision medicine. The application of advanced unsupervised learning technologies, such as PCA-UMAP-DBSCAN, effectively identified clusters with significant genetic differentiation within a homogeneous population. These combined technologies have uncovered intricate patterns within genetic data, proving efficient classification based on intrinsic genetic structures. Simplifying the complex genotype matrix into discrete categorical variables that represent the fine-scale genetic structure enables the use of epidemiological regression. Our statistical analysis revealed gene-environment interactions across clusters that share a homogeneous genetic background. This comprehensive approach links genetic predispositions to lifestyle and dietary patterns, yielding valuable insights for public health interventions. This framework also highlights the possibility of strengthening public health research by including genetic structure as an essential variable in mainstream statistical studies, as integrating genetic architecture could improve our understanding of complex relationships in genetic epidemiology52. However, the convenient nature of the DTC-GT data enables the creation of near-time biobank data, allowing researchers to engage in more precise data processing53. Swift recruitment facilitated by DTC-GT enables faster follow-up studies and offers crucial insights into personalized medicine. This approach allows for the rapid assembly and modification of prevention programs, thereby improving their effectiveness54. Additionally, incorporating gene-environment interactions and genetic clusters into the development of personalized prevention programs may improve their efficacy. By considering genetic susceptibility and modifiable lifestyle factors, these programs could improve personalized medicine by matching individuals using genomic profiling55.

Our study has some limitations. First, validation of gene set enrichment patterns in future studies is essential to establish the robustness of the explanation for the entire Japanese population. Second, the simplicity of the overall regression model, which was adjusted only for age, sex, and BMI, could introduce a potential bias in the interpretation of these relationships. The ORs close to 1.0 may indicate a potential limitation in establishing substantial associations with the respective clusters. Third, the exclusion of missing data could introduce a potential source of bias. The use of self-reported questionnaires for lifestyle and dietary habits in the cross-sectional study could also introduce recall bias. Additionally, while UMAP effectively highlights local data structures, it may overemphasize the differentiation between clusters. Future studies might benefit from validating findings through additional methods that offer different perspectives on the global relationships among genotype data. Finally, inherent self-selection bias could arise from customers opting to participate in the DTC-GT, potentially leading to subject bias. These limitations underscore the need for cautious interpretation and highlight the areas for consideration in future research.

In conclusion, by applying machine-learning techniques, we revealed the fine-scale genetic structure of Japanese DTC-GT customers, marked by high genetic variation among the identified clusters, and revealed a significant relationship between these genetic clusters and geographical ancestry. This study introduces an innovative framework for genetic epidemiology by integrating genetic insights with cross-sectional statistical analyses, shedding light on the subtle effects of lifestyle and dietary factors across different genetic clusters. With the future availability of health outcome data, research could further explore the relationship between risk factors and health outcomes within the context of genetic architecture.

Methods

Study participants

A total of 61,728 healthy Japanese adults were initially enrolled as customers of MYCODE (DLS Inc., Tokyo, Japan), a personal genome service in Japan. Participants returned their saliva samples to the DLS laboratory along with a signed consent form indicating their willingness to participate in MYCODE Research, where their anonymous genetic data and health-related information would be utilized for scientific research objectives. In sample quality control by the DLS laboratory, 55,551 participants along with 684,436 autosomal SNPs were retained after exclusion for sex inconsistency, identity by descent (\(\hat{\pi } > 0.1875\)), missing call rate (>0.01), non-Japanese ancestry estimated by the genetic PCA using East Asian samples from the 1000 Genomes Project Phase 3 version 5, and autosomal heterozygosity (>3 standard deviation (SD) above the mean).

For this study, the eligibility criteria were further established as follows: (i) age 20–64 years, (ii) validated height and weight data, and (iii) height within three times the interquartile range (IQR)56. PCA was conducted on the genotype data of the participants following the quality control procedures described below using PLINK (v1.9)57.

All ethical regulations relevant to human research participants were followed. The study was approved by both the ethics committee of DeNA Life Science Inc. (protocol #20140717_1) and the Institute of Medical Science, University of Tokyo (protocol #2019-48-1219) (Tokyo, Japan). Participants consented to the publication of research findings using their data, under the condition that no personally identifiable information is disclosed. All personal identifiers have been removed to protect confidentiality and comply with ethical standards.

Genotyping and quality control

SNP genotyping was performed using Infinium OmniExpress-24+ BeadChip or Human OmniExpress-24+ BeadChip (Illumina, Inc., San Diego, CA, USA) in the DLS laboratory. Before Genotype imputation, the quality control was performed on the genotyped 684,436 autosomal SNPs with the following exclusion criteria: with a (i) minor allele frequency  < 0.01, (ii) Hardy-Weinberg equilibrium P < 10−6, and (iii) missing call rate  > 0.01. The initial phase of haplotype phasing was conducted on the entire cohort of 55,551 individuals utilizing Eagle (v2.4.1)58, with the reference panels of the 1000 Genomes phase 3 version 5 (1KGP-JPT; available at https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html). Genotype imputation was performed using Minimac3. Quality control procedures for the genotype data included the exclusion of SNPs meeting the following criteria: with a (i) minor allele frequency  < 0.01, (ii) Hardy-Weinberg equilibrium P < 10−6, (iii) call rate  <95%, and (iv) r2 < 0.7. All genetic variants initially identified were subjected to a comprehensive re-annotation procedure using “bcftools annotate”. Linkage disequilibrium (LD) pruning was conducted after quality control, with an option “-indep-pairwise 50 5 0.2” using PLINK (v1.9)57. Following the application of eligibility criteria to the study participants, the dataset consisted of 395,665 pruned-in variants from 43,726 individuals. Genome construction information was based on the GRCh37/hg19 reference.

Dimensionality reduction and clustering

We adopted the dimensionality reduction methods adopted from previous literature12. First, we performed PCA using PLINK (v1.9) with the default parameters. The first two PCs were subsets and were projected onto a two-dimensional space for outlier inspection. We then performed UMAP on the first 20 PCs of the genotype using the Python umap-learn package (v0.5.3; https://pypi.org/project/umap-learn/) with the default parameters. The choice of the top 20 principal components was based on commonly accepted practices and informed by prior research on the fine-structure in the Japanese population12. A scree plot demonstrating eigenvalues is included in Supplementary Fig. 2.

After the manual identification of six distinct clusters based on the separation observed in the PCA-UMAP plot (Fig. 1a), clustering and visualization were conducted. DBSCAN, a widely used clustering algorithm in the spatial domain23, was implemented using the Python scikit-learn package (v1.3.0; https://scikit-learn.org/stable/) with the parameters set to eps = 0.068 and min_samples = 3. Noisy samples identified using DBSCAN were subsequently excluded, resulting in a final cohort of 43,726 individuals for further analysis (Supplementary Fig. 1). Python (v3.9.12) and R (v4.3.2) were used for all analyses in this study.

Genetic variance analysis

We used the FST value, as defined by Weir and Cockerham24, to investigate the overall genetic differentiation among the six clusters identified using PCA-UMAP analysis. FST calculations were performed using the first option in PLINK (v1.9)57. ANNOVAR (v2018Jul08)59 was used for the functional annotation of SNPs with high FST values. In addition, we conducted pairwise comparisons of the clusters in a one-versus-one manner. Furthermore, one-versus-all FST analyses were performed to quantify genetic variance between each cluster and the entire dataset. Randomly generated cluster sets were created using the numpy package (v1.24.4) with two distinct seeds (42 and 100). The similarity between these random clusters and the original clusters was assessed using the ARI from the scikit-learn package (v1.3.0).

Geographical information and related analysis

The geographical information was obtained from a follow-up survey with 3132 volunteers, featuring nine questions on current residence, birthplace, longest place of residence before adulthood, and six familial birthplaces, covering all 47 Japanese prefectures and an additional “I don’t know/don’t want to answer” option. As shown in Supplementary Fig. 8, these prefectures were further organized into 10 regional blocks according to the Type I region classification by the Ministry of Internal Affairs and Communications in Japan (https://www.soumu.go.jp/main_content/000872772.pdf). After excluding responses with missing data, our analysis focused on 2268 participants with complete responses. A chi-square analysis was initially conducted to confirm that the follow-up cohort shared a similar distribution of genetic clusters with the main dataset using the scipy.stats module (v1.11.4) in Python. For association analysis, we conducted chi-square tests to examine the relationship between geographical information and genetic clusters. In Python, using the scikit-learn package (v1.3.0), we applied a random forest classifier using the two models as features and the genetic cluster as the target variable. The two models of features included one with only significantly associated geographic variables, while the other used all the geographic data collected. A stratified five-fold cross-validation was applied to the dataset to ensure the consistency and reliability of the model performance. To mitigate class imbalance, we incorporated microaveraging to generate an receiver operating characteristic (ROC) curve. ADMIXTURE software (v1.3.0)36 was used to conduct an unsupervised estimation of ancestral components among 2268 participants. We examined different numbers of assumed ancestral components by setting the K values in the range of 2 to 1212. This representation distinguishes ancestral components through color coding and sorts individuals based on their predominant ancestry. We selected K = 4 for our analyses as it provided a balance between minimizing error and aligning with the coherent genetic structures identified through our previous analyses. Supplementary Figs. 4 and 5 provide a comprehensive overview of the ancestral estimations across the entire range of K values with cross-validation errors.

GWAS and Post-GWAS analyses

To explore genetic variance among the six clusters, one-versus-all case-control studies were conducted using PLINK (v1.9)57. A logistic regression model assuming additive genetic effects was used for association analysis, adjusting for age and sex as covariates. The association analysis was repeated six times, designating each cluster as a case. The FUMA pipeline (v1.6.0; https://fuma.ctglab.nl/), from SNP2GENE to GENE2FUNC functions, was employed to explore the results obtained from PLINK. In the SNP2GENE step, lead and candidate SNPs were identified using default parameters, except for 1KGP-EAS, which was used as the reference panel. An SNP was considered independently significant if it achieved genome-wide significance level, as determined by the default threshold of 5 × 10−8, and independent from each other with LD r2 < 0.6. FUMA also defines independent lead SNPs with low LD r2 < 0.1. Genomic risk loci were subsequently mapped to genes using functional and eQTL mapping. For positional mapping, ANNOVAR annotations were employed in the FUMA pipeline and candidate SNPs were assigned to the nearest genes within a maximum distance of 0 kb. For eQTL mapping, the expression data for blood tissues in the GTEx v8 were used. A default false discovery rate of 0.05 was applied to define significant eQTL associations. In addition, the GENE2FUNC feature in FUMA conducts gene set enrichment analysis (GSEA) by performing hypergeometric tests using mapped genes to assess the overrepresentation of biological functions using public databases, including the GWAS Catalog, MsigDB, and WikiPathways. The circos plot was generated using the Circlize package (v0.4.15).

Lifestyle and dietary habits

MYCODE users answered an optional web-based questionnaire on lifestyle and dietary habits on the service website to obtain personalized health advice from August 2014 to June 2020. The timing of the data collection varied between participants. Since the study involved self-administered questionnaires, the response times differed for each participant. The anonymized answer data of the MYCODE Research participants were used for this study. To ensure data quality, responses that appeared suspicious were identified and removed during the pre-analysis phase. The questionnaire contained anthropometric traits; dietary habits such as the intake of calcium, dietary fiber, fruit, fish, and red and yellow/green/other vegetables; physical activity; smoking and drinking habits; sleep-related behaviors; stress responses; and blood test results such as triglyceride, LDL cholesterol, and glucose levels. We used the calcium self-check chart60 for calcium intake assessment and the Brief Job Stress Questionnaire for stress response evaluation, which contains a subset of 29 items, comprising 18 items related to psychological stress responses and 11 items focusing on physical stress responses and somatic symptoms61. Other questionnaires on dietary habits were originally developed by DLS based on the Standard Tables of Food Composition in Japan (https://fooddb.mext.go.jp/). To remove potential wrong answers for free-answer questions on alcohol intake, answers for over 10 glasses of wine (120 mL per glass) per day or 10 cans of chuhai (350 mL per can) per day were converted to missing values.

Multinomial logistic regression analysis

For each lifestyle and dietary habit, multinomial logistic regression analyses were performed using the Scikit-Learn Python library (v1.3.0). We excluded all missing variables specific to each lifestyle or dietary habit when conducting multinomial regression for the exposure variable of interest. Age, sex, and body mass index (BMI) were adjusted in the model43, and the covariate data were comprehensive and devoid of missing values. All statistical analyses were conducted using Python, with a significance threshold set at P < 0.05 for statistical significance. The formula for multinomial logistic regression with lifestyle/dietary habit variable X and a specific cluster k is as follows:

$$\log \left(\frac{P(Y=k)}{P(Y=\,{\mbox{reference}})}\right)={\beta }_{0k}+{\beta }_{1k}\cdot {\mbox{age}}+{\beta }_{2k}\cdot {\mbox{sex}}+{\beta }_{3k}\cdot {\mbox{BMI}}\,+{\beta }_{4k}\cdot X$$

where:

  • \((\frac{P(Y=k)}{P(Y=\,{\mbox{reference}})})\) is the probability of the dependent variable Y being in cluster k given the value of the lifestyle/dietary habit variable X, where the reference being Cluster 1*.

  • β1kβ2kβ3k, and β4k are the coefficients for cluster k corresponding to the variables age, sex, BMI, and X, where X is the specific lifestyle/dietary habit variable.

  • β0k is the intercept term representing the baseline log-odds for cluster k.

  • age, sex, and BMI are the covariates adjusted in the model.

The odds ratios (ORs) for X in relation to cluster k is calculated by:

$${{\mbox{OR}}}_{X,k}=\exp ({\beta }_{4k})$$

Statistics and reproducibility

In our study, we utilized PCA, UMAP, and DBSCAN to explore the fine-scale genetic structure of a homogeneous Japanese population based on autosomal SNP data. Additionally, we investigated the relationship between genetic clusters and lifestyle factors by applying multinomial logistic regression.

To ensure the reproducibility of our findings, we provide comprehensive descriptions of all procedures, from data collection to analysis. This detail includes our data cleaning process, the criteria for including and excluding participants, and the specific parameters set within our statistical models. Initially, our study included 61,728 participants, which was then narrowed down to 43,726 individuals after quality control. For replication purposes, researchers should employ a similarly large cohort of both genetic and phenotypic data, ideally encompassing over 40,000 participants post-quality control, to ensure robust and replicable results.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.