Abstract
Accurate genotype imputation is essential for large-scale genetic studies and precision medicine. While East Asian (EAS)-specific reference panels like ChinaMAP and CHN100k have been developed, most studies still rely on multi-ancestry panels like TOPMed due to the large sample size. However, their performance in underrepresented groups like Southeast Asians remains unclear. Using high-coverage whole-genome sequencing and SNP-array data from 8,316 Chinese and Thai individuals, we systematically evaluate six state-of-the-art reference panels for genotype imputation. Our results show that EAS-specific panels outperformed multi-ancestry panels for East and Southeast Asian populations. For example, ChinaMAP achieves a mean heterozygosity concordance rate above 0.90 without R2 filtering, whereas TOPMed requires an R2 threshold of 0.60-0.70 to achieve comparable results. Notably, we find that recent positive selection drives regional disparities in imputation accuracy, as illustrated by the olfactory receptor gene cluster. More importantly, our results indicate that the choice of reference panel and R2 thresholds have a significant impact on polygenic risk score estimation for disease prediction. These findings provide valuable guidelines for improving genotype imputation in East and Southeast Asian populations and underscore the need for ancestrally diverse reference panels to support globally equitable genomic research.
Similar content being viewed by others
Introduction
Genotype imputation is a statistical method used to infer missing genotypes in individual samples. It is widely applied in genome-wide association studies (GWASs) to increase the density of genetic variants and facilitate meta-analyses across studies with varied genotyping platforms1. The accuracy of imputation relies heavily on the characteristics of the reference panel, with sample size, ancestral composition, and haplotype diversity being crucial determinants2.
Reference panels produced by the 1000 Genomes Project3 (1KG Phase3 v5; n = 2504), Haplotype Reference Consortium4 (HRC; n = 32,470), and the Trans-Omics for Precision Medicine5 (TOPMed r3; n = 133,597) are widely adopted in most GWASs since they offer large sample sizes and rich haplotype data. However, East and Southeast Asian populations remain underrepresented in these reference panels, despite their substantial contribution to global diversity6,7. Previous studies have shown that the underrepresentation in the TOPMed reference panel resulted in reduced imputation accuracy for individuals of Asian ancestry compared to those of European ancestry8. The selection of appropriate reference panels is essential for uncovering ancestry-specific disease mechanisms and ensuring equitable translational applications of genetic research9.
Recent efforts have established specific reference panels for these East and Southeast Asian populations. Notable examples include the Westlake BioBank for Chinese10 (WBBC; n = 4489), China Metabolic Analytics Project11 (ChinaMAP; n = 10,155), CHN100k12 (n = 25,734), and South and East Asian Reference Database13 (SEAD; n = 11,067). However, these reference panels remain underutilized in current studies, partially due to limited evaluation and uncertainty regarding their optimal use in imputation applications.
To enhance the applicability of current reference panels for East and Southeast Asian populations, we curated an SNP array dataset comprising 6997 individuals of Chinese and Thai ancestry. In addition, we generated a high-coverage whole-genome sequencing (WGS) dataset containing a total of 1319 individuals from both populations. These datasets enable a more comprehensive evaluation of imputation quality across various reference panels.
Imputation quality is commonly evaluated using metrics such as INFO or R2, which assess the reliability by measuring the proportion of variation explained by the imputation14. However, these metrics can become biased when the reference panels inadequately represent the target population, particularly for low-frequency variants15. Despite this limitation, most studies apply a fixed threshold (e.g., R2 > 0.30) to filter imputed data, potentially compromising downstream analyses.
Genotype concordance rate, which measures the proportion of correctly imputed genotypes for individuals, is widely regarded as the gold standard for evaluating imputation quality. However, this metric can be misleading for variants with skewed allele frequencies. For example, when a variant’s minor allele frequency (MAF) is below 5%, simply predicting the major alleles for all samples can yield over 90% accuracy, regardless of true genotype status16. Two alternative metrics are used to address this limitation: the heterozygosity concordance rate, which only evaluates concordance for heterozygous genotypes for individuals, and the imputation quality score (IQS)16,17, which accounts for chance agreement of genotypes due to biased allele frequency. These metrics provide a more reliable assessment of imputation accuracy.
In this study, we aimed to identify suitable reference panels for Chinese and Thai populations. Furthermore, we also examined potential factors driving differences in genotype imputation, an important question that has not been fully explored to date. Our results indicate that recent positive selection contributes to regional differences in imputation accuracy. Taken together, these findings not only provide practical guidelines for improving genotype imputation but also offer a mechanistic explanation for the importance of ancestry-matched reference panels.
Results
Dataset preparation
A total of 8316 samples of Chinese and Thai ancestry were included in this study following quality control (Methods; Supplementary Table 1). The Chinese dataset comprised 3194 samples previously genotyped using the Infinium Asian Screening Array-24 v1.0 (ASA)18,19, a tailored genotyping array for Asian populations, along with 1,263 newly generated WGS samples with an average depth of 36.1. The Thai dataset consisted of 3803 ASA-genotyped samples from prior studies18,19 and 56 newly sequenced WGS samples with an average depth of 43.8.
Genetic variants for the WGS samples were detected using the GATK pipeline20,21,22 (Methods). Variants that passed quality control and overlapped with the Infinium ASA platform were extracted from the WGS data. These shared variants were then used to merge the WGS samples with the ASA-genotyped samples. This procedure was performed separately for the Chinese and Thai datasets, resulting in combined ASA-WGS datasets for each population. Principal component analysis (PCA) confirmed no significant difference between the ASA-genotyped and WGS samples within each combined dataset (Supplementary Fig. 1).
The combined dataset in each population was phased and imputed using six reference panels (Methods): CHN100k, ChinaMAP, WBBC, SEAD panels, and two global reference panels: 1KG (Phase3 v5) and TOPMed (r3). Differences among these reference panels were outlined in Supplementary Table 2. Imputation performance was evaluated by comparing the imputed genotypes to the genotypes detected by WGS.
Coverage of imputation across reference panels
Given that certain reference panels did not support genotype imputation for variants within the human leukocyte antigen (HLA) region (chr6: 29,722,775–33,314,387 based on the assembly GRCh38) or for indels, and showed a varied output format for indel imputation, our analysis focused on comparing the proportion of single nucleotide variants (SNVs) outside the HLA region. In the Chinese dataset, the imputed variants from the CHN100k panel showed the highest overlap with those identified by WGS among the reference panels (68.6%), followed by the TOPMed panel (59.4%) and the ChinaMAP panel (58.9%) (Supplementary Fig. 2a and Supplementary Table 3). In contrast, the 1KG panel, a widely used global reference, demonstrated the lowest overlap at 31.9%. In the Thai WGS dataset, the SEAD panel exhibited the highest overlap with the WGS-identified variants (81.2%), followed by the CHN100k panel at 78.6%, the TOPMed panel at 77.6% and the ChinaMAP panel at 76.1% (Supplementary Fig. 2b and Supplementary Table 3).
We further categorized WGS-identified variants as common or low-frequency variants (MAF < 0.05). As expected, common variants had substantially higher overlap with imputed datasets than low-frequency variants (Fig. 1a and Supplementary Table 3). In the Chinese dataset, more than 85% of common variants identified by WGS overlapped with those imputed by most reference panels. For low-frequency variants, the CHN100k panel demonstrated the highest overlap at 65.2%, followed by the TOPMed panel at 55.5% and the ChinaMAP panel at 54.3%. Notably, although the 1KG panel showed a high overlap with common variants (89.2%), its overlap for low-frequency variants was considerably lower at 23.6%.
The Y-axis represents the percentage of WGS-identified variants that overlap with imputed variants, while the X-axis shows the different reference panels used for genotype imputation. Overlaps for common variants (minor allele frequency > 0.05) are shown in green and overlaps for low-frequency variants (minor allele frequency ≤0.05) are displayed in yellow. a Results based on the Chinese WGS cohort (n = 1263). MAF estimates were derived from this Chinese WGS cohort. b Results based on the Thai WGS cohort (n = 56). MAF estimates were derived from East Asian samples in the 1000 Genomes Project (n = 513).
A similar pattern was observed in the Thai dataset. 97.1% of common variants identified by WGS overlapped with those imputed by the SEAD panel, followed by the 1KG panel (92.8%), the CHN100k panel (91.6%), and the ChinaMAP panel (91.4%) (Fig. 1b and Supplementary Table 3). For low-frequency variants, the SEAD panel again demonstrated the highest coverage (68.5%), outperforming the CHN100k (68.1%) and TOPMed (68.0%) panels. In contrast, the 1KG panel showed a low overlap at 56.1%.
We noted that the coverage between the Thai and Chinese datasets is not directly comparable because of the marked difference in sample size. The larger Chinese dataset is able to capture both a greater number of variants and much rarer ones than the Thai dataset. Collectively, these findings reveal that a significant proportion of WGS-identified variants, particularly low-frequency variants, remain undetected by current imputation references.
Evaluation of imputation accuracy across reference panels
To evaluate imputation accuracy, high-quality WGS samples were used as the gold standard. The imputed genotypes were then compared with the actual genotypes determined by WGS. Variants identified by WGS and shared across all imputation panels were included in the analysis. Accuracy metrics, including imputation R2, heterozygosity concordance rate, and IQS, were computed for the comparison (Methods). These metrics allowed us to evaluate imputation accuracy at both variant-level (R2 and IQS) and sample-level (heterozygosity concordance rate).
In the Chinese dataset, 9,781,349 SNVs shared between WGS and imputation panels were included in the analysis. We found that imputation accuracy varied with allele frequency. As expected, common variants generally showed higher imputation accuracy than low-frequency variants across all metrics. Among the reference panels, ChinaMAP consistently outperformed the others, achieving the highest scores across different MAF categories (Fig. 2a). While the CHN100k panel also performed well, its results were slightly lower than those of ChinaMAP. Notably, although the 1KG achieved the second-highest imputation R2 scores for common variants, its values appeared inflated relative to actual genotypes. This was evidenced by significantly lower heterozygosity concordance rates and IQS scores compared to ChinaMAP and CHN100k panels (Fig. 2a).
The X-axis represents minor allele frequency (MAF) categories, while the Y-axis in the top panel displays the mean imputation R2. The middle panel shows the heterozygosity concordance rate for each sample, and the bottom panel represents the mean imputation quality score (IQS) across the MAF categories. a Analyses were based on 9,781,349 variants shared between the six reference panels and the Chinese WGS cohort (n = 1263), with variants stratified by MAF. MAF estimates were derived from this Chinese WGS cohort. b Analyses were based on 6,299,991 variants shared between the six reference panels and the Thai WGS cohort, with variants stratified by MAF. MAF estimates for the Thai cohort were derived from East Asian samples in the 1000 Genomes Project (n = 513).
Analysis of the Thai dataset, including 6,299,991 shared variants between WGS and panels, revealed different patterns of imputation accuracy across reference panels. Most panels demonstrated comparable performance in both heterozygosity concordance rates and IQS scores across MAF categories, except the WBBC panel (Fig. 2b). In addition, the TOPMed panel exhibited superior performance for low-frequency variants. Notably, while the SEAD panel demonstrated acceptable concordance rates, we observed consistent deflation in its imputation R2 compared to the actual genotypes. Together with the findings in the Chinese dataset, these results suggest that imputation R2 scores may not fully reflect the genotype concordance between imputed and actual data among reference panels.
To further evaluate the impact of SNP array choice on imputation performance, we repeated the analysis using variants included on the Infinium Global Screening Array (GSA) v3.0, which is optimized for cross-ancestry studies (Methods). We also observed that the ChinaMAP panel achieved the best heterozygosity concordance rates and IQS scores across MAF categories in the Chinese dataset, followed by the CHN100k panel (Supplementary Fig. 3). These findings were consistent with those obtained using variants on the Infinium ASA array, suggesting that the choice of SNP array has minimal impact on the trend of imputation performance among these reference panels.
R2 thresholds need to be tailored based on reference panels
Due to the absence of ground-truth genotype data in real-world studies, imputation R2 scores remain a critical metric for quality control of imputed variants in GWAS analyses. While current practice commonly employs a fixed R2 threshold for this purpose23, the above analyses suggest that optimal filtering thresholds should be determined through panel-specific validation studies.
For each reference panel, we used heterozygosity concordance rates and IQS scores as benchmarks to evaluate imputation accuracy across imputation R2 thresholds. In the Chinese dataset, while increasing the R2 threshold generally improved both heterozygosity concordance rates and IQS scores, the relationships were not strictly linear, particularly for the ChinaMAP and CHN100k panels (Fig. 3a). Notably, even without R2 filtering, these two panels achieved a mean heterozygosity concordance rate above 0.90 and an average IQS score exceeding 0.70 (Fig. 3a and Supplementary Table 4, 5). Further increasing the threshold from 0 to 0.40 for the results from the ChinaMAP panel showed minimal improvement, suggesting limited benefit from applying a threshold within this range. By comparison, the 1KG and TOPMed panels required an R2 threshold between 0.60 and 0.70 to achieve similar performance levels, which substantially reduced the percentage of imputed variants (Fig. 3a).
The X-axis represents imputation R2 threshold categories, while the Y-axis in the top panel displays heterozygosity concordance rate for each sample. The middle panel shows the mean imputation quality score (IQS) and the bottom panel represents percentage of variants retained across R2 thresholds. a Analyses are based on 9,781,349 variants shared between the six reference panels and the Chinese WGS cohort (n = 1263), with variants binned by R2 thresholds. b Analyses are based on 6,299,991 variants shared between the six reference panels and the Thai WGS cohort (n = 56), with variants binned by R2 thresholds.
We also investigated the performance of low-frequency variants. Without R2 filtering, the ChinaMAP panel achieved a mean heterozygosity concordance rate above 0.80 and an average IQS score of 0.67 (Supplementary Fig. 4a and Supplementary Table 6, 7). The robust performance underscores the advantage of using the ChinaMAP panel for downstream analyses in Chinese populations.
In the Thai dataset, the SEAD panel slightly outperformed other panels across when the R² threshold was increased to 0.30 (Fig. 3b and Supplementary Table 4-5). However, without R2 filtering, the ChinaMAP and 1KG panels showed considerably higher mean IQS values than the SEAD panel (Fig. 3b). For low-frequency variants, the heterozygosity concordance rate was substantially lower compared to the Chinese dataset. Achieving a mean heterozygosity concordance rate of 0.80 required raising the R2 threshold for the SEAD panel to between 0.40 and 0.50 (Supplementary Fig. 4b and Supplementary Table 6, 7). This adjustment resulted in a reduction of more than 30% of retained low-frequency variants.
We also assessed the potential impact of SNP array choice on imputation performance by repeating the analysis using variants included on the Infinium GSA array. Consistent with the results derived from variants on the Infinium ASA array, ChinaMAP still demonstrated the best performance across different R2 thresholds in the Chinese dataset, followed by CHN100k (Supplementary Fig. 5). These results suggest that the observed trends in imputation performance among reference panels are unlikely to be substantially affected by the choice of SNP arrays.
Collectively, these findings highlight the critical need to optimize reference panel-specific R2 thresholds that account for ancestral background, enabling balanced optimization of imputation accuracy while retaining sufficient variants for downstream analyses.
Recent selection may contribute to regional variation in imputation quality
We further subdivided autosomal regions into 100 kb windows to assess spatial patterns of imputation quality. Given that allele frequencies vary across genomic regions, we evaluated their performance using the IQS score, which accounts for biased allele frequency in imputation quality evaluation16,17 (Methods). In the Chinese dataset, the ChinaMAP panel significantly outperformed others, yielding the highest IQS scores in over 92% of windows (Fig. 4a and Supplementary Table 8). In contrast, the Thai dataset displayed markedly different patterns: the ChinaMAP and 1KG panels were optimal in approximately 31.7% and 31.1% of windows, respectively, followed by TOPMed (12.7%) and SEAD (11.6%) (Fig. 4b). This trend generally aligned with the performance of mean IQS scores observed without R2 filtering (Fig. 3b).
Each dot corresponds to a 100 kb genomic region, with different colors indicating various reference panels. The color of each dot represents the reference panel with the best performance (measured by mean IQS) for that specific region. a Analyses were based on 9,781,349 variants shared among the six reference panels and the Chinese WGS cohort (n = 1263). b Analyses were based on 6,299,991 variants shared among the six reference panels and the Thai WGS cohort (n = 56).
We further investigated regional disparities in imputation performance by examining IQS variation among reference panels across genomic windows. In the Chinese dataset, we found 37 windows with substantial differences in imputation quality, characterized by a mean-variance exceeding 0.02 (Supplementary Fig. 6 and Supplementary Data 1). Notably, most of these regions (34 out of 37) showed significantly better performance when the Chinese-matched reference panels were used. This observation could be attributed to the effects of recent selection on these regions.
To further evaluate this hypothesis, we examined signatures of recent positive selection using the Singleton Density Score (SDS) method24 with data from an independent Chinese cohort25 (RePoS database; n = 3946). Owing to an insufficient number of variants (<20 variants) within a window, three of the 34 regions were excluded from the analysis. Among the remaining regions, four showed suggestive signatures of recent positive selection (SDS P-value < 5E-04). This ratio was significantly higher compared to regions with minimal variation in imputation quality (OR = 6.75, Fisher’s exact test P-value = 0.0043; Method).
A notable example is a region on chromosome 11 (55.4–55.6 Mb) that contains an olfactory receptor gene cluster (Fig. 5a). Within the region, ChinaMAP (mean IQS = 0.836) demonstrated significantly better imputation performance compared to other panels, such as 1KG (mean IQS = 0.762), TOPMed (mean IQS = 0.644), and SEAD (mean IQS = 0.473). A strong signature of recent positive selection in the region was also detected in the Chinese population, as indicated by the lead SNP rs511492 (SDS = 7.46, SDS P value = 8.69E-14; Fig. 5b). Prior studies have shown an extended haplotype for the alternative allele (T) compared to the reference allele (C)25, further supporting evidence of recent selection in this region. Notably, one of the selected haplotypes (CTT; consisting of rs117749670, rs1459101 and rs511492) carried a stop-gain mutation (rs1459101-C) in the OR4C16 gene. In our Chinese WGS dataset, this haplotype was observed at a frequency of 15.3%. However, it was nearly absent in African, American, South Asian, and European populations (Fig. 5c). These findings help explain why non-Chinese reference panels performed less effectively in regions that have undergone recent selection in the Chinese population.
a Imputation quality performance across reference panels within the region. Each dot represents the IQS score, with different colors indicating the corresponding reference panels. b Selection signatures within the region. The recent selection signature is measured by the Singleton Density Score (SDS) method. Each dot represents the -log10 transformed SDS P-value for individual variants in the region. *rs1459101 is a stop-gain variant in the OR4C16 gene. c Frequency of the selected haplotype (CTT) across 26 global populations. The CTT haplotype, composed of rs117749670, rs1459101, and rs511492, was analyzed using high-coverage data from the 1000 Genomes Project (phase 3) to calculate haplotype frequencies.
Immune-related genes enriched in regions with poor imputation quality
Through the regional evaluation, we also identified 0.94% of genomic regions with consistently poor imputation quality (mean IQS < 0.40) in both populations (Fig. 4 and Supplementary Data 3, 4). These encompassed critical immune-related loci, including immunoglobulin heavy loci (IGH; 16p11.2), kappa (IGK@; 2p11.2) and lambda (IGL@; 22q11.22) gene clusters, along with the T cell receptor beta locus (TRB; 7q34). These regions pose significant challenges for imputation due to the extensive haplotype diversity, segmental duplications, and complex structural26,27,28. Notably, evidence of recent positive selection at the IGH cluster in the Chinese population25 underscores the need for long-read sequencing approaches to resolve population-specific architecture in these regions.
Genes within other poorly imputed regions also showed significant enrichment for immunological pathways including macrophage inflammatory protein (MIP)-1beta signaling (adjusted P-value = 9.20E-06), leukocyte immunoglobulin-like receptor (LILR) activity (adjusted P-value = 5.02E-05), C-C motif chemokine production (eg. CCL3, CCL15, CCL18 and CCL23; adjusted P-value = 1.74E-04) and others (Supplementary Data 5). These findings imply that certain association signals for immune-related diseases or traits remain undetected in East and Southeast Asian populations due to inadequate imputation quality.
Impact of imputation quality on disease risk estimation
Polygenic risk scores (PRS) have been widely utilized to predict individuals’ risk of developing complex diseases9. However, the quality of genotype imputation could affect PRS calculations and, consequently, the accuracy of disease prediction. To evaluate this effect, we leverage the recorded phenotypes in Chinese dataset to examine how different imputation references impact PRS estimation to predict individuals’ risk of systemic lupus erythematosus (SLE).
Using WGS from 1,263 Chinese individuals, we compared SLE PRS calculated from true genotypes against the PRS generated by genotype dosages imputed from each reference panel (Methods). To compute the PRS, we used 203 previously reported SLE-associated variants19,29 shared between the WGS dataset and the reference panels. The results showed that PRSs calculated from imputation using the ChinaMAP and CHN100K panels were highly correlated with the score derived from the true genotypes, with Pearson correlation coefficients (r) of 0.945 and 0.937, respectively. The correlations were much greater than those obtained using the 1KG (r = 0.894) and TopMed (r = 0.898) panels (Fig. 6a).
a Comparison of polygenic risk scores (PRS) derived from true genotypes with PRS computed using genotype dosages imputed from various reference panels. Each dot represents an individual WGS sample. b Predictive accuracy for systemic lupus erythematosus (SLE) using PRS computed by various reference panels. The accuracy is measured by the area under the receiver operating characteristic curve (AUC). Error bars represent the 95% credible intervals for AUC values. c Predictive accuracy for SLE using PRS built from variants filtered at different R2 thresholds. d The number of SLE-associated variants included in constructing the PRS at various R2 thresholds. e Predictive accuracy for SLE using PRS built from variants filtered at different R2 thresholds using Genome-Wide SNP Sets. f The number of variants included in constructing the PRS at various R2 thresholds using Genome-Wide SNP Sets.
To further evaluate the impact on disease risk prediction, we extended the analysis to the ASA-WGS combined dataset, comprising 2027 SLE cases and 2430 controls of Chinese ancestry. Using this data, we measured the accuracy of SLE prediction by the area under the receiver-operating characteristic curve (AUC). The results showed that the PRSs constructed using genotype dosage imputed from the CHN100k (AUC = 0.761) and ChinaMAP (AUC = 0.759) panels performed slightly better in SLE prediction than those generated using other panels (e.g., AUC = 0.751 for both 1KG and TopMed; Fig. 6b).
We next investigated the impact of varying R2 thresholds on prediction performance. Unlike the above study, all 239 previously reported SLE-associated variants (Supplementary Table 10) were included in this analysis. The results showed that overall predictive power declined as the R2 threshold increased (Fig. 6c). The mean AUC across panels dropped significantly from 0.783 at an R2 threshold of 0 to the mean of 0.736 at an R2 threshold of 0.90 (Paired t-test P-value = 2.31E-06). This reduction is likely attributed to the decreased number of associated variants retained at higher R2 thresholds (Fig. 6d). We also observed that PRSs derived from the CHN100k and ChinaMAP panels achieved the best performance without any R2 filtering. This pattern suggests that raising the R2 threshold may enhance predictive power by reducing the influence of poorly imputed genotypes when utilizing global reference panels.
Considering potential bias toward associated variants, we further implemented lassosum30, a genome-wide PRS method, to reanalyze the data (Methods). The model was trained on the summary statistics from a previous Japanese and Korean meta-analysis of SLE31 (5369 cases, 233,167 controls). The PRS estimations performed better when constructed from imputed data using the ancestry-matched reference panels (Fig. 6e). We also observed that the mean AUC across all panels dropped significantly from 0.682 at an R2 threshold of 0.00 to 0.653 at a threshold of 0.90 (paired t-test P-value = 2.44E-04). A marked reduction in prediction accuracy was observed when the threshold was greater than 0.70. These patterns are generally consistent with the results using reported SLE‑associated variants, but the genome‑wide PRS method was more robust and less sensitive to variant selection bias. Collectively, these findings highlight the importance of ancestry-matched reference panels in improving PRS estimation for disease risk prediction.
Discussion
The underrepresentation of East and Southeast Asian populations in widely used global reference panels, such as TOPMed, poses significant challenges for genetic discovery and translational applications in these ancestral groups. By leveraging high-coverage WGS data as a gold standard, we sought to address longstanding uncertainties regarding the optimal selection of imputation panels and quality control strategies for GWAS in East and Southeast Asian populations. More importantly, we further explore potential factors underlying imputation differences across reference panels. Our results demonstrate that recent positive selection may contribute to the accuracy of genotype imputation, exemplified by the olfactory receptor gene cluster. These findings provide new insights into the choice of reference panels and merit broader attention in future studies.
By comparing the overlap between imputed variants and those identified through WGS, our study suggests that a substantial proportion of WGS-identified variants, particularly low-frequency variants, remain uncovered by current imputation references. While global reference panels effectively capture common variants, ancestry-matched panels provide significantly better coverage for low-frequency variants. Notably, we showed an often-overlooked limitation of earlier studies that relied on the 1KG panel, which showed the lowest overlap for low-frequency variants in East and Southeast Asian populations. These results emphasize the continued importance of sequencing technologies for comprehensive variant discovery in future studies.
Our findings demonstrate that imputation R2 scores do not accurately represent the true concordance between imputed and actual genotype data, and the concordances at a fixed R2 threshold vary considerably across different reference panels. For example, in our analysis of Chinese data, the ChinaMAP panel achieved a good performance without any R2 filtering. In contrast, the TOPMed panel required an R2 threshold between 0.60 and 0.70 to reach comparable performance. The deviation may result from genetic differences between the study and reference populations32. Therefore, panel-specific R2 thresholds should be considered rather than applying a uniform threshold for quality control.
This study provides actionable guidelines for genotype imputation and quality control in GWAS of East and Southeast Asian populations. For studies involving the Chinese population, the ChinaMAP and CHN100k panels are recommended due to their superior coverage and accuracy compared with other reference panels. However, we note that ChinaMap does not support the imputation of indels, as well as variants within the HLA region, although it showed the best performance in the Chinese population. Given the high genetic diversity of HLA alleles and their critical role in immune response, there is a need for specialized tools tailored to high-quality HLA allele imputation. In addition, we also showed that the imputation performance across reference panels is unlikely to be affected by the choice of SNP arrays (Supplementary Figs. 3 and 5).
For Thai population studies, the SEAD panel demonstrated broader coverage and slightly higher concordance with actual genotypes when the R2 threshold is increased to 0.30 or above. Thus, we recommend using the SEAD panel for imputing Thai population data, while applying an R2 threshold of at least 0.30 to ensure reliable variant quality.
Although the total number of samples in the SEAD panel is comparable to ChinaMAP, its more diverse composition likely accounts for the better performance observed in the Thai dataset. The SEAD panel included 4480 Chinese samples, 4563 individuals from the SG10K project6, 1031 from the GAsP pilot project7, and 993 East and South Asian samples from the 1000 Genomes Project3. In contrast to ChinaMAP and CHN100k, this broader representation encompasses some individuals that are genetically closer to Thai samples, which likely contributes to the gains in imputation accuracy. In addition, we found that all three panels (SEAD, ChinaMAP, and CHN100k) used the SHAPEIT algorithm for phasing, though with different versions (SHAPEIT233 for SEAD and SHAPEIT434 for ChinaMAP and CHN100k). In this study, we phased all test samples using the same algorithm and then performed imputation using Minimac414. Thus, the stronger performance of SEAD in the Thai population is unlikely to stem from technical differences and is more plausibly attributed to its broader representation of populations genetically closer to Thai samples.
However, our results also highlight the influence of R2 thresholds on the performance of Thai samples: without filtering, ChinaMAP and 1KG performed slightly better (Fig. 2b and Fig. 4b), whereas at thresholds of 0.30 or above, SEAD becomes the preferred panel (Fig. 3b). In addition, the Thai dataset exhibited greater variation in imputation quality, with 168 genomic windows showing significant differences across reference panels, compared to 37 windows in the Chinese dataset (Supplementary Fig. 6 and Supplementary Data 1-2). These findings underscore the need for more tailored and optimized reference panels to enhance imputation accuracy for this population. Notably, the Genomics Thailand project has been launched, which could provide a valuable foundation for developing an optimized imputation panel for the Thai population.
More importantly, our study suggests that recent selection may contribute to the imputation variations across different reference panels. This is exemplified by the olfactory receptor gene cluster on chromosome 11, which has been reported to be under recent positive selection in the Chinese population (Fig. 5). The selected haplotype (CTT) in this region is highly prevalent in East Asian populations but nearly absent in others. Due to the unique haplotype structure, imputation quality can be significantly improved when using ancestry-matched reference panels. Previous studies on the lactase (LCT) gene also highlight potential limitations of genotype imputation in the regions under recent selections35. In addition, recent investigations in type 2 diabetes (T2D)-associated loci demonstrated that haplotype differences between study cohorts and the reference panels can skew imputation results, favoring allelic calls that are more common in the reference panel36. Taken together, these findings underscore the importance of using a matched reference panel to improve imputation accuracy in regions with varying haplotype frequencies across populations.
Our studies also revealed that certain regions, including multiple immune-related genes such as the IGH, IGK@, IGL@, and TRB loci, were poorly imputed in East and Southeast Asian Populations. The complex structure and evidence of recent selection within these regions26,27,28 suggest that targeted long-read sequencing approaches are necessary to refine them in an ancestry-specific manner. In addition, genes involved in MIP-1beta signaling, LILR activity, and C-C motif chemokine production, playing an important role in the immune system37,38,39,40, were also enriched in these poorly imputed regions. These findings suggest that certain essential association signals related to immune-related diseases or traits may have been ignored in previous studies. However, it is important to note that these results may be influenced by the SNP array platform used in our analysis, which was limited to the Infinium ASA platform. Further studies are needed to determine whether imputation quality in these regions can be improved using other genotyping platforms.
We also identified other important genomic regions that were poorly imputed (Supplementary Data 3-4). One example lies on chromosome 15 (25,000,001–32,200,000 bp), which overlaps with 15q11–q13. This locus is prone to structural rearrangements and is a known hotspot for recurrent copy number variations. Several genes within this region, including SNORD116, HERC2, UBE3A, SNHG14 (UBE3A-AS1), and CHRNA7, have been linked to neurodevelopmental disorders41,42. Another example is a region on chromosome 17 (45,800,001–46,300,000 bp) that contains MAPT, which encodes the tau protein, a hallmark of Alzheimer’s disease and frontotemporal dementia43,44. The MAPT locus is flanked by three highly homologous low-copy repeats (LCRs) spanning approximately 1.7 Mb43, which likely complicates accurate genotype imputation. These poorly imputed regions highlight areas that deserve closer attention in future studies.
Finally, we demonstrated the significant impact of reference panel selection on disease risk prediction, using SLE as a model disease. Our results showed notable improvements in predictive power when using ancestry-optimized reference panels. Notably, filtering variants based on an R2 threshold would reduce predictive power. We chose SLE due to the following reasons: 1) it is a prototypic autoimmune disease with a strong genetic component (heritability 43–66%45,46,47); 2) although most GWAS findings to date have been biased toward European populations, GWAS findings for SLE are mainly based on Asian populations19,31,48; 3) we had access to a relatively large, well-curated SLE dataset for robust analysis.
We acknowledge several limitations in this study. First, although our analyses focused on Chinese and Thai populations, which represent important groups, they do not capture the full genetic diversity of East and Southeast Asians. Second, the Thai WGS dataset was relatively small, which may limit the power to detect low-frequency variants. Third, our PRS analyses were restricted to SLE, and further studies are needed to determine whether these patterns hold for other diseases or continuous phenotypes.
In summary, this study provides valuable guidelines for future GWAS in East and Southeast Asian populations. More importantly, it highlights the importance of ongoing efforts to develop ancestrally diverse reference panels, as recent selection may generate ancestry-specific haplotypes that are challenging to capture without sufficient representation. Together with advancements in long-read sequencing technologies, these efforts will be essential for addressing current disparities in imputation quality and enhancing the accuracy of disease prediction.
Methods
Ethics approval
This study was approved by the institutional review boards, including the ethical committee from the Hospital Authority Hong Kong West Cluster (UW 07-119) and the Faculty of Medicine Ramathibodi Hospital, Mahidol University (12-58-12). All patients signed a written informed consent to participate in the study. All ethical regulations relevant to human research participants were followed.
Sample overview
This study collected a total of 8438 individuals from Chinese and Thai populations. Of these, 3198 Chinese samples and 3830 Thai samples were sourced from our earlier studies and genotyped using the Infinium Asian Screening Array-24 v1.0 (ASA)18,19. To assess imputation accuracy, we recruited 1353 samples of Chinese ancestry from Hong Kong, China, and 57 samples of Thai ancestry from Bangkok, Thailand. Whole-genome DNA from these individuals was sequenced using the DNBSEQ platform by BGI Genomics, with a read length of 150 base pairs (PE150).
Variant calling
NVIDIA Parabricks v4.3.0 software suite49 was used to identify germline variants for WGS data. The software suite significantly accelerated genomic analysis by utilizing graphics processing units (GPUs) instead of traditional central processing units (CPUs), achieving faster processing speeds while maintaining accuracy equivalent to the standard GATK (Genome Analysis Toolkit) best practices20,21,22.
The analysis followed a series of steps to process the sequencing data. Raw paired-end sequencing data were aligned to the GRCh38 reference genome using the BWA-mem algorithm. PCR-induced duplicate fragments were removed with the MarkDuplicates algorithm. Base Quality Score Recalibration (BQSR) was applied to correct sequencing quality discrepancies that arise from different sequencing cycles and contexts. Variants in individual samples were subsequently detected using the HaplotypeCaller algorithm. Joint genotyping was performed using Glnexus50, a tool designed for large-scale genomic variant analysis.
In addition, vcftools51 was used to refine the data by filtering out variants with genotype quality (GQ) less than 20 or depth less than 5. Variants with a missing rate exceeding 20% among samples were removed. “Half-called” variants, referring to genotypic calls that are incomplete or ambiguous within the variant calling process, were excluded from the analysis. Following these steps, 46,218,832 variants on autosomes were retained in the Chinese WGS dataset, consisting of 42,751,494 SNVs and 3,467,338 indels. Similarly, 15,604,227 variants on autosomes were retained in the Thai WGS dataset, including 12,598,039 SNVs and 3,006,188 indels. The transition/transversion (Ti/Tv) ratio for individuals ranged from 2.03 to 2.06 in both datasets.
Quality control and genotype imputation
To enhance phasing accuracy, the WGS samples were merged with the ASA-genotyped samples based on their shared variants. Quality control steps were applied to the merged datasets using PLINKv1.952. Variants with missingness greater than 0.05, MAF less than 0.01, or failing the Hardy-Weinberg equilibrium (HWE) test (P-value < 1.00E-05) were removed. Samples identified as potential duplicates (identical-by-descent relationship PIHAT > 0.90), or exhibiting abnormal heterozygosity levels (|F-score|> 0.1) were also excluded. After quality control, 8316 samples from both populations were retained for downstream analyses (Supplementary Table 1).
Eagle (version 2.4.1)53 was then used to phase samples from the WGS-ASA combined dataset for each population without using reference. The Genetic map generated by the HapMap project54 was utilized during this process. The phased datasets from Chinese and Thai populations were submitted to the CHN100k12, ChinaMAP55, Westlake10, TOPMed5, and Michigan14 Imputation Servers, respectively. The 1KG reference panel was included in the Michigan Imputation Server, and the SEAD reference panel was included in the Westlake Imputation Server. The Minimac4 algorithm14 was used for genotype imputation across all imputation servers.
Evaluation of imputation quality
To evaluate the performance of different reference panels, high-quality WGS data was used as the gold standard. Variants overlapping with those on the Infinium ASA array were extracted for imputation. After imputation, the imputed genotypes were compared with the actual genotypes determined by WGS to assess imputation performance across different reference panels.
Heterozygosity concordance rate and IQS score16,17 were used to evaluate imputation performance across reference panels. The heterozygosity concordance rate was calculated by comparing the observed heterozygous genotypes from WGS with the imputed genotypes for each individual. The IQS score for each variant was computed by subtracting the chance agreement (\({P}_{c}\)) from the observed agreement (\({P}_{o}\)), and then dividing the result by the maximum possible agreement excluding the chance agreement (Eq. 1). The calculation is represented by the following equations16,17:
Where \({P}_{o}\) represents the observed concordance rate for different genotype classes, calculated as the sum of estimated genotype probabilities for each matching genotype class (Eq. 2). \({P}_{11}\) denotes the estimated genotype probability for the homozygous reference genotype (0/0) in cases where the true genotype, determined by WGS, is 0/0 and the imputed genotype also matches as 0/0. Similarly, \({P}_{22}\) and \({P}_{33}\) represent the estimated genotype probabilities for the heterozygous genotype (0/1) and the alternative homozygous genotype (1/1), respectively, in cases where both the true and imputed genotypes align. \({N}_{11}\), \({N}_{22}\) and \({N}_{33}\) refer to the number of individuals with matched genotypes for the homozygous reference (0/0), heterozygous (0/1), and alternative homozygous (1/1) categories, respectively. \(N\) denotes the total number of individuals.
The chance agreement (\({P}_{c}\)) is calculated as the sum of the products of marginal frequencies, representing the expected agreement if genotypes were randomly assigned based on their respective marginal rates (Eq. 3). Specifically, W1, W2, and W3 refer to the marginal frequency for the observed genotype 0/0, 0/1, and 1/1, respectively, while Y1, Y2, and Y3 represent the marginal frequency for the imputed genotype 0/0, 0/1, and 1/1, respectively, as illustrated in Supplementary Table 9.
For the analysis of the Chinese dataset, the MAF categories were defined using 1263 Chinese WGS samples. For the analysis of the Thai dataset, the MAF categories were defined based on WGS data derived from 1KG East Asian populations, due to a limited number of WGS samples in the Thai dataset.
Impact of SNP array choice on imputation performance
To streamline the analysis, we focused on variants located on chromosome 1 within the Chinese dataset, which had a much larger sample size. After quality control, 27,419 GSA variants were included in the analysis, 27% of which overlapped with variants on the Infinium ASA array. Phasing was performed using Eagle (version 2.4.1)53 based on these GSA variants, and the phased datasets were then used for imputation across different reference panels. To assess imputation performance using variants onthe Infinium GAS array, accuracy metrics such as imputation R2, heterozygosity concordance rate, and IQS were computed for comparison across different reference panels.
Region-based analysis
For regional analysis, 9,781,349 SNVs shared by all reference panels and the Chinese WGS dataset were used in the Chinese dataset. Similarly, 6,299,991 shared SNVs were used for analysis in the Thai dataset. To conduct this analysis, the genome was divided into windows of 100 kbp. This resulted in 26,352 windows in the Chinese dataset and 26,029 windows in the Thai dataset, excluding the HLA region. The difference in the number of windows between the two datasets reflected different variants included in each analysis (Supplementary Table 8).
To account for differences in allele frequencies across genomic regions, we assessed imputation quality using the IQS score. For each genomic window, the mean IQS score for each reference panel and the variation in IQS across panels were calculated. Windows with a mean IQS below 0.4 across all panels were classified as regions with poor imputation quality, while those with a variance in IQS exceeding 0.02 across panels were categorized as regions with significant differences. Regions identified as having poor imputation quality in both the Chinese and Thai datasets were further analyzed. From these regions, 542 genes were identified and used for enrichment analysis with ToppGene56 (Supplementary Data 5).
To investigate whether genomic regions with substantial variation in imputation accuracy across reference panels were enriched in regions under recent positive selection, we analyzed selection signals using the SDS method24. Compared to extended haplotype heterozygosity (EHH), integrated haplotype score (iHS), and cross-population extended haplotype homozygosity (XP-EHH) approaches, the SDS method offers greater sensitivity and power in identifying more recent selection events, occurring within the past 2000–3000 years24. The SDS data were sourced from the RePoS database, including 3946 WGS Chinese samples25. We focused on regions with the best performance in Chinese ancestry-matched panels. We excluded regions with an insufficient number of variants (<20 variants) in the RePoS database. For comparison, a set of 22,540 genomic regions with an IQS variance of less than 0.005 was used as a benchmark. Fisher’s exact test was then conducted to examine whether genomic regions with substantial variation in imputation accuracy were more likely to be enriched in regions under recent positive selection compared to regions without significant variation. Due to insufficient data for assessing recent positive selection in the Thai population, this analysis was not performed on the Thai dataset.
Calculation of polygenic risk scores
The polygenic risk score (PRS) of SLE was calculated for each individual using the equation provided below:
Where \({\beta }_{i}\) represents the effect size of the ith SLE-associated variant, \({{Dosage}}_{i}\) represents the corresponding allele dosage obtained from imputation results, and n is the total number of associated variants used in the calculation. In this study, a total of 239 SLE-associated variants outside of the HLA region were extracted from previous studies19,29 and their genetic effect on SLE development was summarized in Supplementary Table 10.
To assess the impact of reference panel selection on SLE prediction, the PRS was calculated using 203 SLE-associated variants that were shared across the six reference panels and the Chinese WGS datasets. No R2 filtering was applied in this analysis. For the WGS samples in the Chinese population, we evaluated the accuracy of PRSs by comparing those derived from imputed genotype dosages with PRSs calculated from sequencing-based genotypes, using the Pearson Correlation Coefficient (r) to measure the agreement.
The performance of various reference panels on SLE prediction for the Chinese ASA-WGS combined dataset was assessed using the AUC score. To examine the impact of different imputation R2 thresholds on SLE prediction, all SLE-associated variants were analyzed. For each R2 threshold, PRSs were calculated using the associated variants retained in each reference panel, and the corresponding AUC scores were calculated. The AUC scores and the 95% confidence intervals were calculated using the pROC57 package.
In addition, lassosum, one of the most effective genome-wide PRS methods30,58, was also applied to the imputed datasets from different reference panels. Previous SLE GWAS summary statistics from Japanese and Korean populations31 (5,369 cases and 233,167 controls) were used to train the model. Linkage disequilibrium (LD) among variants was estimated from the testing dataset.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The details of the datasets used for conducting this research are provided here: The data for the 1000 Genomes Project Phase 33 are available at https://www.internationalgenome.org/. The Genetic map (hg38)54 is publicly available at https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz. The RePoS SDS dataset is available at http://bigdata.ibp.ac.cn/RePoS/download.php. All other data supporting the findings of this study are available from the corresponding author upon reasonable request.
Code availability
Custom code for variant calling, quality control, and IQS calculation can be found on GitHub (https://github.com/LI-DINGYANG/Synecdoche). Custom code for variant calling, quality control, and IQS calculation can also be found on Zenodo (https://doi.org/10.5281/zenodo.17472213)59. CHN100k Imputation Server12: http://bioinformatics.hit.edu.cn/imputation/ ChinaMAP Imputation Server55: http://www.mbiobank.com/imputation/ Westlake Imputation Server10: https://imputationserver.westlake.edu.cn/index.html TOPMed Imputation Server5: https://imputation.biodatacatalyst.nhlbi.nih.gov/ Michigan Imputation Server14: https://imputationserver.sph.umich.edu/index.html PLINK52 (v1.9): http://www.cog-genomics.org/plink/1.9/ EAGLE253 (v2.4.1): https://alkesgroup.broadinstitute.org/Eagle/ VCFtools51 (v0.1.16): https://vcftools.github.io/index.html GLnexus50 (v1.4.1): https://github.com/dnanexus-rnd/GLnexus Clara Parabricks49 (v4.3.0-1): https://www.nvidia.com/en-us/clara/genomics/ Toppgene56: https://toppgene.cchmc.org/ Locuszoom60: https://github.com/statgen/locuszoom Lassosum30: https://github.com/tshmak/lassosum pROC package57: https://www.expasy.org/resources/proc Geography of Genetic Variants Browser61: https://popgen.uchicago.edu/ggv/
References
Das, S., Abecasis, G. R. & Browning, B. L. Genotype Imputation from Large Reference Panels. Annu. Rev. Genom. Hum. Genet. 19, 73–96 (2018).
Sengupta, D. et al. Performance and accuracy evaluation of reference panels for genotype imputation in sub-Saharan African populations. Cell Genom. 3, 100332 (2023).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Wu, D. et al. Large-scale whole-genome sequencing of three diverse Asian populations in Singapore. Cell 179, 736–749.e15 (2019).
Wall, J. D. et al. The GenomeAsia 100 K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Cahoon, J. L. et al. Imputation accuracy across global human populations. Am. J. Hum. Genet. 111, 979–989 (2024).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Cong, P.-K. et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat. Commun. 13, 2939 (2022).
Cao, Y. et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 30, 717–731 (2020).
Jiang, T. et al. A comprehensive genetic variant reference for the Chinese population. Sci. Bull. https://doi.org/10.1016/j.scib.2024.06.017 (2024).
Yang, M.-Y. et al. SEAD reference panel with 22,134 haplotypes boosts rare variant imputation and genome-wide association analysis in Asian populations. Nat. Commun. 15, 10839 (2024).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Shi, M. et al. Genotype imputation accuracy and the quality metrics of the minor ancestry in multi-ancestry reference panels. Brief. Bioinform. 25, bbad509 (2024).
Lin, P. et al. A new statistic to evaluate imputation reliability. PLOS ONE 5, e9697 (2010).
Ramnarine, S. et al. When does choice of accuracy measure alter imputation accuracy assessments?. PLoS ONE 10, e0137601 (2015).
Wang, Y.-F. et al. Identification of shared and asian-specific loci for systemic lupus erythematosus and evidence for roles of type iii interferon signaling and lysosomal function in the disease: a multi-ancestral genome-wide association study. Arthritis Rheumatol. 74, 840–848 (2022).
Wang, Y.-F. et al. Identification of 38 novel loci for systemic lupus erythematosus and genetic heterogeneity between ancestral groups. Nat. Commun. 12, 772 (2021).
Van der Auwera, G. A. et al. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, 11.10.1–11.10.33 (2013).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Roshyara, N. R., Kirsten, H., Horn, K., Ahnert, P. & Scholz, M. Impact of pre-imputation SNP-filtering on genotype imputation results. BMC Genet. 15, 88 (2014).
Field, Y. et al. Detection of human adaptation during the past 2000 years. Science 354, 760–764 (2016).
Luo, H. et al. Recent positive selection signatures reveal phenotypic evolution in the Han Chinese population. Sci. Bull. 68, 2391–2404 (2023).
Engelbrecht, E. et al. Resolving haplotype variation and complex genetic architecture in the human immunoglobulin kappa chain locus in individuals of diverse ancestry. Genes Immun. 25, 297–306 (2024).
Watson, C. T. et al. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. Am. J. Hum. Genet. 92, 530–546 (2013).
Rodriguez, O. L. et al. Genetic variation in the immunoglobulin heavy chain locus shapes the human antibody repertoire. Nat. Commun. 14, 4419 (2023).
Laurynenka, V. & Harley, J. B. The 330 risk loci known for systemic lupus erythematosus (SLE): a review. Front. Lupus 2, 1398035 (2024).
Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X. & Sham, P. C. Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 41, 469–480 (2017).
Yin, X. et al. Meta-analysis of 208370 East Asians identifies 113 susceptibility loci for systemic lupus erythematosus. Ann. Rheum. Dis. 80, 632 (2021).
Gurdasani, D. et al. Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell 179, 984–1002.e36 (2019).
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).
Ali, A. T., Liebert, A., Lau, W., Maniatis, N. & Swallow, D. M. The hazards of genotype imputation in chromosomal regions under selection: a case study using the Lactase gene region. Ann. Hum. Genet. 86, 24–33 (2022).
Lau, W. et al. The hazards of genotype imputation when mapping disease susceptibility variants. Genome Biol. 25, 7 (2024).
Hughes, C. E. & Nibbs, R. J. B. A guide to chemokines and their receptors. FEBS J. 285, 2944–2971 (2018).
Menten, P., Wuyts, A. & Van Damme, J. Macrophage inflammatory protein-1. Cytokine Growth Factor Rev. 13, 455–481 (2002).
Barrow, A. D. & Trowsdale, J. The extended human leukocyte receptor complex: diverse ways of modulating immune responses. Immunol. Rev. 224, 98–123 (2008).
McLaren, P. J. et al. Polymorphisms of large effect explain the majority of the host genetic contribution to variation of HIV-1 virus load. Proc. Natl Acad. Sci. 112, 14658–14663 (2015).
Paparella, A. et al. Structural variation evolution at the 15q11-q13 disease-associated locus. Int. J. Mol. Sci. 24, 15818 (2023).
Kirov, G. et al. The penetrance of copy number variations for schizophrenia and developmental delay. Schizophr. Rare Gene Var. Uncommon Syndr. 75, 378–385 (2014).
Cruts, M. et al. Genomic architecture of human 17q21 linked to frontotemporal dementia uncovers a highly homologous family of low-copy repeats in the tau region. Hum. Mol. Genet. 14, 1753–1762 (2005).
Myers, A. J. et al. The H1c haplotype at the MAPT locus is associated with Alzheimer’s disease. Hum. Mol. Genet. 14, 2399–2404 (2005).
Kuo, C.-F. et al. Familial aggregation of systemic lupus erythematosus and coaggregation of autoimmune diseases in affected families. JAMA Intern. Med. 175, 1518–1526 (2015).
Lawrence, J. S., Martins, C. L. & Drake, G. L. A family survey of lupus erythematosus. 1. Heritability. J. Rheumatol. 14, 913–921 (1987).
Wang, J. et al. Systemic lupus erythematosus: a genetic epidemiology study of 695 patients from China. Arch. Dermatol. Res. 298, 485–491 (2007).
Guga, S., Wang, Y., Graham, D. C. & Vyse, T. J. A review of genetic risk in systemic lupus erythematosus. Expert Rev. Clin. Immunol. 19, 1247–1258 (2023).
NVIDIA (2024). Clara Parabricks (v4.3.0-1). https://www.nvidia.com/en-us/clara/genomics/ (2024).
Lin, M. F. et al. GLnexus: joint variant calling for large cohort sequencing. bioRxiv 343970. https://doi.org/10.1101/343970 (2018).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, s13742–015-0047–8 (2015).
Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Altshuler, D. & Donnelly, P. & The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Li, L. et al. The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res. 31, 1308–1310 (2021).
Chen, J., Bardes, E. E., Aronow, B. J. & Jegga, A. G. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37, W305–W311 (2009).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinforma. 12, 77 (2011).
Pain, O. et al. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLOS Genet. 17, e1009021 (2021).
Li, D. Imputation disparities driven by recent selection and their impact on disease risk estimation in East and Southeast Asian populations. https://doi.org/10.5281/zenodo.17472213 (2025).
Boughton, A. P. et al. LocusZoom.js: interactive and embeddable visualization of genetic association study results. Bioinformatics 37, 3017–3018 (2021).
Marcus, J. H. & Novembre, J. Visualizing the geography of genetic variants. Bioinformatics 33, 594–595 (2017).
Acknowledgements
We gratefully acknowledge funding support from the Shenzhen-Hong Kong Jointly Funded Project (Category A; SGDX20230116093201002), the Stability Support for Higher Education from Shenzhen Science and Technology Program, the Sanming Project of Medicine in Shenzhen (SZSM202411006), the Guangdong Natural Science Foundation Youth Enhancement Project (2024A1515030287), the Guangdong Provincial Science and Technology Department (GDSTC; 2024A0505050001), the 1 + 1 + 1 CUHK-CUHK(SZ)-GDSTC Joint Collaboration Fund (2025A0505000056), the Guangdong Basic Research Center of Excellence for Aggregate Science (C10120250191), the National Natural Science Foundation of China (Grant No. 82471825), National Research Foundation of Korea (NRF-2021R1A6A1A03038899) and the National Facility & Equipment Center (NFEC-2023R1A6C101A009). We also extend our thanks to the Warshel Institute for Computational Biology and their funding support from Shenzhen City and Longgang District (LGKCSDPT2024001). The authors thank the anonymous reviewers for their constructive comments and valuable suggestions that helped to improve the quality of the manuscript.
Author information
Authors and Affiliations
Contributions
Y.-F. Wang conceived the study. D. Li took the lead in data analysis. P. Tangtanatakul, Y. Lei, H-Y. Huang, Y.-C.-D. Lin, C. Li, Y. Chen, L. Cai, J. Zhao, P. Pisitkul, T. Suangtamai, J. Yu, Y. Zhou, Y. Xu, Y.Xiao, P. Kunhapan, R. Sun, G. Yu, H. Sun, N. Hirankarn, M. Wang, H.-D, Huang, W. Yang undertook subject recruitment and collected phenotype data. Y. Ishikawa, C. Terao, K. Kim and S.-C. Bae provided SLE summary statistics from East Asian populations for PRS training. D. Li, P. Tangtanatakul, Y. Lei, X. Liu, W. Yang and Y.-F. Wang carried out data analyses and interpretation. D. Li, P. Tangtanatakul and Y.-F. Wang wrote the manuscript. All authors read and contributed to the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Jeffrey Beck and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Eirini Marouli and Aylin Bircan. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, D., Tangtanatakul, P., Lei, Y. et al. Imputation disparities driven by recent selection and their impact on disease risk estimation in East and Southeast Asian populations. Commun Biol 8, 1822 (2025). https://doi.org/10.1038/s42003-025-09214-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s42003-025-09214-1








