Imputation disparities driven by recent selection and their impact on disease risk estimation in East and Southeast Asian populations

Li, Dingyang; Tangtanatakul, Pattarin; Lei, Yao; Liu, Xiaoxi; Huang, Hsi-Yuan; Lin, Yang-Chi-Dung; Li, Chengjia; Chen, Yidan; Cai, Lizhi; Zhao, Jinglu; Pisitkul, Prapaporn; Suangtamai, Thanitta; Yu, Jinhan; Zhou, Yihang; Xu, Yuan; Xiao, Yue; Kunhapan, Punna; Sun, Rui; Yu, Guangjun; Sun, Hao; Hirankarn, Nattiya; Ishikawa, Yuki; Terao, Chikashi; Kim, Kwangwoo; Bae, Sang-Cheol; Wang, Meiying; Huang, Hsien-Da; Yang, Wanling; Wang, Yong-Fei

doi:10.1038/s42003-025-09214-1

Download PDF

Article
Open access
Published: 21 November 2025

Imputation disparities driven by recent selection and their impact on disease risk estimation in East and Southeast Asian populations

Communications Biology volume 8, Article number: 1822 (2025) Cite this article

3361 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Accurate genotype imputation is essential for large-scale genetic studies and precision medicine. While East Asian (EAS)-specific reference panels like ChinaMAP and CHN100k have been developed, most studies still rely on multi-ancestry panels like TOPMed due to the large sample size. However, their performance in underrepresented groups like Southeast Asians remains unclear. Using high-coverage whole-genome sequencing and SNP-array data from 8,316 Chinese and Thai individuals, we systematically evaluate six state-of-the-art reference panels for genotype imputation. Our results show that EAS-specific panels outperformed multi-ancestry panels for East and Southeast Asian populations. For example, ChinaMAP achieves a mean heterozygosity concordance rate above 0.90 without R² filtering, whereas TOPMed requires an R² threshold of 0.60-0.70 to achieve comparable results. Notably, we find that recent positive selection drives regional disparities in imputation accuracy, as illustrated by the olfactory receptor gene cluster. More importantly, our results indicate that the choice of reference panel and R² thresholds have a significant impact on polygenic risk score estimation for disease prediction. These findings provide valuable guidelines for improving genotype imputation in East and Southeast Asian populations and underscore the need for ancestrally diverse reference panels to support globally equitable genomic research.

A diverse ancestrally-matched reference panel increases genotype imputation accuracy in a underrepresented population

Article Open access 31 July 2023

Populational pan-ethnic screening panel enabled by deep whole genome sequencing

Article Open access 20 November 2023

A genotype imputation reference panel specific for native Southeast Asian populations

Article Open access 05 October 2024

Introduction

Genotype imputation is a statistical method used to infer missing genotypes in individual samples. It is widely applied in genome-wide association studies (GWASs) to increase the density of genetic variants and facilitate meta-analyses across studies with varied genotyping platforms¹. The accuracy of imputation relies heavily on the characteristics of the reference panel, with sample size, ancestral composition, and haplotype diversity being crucial determinants².

Reference panels produced by the 1000 Genomes Project³ (1KG Phase3 v5; n = 2504), Haplotype Reference Consortium⁴ (HRC; n = 32,470), and the Trans-Omics for Precision Medicine⁵ (TOPMed r3; n = 133,597) are widely adopted in most GWASs since they offer large sample sizes and rich haplotype data. However, East and Southeast Asian populations remain underrepresented in these reference panels, despite their substantial contribution to global diversity^6,7. Previous studies have shown that the underrepresentation in the TOPMed reference panel resulted in reduced imputation accuracy for individuals of Asian ancestry compared to those of European ancestry⁸. The selection of appropriate reference panels is essential for uncovering ancestry-specific disease mechanisms and ensuring equitable translational applications of genetic research⁹.

Recent efforts have established specific reference panels for these East and Southeast Asian populations. Notable examples include the Westlake BioBank for Chinese¹⁰ (WBBC; n = 4489), China Metabolic Analytics Project¹¹ (ChinaMAP; n = 10,155), CHN100k¹² (n = 25,734), and South and East Asian Reference Database¹³ (SEAD; n = 11,067). However, these reference panels remain underutilized in current studies, partially due to limited evaluation and uncertainty regarding their optimal use in imputation applications.

To enhance the applicability of current reference panels for East and Southeast Asian populations, we curated an SNP array dataset comprising 6997 individuals of Chinese and Thai ancestry. In addition, we generated a high-coverage whole-genome sequencing (WGS) dataset containing a total of 1319 individuals from both populations. These datasets enable a more comprehensive evaluation of imputation quality across various reference panels.

Imputation quality is commonly evaluated using metrics such as INFO or R², which assess the reliability by measuring the proportion of variation explained by the imputation¹⁴. However, these metrics can become biased when the reference panels inadequately represent the target population, particularly for low-frequency variants¹⁵. Despite this limitation, most studies apply a fixed threshold (e.g., R² > 0.30) to filter imputed data, potentially compromising downstream analyses.

Genotype concordance rate, which measures the proportion of correctly imputed genotypes for individuals, is widely regarded as the gold standard for evaluating imputation quality. However, this metric can be misleading for variants with skewed allele frequencies. For example, when a variant’s minor allele frequency (MAF) is below 5%, simply predicting the major alleles for all samples can yield over 90% accuracy, regardless of true genotype status¹⁶. Two alternative metrics are used to address this limitation: the heterozygosity concordance rate, which only evaluates concordance for heterozygous genotypes for individuals, and the imputation quality score (IQS)^16,17, which accounts for chance agreement of genotypes due to biased allele frequency. These metrics provide a more reliable assessment of imputation accuracy.

In this study, we aimed to identify suitable reference panels for Chinese and Thai populations. Furthermore, we also examined potential factors driving differences in genotype imputation, an important question that has not been fully explored to date. Our results indicate that recent positive selection contributes to regional differences in imputation accuracy. Taken together, these findings not only provide practical guidelines for improving genotype imputation but also offer a mechanistic explanation for the importance of ancestry-matched reference panels.

Results

Dataset preparation

A total of 8316 samples of Chinese and Thai ancestry were included in this study following quality control (Methods; Supplementary Table 1). The Chinese dataset comprised 3194 samples previously genotyped using the Infinium Asian Screening Array-24 v1.0 (ASA)^18,19, a tailored genotyping array for Asian populations, along with 1,263 newly generated WGS samples with an average depth of 36.1. The Thai dataset consisted of 3803 ASA-genotyped samples from prior studies^18,19 and 56 newly sequenced WGS samples with an average depth of 43.8.

Genetic variants for the WGS samples were detected using the GATK pipeline^20,21,22 (Methods). Variants that passed quality control and overlapped with the Infinium ASA platform were extracted from the WGS data. These shared variants were then used to merge the WGS samples with the ASA-genotyped samples. This procedure was performed separately for the Chinese and Thai datasets, resulting in combined ASA-WGS datasets for each population. Principal component analysis (PCA) confirmed no significant difference between the ASA-genotyped and WGS samples within each combined dataset (Supplementary Fig. 1).

The combined dataset in each population was phased and imputed using six reference panels (Methods): CHN100k, ChinaMAP, WBBC, SEAD panels, and two global reference panels: 1KG (Phase3 v5) and TOPMed (r3). Differences among these reference panels were outlined in Supplementary Table 2. Imputation performance was evaluated by comparing the imputed genotypes to the genotypes detected by WGS.

Coverage of imputation across reference panels

Given that certain reference panels did not support genotype imputation for variants within the human leukocyte antigen (HLA) region (chr6: 29,722,775–33,314,387 based on the assembly GRCh38) or for indels, and showed a varied output format for indel imputation, our analysis focused on comparing the proportion of single nucleotide variants (SNVs) outside the HLA region. In the Chinese dataset, the imputed variants from the CHN100k panel showed the highest overlap with those identified by WGS among the reference panels (68.6%), followed by the TOPMed panel (59.4%) and the ChinaMAP panel (58.9%) (Supplementary Fig. 2a and Supplementary Table 3). In contrast, the 1KG panel, a widely used global reference, demonstrated the lowest overlap at 31.9%. In the Thai WGS dataset, the SEAD panel exhibited the highest overlap with the WGS-identified variants (81.2%), followed by the CHN100k panel at 78.6%, the TOPMed panel at 77.6% and the ChinaMAP panel at 76.1% (Supplementary Fig. 2b and Supplementary Table 3).

We further categorized WGS-identified variants as common or low-frequency variants (MAF < 0.05). As expected, common variants had substantially higher overlap with imputed datasets than low-frequency variants (Fig. 1a and Supplementary Table 3). In the Chinese dataset, more than 85% of common variants identified by WGS overlapped with those imputed by most reference panels. For low-frequency variants, the CHN100k panel demonstrated the highest overlap at 65.2%, followed by the TOPMed panel at 55.5% and the ChinaMAP panel at 54.3%. Notably, although the 1KG panel showed a high overlap with common variants (89.2%), its overlap for low-frequency variants was considerably lower at 23.6%.

**Fig. 1: Percentage of WGS-Identified Variants from Chinese and Thai WGS Dataset Overlapping with Imputed Variants Across Reference Panels.**

A similar pattern was observed in the Thai dataset. 97.1% of common variants identified by WGS overlapped with those imputed by the SEAD panel, followed by the 1KG panel (92.8%), the CHN100k panel (91.6%), and the ChinaMAP panel (91.4%) (Fig. 1b and Supplementary Table 3). For low-frequency variants, the SEAD panel again demonstrated the highest coverage (68.5%), outperforming the CHN100k (68.1%) and TOPMed (68.0%) panels. In contrast, the 1KG panel showed a low overlap at 56.1%.

We noted that the coverage between the Thai and Chinese datasets is not directly comparable because of the marked difference in sample size. The larger Chinese dataset is able to capture both a greater number of variants and much rarer ones than the Thai dataset. Collectively, these findings reveal that a significant proportion of WGS-identified variants, particularly low-frequency variants, remain undetected by current imputation references.

Evaluation of imputation accuracy across reference panels

To evaluate imputation accuracy, high-quality WGS samples were used as the gold standard. The imputed genotypes were then compared with the actual genotypes determined by WGS. Variants identified by WGS and shared across all imputation panels were included in the analysis. Accuracy metrics, including imputation R², heterozygosity concordance rate, and IQS, were computed for the comparison (Methods). These metrics allowed us to evaluate imputation accuracy at both variant-level (R² and IQS) and sample-level (heterozygosity concordance rate).

In the Chinese dataset, 9,781,349 SNVs shared between WGS and imputation panels were included in the analysis. We found that imputation accuracy varied with allele frequency. As expected, common variants generally showed higher imputation accuracy than low-frequency variants across all metrics. Among the reference panels, ChinaMAP consistently outperformed the others, achieving the highest scores across different MAF categories (Fig. 2a). While the CHN100k panel also performed well, its results were slightly lower than those of ChinaMAP. Notably, although the 1KG achieved the second-highest imputation R² scores for common variants, its values appeared inflated relative to actual genotypes. This was evidenced by significantly lower heterozygosity concordance rates and IQS scores compared to ChinaMAP and CHN100k panels (Fig. 2a).

**Fig. 2: Evaluation of Imputation Quality Across Reference Panels by Minor Allele Frequency Categories in Chinese and Thai WGS Datasets.**

Analysis of the Thai dataset, including 6,299,991 shared variants between WGS and panels, revealed different patterns of imputation accuracy across reference panels. Most panels demonstrated comparable performance in both heterozygosity concordance rates and IQS scores across MAF categories, except the WBBC panel (Fig. 2b). In addition, the TOPMed panel exhibited superior performance for low-frequency variants. Notably, while the SEAD panel demonstrated acceptable concordance rates, we observed consistent deflation in its imputation R² compared to the actual genotypes. Together with the findings in the Chinese dataset, these results suggest that imputation R² scores may not fully reflect the genotype concordance between imputed and actual data among reference panels.

To further evaluate the impact of SNP array choice on imputation performance, we repeated the analysis using variants included on the Infinium Global Screening Array (GSA) v3.0, which is optimized for cross-ancestry studies (Methods). We also observed that the ChinaMAP panel achieved the best heterozygosity concordance rates and IQS scores across MAF categories in the Chinese dataset, followed by the CHN100k panel (Supplementary Fig. 3). These findings were consistent with those obtained using variants on the Infinium ASA array, suggesting that the choice of SNP array has minimal impact on the trend of imputation performance among these reference panels.

R² thresholds need to be tailored based on reference panels

Due to the absence of ground-truth genotype data in real-world studies, imputation R² scores remain a critical metric for quality control of imputed variants in GWAS analyses. While current practice commonly employs a fixed R² threshold for this purpose²³, the above analyses suggest that optimal filtering thresholds should be determined through panel-specific validation studies.

For each reference panel, we used heterozygosity concordance rates and IQS scores as benchmarks to evaluate imputation accuracy across imputation R² thresholds. In the Chinese dataset, while increasing the R² threshold generally improved both heterozygosity concordance rates and IQS scores, the relationships were not strictly linear, particularly for the ChinaMAP and CHN100k panels (Fig. 3a). Notably, even without R² filtering, these two panels achieved a mean heterozygosity concordance rate above 0.90 and an average IQS score exceeding 0.70 (Fig. 3a and Supplementary Table 4, 5). Further increasing the threshold from 0 to 0.40 for the results from the ChinaMAP panel showed minimal improvement, suggesting limited benefit from applying a threshold within this range. By comparison, the 1KG and TOPMed panels required an R² threshold between 0.60 and 0.70 to achieve similar performance levels, which substantially reduced the percentage of imputed variants (Fig. 3a).

**Fig. 3: Evaluation of Imputation Quality Across Reference Panels by Imputation R2 Thresholds in Chinese and Thai WGS Datasets.**

We also investigated the performance of low-frequency variants. Without R² filtering, the ChinaMAP panel achieved a mean heterozygosity concordance rate above 0.80 and an average IQS score of 0.67 (Supplementary Fig. 4a and Supplementary Table 6, 7). The robust performance underscores the advantage of using the ChinaMAP panel for downstream analyses in Chinese populations.

In the Thai dataset, the SEAD panel slightly outperformed other panels across when the R² threshold was increased to 0.30 (Fig. 3b and Supplementary Table 4-5). However, without R² filtering, the ChinaMAP and 1KG panels showed considerably higher mean IQS values than the SEAD panel (Fig. 3b). For low-frequency variants, the heterozygosity concordance rate was substantially lower compared to the Chinese dataset. Achieving a mean heterozygosity concordance rate of 0.80 required raising the R² threshold for the SEAD panel to between 0.40 and 0.50 (Supplementary Fig. 4b and Supplementary Table 6, 7). This adjustment resulted in a reduction of more than 30% of retained low-frequency variants.

We also assessed the potential impact of SNP array choice on imputation performance by repeating the analysis using variants included on the Infinium GSA array. Consistent with the results derived from variants on the Infinium ASA array, ChinaMAP still demonstrated the best performance across different R² thresholds in the Chinese dataset, followed by CHN100k (Supplementary Fig. 5). These results suggest that the observed trends in imputation performance among reference panels are unlikely to be substantially affected by the choice of SNP arrays.

Collectively, these findings highlight the critical need to optimize reference panel-specific R² thresholds that account for ancestral background, enabling balanced optimization of imputation accuracy while retaining sufficient variants for downstream analyses.

Recent selection may contribute to regional variation in imputation quality

We further subdivided autosomal regions into 100 kb windows to assess spatial patterns of imputation quality. Given that allele frequencies vary across genomic regions, we evaluated their performance using the IQS score, which accounts for biased allele frequency in imputation quality evaluation^16,17 (Methods). In the Chinese dataset, the ChinaMAP panel significantly outperformed others, yielding the highest IQS scores in over 92% of windows (Fig. 4a and Supplementary Table 8). In contrast, the Thai dataset displayed markedly different patterns: the ChinaMAP and 1KG panels were optimal in approximately 31.7% and 31.1% of windows, respectively, followed by TOPMed (12.7%) and SEAD (11.6%) (Fig. 4b). This trend generally aligned with the performance of mean IQS scores observed without R² filtering (Fig. 3b).

**Fig. 4: Overview of Reference Panel Performance Across Genomic Regions in Chinese and Thai Datasets.**

We further investigated regional disparities in imputation performance by examining IQS variation among reference panels across genomic windows. In the Chinese dataset, we found 37 windows with substantial differences in imputation quality, characterized by a mean-variance exceeding 0.02 (Supplementary Fig. 6 and Supplementary Data 1). Notably, most of these regions (34 out of 37) showed significantly better performance when the Chinese-matched reference panels were used. This observation could be attributed to the effects of recent selection on these regions.

To further evaluate this hypothesis, we examined signatures of recent positive selection using the Singleton Density Score (SDS) method²⁴ with data from an independent Chinese cohort²⁵ (RePoS database; n = 3946). Owing to an insufficient number of variants (<20 variants) within a window, three of the 34 regions were excluded from the analysis. Among the remaining regions, four showed suggestive signatures of recent positive selection (SDS P-value < 5E-04). This ratio was significantly higher compared to regions with minimal variation in imputation quality (OR = 6.75, Fisher’s exact test P-value = 0.0043; Method).

A notable example is a region on chromosome 11 (55.4–55.6 Mb) that contains an olfactory receptor gene cluster (Fig. 5a). Within the region, ChinaMAP (mean IQS = 0.836) demonstrated significantly better imputation performance compared to other panels, such as 1KG (mean IQS = 0.762), TOPMed (mean IQS = 0.644), and SEAD (mean IQS = 0.473). A strong signature of recent positive selection in the region was also detected in the Chinese population, as indicated by the lead SNP rs511492 (SDS = 7.46, SDS P value = 8.69E-14; Fig. 5b). Prior studies have shown an extended haplotype for the alternative allele (T) compared to the reference allele (C)²⁵, further supporting evidence of recent selection in this region. Notably, one of the selected haplotypes (CTT; consisting of rs117749670, rs1459101 and rs511492) carried a stop-gain mutation (rs1459101-C) in the OR4C16 gene. In our Chinese WGS dataset, this haplotype was observed at a frequency of 15.3%. However, it was nearly absent in African, American, South Asian, and European populations (Fig. 5c). These findings help explain why non-Chinese reference panels performed less effectively in regions that have undergone recent selection in the Chinese population.

Fig. 5: Imputation disparities and selection signatures in the Olfactory Receptor Gene Cluster region. — **Fig. 5: Imputation disparities and selection signatures in the *Olfactory Receptor Gene Cluster* region.**

Immune-related genes enriched in regions with poor imputation quality

Through the regional evaluation, we also identified 0.94% of genomic regions with consistently poor imputation quality (mean IQS < 0.40) in both populations (Fig. 4 and Supplementary Data 3, 4). These encompassed critical immune-related loci, including immunoglobulin heavy loci (IGH; 16p11.2), kappa (IGK@; 2p11.2) and lambda (IGL@; 22q11.22) gene clusters, along with the T cell receptor beta locus (TRB; 7q34). These regions pose significant challenges for imputation due to the extensive haplotype diversity, segmental duplications, and complex structural^26,27,28. Notably, evidence of recent positive selection at the IGH cluster in the Chinese population²⁵ underscores the need for long-read sequencing approaches to resolve population-specific architecture in these regions.

Genes within other poorly imputed regions also showed significant enrichment for immunological pathways including macrophage inflammatory protein (MIP)-1beta signaling (adjusted P-value = 9.20E-06), leukocyte immunoglobulin-like receptor (LILR) activity (adjusted P-value = 5.02E-05), C-C motif chemokine production (eg. CCL3, CCL15, CCL18 and CCL23; adjusted P-value = 1.74E-04) and others (Supplementary Data 5). These findings imply that certain association signals for immune-related diseases or traits remain undetected in East and Southeast Asian populations due to inadequate imputation quality.

Impact of imputation quality on disease risk estimation

Polygenic risk scores (PRS) have been widely utilized to predict individuals’ risk of developing complex diseases⁹. However, the quality of genotype imputation could affect PRS calculations and, consequently, the accuracy of disease prediction. To evaluate this effect, we leverage the recorded phenotypes in Chinese dataset to examine how different imputation references impact PRS estimation to predict individuals’ risk of systemic lupus erythematosus (SLE).

Using WGS from 1,263 Chinese individuals, we compared SLE PRS calculated from true genotypes against the PRS generated by genotype dosages imputed from each reference panel (Methods). To compute the PRS, we used 203 previously reported SLE-associated variants^19,29 shared between the WGS dataset and the reference panels. The results showed that PRSs calculated from imputation using the ChinaMAP and CHN100K panels were highly correlated with the score derived from the true genotypes, with Pearson correlation coefficients (r) of 0.945 and 0.937, respectively. The correlations were much greater than those obtained using the 1KG (r = 0.894) and TopMed (r = 0.898) panels (Fig. 6a).

**Fig. 6: Impact of imputation results on disease risk prediction.**

To further evaluate the impact on disease risk prediction, we extended the analysis to the ASA-WGS combined dataset, comprising 2027 SLE cases and 2430 controls of Chinese ancestry. Using this data, we measured the accuracy of SLE prediction by the area under the receiver-operating characteristic curve (AUC). The results showed that the PRSs constructed using genotype dosage imputed from the CHN100k (AUC = 0.761) and ChinaMAP (AUC = 0.759) panels performed slightly better in SLE prediction than those generated using other panels (e.g., AUC = 0.751 for both 1KG and TopMed; Fig. 6b).

We next investigated the impact of varying R² thresholds on prediction performance. Unlike the above study, all 239 previously reported SLE-associated variants (Supplementary Table 10) were included in this analysis. The results showed that overall predictive power declined as the R² threshold increased (Fig. 6c). The mean AUC across panels dropped significantly from 0.783 at an R² threshold of 0 to the mean of 0.736 at an R² threshold of 0.90 (Paired t-test P-value = 2.31E-06). This reduction is likely attributed to the decreased number of associated variants retained at higher R² thresholds (Fig. 6d). We also observed that PRSs derived from the CHN100k and ChinaMAP panels achieved the best performance without any R² filtering. This pattern suggests that raising the R² threshold may enhance predictive power by reducing the influence of poorly imputed genotypes when utilizing global reference panels.

Considering potential bias toward associated variants, we further implemented lassosum³⁰, a genome-wide PRS method, to reanalyze the data (Methods). The model was trained on the summary statistics from a previous Japanese and Korean meta-analysis of SLE³¹ (5369 cases, 233,167 controls). The PRS estimations performed better when constructed from imputed data using the ancestry-matched reference panels (Fig. 6e). We also observed that the mean AUC across all panels dropped significantly from 0.682 at an R² threshold of 0.00 to 0.653 at a threshold of 0.90 (paired t-test P-value = 2.44E-04). A marked reduction in prediction accuracy was observed when the threshold was greater than 0.70. These patterns are generally consistent with the results using reported SLE‑associated variants, but the genome‑wide PRS method was more robust and less sensitive to variant selection bias. Collectively, these findings highlight the importance of ancestry-matched reference panels in improving PRS estimation for disease risk prediction.

Discussion

The underrepresentation of East and Southeast Asian populations in widely used global reference panels, such as TOPMed, poses significant challenges for genetic discovery and translational applications in these ancestral groups. By leveraging high-coverage WGS data as a gold standard, we sought to address longstanding uncertainties regarding the optimal selection of imputation panels and quality control strategies for GWAS in East and Southeast Asian populations. More importantly, we further explore potential factors underlying imputation differences across reference panels. Our results demonstrate that recent positive selection may contribute to the accuracy of genotype imputation, exemplified by the olfactory receptor gene cluster. These findings provide new insights into the choice of reference panels and merit broader attention in future studies.

By comparing the overlap between imputed variants and those identified through WGS, our study suggests that a substantial proportion of WGS-identified variants, particularly low-frequency variants, remain uncovered by current imputation references. While global reference panels effectively capture common variants, ancestry-matched panels provide significantly better coverage for low-frequency variants. Notably, we showed an often-overlooked limitation of earlier studies that relied on the 1KG panel, which showed the lowest overlap for low-frequency variants in East and Southeast Asian populations. These results emphasize the continued importance of sequencing technologies for comprehensive variant discovery in future studies.

Our findings demonstrate that imputation R² scores do not accurately represent the true concordance between imputed and actual genotype data, and the concordances at a fixed R² threshold vary considerably across different reference panels. For example, in our analysis of Chinese data, the ChinaMAP panel achieved a good performance without any R² filtering. In contrast, the TOPMed panel required an R² threshold between 0.60 and 0.70 to reach comparable performance. The deviation may result from genetic differences between the study and reference populations³². Therefore, panel-specific R² thresholds should be considered rather than applying a uniform threshold for quality control.

This study provides actionable guidelines for genotype imputation and quality control in GWAS of East and Southeast Asian populations. For studies involving the Chinese population, the ChinaMAP and CHN100k panels are recommended due to their superior coverage and accuracy compared with other reference panels. However, we note that ChinaMap does not support the imputation of indels, as well as variants within the HLA region, although it showed the best performance in the Chinese population. Given the high genetic diversity of HLA alleles and their critical role in immune response, there is a need for specialized tools tailored to high-quality HLA allele imputation. In addition, we also showed that the imputation performance across reference panels is unlikely to be affected by the choice of SNP arrays (Supplementary Figs. 3 and 5).

For Thai population studies, the SEAD panel demonstrated broader coverage and slightly higher concordance with actual genotypes when the R² threshold is increased to 0.30 or above. Thus, we recommend using the SEAD panel for imputing Thai population data, while applying an R² threshold of at least 0.30 to ensure reliable variant quality.

Although the total number of samples in the SEAD panel is comparable to ChinaMAP, its more diverse composition likely accounts for the better performance observed in the Thai dataset. The SEAD panel included 4480 Chinese samples, 4563 individuals from the SG10K project⁶, 1031 from the GAsP pilot project⁷, and 993 East and South Asian samples from the 1000 Genomes Project³. In contrast to ChinaMAP and CHN100k, this broader representation encompasses some individuals that are genetically closer to Thai samples, which likely contributes to the gains in imputation accuracy. In addition, we found that all three panels (SEAD, ChinaMAP, and CHN100k) used the SHAPEIT algorithm for phasing, though with different versions (SHAPEIT2³³ for SEAD and SHAPEIT4³⁴ for ChinaMAP and CHN100k). In this study, we phased all test samples using the same algorithm and then performed imputation using Minimac4¹⁴. Thus, the stronger performance of SEAD in the Thai population is unlikely to stem from technical differences and is more plausibly attributed to its broader representation of populations genetically closer to Thai samples.

However, our results also highlight the influence of R² thresholds on the performance of Thai samples: without filtering, ChinaMAP and 1KG performed slightly better (Fig. 2b and Fig. 4b), whereas at thresholds of 0.30 or above, SEAD becomes the preferred panel (Fig. 3b). In addition, the Thai dataset exhibited greater variation in imputation quality, with 168 genomic windows showing significant differences across reference panels, compared to 37 windows in the Chinese dataset (Supplementary Fig. 6 and Supplementary Data 1-2). These findings underscore the need for more tailored and optimized reference panels to enhance imputation accuracy for this population. Notably, the Genomics Thailand project has been launched, which could provide a valuable foundation for developing an optimized imputation panel for the Thai population.

More importantly, our study suggests that recent selection may contribute to the imputation variations across different reference panels. This is exemplified by the olfactory receptor gene cluster on chromosome 11, which has been reported to be under recent positive selection in the Chinese population (Fig. 5). The selected haplotype (CTT) in this region is highly prevalent in East Asian populations but nearly absent in others. Due to the unique haplotype structure, imputation quality can be significantly improved when using ancestry-matched reference panels. Previous studies on the lactase (LCT) gene also highlight potential limitations of genotype imputation in the regions under recent selections³⁵. In addition, recent investigations in type 2 diabetes (T2D)-associated loci demonstrated that haplotype differences between study cohorts and the reference panels can skew imputation results, favoring allelic calls that are more common in the reference panel³⁶. Taken together, these findings underscore the importance of using a matched reference panel to improve imputation accuracy in regions with varying haplotype frequencies across populations.

Our studies also revealed that certain regions, including multiple immune-related genes such as the IGH, IGK@, IGL@, and TRB loci, were poorly imputed in East and Southeast Asian Populations. The complex structure and evidence of recent selection within these regions^26,27,28 suggest that targeted long-read sequencing approaches are necessary to refine them in an ancestry-specific manner. In addition, genes involved in MIP-1beta signaling, LILR activity, and C-C motif chemokine production, playing an important role in the immune system^37,38,39,40, were also enriched in these poorly imputed regions. These findings suggest that certain essential association signals related to immune-related diseases or traits may have been ignored in previous studies. However, it is important to note that these results may be influenced by the SNP array platform used in our analysis, which was limited to the Infinium ASA platform. Further studies are needed to determine whether imputation quality in these regions can be improved using other genotyping platforms.

We also identified other important genomic regions that were poorly imputed (Supplementary Data 3-4). One example lies on chromosome 15 (25,000,001–32,200,000 bp), which overlaps with 15q11–q13. This locus is prone to structural rearrangements and is a known hotspot for recurrent copy number variations. Several genes within this region, including SNORD116, HERC2, UBE3A, SNHG14 (UBE3A-AS1), and CHRNA7, have been linked to neurodevelopmental disorders^41,42. Another example is a region on chromosome 17 (45,800,001–46,300,000 bp) that contains MAPT, which encodes the tau protein, a hallmark of Alzheimer’s disease and frontotemporal dementia^43,44. The MAPT locus is flanked by three highly homologous low-copy repeats (LCRs) spanning approximately 1.7 Mb⁴³, which likely complicates accurate genotype imputation. These poorly imputed regions highlight areas that deserve closer attention in future studies.

Finally, we demonstrated the significant impact of reference panel selection on disease risk prediction, using SLE as a model disease. Our results showed notable improvements in predictive power when using ancestry-optimized reference panels. Notably, filtering variants based on an R² threshold would reduce predictive power. We chose SLE due to the following reasons: 1) it is a prototypic autoimmune disease with a strong genetic component (heritability 43–66%^45,46,47); 2) although most GWAS findings to date have been biased toward European populations, GWAS findings for SLE are mainly based on Asian populations^19,31,48; 3) we had access to a relatively large, well-curated SLE dataset for robust analysis.

We acknowledge several limitations in this study. First, although our analyses focused on Chinese and Thai populations, which represent important groups, they do not capture the full genetic diversity of East and Southeast Asians. Second, the Thai WGS dataset was relatively small, which may limit the power to detect low-frequency variants. Third, our PRS analyses were restricted to SLE, and further studies are needed to determine whether these patterns hold for other diseases or continuous phenotypes.

In summary, this study provides valuable guidelines for future GWAS in East and Southeast Asian populations. More importantly, it highlights the importance of ongoing efforts to develop ancestrally diverse reference panels, as recent selection may generate ancestry-specific haplotypes that are challenging to capture without sufficient representation. Together with advancements in long-read sequencing technologies, these efforts will be essential for addressing current disparities in imputation quality and enhancing the accuracy of disease prediction.

Methods

Ethics approval

This study was approved by the institutional review boards, including the ethical committee from the Hospital Authority Hong Kong West Cluster (UW 07-119) and the Faculty of Medicine Ramathibodi Hospital, Mahidol University (12-58-12). All patients signed a written informed consent to participate in the study. All ethical regulations relevant to human research participants were followed.

Sample overview

This study collected a total of 8438 individuals from Chinese and Thai populations. Of these, 3198 Chinese samples and 3830 Thai samples were sourced from our earlier studies and genotyped using the Infinium Asian Screening Array-24 v1.0 (ASA)^18,19. To assess imputation accuracy, we recruited 1353 samples of Chinese ancestry from Hong Kong, China, and 57 samples of Thai ancestry from Bangkok, Thailand. Whole-genome DNA from these individuals was sequenced using the DNBSEQ platform by BGI Genomics, with a read length of 150 base pairs (PE150).

Variant calling

NVIDIA Parabricks v4.3.0 software suite⁴⁹ was used to identify germline variants for WGS data. The software suite significantly accelerated genomic analysis by utilizing graphics processing units (GPUs) instead of traditional central processing units (CPUs), achieving faster processing speeds while maintaining accuracy equivalent to the standard GATK (Genome Analysis Toolkit) best practices^20,21,22.

The analysis followed a series of steps to process the sequencing data. Raw paired-end sequencing data were aligned to the GRCh38 reference genome using the BWA-mem algorithm. PCR-induced duplicate fragments were removed with the MarkDuplicates algorithm. Base Quality Score Recalibration (BQSR) was applied to correct sequencing quality discrepancies that arise from different sequencing cycles and contexts. Variants in individual samples were subsequently detected using the HaplotypeCaller algorithm. Joint genotyping was performed using Glnexus⁵⁰, a tool designed for large-scale genomic variant analysis.

In addition, vcftools⁵¹ was used to refine the data by filtering out variants with genotype quality (GQ) less than 20 or depth less than 5. Variants with a missing rate exceeding 20% among samples were removed. “Half-called” variants, referring to genotypic calls that are incomplete or ambiguous within the variant calling process, were excluded from the analysis. Following these steps, 46,218,832 variants on autosomes were retained in the Chinese WGS dataset, consisting of 42,751,494 SNVs and 3,467,338 indels. Similarly, 15,604,227 variants on autosomes were retained in the Thai WGS dataset, including 12,598,039 SNVs and 3,006,188 indels. The transition/transversion (Ti/Tv) ratio for individuals ranged from 2.03 to 2.06 in both datasets.

Quality control and genotype imputation

To enhance phasing accuracy, the WGS samples were merged with the ASA-genotyped samples based on their shared variants. Quality control steps were applied to the merged datasets using PLINKv1.9⁵². Variants with missingness greater than 0.05, MAF less than 0.01, or failing the Hardy-Weinberg equilibrium (HWE) test (P-value < 1.00E-05) were removed. Samples identified as potential duplicates (identical-by-descent relationship PIHAT > 0.90), or exhibiting abnormal heterozygosity levels (|F-score|> 0.1) were also excluded. After quality control, 8316 samples from both populations were retained for downstream analyses (Supplementary Table 1).

Eagle (version 2.4.1)⁵³ was then used to phase samples from the WGS-ASA combined dataset for each population without using reference. The Genetic map generated by the HapMap project⁵⁴ was utilized during this process. The phased datasets from Chinese and Thai populations were submitted to the CHN100k¹², ChinaMAP⁵⁵, Westlake¹⁰, TOPMed⁵, and Michigan¹⁴ Imputation Servers, respectively. The 1KG reference panel was included in the Michigan Imputation Server, and the SEAD reference panel was included in the Westlake Imputation Server. The Minimac4 algorithm¹⁴ was used for genotype imputation across all imputation servers.

Evaluation of imputation quality

To evaluate the performance of different reference panels, high-quality WGS data was used as the gold standard. Variants overlapping with those on the Infinium ASA array were extracted for imputation. After imputation, the imputed genotypes were compared with the actual genotypes determined by WGS to assess imputation performance across different reference panels.

Heterozygosity concordance rate and IQS score^16,17 were used to evaluate imputation performance across reference panels. The heterozygosity concordance rate was calculated by comparing the observed heterozygous genotypes from WGS with the imputed genotypes for each individual. The IQS score for each variant was computed by subtracting the chance agreement (${P}_{c}$) from the observed agreement (${P}_{o}$), and then dividing the result by the maximum possible agreement excluding the chance agreement (Eq. 1). The calculation is represented by the following equations^16,17:

$${IQS}=\,\frac{{P}_{o}-{P}_{c}}{1-{P}_{c}}$$

(1)

$${P}_{o}=\frac{{\sum }_{n=1}^{{N}_{11}}{P}_{11{{{\rm{\_}}}}n}\,+\,{\sum }_{n=1}^{{N}_{22}}{P}_{22{{{\rm{\_}}}}n}\,+\,{\sum }_{n=1}^{{N}_{33}}{P}_{33{{{\rm{\_}}}}n}\,}{N}$$

(2)

$${P}_{c}=\frac{{W}_{1}{Y}_{1}+{W}_{2}{Y}_{2}+{W}_{3}{Y}_{3}}{N* N}\,$$

(3)

Where ${P}_{o}$ represents the observed concordance rate for different genotype classes, calculated as the sum of estimated genotype probabilities for each matching genotype class (Eq. 2). ${P}_{11}$ denotes the estimated genotype probability for the homozygous reference genotype (0/0) in cases where the true genotype, determined by WGS, is 0/0 and the imputed genotype also matches as 0/0. Similarly, ${P}_{22}$ and ${P}_{33}$ represent the estimated genotype probabilities for the heterozygous genotype (0/1) and the alternative homozygous genotype (1/1), respectively, in cases where both the true and imputed genotypes align. ${N}_{11}$, ${N}_{22}$ and ${N}_{33}$ refer to the number of individuals with matched genotypes for the homozygous reference (0/0), heterozygous (0/1), and alternative homozygous (1/1) categories, respectively. $N$ denotes the total number of individuals.

The chance agreement (${P}_{c}$) is calculated as the sum of the products of marginal frequencies, representing the expected agreement if genotypes were randomly assigned based on their respective marginal rates (Eq. 3). Specifically, W₁, W₂, and W₃ refer to the marginal frequency for the observed genotype 0/0, 0/1, and 1/1, respectively, while Y₁, Y₂, and Y₃ represent the marginal frequency for the imputed genotype 0/0, 0/1, and 1/1, respectively, as illustrated in Supplementary Table 9.

For the analysis of the Chinese dataset, the MAF categories were defined using 1263 Chinese WGS samples. For the analysis of the Thai dataset, the MAF categories were defined based on WGS data derived from 1KG East Asian populations, due to a limited number of WGS samples in the Thai dataset.

Impact of SNP array choice on imputation performance

To streamline the analysis, we focused on variants located on chromosome 1 within the Chinese dataset, which had a much larger sample size. After quality control, 27,419 GSA variants were included in the analysis, 27% of which overlapped with variants on the Infinium ASA array. Phasing was performed using Eagle (version 2.4.1)⁵³ based on these GSA variants, and the phased datasets were then used for imputation across different reference panels. To assess imputation performance using variants onthe Infinium GAS array, accuracy metrics such as imputation R², heterozygosity concordance rate, and IQS were computed for comparison across different reference panels.

Region-based analysis

For regional analysis, 9,781,349 SNVs shared by all reference panels and the Chinese WGS dataset were used in the Chinese dataset. Similarly, 6,299,991 shared SNVs were used for analysis in the Thai dataset. To conduct this analysis, the genome was divided into windows of 100 kbp. This resulted in 26,352 windows in the Chinese dataset and 26,029 windows in the Thai dataset, excluding the HLA region. The difference in the number of windows between the two datasets reflected different variants included in each analysis (Supplementary Table 8).

To account for differences in allele frequencies across genomic regions, we assessed imputation quality using the IQS score. For each genomic window, the mean IQS score for each reference panel and the variation in IQS across panels were calculated. Windows with a mean IQS below 0.4 across all panels were classified as regions with poor imputation quality, while those with a variance in IQS exceeding 0.02 across panels were categorized as regions with significant differences. Regions identified as having poor imputation quality in both the Chinese and Thai datasets were further analyzed. From these regions, 542 genes were identified and used for enrichment analysis with ToppGene⁵⁶ (Supplementary Data 5).

To investigate whether genomic regions with substantial variation in imputation accuracy across reference panels were enriched in regions under recent positive selection, we analyzed selection signals using the SDS method²⁴. Compared to extended haplotype heterozygosity (EHH), integrated haplotype score (iHS), and cross-population extended haplotype homozygosity (XP-EHH) approaches, the SDS method offers greater sensitivity and power in identifying more recent selection events, occurring within the past 2000–3000 years²⁴. The SDS data were sourced from the RePoS database, including 3946 WGS Chinese samples²⁵. We focused on regions with the best performance in Chinese ancestry-matched panels. We excluded regions with an insufficient number of variants (<20 variants) in the RePoS database. For comparison, a set of 22,540 genomic regions with an IQS variance of less than 0.005 was used as a benchmark. Fisher’s exact test was then conducted to examine whether genomic regions with substantial variation in imputation accuracy were more likely to be enriched in regions under recent positive selection compared to regions without significant variation. Due to insufficient data for assessing recent positive selection in the Thai population, this analysis was not performed on the Thai dataset.

Calculation of polygenic risk scores

The polygenic risk score (PRS) of SLE was calculated for each individual using the equation provided below:

$${PRS}= {\sum}_{i=1}^{n}{\beta }_{i}* {{Dosage}}_{i}$$

(4)

Where ${\beta }_{i}$ represents the effect size of the ith SLE-associated variant, ${{Dosage}}_{i}$ represents the corresponding allele dosage obtained from imputation results, and n is the total number of associated variants used in the calculation. In this study, a total of 239 SLE-associated variants outside of the HLA region were extracted from previous studies^19,29 and their genetic effect on SLE development was summarized in Supplementary Table 10.

To assess the impact of reference panel selection on SLE prediction, the PRS was calculated using 203 SLE-associated variants that were shared across the six reference panels and the Chinese WGS datasets. No R² filtering was applied in this analysis. For the WGS samples in the Chinese population, we evaluated the accuracy of PRSs by comparing those derived from imputed genotype dosages with PRSs calculated from sequencing-based genotypes, using the Pearson Correlation Coefficient (r) to measure the agreement.

The performance of various reference panels on SLE prediction for the Chinese ASA-WGS combined dataset was assessed using the AUC score. To examine the impact of different imputation R² thresholds on SLE prediction, all SLE-associated variants were analyzed. For each R² threshold, PRSs were calculated using the associated variants retained in each reference panel, and the corresponding AUC scores were calculated. The AUC scores and the 95% confidence intervals were calculated using the pROC⁵⁷ package.

In addition, lassosum, one of the most effective genome-wide PRS methods^30,58, was also applied to the imputed datasets from different reference panels. Previous SLE GWAS summary statistics from Japanese and Korean populations³¹ (5,369 cases and 233,167 controls) were used to train the model. Linkage disequilibrium (LD) among variants was estimated from the testing dataset.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The details of the datasets used for conducting this research are provided here: The data for the 1000 Genomes Project Phase 3³ are available at https://www.internationalgenome.org/. The Genetic map (hg38)⁵⁴ is publicly available at https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz. The RePoS SDS dataset is available at http://bigdata.ibp.ac.cn/RePoS/download.php. All other data supporting the findings of this study are available from the corresponding author upon reasonable request.

Code availability

Custom code for variant calling, quality control, and IQS calculation can be found on GitHub (https://github.com/LI-DINGYANG/Synecdoche). Custom code for variant calling, quality control, and IQS calculation can also be found on Zenodo (https://doi.org/10.5281/zenodo.17472213)⁵⁹. CHN100k Imputation Server¹²: http://bioinformatics.hit.edu.cn/imputation/ ChinaMAP Imputation Server⁵⁵: http://www.mbiobank.com/imputation/ Westlake Imputation Server¹⁰: https://imputationserver.westlake.edu.cn/index.html TOPMed Imputation Server⁵: https://imputation.biodatacatalyst.nhlbi.nih.gov/ Michigan Imputation Server¹⁴: https://imputationserver.sph.umich.edu/index.html PLINK⁵² (v1.9): http://www.cog-genomics.org/plink/1.9/ EAGLE2⁵³ (v2.4.1): https://alkesgroup.broadinstitute.org/Eagle/ VCFtools⁵¹ (v0.1.16): https://vcftools.github.io/index.html GLnexus⁵⁰ (v1.4.1): https://github.com/dnanexus-rnd/GLnexus Clara Parabricks⁴⁹ (v4.3.0-1): https://www.nvidia.com/en-us/clara/genomics/ Toppgene⁵⁶: https://toppgene.cchmc.org/ Locuszoom⁶⁰: https://github.com/statgen/locuszoom Lassosum³⁰: https://github.com/tshmak/lassosum pROC package⁵⁷: https://www.expasy.org/resources/proc Geography of Genetic Variants Browser⁶¹: https://popgen.uchicago.edu/ggv/

References

Das, S., Abecasis, G. R. & Browning, B. L. Genotype Imputation from Large Reference Panels. Annu. Rev. Genom. Hum. Genet. 19, 73–96 (2018).
Article Google Scholar
Sengupta, D. et al. Performance and accuracy evaluation of reference panels for genotype imputation in sub-Saharan African populations. Cell Genom. 3, 100332 (2023).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Article PubMed PubMed Central Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Article PubMed PubMed Central Google Scholar
Wu, D. et al. Large-scale whole-genome sequencing of three diverse Asian populations in Singapore. Cell 179, 736–749.e15 (2019).
Article PubMed Google Scholar
Wall, J. D. et al. The GenomeAsia 100 K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Article Google Scholar
Cahoon, J. L. et al. Imputation accuracy across global human populations. Am. J. Hum. Genet. 111, 979–989 (2024).
Article PubMed PubMed Central Google Scholar
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Article PubMed PubMed Central Google Scholar
Cong, P.-K. et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat. Commun. 13, 2939 (2022).
Article PubMed PubMed Central Google Scholar
Cao, Y. et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 30, 717–731 (2020).
Article PubMed PubMed Central Google Scholar
Jiang, T. et al. A comprehensive genetic variant reference for the Chinese population. Sci. Bull. https://doi.org/10.1016/j.scib.2024.06.017 (2024).
Yang, M.-Y. et al. SEAD reference panel with 22,134 haplotypes boosts rare variant imputation and genome-wide association analysis in Asian populations. Nat. Commun. 15, 10839 (2024).
Article PubMed PubMed Central Google Scholar
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Article PubMed PubMed Central Google Scholar
Shi, M. et al. Genotype imputation accuracy and the quality metrics of the minor ancestry in multi-ancestry reference panels. Brief. Bioinform. 25, bbad509 (2024).
Lin, P. et al. A new statistic to evaluate imputation reliability. PLOS ONE 5, e9697 (2010).
Article PubMed PubMed Central Google Scholar
Ramnarine, S. et al. When does choice of accuracy measure alter imputation accuracy assessments?. PLoS ONE 10, e0137601 (2015).
Article PubMed PubMed Central Google Scholar
Wang, Y.-F. et al. Identification of shared and asian-specific loci for systemic lupus erythematosus and evidence for roles of type iii interferon signaling and lysosomal function in the disease: a multi-ancestral genome-wide association study. Arthritis Rheumatol. 74, 840–848 (2022).
Article PubMed Google Scholar
Wang, Y.-F. et al. Identification of 38 novel loci for systemic lupus erythematosus and genetic heterogeneity between ancestral groups. Nat. Commun. 12, 772 (2021).
Article PubMed PubMed Central Google Scholar
Van der Auwera, G. A. et al. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, 11.10.1–11.10.33 (2013).
Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article PubMed PubMed Central Google Scholar
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article PubMed PubMed Central Google Scholar
Roshyara, N. R., Kirsten, H., Horn, K., Ahnert, P. & Scholz, M. Impact of pre-imputation SNP-filtering on genotype imputation results. BMC Genet. 15, 88 (2014).
Article PubMed PubMed Central Google Scholar
Field, Y. et al. Detection of human adaptation during the past 2000 years. Science 354, 760–764 (2016).
Article PubMed PubMed Central Google Scholar
Luo, H. et al. Recent positive selection signatures reveal phenotypic evolution in the Han Chinese population. Sci. Bull. 68, 2391–2404 (2023).
Article Google Scholar
Engelbrecht, E. et al. Resolving haplotype variation and complex genetic architecture in the human immunoglobulin kappa chain locus in individuals of diverse ancestry. Genes Immun. 25, 297–306 (2024).
Article PubMed PubMed Central Google Scholar
Watson, C. T. et al. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. Am. J. Hum. Genet. 92, 530–546 (2013).
Article PubMed PubMed Central Google Scholar
Rodriguez, O. L. et al. Genetic variation in the immunoglobulin heavy chain locus shapes the human antibody repertoire. Nat. Commun. 14, 4419 (2023).
Article PubMed PubMed Central Google Scholar
Laurynenka, V. & Harley, J. B. The 330 risk loci known for systemic lupus erythematosus (SLE): a review. Front. Lupus 2, 1398035 (2024).
Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X. & Sham, P. C. Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 41, 469–480 (2017).
Article PubMed Google Scholar
Yin, X. et al. Meta-analysis of 208370 East Asians identifies 113 susceptibility loci for systemic lupus erythematosus. Ann. Rheum. Dis. 80, 632 (2021).
Article PubMed Google Scholar
Gurdasani, D. et al. Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell 179, 984–1002.e36 (2019).
Article PubMed PubMed Central Google Scholar
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Article PubMed Google Scholar
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).
Article PubMed PubMed Central Google Scholar
Ali, A. T., Liebert, A., Lau, W., Maniatis, N. & Swallow, D. M. The hazards of genotype imputation in chromosomal regions under selection: a case study using the Lactase gene region. Ann. Hum. Genet. 86, 24–33 (2022).
Article PubMed Google Scholar
Lau, W. et al. The hazards of genotype imputation when mapping disease susceptibility variants. Genome Biol. 25, 7 (2024).
Article PubMed PubMed Central Google Scholar
Hughes, C. E. & Nibbs, R. J. B. A guide to chemokines and their receptors. FEBS J. 285, 2944–2971 (2018).
Article PubMed PubMed Central Google Scholar
Menten, P., Wuyts, A. & Van Damme, J. Macrophage inflammatory protein-1. Cytokine Growth Factor Rev. 13, 455–481 (2002).
Article PubMed Google Scholar
Barrow, A. D. & Trowsdale, J. The extended human leukocyte receptor complex: diverse ways of modulating immune responses. Immunol. Rev. 224, 98–123 (2008).
Article PubMed Google Scholar
McLaren, P. J. et al. Polymorphisms of large effect explain the majority of the host genetic contribution to variation of HIV-1 virus load. Proc. Natl Acad. Sci. 112, 14658–14663 (2015).
Article PubMed PubMed Central Google Scholar
Paparella, A. et al. Structural variation evolution at the 15q11-q13 disease-associated locus. Int. J. Mol. Sci. 24, 15818 (2023).
Kirov, G. et al. The penetrance of copy number variations for schizophrenia and developmental delay. Schizophr. Rare Gene Var. Uncommon Syndr. 75, 378–385 (2014).
Google Scholar
Cruts, M. et al. Genomic architecture of human 17q21 linked to frontotemporal dementia uncovers a highly homologous family of low-copy repeats in the tau region. Hum. Mol. Genet. 14, 1753–1762 (2005).
Article PubMed Google Scholar
Myers, A. J. et al. The H1c haplotype at the MAPT locus is associated with Alzheimer’s disease. Hum. Mol. Genet. 14, 2399–2404 (2005).
Article PubMed Google Scholar
Kuo, C.-F. et al. Familial aggregation of systemic lupus erythematosus and coaggregation of autoimmune diseases in affected families. JAMA Intern. Med. 175, 1518–1526 (2015).
Article PubMed Google Scholar
Lawrence, J. S., Martins, C. L. & Drake, G. L. A family survey of lupus erythematosus. 1. Heritability. J. Rheumatol. 14, 913–921 (1987).
PubMed Google Scholar
Wang, J. et al. Systemic lupus erythematosus: a genetic epidemiology study of 695 patients from China. Arch. Dermatol. Res. 298, 485–491 (2007).
Article PubMed Google Scholar
Guga, S., Wang, Y., Graham, D. C. & Vyse, T. J. A review of genetic risk in systemic lupus erythematosus. Expert Rev. Clin. Immunol. 19, 1247–1258 (2023).
Article PubMed Google Scholar
NVIDIA (2024). Clara Parabricks (v4.3.0-1). https://www.nvidia.com/en-us/clara/genomics/ (2024).
Lin, M. F. et al. GLnexus: joint variant calling for large cohort sequencing. bioRxiv 343970. https://doi.org/10.1101/343970 (2018).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article PubMed PubMed Central Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, s13742–015-0047–8 (2015).
Article Google Scholar
Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Article PubMed PubMed Central Google Scholar
Altshuler, D. & Donnelly, P. & The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Article Google Scholar
Li, L. et al. The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res. 31, 1308–1310 (2021).
Article PubMed PubMed Central Google Scholar
Chen, J., Bardes, E. E., Aronow, B. J. & Jegga, A. G. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37, W305–W311 (2009).
Article PubMed PubMed Central Google Scholar
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinforma. 12, 77 (2011).
Article Google Scholar
Pain, O. et al. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLOS Genet. 17, e1009021 (2021).
Article PubMed PubMed Central Google Scholar
Li, D. Imputation disparities driven by recent selection and their impact on disease risk estimation in East and Southeast Asian populations. https://doi.org/10.5281/zenodo.17472213 (2025).
Boughton, A. P. et al. LocusZoom.js: interactive and embeddable visualization of genetic association study results. Bioinformatics 37, 3017–3018 (2021).
Article PubMed PubMed Central Google Scholar
Marcus, J. H. & Novembre, J. Visualizing the geography of genetic variants. Bioinformatics 33, 594–595 (2017).
Article PubMed Google Scholar

Download references

Acknowledgements

We gratefully acknowledge funding support from the Shenzhen-Hong Kong Jointly Funded Project (Category A; SGDX20230116093201002), the Stability Support for Higher Education from Shenzhen Science and Technology Program, the Sanming Project of Medicine in Shenzhen (SZSM202411006), the Guangdong Natural Science Foundation Youth Enhancement Project (2024A1515030287), the Guangdong Provincial Science and Technology Department (GDSTC; 2024A0505050001), the 1 + 1 + 1 CUHK-CUHK(SZ)-GDSTC Joint Collaboration Fund (2025A0505000056), the Guangdong Basic Research Center of Excellence for Aggregate Science (C10120250191), the National Natural Science Foundation of China (Grant No. 82471825), National Research Foundation of Korea (NRF-2021R1A6A1A03038899) and the National Facility & Equipment Center (NFEC-2023R1A6C101A009). We also extend our thanks to the Warshel Institute for Computational Biology and their funding support from Shenzhen City and Longgang District (LGKCSDPT2024001). The authors thank the anonymous reviewers for their constructive comments and valuable suggestions that helped to improve the quality of the manuscript.

Author information

These authors contributed equally: Dingyang Li, Pattarin Tangtanatakul.

Authors and Affiliations

School of Medicine, Warshel Institute for Computational Biology, The Second Affiliated Hospital, The Chinese University of Hong Kong, Shenzhen, Guangdong, China
Dingyang Li, Hsi-Yuan Huang, Yang-Chi-Dung Lin, Chengjia Li, Yidan Chen, Lizhi Cai, Jinglu Zhao, Jinhan Yu, Yihang Zhou, Yuan Xu, Yue Xiao, Rui Sun, Guangjun Yu, Hao Sun, Hsien-Da Huang & Yong-Fei Wang
Guangdong Basic Research Center of Excellence for Aggregate Science, School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Guangdong, China
Dingyang Li, Hsien-Da Huang & Yong-Fei Wang
Department of Transfusion Medicine and Clinical Microbiology, Faculty of Allied Health Sciences, Bangkok, Thailand
Pattarin Tangtanatakul
Centre of Excellent in Immunology and Immune-Mediated Diseases, Department of Microbiology, Chulalongkorn University, Bangkok, Thailand
Pattarin Tangtanatakul
Department of Paediatrics & Adolescent Medicine, Queen Mary Hospital, The University of Hong Kong, Hong Kong, China
Yao Lei & Wanling Yang
Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Xiaoxi Liu, Yuki Ishikawa & Chikashi Terao
Guangdong Provincial Key Laboratory of Digital Biology and Drug Development, The Chinese University of Hong Kong, Shenzhen, Guangdong, China
Hsi-Yuan Huang, Yang-Chi-Dung Lin & Hsien-Da Huang
Faculty of Medicine, Section of Translational Medicine, Mahidol University, Ramathibodi Hospital, Bangkok, Thailand
Prapaporn Pisitkul
Division of Allergy, Immunology, and Rheumatology, Department of Medicine, Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok, Thailand
Thanitta Suangtamai
Department of Medical Sciences, Ministry of Public Health, Nonthaburi, Thailand
Punna Kunhapan
Immunology Division, Department of Microbiology, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
Nattiya Hirankarn
Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
Chikashi Terao
Department of Applied Genetics, School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
Chikashi Terao
Department of Biology, Kyung Hee University, Seoul, Republic of Korea
Kwangwoo Kim
Department of Rheumatology, Hanyang University Hospital for Rheumatic Diseases, Seoul, Republic of Korea
Sang-Cheol Bae
Hanyang University Institute for Rheumatology Research and Hanyang Institute of Bioscience and Biotechnology, Seoul, Republic of Korea
Sang-Cheol Bae
Department of Rheumatology and Immunology, The Second People’s Hospital, The First Affiliated Hospital of Shenzhen University, Shenzhen, China
Meiying Wang

Authors

Dingyang Li
View author publications
Search author on:PubMed Google Scholar
Pattarin Tangtanatakul
View author publications
Search author on:PubMed Google Scholar
Yao Lei
View author publications
Search author on:PubMed Google Scholar
Xiaoxi Liu
View author publications
Search author on:PubMed Google Scholar
Hsi-Yuan Huang
View author publications
Search author on:PubMed Google Scholar
Yang-Chi-Dung Lin
View author publications
Search author on:PubMed Google Scholar
Chengjia Li
View author publications
Search author on:PubMed Google Scholar
Yidan Chen
View author publications
Search author on:PubMed Google Scholar
Lizhi Cai
View author publications
Search author on:PubMed Google Scholar
Jinglu Zhao
View author publications
Search author on:PubMed Google Scholar
Prapaporn Pisitkul
View author publications
Search author on:PubMed Google Scholar
Thanitta Suangtamai
View author publications
Search author on:PubMed Google Scholar
Jinhan Yu
View author publications
Search author on:PubMed Google Scholar
Yihang Zhou
View author publications
Search author on:PubMed Google Scholar
Yuan Xu
View author publications
Search author on:PubMed Google Scholar
Yue Xiao
View author publications
Search author on:PubMed Google Scholar
Punna Kunhapan
View author publications
Search author on:PubMed Google Scholar
Rui Sun
View author publications
Search author on:PubMed Google Scholar
Guangjun Yu
View author publications
Search author on:PubMed Google Scholar
Hao Sun
View author publications
Search author on:PubMed Google Scholar
Nattiya Hirankarn
View author publications
Search author on:PubMed Google Scholar
Yuki Ishikawa
View author publications
Search author on:PubMed Google Scholar
Chikashi Terao
View author publications
Search author on:PubMed Google Scholar
Kwangwoo Kim
View author publications
Search author on:PubMed Google Scholar
Sang-Cheol Bae
View author publications
Search author on:PubMed Google Scholar
Meiying Wang
View author publications
Search author on:PubMed Google Scholar
Hsien-Da Huang
View author publications
Search author on:PubMed Google Scholar
Wanling Yang
View author publications
Search author on:PubMed Google Scholar
Yong-Fei Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.-F. Wang conceived the study. D. Li took the lead in data analysis. P. Tangtanatakul, Y. Lei, H-Y. Huang, Y.-C.-D. Lin, C. Li, Y. Chen, L. Cai, J. Zhao, P. Pisitkul, T. Suangtamai, J. Yu, Y. Zhou, Y. Xu, Y.Xiao, P. Kunhapan, R. Sun, G. Yu, H. Sun, N. Hirankarn, M. Wang, H.-D, Huang, W. Yang undertook subject recruitment and collected phenotype data. Y. Ishikawa, C. Terao, K. Kim and S.-C. Bae provided SLE summary statistics from East Asian populations for PRS training. D. Li, P. Tangtanatakul, Y. Lei, X. Liu, W. Yang and Y.-F. Wang carried out data analyses and interpretation. D. Li, P. Tangtanatakul and Y.-F. Wang wrote the manuscript. All authors read and contributed to the manuscript.

Corresponding author

Correspondence to Yong-Fei Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Jeffrey Beck and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Eirini Marouli and Aylin Bircan. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file (download PDF )

Supplementary Information (download PDF )

Description of Additional Supplementary files (download PDF )

Supplementary Data 1 (download XLSX )

Supplementary Data 2 (download XLSX )

Supplementary Data 3 (download XLSX )

Supplementary Data 4 (download XLSX )

Supplementary Data 5 (download XLSX )

Reporting Summary (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, D., Tangtanatakul, P., Lei, Y. et al. Imputation disparities driven by recent selection and their impact on disease risk estimation in East and Southeast Asian populations. Commun Biol 8, 1822 (2025). https://doi.org/10.1038/s42003-025-09214-1

Download citation

Received: 28 April 2025
Accepted: 10 November 2025
Published: 21 November 2025
Version of record: 29 December 2025
DOI: https://doi.org/10.1038/s42003-025-09214-1