Introduction

Inflammatory Bowel Disease (IBD) includes Crohn’s disease (CD) and ulcerative colitis (UC). The etiology of IBD is multifactorial, involving a complex interplay of genetic predispositions, environmental influences, and dysregulation of gut microbiota, leading to an aberrant immune response1. While the precise cause of IBD remains elusive, the clinical progression of the disease is notably heterogeneous among patients, which makes it an unpredictable disease2.

More than 240 genetic variants related to IBD have been identified3,4,5. Unlike classical Mendelian disorders, IBD is a genetically complex disease, and traditional genetic analytics are not enough to shape the disease´s complexity6. However, genetic studies have helped to answer in some cases which individuals have more IBD risk and which IBD patients will suffer a disabling course of the disease7. Host genetic factors are known to influence susceptibility to IBD. Furthermore, ancestry influences the risk to develop the disease and disease presentation8. An important limitation of the first genetic studies in IBD is that they were primarily conducted on people of European Ancestry9. Recent genetic studies have incorporated diverse populations, revealing that IBD prediction is enhanced when utilizing data from multiple ancestry groups compared to single population data. This approach has significant implications in identifying population-specific variants, which could facilitate the development of targeted treatments10. South American populations, such as the Chilean population, are underrepresented in Genomic Wide Association Studies (GWAs). Our study aimed to explore, in a South American sample, the relationship between ancestry proportion and IBD clinical phenotypes. Additionally, we assessed the impact of previously identified IBD risk variants from IBD GWAs on the disease clinical outcomes. We used traditional statistical analysis and machine learning tools to develop predictive models to accomplish this objective.

Methods

Patient recruitment

We conducted an observational and prospective study at Hospital San Borja Arriarán (HSBA), a tertiary referral center for IBD in Santiago, Chile. The study included patients from similar socioeconomic backgrounds, classified as working class (D) and lower middle class (C3) according to the scale of the Association of Market Researchers and Public Opinion, Chile11. Patients were enrolled if they had an IBD diagnosis supported by clinical, endoscopic, histologic, and imaging data according to clinical guidelines12,13,14 and International Disease Classification criteria. Patients were invited to participate during scheduled colonoscopies as requested by their doctors. A comprehensive database of relevant clinical data was compiled for all participants at the time of recruitment. We collected clinical data common data for UC and CD including sex, age, age at diagnosis, age less than 40 years at diagnoses, alcohol consumption/smoking habits, IBD family history, extraintestinal manifestations (EIMs). Main IBD phenotypes were classified according to the Montreal classification15. Moreover, history of infection for cytomegalovirus, Clostridioides difficile, Coronavirus 19 infection (COVID-19), laboratory parameters, IBD resective surgery, pouch, current use of steroids, immunomodulators, biological therapies, naïve anti-TNF, history of discontinuation or failure to anti-TNF, primary no response to anti-TNF, loss of response to anti-TNF, history of immunogenicity to biological therapy, clinical activity (Harvey Bradshaw index for CD and Total Mayo score for UC), endoscopy activity (Simple Endoscopic Activity Score in Crohn’s Disease (SES-CD) for CD and Mayo Score for UC) was registered. Clinical remission was defined for CD as a Harvey Bradshaw less than 5 and for UC as a total Mayo score less than 316. Endoscopy remission for CD as a SES-CD less than 3 and for UC as an Endoscopy Mayo score 016. Histologic remission defined by absence of erosion, ulceration, and epithelial damage and absence of neutrophils17,18. The authors defined prolonged Clinical and endoscopic remission as clinical and endoscopic remission over the last five years and frequent relapse by more than one flare per year over the previous five years. Furthermore, a control group comprised patients who underwent a colonoscopy indicated by their doctor. These patients did not have any conditions such as IBD, immune disorders, or cancer, and obtained normal findings on the exam. Ethics approval was obtained from the Institutional Review Boards of the Servicio de Salud Metropolitano Central/HSBA (IRB:43/2022) and the Pontificia Universidad Católica de Chile (IRB:220228001). All individuals provided written informed consent. All methods were performed in accordance with the relevant guidelines and regulations.

Genotyping

Five mL of blood was retrieved from each participant and stored in ethylenediaminetetraacetic acid disodium salt (EDTA) tubes. Then DNA was extracted with Invisorb Blood Universal (Invitek) # ref 1,031,150,200 purification kit, according to the manufacturer’s suggestions. Samples were stored at -80º C until genotyped at Erasmus MC-Netherlands, and 725.497 single nucleotide polymorphisms (SNPs) were investigated using Illumina’s Infinium Global Screening Array.

Genotyping QC

Genotype quality control (QC) was performed using R Studio version 4.2.2 with the plinkQC library. The perMarkerQC function was utilized to assess missingness rates across samples, deviation from Hardy-Weinberg Equilibrium (HWE), and minor allele frequencies (MAF) by applying a threshold of 0.01 (Supplementary Fig. 1). Additionally, the perIndividualQC function was employed to evaluate the total heterozygosity rates, missingness, concordance of assigned sex with SNP sex, relatedness to other study individuals, and genetic ancestry of the samples in the PLINK dataset. Supplementary data shows the QC for individuals and markers (Supplementary Figs. 1 and 2).

Estimation of genetic ancestry

We performed a global ancestry analysis using the admixture19. We employed a reference panel of populations obtained from the 1000 Genome Project and HapMap for our analyses. This included the Native American population (AMR = 43 unrelated individuals), the European population (CEU = 56 unrelated individuals), the African population (YRI = 55 unrelated individuals), and our dataset of 342 individuals from the Chilean population. Of Note, the 43 native American samples exhibited 99% or higher Native American ancestry. This cohort was assembled from a collective of populations, including ten individuals from the Nahua, six from the Maya, two from the Quechua, and twenty-five from the Aymara. Additionally, reference populations from 1000 Genomes Project (ASW: African Ancestry in Southwest USA, TSI: Toscani in Italy, IBS: Iberian Population in Spain, MXL: Mexican Ancestry in Los Angeles, USA, PUR: Puerto Ricans in Puerto Rico, CLM: Colombians from Medellin, Colombia, PEL: Peruvians from Lima, Peru) were included for population structure and admixture analysis comparison.

PLINK20 was used to manipulate the VCF and bed file formats from HAPMAP obtaining a total of 41,193 SNPs with genotypes for all 496 individuals in the study. To reduce the impact of linkage disequilibrium on our ancestry estimation, we pruned and filtered the SNPs using the Plink options (--indep-pairwise 50 10 0.1 –-geno 0.01), resulting in a refined set of 23,716 SNPs suitable for ADMIXTURE analysis. We then leveraged the ADMIXTURE cross-validation option (--cv) to ascertain the optimal number of ancestral populations, or clusters, for a supervised analysis (Supplementary Fig. 3). ADMIXTURE analysis was performed for two to six possible ancestral groups (K = 2 … K = 6), aiming to pinpoint the number of ancestral populations corresponding to the lowest CV error, as detailed in Supplementary Fig. 3. Our iterative approach, which involved testing various K values, determined that a K value of 3 yielded the lowest average CV error. This indicates that three ancestral populations most accurately represent the genetic foundation of the Chilean individuals in this study. For comparison, we conducted an additional ADMIXTURE analysis incorporating 1000 Genomes (1000G) data. Using PLINK, we merged genotype data from 962 individuals across 232,297 SNPs, including reference populations ASW (55), TSI (107), IBS (107), MXL (64), PUR (104), CLM (97), PEL (86), and CHI (342). After variant pruning and minor allele frequency (MAF) filtering (MAF > 0.05), a total of 96,786 markers were retained for ADMIXTURE analysis (Fig. 1).

Fig. 1
figure 1

Population Structure and Genetic Ancestry Composition of Chilean Individuals. (A) Principal Component Analysis (PCA): PCA plot of SNP data showing clustering by superpopulation (AFR: African populations, AMR: American populations, EAS: East Asian populations, EUR: European populations, SAS: South Asian populations). PC1 (47%) and PC2 (26%) capture most of the genetic variation, with Chilean individuals clustering with the Admixed American populations, reflecting their admixed ancestry. (B) Global ancestry proportions of Chilean individuals inferred using ADMIXTURE (K = 3). Three primary ancestry components are identified: European (EUR, green), Amerindian (AMR, orange), and African (AFR, red). Reference populations from 1000 Genomes Project (ASW, TSI, IBS, MXL, PUR, CLM, PEL) were included for comparison. (C) Ancestry Proportions: Boxplots displaying the proportion of European, Amerindian, and African ancestry in the Chilean cohort. The results highlight the predominance of European (x = 58%) and Amerindian(x = 39%) ancestry, with a smaller African (x = 3%) component.

Furthermore, a genetic Principal Component Analysis (PCA) was conducted using PLINK and the 1000Genome data. The PCA was performed using a total of 107,891 SNPs from 2915 unrelated individuals, and 20 components (Fig. 1). Finally, the admixture and PCA results were visualized using the libraries ggplot and tidyverse from R studio version 4.2.2.

Statistical methods

We analyzed association between the phenotype and ancestry proportions using a Chi-square test and Fisher test. Additionally, we calculated the odds ratio using the Wald method. For these analyses, we utilized the epitools, readxl, and rapportools libraries from R version 4.2.2. Next, we explored the association between Amerindian ancestry proportion (AMR) and categorical (demographical and clinical) variables. We also examined the relationship between AMR and numerical variables. Statistical analyses were performed using Python libraries such as pandas, seaborn, matplotlib, and scipy.stats. To compare the median of the quantitative variables between the two categorical groups, we employed the Mann-Whitney U test, a non-parametric test. Regarding the categorical variables, we utilized the Chi-square test and Fisher’s test to assess significant differences between the groups. Odds ratios (OR) and confidence intervals (CI) were calculated to further evaluate the associations. We considered a p-value < 0.05 as indicative of significance. From published IBD GWAs studies,1,2,21 we investigated 226 SNPs related to IBD among 291 IBD Chilean genotypes obtained from the bim, fam, and ped files from the Illumina array after performing the GWAS quality control. Using R studio version 4.2.3 and the libraries genio, plinkFile, readr, and tidyverse, we filtered the226 mentioned variants (Supplementary Table 1). A total of171 variants were found in our Chilean cohort. This information was integrated to build a database merging the clinical data with the genotypes. We aimed to explore the potential association between SNP genotypes related to IBD and High Amerindian Ancestry Proportion (HAAP), defined as greater than 43%, representing the third quartile of the AMR population in our sample. A contingency table was constructed, and a Chi-square test was conducted using Python programming and libraries such as pandas, seaborn, and matplotlib.pyplot to determine the statistical significance of the association. The significance threshold was set at 0.05. Subsequently, we conducted logistic regression analyses, incorporating covariates such as age, sex, smoking, and current anti-TNF therapy to account for potential confounders. These analyses utilized the readxl library for data import, dplyr for data manipulation, and stats for statistical modeling in R. The same analysis was performed to explore the association between prolonged clinical and endoscopy remission and SNPs related to IBD.

Furthermore, leveraging our previous study, where we developed a regression model for various binary clinical outcomes22, our current research focuses on constructing a classification model specifically for prolonged clinical/endoscopic remission. The aim was to examine the relevance of various features in predicting this outcome. These features encompassed clinical outcomes, laboratory parameters, ancestry proportions, and SNPs. To achieve this, tree decision and random forest techniques were employed to understand better the genetic and clinical factors associated with prolonged clinical/endoscopic remission. In our model-building process, we utilized Python and various libraries. Pandas aided in data manipulation and analysis, numpy facilitated mathematical operations, and matplotlib.pyplot and seaborn were used for data visualization. Data preprocessing involved scaling with StandardScaler and handling missing values using SimpleImputer. The data was split into training and testing sets using the train_test_split function from sklearn.model_selection.

We experimented with algorithms for classification models, including Logistic Regression, Decision Tree Classifier, and Random Forest Classifier from sklearn.linear_model, sklearn.tree and sklearn.ensemble Python libraries. Model performance evaluation employed metrics such as confusion matrix, classification report, precision-recall curve, and recall score from sklearn.metrics. Data preprocessing techniques like MinMaxScaler, Label Encoder, and One Hot Encoder from sklearn.preprocessing were applied as needed. To optimize the models, we utilized GridSearchCV from sklearn.model_selection for hyperparameter tuning, enabling fine-tuning of the models to improve performance and accuracy.

Results

We genotyped 384 IBD patients and controls at Erasmus MC-Netherlands using Illumina’s Infinium Global Screening Array, resulting in the genotyping of 725,497 SNPs. However, after applying standard filters and quality control, 357,392 variants and 342 individuals (291 cases and 51 controls) remained for further analysis. Most of the SNPs discarded (n = 232,141) were not polymorphic in the Chilean individuals (MAF < 0.01, Supplementary Fig. 1–2). The genetic structure of the Chilean population, as revealed by PCA and Admixture analysis, demonstrates the admixed nature of the Chilean Population (Fig. 1). The PCA shows Chilean individuals are positioned with the Admixed American (AMR) populations, confirming their mixed genetic background. This positioning aligns with historical admixture events involving European colonizers, indigenous Native American groups, and African descendants. Figure 1B, the admixture analysis, further supports this by identifying three primary ancestral clusters (African, European, and Native American) within the Chilean population. The analysis reveals the most predominant ancestries: Native American and European ancestry, with a more minor but significant African contribution. Figure 1C quantifies these proportions, confirming that most Chilean individuals have a substantial Native American ancestry component, a notable European contribution, and a minor African component. Together, these results highlight the diverse genetic heritage of the Chilean population, shaped by historical migration, colonization, and the transatlantic slave trade, and distinguish Chileans from other reference populations such as African Americans (ASW), Puerto Ricans (PUR), and Peruvians (PEL). The ancestry structure composition is illustrated in Fig. 1, revealing the presence of three primary ancestry groups. The green color represents European ancestry, the orange color represents Amerindian ancestry, and the red color represents African ancestry. Figure 1B presents the results of the ADMIXTURE analysis, showing the cross-validation error for ancestral population clusters (K = 2–6) within the Chilean cohort. The optimal number of ancestral populations is identified at K = 3, utilizing CEU (European), AMR (Amerindian), and YRI (African) as reference panels. The PCA was conducted to examine the ancestry relationships (Fig. 1C). The distribution of ancestry proportions in the studied population is provided in Table 1. This cohort’s median ancestry distribution comprised 58% European, 39% Amerindian, and 3% African ancestry. Notably, the third quartile of Amerindian ancestry proportion was calculated as 42.9%, prompting us to classify a high Amerindian ancestry proportion as 43% for subsequent analysis. The analysis of IBD risk did not reveal any significant differences based on ancestry proportion when comparing cases and controls (Supplementary Tables 2 and 3).

Table 1 Distribution of ancestry proportion in the studied cohort.

A total of 291 patients with IBD were included in the study, with 216 (74%) diagnosed with UC and 75 (26%) with CD. The clinical characteristics of the investigated IBD patients are summarized in Table 2. The median age of the patients was 50 years (range: 15–81), and the median duration of disease was nine years (range: 0–49 years). Extra-intestinal manifestations were reported by 36% of the IBD patients, and more than 50% had a history of hospitalization. Most patients were non-smokers. Surgical resection was reported by 16% of the patients, and 10% had a history of Clostridioides difficile infections. At the time of the study, 15% of the patients were using steroids, 15% were on anti-TNF therapy, 81% were anti-TNF naive, and 31% were using thiopurines. According to the Montreal Classification, 55% of UC cases had extensive colitis, 26% had left-sided colitis, 18% had proctitis, and information on disease extent was unavailable for 1% of cases. In the CD group, only 8% were diagnosed before 17. The most common disease extension in CD was colonic (L2) involvement, observed in 51% of cases, followed by ileocolonic (L3) involvement in 33% of cases. Upper digestive tract involvement (L4) was present in only 9% of CD cases, and 43% had perianal involvement. The most frequently observed CD phenotype was inflammatory (B1, 41%), followed by penetrating (B3, 33%) and structuring (B2, 25%).

Table 2 Clinical characterization of the studied cohort.

Several findings were observed when examining the association between AMR and clinical variables in the UC group. Firstly, the median AMR was higher in patients diagnosed before the age of 40 compared to those diagnosed later (39.9% versus 37.4%, P value = 0.01). Conversely, it was lower in the patients who achieved maintained clinical and endoscopic remission in the last five years (35% versus 39%, P value = 0.02). Interestingly, a higher median AMR was associated with IBD reactivation during a COVID-19 infection (43% versus 39%, P value = 0.006). See Table 3 Among the studied variables was a family history of IBD; however, we did not find any association with the median AMR (40% vs. 39%, P value = 0.39). In addition to these associations, we further explored the impact of a HAAP (High Amerindian Ancestry Proportion  of 43%) on clinical outcomes within the UC group. This HAAP was significantly associated with resective surgery (57, OR = 4.27, CI = 1,41-12.94, p-value = 0.01), pouch (70%, OR = 1.86–3.01, p-value = 0.003), clinical and endoscopic remission over one year (19%, OR = 0.46,CI = 0.24-0,87), and IBD reactivation during a COVID-19 infection (62%, OR = 5.16, CI = 1.61–6.53) as shown in Table 4. Interestingly, 70% of UC patients who required pouch formation had a HAAP. Moreover, among the UC patients who maintained clinical and endoscopic remission over one year, 81% had a lower Amerindian ancestry proportion, while only 19% had HAAP. Additionally, 62% of UC patients who experienced a COVID-19 infection had a HAAP.

Table 3 Significant association between ulcerative colitis clinical outcomes and median amerindian ancestry.
Table 4 Significant association between ulcerative colitis outcomes and high amerindian ancestry.

In the CD group, we observed that the median AMR was lower in the group with perianal disease than the group without perianal disease (33.5% versus 39.5%, P value = 0.03). Additionally, only 6% of the CD patients who underwent resective surgery had HAAP (OR = 0.17, CI = 0.03–0.83, P value = 0.02). These findings suggest that there may be a potential association between Amerindian ancestry and a lower likelihood of developing perianal disease and requiring resective surgery in the CD group (Table 5). These results provide valuable insights into the potential role of Amerindian ancestry in influencing the phenotype of CD within this specific cohort.

Table 5 Significant association between Crohn’s disease outcomes and amerindian ancestry based on (a) median ancestry proportion, and (b) high amerindian ancestry.

In our analysis, we integrated data from both UC and CD patients to explore the potential association between Amerindian ancestry and standard clinical variables in IBD. We observed that in IBD patients diagnosed before the age of 40, the median AMR was higher compared to those diagnosed later (40% versus 38%, P value = 0.03). Similarly, we found a similar trend in the group of IBD patients who experienced one or more outcomes associated with severe disease, such as surgery, failure to anti-TNF treatment, pouch, or flares in the last five years. In this group, the median AMR was 39.4% compared to 34.9% in the reference group (P value = 0.0007). Conversely, a lower median Amerindian ancestry proportion was observed in the group of IBD patients currently on biological therapy (36.6% versus 39.3%, P value = 0.03) and those who achieved sustained clinical and endoscopic remission in the last five years (36.2% versus 39.6%, P value = 0.0006). Additionally, in the IBD group, we found a significant association between Amerindian ancestry proportion and a history of gastrointestinal infection, previous Clostridioides difficile infection, as well as prolonged clinical and endoscopic remission (over five years). See Tables 6 and 7. Figure 2, summarize these results.

Table 6 Significant association between inflammatory bowel disease clinical outcomes and median amerindian ancestry.
Table 7 Significant association between inflammatory bowel disease and high amerindian ancestry.
Fig. 2
figure 2

Impact of Amerindian Ancestry on Clinical Variables in IBD Subgroups. (A) Median Amerindian ancestry proportion across various clinical features in IBD (Median group yes versus Median group no). A higher median Amerindian ancestry proportion was associated with early-onset IBD/UC, a severe disease course (IBD), and UC flare during COVID-19 infection. Conversely, a lower median Amerindian ancestry proportion is linked to prolonged clinical and endoscopic remission in UC and IBD, current use of biological therapy in IBD, and perianal disease in CD (B) High Amerindian Ancestry Proportion Impact on Clinical Outcomes. We defined HAAP as an Amerindian ancestry proportion equal to or greater than 43%. In the UC group, a high proportion of patients with HAAP had a history of pouch formation, surgical resection, and IBD flare during a COVID-19 infection. Conversely, most patients who achieved clinical and endoscopic remission over a year (UC), underwent resective surgery (CD), had a previous history of gastrointestinal infection (IBD), experienced past infection by Clostridioides (IBD), or had prolonged clinical and endoscopic remission (IBD) did not have HAAP. IBD: Inflammatory Bowel Disease, UC: Ulcerative Colitis, CD: Crohn’s Disease, HAAP: High Amerindian Ancestry Proportion, COVID-19=Coronavirus 19 infection.

We also investigated the potential association between HAAP and genotypes of SNPs previously associated with IBD. The SNPs significantly associated with this outcome are shown in SupplementaryTable 5. Furthermore, we performed a gene set enrichment analysis using gProfiler, using the genes to which these SNPs were mapped. Our analysis revealed a significant enrichment of cellular response to interleukin-6 (GO:0071354, p-value adjusted = 0.05) and histone H3Y41 kinase activity (GO:00035401, p-value adjusted = 0.03) within the gene/protein set. These findings might suggest a potential involvement of IL-6 cytokine according to Amerindian ancestry. Further studies will be interesting in exploring how histone modifications influence gene expression patterns in IBD within populations, such as in Latin American countries, where epigenetic changes may account for the rising incidence of IBD23 (Table 8).

Table 8 Gene set enrichment analysis (gProfile) from genes related to single nucleotide polymorphisms associated to amerindian ancestry proportion.

We found a significant association between previously identified SNPs linked to IBD and prolonged clinical and endoscopy remission, as shown in Table 9 and Supplementary Table 4. Additionally, a gene set enrichment analysis (Table 10) revealed that the genes associated with these SNPs were connected to specific enzyme activities, including L-cystine L-cysteine-lyase (deaminating) (GO:0044540, adjusted P value = 0.04), homocysteine desulfhydrase activity (GO:0047982, adjusted P value = 0.04), cystathionine gamma-lyase activity (GO:0004123, adjusted P value = 0.04), selenocystathionine gamma-lyase activity (GO:0098606, adjusted p-value = 0.04), and L-cysteine desulfhydrase activity (GO:0080146, adjusted P value = 0.04). The connection between these enzymes, IBD prolonged clinical and endoscopy remission, and microbiota interaction presents an intriguing avenue for future research24,25,26.

Table 9 Genetic variants associated to prolonged clinical and endoscopy remission.
Table 10 Gene set enrichment analysis (gProfile) of genes associated with SNPs linked to prolonged clinical and endoscopic remission in the IBD Chilean cohort studied.

As mentioned, we have developed a classification model to evaluate the prolonged clinical and endoscopy remission. This classification model provides an opportunity to explore the feasibility of utilizing this model in identifying individuals with a less aggressive disease course and a more favorable prognosis, evaluating the importance features (clinical variables, laboratory parameters, ancestries proportion, and SNPs) for this outcome. The development of such a model holds great potential in evaluates the influence of both clinical and genetic factors on disease progression. Our study’s two most effective models were the Tree Decision (TD) and Random Forest (RF) models.

The TD model exhibited exceptional performance on the training data, achieving 100% accuracy, precision, recall, and F1 score (weighted average). The model demonstrated a precision of 97%, recall of 97%, and an F1 score of 97% (weighted average) on the testing data. See Supplementary Figure S4. Upon analyzing the variable importance in the TD model, we identified that the most significant factor was the history of outcomes related to a severe course (such as surgery, failure of anti-TNF treatment, pouch, or flares within the last five years), accounting for 80% of the model’s importance. Other influential factors included female sex (6%) and creatinine levels (5%). Please refer to Fig. 3 for further details.

Similarly, the RF model also exhibited strong performance on the training data, achieving 100% accuracy, precision, recall, and F1 score (weighted average). On the testing data, the model achieved a precision of 100%, recall of 73%, and an F1 score of 84% (weighted average). Supplementary Figure S4. Consistent with the TD model, the most important variable for this classifier was the history of outcomes related to a severe course, accounting for 75% of its importance. Other significant factors included clinical and endoscopy remission in the last year (5%), creatinine levels (3%), hemoglobin levels (2%), age of diagnosis (1%), and loss of response to anti-TNF treatment (1%). Please refer to Fig. 4 for further details.

Fig. 3
figure 3

Top ten features identified in the Decision Tree model for predicting Prolonged Clinical and Endoscopic Remission. In this classifier model for predicting prolonged clinical and endoscopic remission, the most important features were associated with a severe phenotype, including a history of surgical failure, use of anti-TNFa medication, and relapse within the past years. Other significant predictors included sex, creatinine levels, and the genetic variant rs921720.

Fig. 4
figure 4

Top ten features identified in the Random Forest model for predicting Prolonged Clinical and Endoscopy Remission. The Random Forest classifier for Prolonged Clinical and Endoscopic Remission identified several key clinical features. These included characteristics associated with a severe phenotype, such as a history of surgical failure, use of anti-TNF medication, and relapse within the past years. Additionally, features such as clinical and endoscopic remission over a year, creatinine levels, hemoglobin levels, age at diagnosis, loss of response to anti-TNF medication, Glutamato Piruvate Transaminase (GPT) levels, white cell count, the genetic variant rs7236492, and clinical remission were also found to be significant predictors.

Discussion

While inflammatory bowel disease (IBD) was initially believed to affect individuals of European ancestry primarily, there has been a significant shift in the epidemiological landscape, with an increasing prevalence observed among individuals in Latin America as well as the Latino population in the United States. In Latin America, IBD is currently in an accelerating stage, marked by rising incidence and prevalence rates27. Meanwhile, the reported prevalence of IBD among Latin communities in the United States is approximately 383 per 100,000 person-year28,29.

Latin American populations differ from Caucasian populations as they are the result of genetic admixture among ancestral populations from Europe, Native Americans, and Africa30. Each Latin American population presents a unique pattern of these three ancestral groups, contributing to their distinct genetic makeup. Mixing genetic backgrounds from multiple continents has led to a rich diversity within Latin American populations. This diversity is reflected in the wide range of genetic variations and phenotypic characteristics observed across Latin American countries and regions31. Therefore, assessing how variations in ancestry may impact the phenotype of IBD across populations can reveal differences that could facilitate the implementation of personalized medicine approaches. In our cohort, the predominant subtype of IBD was UC, accounting for 74% of cases, which is similar to previously reported rates in Latin America32. The average age of onset was 36 years, and approximately 36% of patients reported extraintestinal manifestations, like in previous studies in Latin America28,32,33. When looking at the extension of the disease, it was found that pancolitis was the most common in UC patients (55%), which aligns with findings from other Latin American studies32. Nevertheless, there is variation in the prevalence of UC extension across different regions in Latin America. In Puerto Rico, distal proctitis (Montreal classification E1) was found to be as high as 55.3%34. Meanwhile, in Peru, the extent of left-sided colitis (Montreal classification E2) varied between 11.1% and 62.9% in different studies35,36. As for extensive colitis (Montreal classification E3), one Brazilian study reported a prevalence of 12%.37 However, in Argentina, the prevalence of extensive colitis was reported to be as high as 77%38. In CD patients, colonic extension was the most prevalent disease localization (51%). In comparison, only 16% showed isolated ileal involvement. This differs from other IBD studies where Latin-American CD patients mainly developed ileocolonic disease28,32. Another difference observed was the rate of upper gastrointestinal involvement, which was found in 9% of the population, twice the rate reported in other Latin American IBD studies32. Previous studies have shown that African American or Black, Hispanic, and Asian patients with CD tend to have a more extensive distribution of intestinal inflammation compared to White-non-Hispanic patients. Specifically, higher proportions of White-non-Hispanic patients were found to have isolated ileal disease when directly compared to African American, Hispanic, or Asian patients with CD in studies that examined disease location among different ancestries39. Furthermore, the perianal CD was present in 43% of Crohn’s patients, higher than the 16.7% reported in other Latin American studies32. Interestingly, in Latin America, the perianal compromise varies from 12% in Brazil to 53% in Peru40,41. Despite these differences, the inflammatory behavior in CD was the most prevalent, which is consistent with observations in other Latin American IBD populations38,32. Overall, our findings demonstrate both similarities and differences in the characteristics of IBD in our cohort compared to previous studies conducted in Latin America.

On average, Chileans are 42% Amerindian and 53% European (disaggregated into 25% Mapuche and 18% Aymara)42. The ancestry distribution in our IBD Chilean cohort was 58% European, 39% Amerindian, and 3% African. In our previous work, we discovered a significant association between a high Mapuche ancestry proportion and the risk of IBD43. However, we did not observe risk differences according to Amerindian ancestry proportion in this cohort. In this study, we utilized a native American ancestry proportion derived from a reference panel that included a broader Latin population rather than specifically focusing on the Mapuche population, which could explain these differences. The proportion of native American ancestry in Chile represents a combination of various native American groups, including the Mapuche and Aymara populations44. Therefore, the observed differences in this study may be attributed to including multiple Native American groups in the analysis rather than solely focusing on the Mapuche population. Considering this issue, we estimated the ancestry proportions for Mapuches and Aymara by utilizing the K = 4 clustering results, which included European, Aymara, Mapuche, and African groups, as opposed to the K = 3 clustering that only included European, Amerindian, and African groups. However, no significant differences in IBD risk were observed (Supplementary Table 3). Another potential explanation could be attributed to the utilization of a larger and different control group in our previous study (3,147 individuals of Chilean descent from a gallbladder cancer study)45. Hence, further investigation with a larger sample size is warranted to definitively explore the potential influence of ancestry on IBD risk. We made some notable observations when exploring the relationship between ancestry and clinical outcomes in UC. Firstly, we found a higher median Amerindian ancestry in the group of patients diagnosed before age 40, suggesting a potential association between ancestry and early-onset UC. On the other hand, patients who achieved prolonged clinical and endoscopic remission had a lower median Amerindian ancestry, indicating a possible negative correlation between Amerindian ancestry and UC sustained remission.

Furthermore, interesting findings emerged among UC patients who underwent pouch surgery. Approximately 70% of these patients had HAAP. Similarly, 57% of UC patients who required surgery exhibited HAAP. These findings may suggest an association between a high Amerindian ancestry and a more severe phenotype in UC. Conversely, a lower median proportion of Amerindian ancestry was observed in CD patients with perianal disease. Furthermore, among CD patients who required surgery, a significant majority (94%) had a lower Amerindian ancestry. These contrasting observations suggest that the influence of Amerindian ancestry on disease severity and surgical outcomes may differ between UC and CD patients. While a higher Amerindian ancestry appears to be associated with a more severe phenotype in UC, a lower Amerindian ancestry may be linked to perianal disease and the need for surgery in CD patients. The observed differences in the association between Amerindian ancestry and disease characteristics in UC and CD patients could be attributed to various factors, including genetic, environmental, and immunological influences. Maybe certain genetic variants or alleles associated with Amerindian ancestry contribute to an increased risk or severity of UC or are protective for CD in these patients. Additionally, environmental factors prevalent in populations with higher Amerindian ancestry may play a role in exacerbating disease severity. It is important to note that these associations between ancestry and disease characteristics are complex and multifactorial. Genetic and environmental factors interact in intricate ways, and additional research is needed to understand further the underlying mechanisms driving these differences. When analyzing the IBD group, it is important to consider the divergent effects of ancestry on UC and CD. Interestingly, like the observations in UC, we found a higher median Amerindian ancestry in the subgroup of patients diagnosed younger than 40. In contrast, a lower median Amerindian ancestry was associated with prolonged clinical and endoscopy remission. However, it is worth noting that these results should be interpreted in the context of the sample size discrepancy between UC and Crohn’s disease, with the UC cohort being almost 3 times larger. Apart from the genetic variability linked to the general risk of developing IBD, there has been significant attention given to exploring the relationship between genetic variants and specific subtypes or characteristics of IBD, such as prolonged clinical and endoscopy remission. Tables 9 and 10. Importantly, our analyses indicated no significant differences in the use of anti-TNF therapy between high and low Amerindian ancestry groups, suggesting that treatment disparities are unlikely to influence the observed outcomes. These patient groups belong to the same socioeconomic stratum and had equal access to biological therapies.

Several SNPs exhibit significant associations with prolonged clinical and endoscopy remission, as shown in Table 9. For rs6871626, the AA genotype shows a stronger association with remission, with a multivariate OR of 5.86, CI of 1.80-19.08, and p-value of 0.002. The AG and AA genotypes of rs7134599 are associated with remission, with multivariate ORs of 2.43 (CI 1.06–5.82, p = 0.04) and 3.88 (CI 1.09–12.44, p = 0.02), respectively. rs11150589’s TT genotype shows a multivariate OR of 4.02 (CI 1.40-11.23, p = 0.007). Finally, rs2651244’s AA genotype strongly associates with remission, with a multivariate OR of 5.12, CI 1.55-16.00, and p-value of 0.005. These findings highlight significant genetic influences on IBD remission outcomes.

The gene set enrichment Analysis derived from SNPs associated with IBD and HAAP revealed a significant association with IL-6, a key inflammatory mediator in IBD. It influences the differentiation of T helper 17 (Th17) cells and supports Th1 cell survival, contributing to inflammation in CD and ulcerative UC46.

A key point to highlight is that our analysis did not specifically target SNPs directly associated with Amerindian ancestry; instead, we aimed to explore which IBD genetic variants are linked to HPAA. Our investigation seeks to identify any genetic variants that may also have a connection to ancestry, thereby indirectly enhancing our understanding of how ancestry influences IBD phenotypes. Interestingly, we made an intriguing observation during the development of classifiers for predicting clinical and endoscopy remission over the past five years. At least one outcome associated with a severe disease course emerged as the primary distinguishing feature. These outcomes encompassed surgery, failure of anti-TNF treatment, pouch, or flares within the last five years. None of the SNPs exhibited an importance level exceeding 5% in these models. Our findings suggest that clinical features play a more significant role in predicting these outcomes within our population. These results motivate us to expand our sample size and plan for future whole-genome sequencing to identify new genetic variants that may be relevant to our population.

Due to our relatively small size of the dataset we focused on established genetic associations to address a specific query regarding the potential risk associated with previously identified variants in the phenotype of these IBD individuals.

Strengths and limitations

Our study gives valuable insights into the clinical and genetic dimensions of IBD within a South American cohort, emphasizing the role of ancestry in disease phenotype. A main strength lies in the integration of clinical characteristics with comprehensive genotyping data of SNPs previously associated with IBD risk in other ancestral cohorts. This allows for a more detailed analysis of the genetic influences on IBD among a historically underrepresented population. The study helps to explore how Amerindian ancestry influences IBD phenotype, contributing to personalized medicine.

A study limitation is a relatively small sample size, particularly in the CD group, which may constrain the generalizability of the findings. Besides, the focus on a Chilean cohort restricts the ability to extrapolate results to other Latin American populations with different genetic admixtures.

The study’s cross-sectional nature also limits causal interpretations of the observed associations. Further extensive longitudinal multi-ethnic cohort studies that include a larger Latin population with a yet characterized population would be beneficial in deciphering the complicated interactions between IBD phenotypes and ancestry and their relationship with the immune response. These enhancements will aid in confirming genetic associations and exploring the influence of environmental and lifestyle factors.

Conclusion

Our findings demonstrate differences in IBD phenotypes based on Amerindian ancestry proportion, suggesting that genetic or ancestral factors may contribute to the disease’s phenotype and severity. Additionally, the results indicate a difference in the direction and effect of the influence of Amerindian ancestry on UC and CD patients. Further research is necessary to gain a deeper understanding of the underlying mechanisms that drive these associations.