Introduction

Global genetic diversity in livestock determines their ability to respond to selection and climate change, and their climatic footprint1. A large proportion of the global genetic diversity in cattle is contained in indigenous breeds, which are generally adapted to their abiotic environment, fodder, parasites, and pathogens. Yet, they are understudied and under pressure from genetic erosion occurring as a result of cross-breeding with higher-yielding breeds, leading to a loss of valuable genetic resources in cattle as a whole2,3. On the other hand, historical admixture between different cattle breeds, as well as between cattle and other members of the Bos genus, has introduced new genetic variation that has helped populations adapt to novel environments, such as in eastern Asia4,5 and Africa6,7.

While recent studies using whole genome sequencing have substantially increased our understanding of the overall evolutionary history of cattle in Asia4,5,8,9,10, Southeast Asian cattle have not been studied, constituting a large area where the history of cattle is almost unknown. Yet, Southeast Asia is an important region in cattle history, constituting the main dispersal route from India to China4,5,11, and is the likely scene of introgression from endemic wild bovine species into cattle4,5,12. A resolution of the historical movement of cattle in this region would also provide information regarding the early trade routes, political and cultural links within Southeast Asia13.

Indonesian cattle are unique examples of stabilized hybrid cattle with substantial contributions from two parental Bos species with limited cross-fertility. They contain both zebu (Bos indicus) and banteng (B. javanicus) ancestry, with different breeds representing potentially different proportions of zebu and banteng14,15,16. Banteng were probably domesticated as early as 3500 years ago on Java17,18. These domesticated banteng, referred to as Bali cattle, have historically been one of the most abundant and important livestock breeds in the country and remain so today19. Zebu cattle were introduced to the archipelago later, but the sources and date of introduction are obscure. More recently, taurine cattle was imported to Indonesia from Europe and Australia from the 19th century with the aim of improving local cattle by crossbreeding20. It is commonly assumed that indigenous Indonesian breeds have acquired adaptations to the Indonesian climate, diseases, and to the smallholder husbandry practiced throughout the archipelago21,22. The mixing of indigenous banteng and introduced zebu cattle may have conferred banteng adaptations to the admixed Indonesian cattle by the process of adaptive introgression16. The unique history and mosaic of ancestry components make Indonesian cattle globally important because they contain genetic variation and adaptations that are not found in any other cattle population.

Admixture is common between species in the Bos genus23, and cases of adaptive introgression of exotic Bos ancestry have been identified in East Asian cattle4,5,12. However, in general, introgression from other species is thought to be predominantly under negative selection due to the presence of reproductive barriers between species24. Such barriers can be in the shape of Bateson-Dobzhanzky-Muller incompatibilities (BDMIs) or other genetic incompatibility mechanisms, whereby alleles are harmful outside their native genomic context25. As a consequence, minor parent ancestry (i.e., the lesser of two ancestry components in a hybrid population) is usually under negative selection26,27. These studies also highlight that the nature of the selection pressure may differ from case to case depending on the degree of evolutionary divergence between hybridizing populations, the amount of genetic load in each parental source population, and on the environment in which the hybrid population evolved28. Furthermore, in admixed livestock such as Indonesian cattle, the selection regime on introgressing haplotypes might differ from those found in wild species. Indonesian cattle therefore represent an interesting model system to investigate how selection works on introgressed haplotypes in the Bos genus, which is a question of fundamental evolutionary importance and potential value in cattle breeding.

In order to elucidate the genomic resources in Indonesian cattle and how they have been shaped by their complex and poorly resolved history, we whole-genome sequenced 233 Indonesian cattle and banteng samples. These were analyzed alongside publicly available data representing global cattle diversity. This allowed us to quantify the population structure, genetic diversity, and admixture processes in Indonesian cattle, as well as the genetic contribution of banteng to present day cattle. We also investigated the genomic landscape of ancestry in the admixed breeds and identified candidates of adaptive introgression, and compared it to the parallel case of wild Bos introgression into East Asian cattle4,5,12. Our analyses provide the detailed insights into the origins, history, introgression and selection in a unique and globally important admixture system of cattle.

Results

We generated 233 new whole genomes, out of which 179 are from Indonesian cattle breeds, including Aceh (31), Pesisir (37), Pasundan (26), Jabres (33), Madura (34) and Sumba Ongole (18). A further 51 whole genomes are of Bali cattle from Bali (19), Kupang (17) and the feral population in northern Australia (15); and three captive Javan banteng from Texas, USA. The samples were sequenced to a mean depth of ~ 13.1X (range 9.62X – 34.9X; Fig. 1a; Supplementary Fig. 1; Supplementary Data 1; Supplementary Data 2). We combined them with 81 publicly available genomes from 2 Javan banteng, 8 Bali cattle, 42 zebu, 27 taurine, and 2 gaur (B. gaurus), resulting in 314 samples for mapping and subsequent data filtering (Supplementary Fig. 1; Supplementary Data 1; Supplementary Data 2). After mapping to two reference genomes, taurine cattle (BosTau9, GenBank: GCA_002263795.2, ARS-UCD1.2) and water buffalo (WaterBuffalo, GenBank: GCA_003121395.1, UOA_WB_1), and sample-quality filtering, we removed 81 samples due to suspected duplication (54 samples) and relatedness (27 samples of first and second degree). Hence, the final dataset consisted of 233 samples, including 157 newly sequenced individuals and 76 downloaded individuals passing sample filtering (Supplementary Data 1; Supplementary Data 2). Rigorous filtering to keep only sites with high confidence genotype calls resulted in 1,486,371,715 (53.9%) remaining sites in the BosTau9-mapped datasets, and 1,462,140,103 (55.1%) sites in the Waterbuffalo-mapped datasets (Supplementary Data 3). The number of SNPs and reference genome used in each downstream analyses were shown in Supplementary Data 4.

Fig. 1: Sampling and population structure.
Fig. 1: Sampling and population structure.The alternative text for this image may have been generated using AI.
Full size image

a Sampling locations of the sequenced Indonesian cattle and Bali cattle. b Visualization of 231 samples using two principal components based on genetic distance inferred by HaploNet. One individual (N_31B) had equal admixture from banteng and taurine cattle, referred to as the putative F1 hybrid. c Global genetic differentiation measured as FST values using Hudson’s estimator between all population pairs. d Individual ancestry proportions inferred by HaploNet for K = 3 and K = 7.

Population structure

We used HaploNet29 to visualize relative genetic distances using principal component analysis (PCA) and to estimate admixture proportions of the individuals (Fig. 1b; Fig. 1d; Supplementary Fig. 2; Supplementary Fig. 3). The first two principal components revealed three main clusters of ancestry, corresponding to banteng, zebu, and taurine cattle, with a gradient of admixed samples between zebu and taurine cattle also evident in the admixture analysis using three ancestry components (K = 3; Fig. 1d). The Indonesian cattle (Madura, Pasundan, Pesisir, Jabres, Aceh, Sumba Ongole) were located in the zebu cluster in the PCA; however, there was a tendency towards the banteng cluster, especially in Madura cattle, which is also observed in the admixture proportions at K = 3 (Fig. 1b; Fig. 1d). East Asian zebu also showed a tendency towards the banteng cluster in the PCA (Fig. 1b), but banteng admixture in this population was inferred to be only 1.26% in the admixture analysis at K = 3 (Fig. 1d; Supplementary Data 5), which is much lower than previously reported4. This apparent underestimation of banteng ancestry in East Asian zebu and in some of the Indonesian breeds at K = 3 could be due to an unknown (‘ghost’) source of admixture, which is not well represented by the Javan banteng ancestry component represented by our samples, consistent with Dai et al. 12. Three Pasundan individuals had substantial admixture from the taurine cluster (Fig. 1b; Fig. 1d), suggesting recent cross-breeding. All Bali cattle clustered with banteng except for two individuals from Kupang (N_31B and N101B), one of which had equal admixture from banteng and taurine cattle (Fig. 1b; Fig. 1d), suggesting it is a hybrid between these two lineages.

The admixture analysis at K = 7, the highest value of K that converged, assigned each Indonesian breed to a separate genetic cluster, except for Jabres and Pasundan, both of which we inferred to be admixed between Madura and Sumba Ongole (Fig. 1d), consistent with the recent formation of Pasundan in 1907 by crossbreeding Madura with Ongole-type cattle30. Pasundan was more heterogeneous than other breeds, which is consistent with reports that Pasundan consists of individuals with more variable morphological features than Bali cattle, Madura, or Sumba Ongole31. Similarly, Aceh (1) and Pesisir (5) contained individuals with uncharacteristic admixture proportions, suggesting that some recent cross-breeding has taken place. Even at K = 7, we found no substructure within the banteng-Bali cattle group.

We found low genetic differentiation among three Javan breeds (Pasundan, Jabres, and Madura; FST = 0.01–0.04), and between the two Sumatran breeds (Aceh and Pesisir; FST = 0.04) (Fig. 1c; Supplementary Fig. 4). Similarly, we found almost no genetic differentiation (FST = 0.00–0.01) among the Bali cattle populations, with the exception of the Australian Bali cattle, consistent with its recent strong founder event32. Differentiation was generally high between banteng and unadmixed cattle populations (FST = 0.62–0.85), but several Indonesian breeds had reduced FST with banteng, especially Madura (FST = 0.42–0.54). A NeighborNet tree based on the pairwise global FST distances conform with the PCA and admixture analyses, and suggests reticulation in the evolutionary relation between the populations (Supplementary Fig. 5).

Genetic diversity and inbreeding

We estimated genome-wide heterozygosity in all 231 samples (Fig. 2a). We found the highest median heterozygosity in Madura cattle (0.00548), followed by other admixed or unadmixed zebu cattle: East Asian zebu (0.00525), Pasundan (0.00521), Jabres (0.00515), Aceh (0.00411), Pesisir (0.00412), African zebu (0.00406), and South Asian zebu (0.00396). Madura cattle have even higher genetic diversity than East Asian zebu, previously considered the most genetically diverse cattle in the world due to introgression from other Bos species4,33. Taurine cattle had lower heterozygosities, consistent with previous findings34,35,36,37, whereas banteng and Bali cattle had the lowest heterozygosity of all populations.

Fig. 2: Genetic diversity and inbreeding.
Fig. 2: Genetic diversity and inbreeding.The alternative text for this image may have been generated using AI.
Full size image

a Heterozygosity of all Indonesian cattle and other cattle based on genotype data with non-variable sites included (See Methods). Boxplots indicate median (centre line), the 25th and 75th percentiles (box), and the highest and lowest values within the upper and lower quartiles ± 1.5* interquartile range, respectively (whiskers). b The total number of ROH segments (y axis) and the total length (Mb) of the genome in ROH (x axis) for all 231 samples. Each dot represents an individual. Sample sizes of the populations included in (a, b) are as follows: Aceh (n  =  25), Pesisir (n  = 24), Pasundan (n  =  24), Jabres (n  =  3), Madura (n  =  17), Sumba Ongole (n  =  14), Bali (n  =  19), Kupang (n  =  16), Australia (n  =  12), Captive banteng (n  =  5), Unknown-Indonesia (n  =  8), South Asian zebu (n  =  13), East Asian zebu (n  =  8), African zebu (n  =  10), Asian admixed (n  =  15), East Asian taurine (n  =  3), European taurine (n  =  10), and African taurine (n  =  5).

We also analyzed runs of homozygosity (ROH) as a measure of genomic autozygosity and recent inbreeding. A visualization approach similar to Liu et al. 38 was taken to assess different settings for the ROH calling and to validate that our final ROH calling approach was accurate (Supplementary Fig. 6). While we emphasize that ROH calling is highly sensitive to even small errors in genotype calling, our approach was able to detect most of the apparent true ROHs, and evidently able to identify meaningful differences in ROH length and abundance between different samples. Most of the inferred ROHs were relatively short (0.5–2 Mb), except for captive banteng and Australian Bali cattle, which were dominated by long ROHs (>5 Mb), indicative of recent inbreeding in these individuals (Fig. 2b; Supplementary Fig. 7). We also observed a few individuals in Pasundan and Jabres with large fractions of long ROHs (>10 Mb), indicating occasional, rather than prevalent, cases of close-kin mating in Indonesian cattle (Supplementary Fig. 7). In banteng-Bali cattle, heterozygosity across populations was more uniform when removing ROH regions from the heterozygosity estimation (Supplementary Fig. 8), confirming that inbreeding and recent demographic events play a large role in explaining the variation in genetic diversity across these populations. In contrast, genetic diversity in the cattle breeds is more driven by other factors, such as older demographic history or potentially the amount of admixture.

Admixture and origin of Indonesian cattle

To assess the origin and history of admixture events in Indonesian cattle, we performed a series of analyses after removing individuals with evidence of recent admixture. We first used TreeMix on the Indonesian cattle, banteng, and representative zebu and taurine cattle populations to infer the population tree assuming different numbers of admixture events. This analysis showed that at least five admixture events are required to explain the inter-population genetic variation (Supplementary Fig. 9), and we inferred extensive admixture events from the banteng branch into Indonesian and East Asian cattle (Fig. 3a). TreeMix also suggested three distinct lineages of Indonesian cattle, with the Sumatran cattle breeds (Aceh and Pesisir) being separated from the Javan breeds (Madura, Jabres, and Pasundan) and Sumba Ongole. D-statistics analyses confirmed that all Indonesian cattle and East Asian cattle have excess allele sharing with banteng relative to South Asian zebu (Fig. 3b). Conversely, the different Bali cattle populations had similar amounts of allele sharing with zebu cattle as did banteng, with the exception of Bali cattle from Kupang, which showed a weak but significant signal of excess allele sharing with zebu, indicating introgression from zebu cattle into this population after its divergence from other Bali cattle populations.

Fig. 3: Population history and introgression.
Fig. 3: Population history and introgression.The alternative text for this image may have been generated using AI.
Full size image

a Population tree inferred with TreeMix assuming five migration events. We used a water buffalo as an outgroup. Arrows represent migration edges, with colour indicating the migration weight (proportion of the admixed population estimated to derive from the source population). b D-statistics calculated by ADMIXTOOLS2 when using water buffalo as outgroup H4, various cattle populations as H1/H2. Left panel: South Asian zebu and captive banteng as H2 and H3; right panel: captive banteng and South Asian zebu as H1 and H3. A significant non-zero value, as depicted by the red arrow in the graphic for each panel, provides evidence for gene flow between H3 and H1 (Left panel), and between H3 and H2 (Right panel). Data are presented as the estimated D-statistic ± 3 standard errors. Star represents significant allele sharing for each combination. Sample sizes for each population are as follows: Aceh (n  =  24), Pesisir (n  = 17), Pasundan (n  =  21), Jabres (n  =  3), Madura (n  =  17), Sumba Ongole (n  =  14), Bali (n  =  19), Kupang (n  =  14), Australia (n  =  12), Captive banteng (n  =  5), Unknown-Indonesia (n  =  8), South Asian zebu (n  =  11), East Asian zebu (n  =  8), African zebu (n  =  10), Asian admixed (n  =  15), East Asian taurine (n  =  3), European taurine (n  =  10), and African taurine (n  =  5). c Individual admixture proportion from LOTER and F4. We used unadmixed zebu and banteng as the two ancestry source references for LOTER. We used water buffalo and taurine as outgroups for the F4 ratio admixture inference.

To quantify the amount of banteng-like introgression in admixed cattle populations, we estimated genome-wide admixture proportions using local ancestry inference with LOTER39 and F4-derived admixture ratio40. The pattern was consistent with D-statistics and other results, and suggested that Madura individuals have very high banteng ancestry proportions of 0.33–0.37 (Fig. 3c; Supplementary Data 6). Banteng ancestry proportions were generally homogeneous within breeds, suggesting that the admixture is relatively old and evenly distributed across individuals, except in Pasundan where we found 0.13–0.26 banteng ancestry (Fig. 3c; Supplementary Data 6), consistent with the recent formation of this breed and the variable proportions of Madura and Ongole ancestry in Pasundan individuals (Fig. 1d). Overall, we obtained similar proportions using both F4 ratio and LOTER. However, the F4 ratio was lower than the LOTER-based results in Madura (LOTERmedian = 0.35, F4median = 0.29), Pasundan (LOTERmedian = 0.23, F4median = 0.19) and Jabres (LOTERmedian = 0.24, F4median = 0.20), possibly due to small amounts of taurine introgression into these populations, or other violations of the F4 assumptions. Correspondingly, the F1 hybrid offspring of a Bali cattle and a taurine cattle from Kupang showed a much-reduced F4 ratio as expected (Fig. 3c; Supplementary Data 6).

To disentangle the history of introgression from other Bos sources, we used three complementary approaches. First, we used Hmmix41 to identify regions in the genome that were introgressed from highly divergent source populations into the six admixed Asian cattle breeds. We then calculated the genetic similarity of each introgressed region with two different potential source lineages, Javan banteng, and gaur. We found that the average genetic similarity with these two source species in introgressed regions differed substantially among breeds, with Javan breeds having higher similarity with the Javan banteng source, and the Sumatran breeds and East Asian zebu having higher similarity with gaur (Fig. 4a). Aceh cattle and East Asian zebu had almost identical average genetic similarity patterns, whereas Pesisir cattle had patterns intermediate between the Javan breeds and Aceh. Furthermore, the genetic distance calculated within introgressed regions shared between pairs of individuals showed a similar pattern (Fig. 4b; Supplementary Fig. 10), suggesting that the three Javan breeds have a distinct introgression source from the two Sumatran breeds, which are in turn more similar to East Asian zebu.

Fig. 4: Origin of introgressed fragments in Indonesian cattle.
Fig. 4: Origin of introgressed fragments in Indonesian cattle.The alternative text for this image may have been generated using AI.
Full size image

a The mean ratio of SNPs shared with two outgroups, Javan banteng and gaur, in inferred introgressed regions of each admixed cattle population, based on Hmmix. b Pairwise identity by state matrix (mibs) calculated from overlapping archaic regions (probability > 0.9) between two individuals using a 10 Kb window size. Insert: neighbor-joining tree of the genetic distance (1-mibs) in pairwise overlapping archaic regions. c Inference of admixture time using AdmixtureHMM with one pulse model. Numbers shown are average generation times across 100 bootstraps. d Genome-wide heterozygosity in two admixed populations, Aceh (n  =  24) and Madura (n  =  17), when stratified by the three possible ancestry states: homozygous for zebu ancestry (“Zebu” on x-axis), homozygous for banteng ancestry (“Banteng”), heterozygous for zebu/banteng ancestry (“Mixed ancestry”) and finally across the whole genome (“All”). Results for all admixed populations and East Asian zebu are in Supplementary Fig. 14. Boxplots indicate median (centre line), the 25th and 75th percentiles (box), and the highest and lowest values within the upper and lower quartiles ± 1.5* interquartile range, respectively (whiskers). e PCA analysis on genome segments of banteng ancestry origin and zebu ancestry origin respectively, as inferred by LOTER. To remove any distant relatedness among samples, we removed one of each pair of individuals with K1 > 0.2 identified by ngsRelate for (b, d) (Supplementary Fig. 31).

Next, we inferred the introgression time using AncestryHMM42 (Fig. 4c) and again found distinct differences between the Javan breeds and the other admixed Asian cattle, with a younger inferred admixture time of 269 generations for Madura, corresponding to 1345–1883 years ago assuming a generation time of 5–7 years4,43,44. Pesisir and especially Aceh showed substantially older introgression times (407 and 1051 generations, respectively), again suggesting that the introgression into Sumatran breeds was a separate event from the introgression into Javan breeds. We note that AncestryHMM tends to overestimate the age of the introgression if the source is a ghost population not well represented by the proxy donor population data used in the analysis45, as already indicated by the HaploNet (Fig. 1d) and Hmmix (Fig. 4a) results. Therefore, the older inferred dating of introgression in the Sumatran breeds, and even more so in the East Asian zebu, could be explained if the Bos source deviated significantly from that represented by our Javan banteng data — e.g., one or more of the Bos lineages found in mainland Southeast Asia, as suggested before for East Asian zebu33. Complicated introgression involving more than one species of wild Bos has been suggested for East Asian zebu before12.

Last, to disentangle the origin of the zebu and banteng ancestry components of the admixed Asian breeds, we used local ancestry inference to mask sequentially one of these two ancestries in the individual genomes, and performed PCA using a method targeted for datasets with very high missingness on the resulting genotype matrices. We checked how well the local ancestry masking works by redoing the D-statistic analysis applying the masking, and confirmed that this masking removes all significant traces of banteng-like ancestry (Supplementary Fig. 11a), bringing the D-statistics close to zero for all admixed populations. The masked EMU-PCA analyses showed that the banteng ancestry groups into three different clusters: (1) extant Javan banteng and Bali cattle, (2) Javan cattle, and (3) Sumatran cattle and East Asian zebu (Fig. 4e; Supplementary Fig. 12; Supplementary Fig. 13). The zebu ancestry groups into five main clusters represented by (i) South Asian zebu and Sumba Ongole, (ii) Madura, (iii) Pesisir, (iv) Aceh, and (v) East Asian zebu, respectively (Fig. 4e; Supplementary Fig. 12; Supplementary Fig. 13). The other Javan breeds — Pasundan and Jabres — fall on a cline between Madura and Ongole cattle, consistent with them being a recent admixture of these two sources. The addition of new downloaded samples from northernmost Southeast Asia and East Asia (e.g. Myanmar and southern China, Supplementary Data 7) did not resolve the origins of Indonesian zebu ancestry, but showed a cline of zebu ancestry consistent with their geographical placement on the shortest route between India and China (Supplementary Fig. 11b).

As a measure of polymorphism within and between the source populations of admixed individuals, we calculated heterozygosity in each of the three ancestry states along the genome in admixed breeds: tracts homozygous for zebu ancestry, tracts heterozygous for zebu/banteng ancestry, and tracts homozygous for banteng ancestry (Fig. 4d; Supplementary Fig. 14). The results highlight how introgression elevates the overall genetic diversity, but also showed that the banteng source contributing to Madura had much higher genetic diversity (median heterozygosity: 0.0043) than in any extant banteng populations (median heterozygosity in captive banteng: 0.0019; in Bali cattle: 0.0027). This suggests that the formation of the Madura breed did not entail a major bottleneck, and that the banteng population contributing to Madura had a higher genetic diversity than the ones found today. Aceh and Pesisir showed lower diversity in the banteng ancestry tracts compared with those in Madura and Pasundan, further supporting the different introgression history between Sumatran breeds and Javan breeds (Fig. 4d; Supplementary Fig. 14).

In summary, our analyses suggest three distinct origins of the Indonesian cattle: (1) Sumatran breeds Aceh and Pesisir, which have similar zebu and banteng ancestries as East Asian zebu and show signs of either old and/or ghost Bos introgression from mainland Southeast Asia; (2) Javan breeds Madura, Pasundan, and Jabres, which have introgression that occurred more recently and from a source more similar to extant Javan banteng; and (3) Sumba Ongole, which does not have appreciable Bos introgression and is genetically similar to South Asian zebu, consistent with its recent introduction20.

Genomic landscape of ancestry

mtDNA haplotypes in Indonesian breeds were found both within banteng and cattle haplogroups, whereas Y-chromosomal haplotypes were exclusively of cattle origin (Supplementary Fig. 15; Supplementary Fig. 16). This suggests that interbreeding did not involve male banteng, whereas both male and female cattle were involved, consistent with previous evidence15. In addition, the mtDNA haplotype network corroborated a closer affinity of the Indonesian breeds, except Sumba Ongole, with East Asian than with South Asian zebu (Supplementary Fig. 15).

To investigate the distribution of different ancestries across the genome in admixed cattle, we used the local ancestry inference results from the five admixed Asian cattle breeds with sufficient sample sizes (Aceh, Pesisir, Pasundan, Madura, and East Asian zebu). Preliminary analyses concluded that the local ancestry inference was robust to the composition of the reference panels used in LOTER (Supplementary Fig. 17) and to the choice of reference genome (Supplementary Fig. 18). Based on the local ancestry inference from each individual, we calculated the mean banteng ancestry in non-overlapping 50 Kb windows for each population. We visualized the genome-wide local ancestry landscape for each admixed breed (Fig. 5a; Supplementary Fig. 19). We found noticeable peaks and valleys, suggesting that selection (positive as well as negative) might have been acting on introgressed haplotypes. Some large-scale features are visible; for example, there are two large-scale ‘deserts’ of banteng ancestry on chromosome 7 ( ~ 5–25 Mb and 40–55 Mb), and another one on chromosome 16 ( ~ 2–12 Mb). These decrease the average banteng ancestry markedly on the entire chromosomes 7 and 16 relative to other autosomes (Supplementary Fig. 20).

Fig. 5: Ancestry landscape of Indonesia cattle.
Fig. 5: Ancestry landscape of Indonesia cattle.The alternative text for this image may have been generated using AI.
Full size image

a Banteng ancestry across chromosome 13 in five cattle groups (Aceh, East Asian zebu, Madura, Pasundan, and Pesisir). Window-based scan of regions with extreme banteng ancestry using the proportion of inferred banteng SNPs from LOTER for each cattle group divided by the mean proportion per group. Pink shade marks regions in the genome-wide top 5% of the normalized LOTER summed across all groups. Similar plots for other chromosomes are in Supplementary Fig. 19. b LOTER and Ux results across cattle groups, showing windows in ASIP (chr 13, 62700000–64250000) and KIT region (chr 6, 64900000–71100000). The windows containing the actual coding region of ASIP and KIT are highlighted with the gene name. c Spearman correlations of banteng proportion in all windows of 50 Kb. d Upset plot of genes within top 5% highest banteng proportion windows for five admixed groups (Aceh, Pesisir, Pasundan, Madura, East Asian zebu). The number in each bar represents the number of genes between different breed comparisons. e Word clouds of the QTL terms contained within the top 5% of LOTER inferred banteng ancestry for each group. We included only the top 5% of terms overlapping most frequently with regions of high banteng ancestry for each cattle group.

In addition, banteng ancestry was severely depleted on the X chromosome across breeds (Supplementary Fig. 21), consistent with sex chromosomes harboring more incompatibility loci than autosomes28,46. To investigate the overall type of selection acting on haplotypes of banteng (minor) ancestry in a zebu (major) ancestry background, we calculated the correlation between the mean banteng ancestry and three genomic features: recombination rate, conservation score, and coding region density in Madura cattle (Supplementary Fig. 22). These analyses revealed a very weak positive correlation between recombination rate and banteng ancestry, a very weak negative correlation between conserved sites and banteng ancestry, and a very weak positive correlation between banteng ancestry and coding sites density, collectively indicating a lack of linked purifying selection removing banteng ancestry overall, contrasting with the results from hybridizing swordtail (Xiphophorus spp.) and hominin populations24.

We observed an overall high correlation in banteng ancestry proportions per 50 Kb window even between breeds that we inferred to have independent introgression events — e.g., East Asian zebu and Madura (Spearman’s ρ = 0.30; Fig. 5c). The correlation was markedly higher between Madura and Pasundan (ρ = 0.58; Fig. 5c), and between Aceh and Pesisir (ρ = 0.49; Fig. 5c), both inferred to share introgression history.

Regions of high banteng ancestry in admixed cattle

To investigate the functional consequences of introgression, we identified regions of high banteng ancestry (top 5%) within breeds using 50 Kb non-overlapping windows across the genome. We used three complementary methods of local ancestry inference (LOTER, Hmmix, and UX), which showed consistent but complementary results (Supplementary Fig. 23, see also Methods). In total, the top 5% proportion of population-wise banteng ancestry as inferred from LOTER overlaps with 839 genes in Madura, 813 genes in Pasundan, 892 genes in Pesisir, 1041 genes in Aceh, and 914 genes in East Asian zebu (Fig. 5d), all significantly lower gene density than expected by chance (Supplementary Fig. 24). These results suggest that elevated banteng ancestry tends to occur in relatively gene-poor regions. Gene ontology enrichment tests identified multiple immune system-related enrichment categories across breeds, and developmental processes (Supplementary Data 8). In addition, the outlier regions overlap with many production relevant quantitative trait loci that differed among populations, notably milk-related (e.g., ‘milk fat percentage’, ‘milk yield’), fertility-related (‘calving ease’, ‘conception rate’), and body size or growth-related traits (‘body weight’, ‘average daily gain’) (Fig. 5e). We additionally reported the gene list targeted by top 5% outliers that were overlapping between any two out of three methods (LOTER, Hmmix, and Ux) and a Venn diagram plot showing the number of overlapping genes for each Indonesian breed (Supplementary Fig. 25; Supplementary Data 911).

Shared outlier regions and convergent adaptive introgression

Identifying regions of convergent negative or positive selection acting on introgressing haplotypes in admixed cattle can be valuable for assessing the potential for genetic improvement of cattle through gene editing or controlled crossing. To investigate signals of shared or convergent adaptive introgression from banteng, we tabulated the overlap between genes in the top 5% windows with banteng ancestry across the five admixed breeds with sufficient sample sizes (Fig. 5d; Supplementary Data 12). The magnitude of gene sharing between populations followed the same trends as the overall genomic correlations in banteng ancestry (Fig. 5c), with the exception that the Pesisir and Aceh (Sumatran) breeds shared more outlier genes (195) than any other population pair (Fig. 5d).

Among the genes that have elevated (top 5%) banteng ancestry across multiple breeds, several are well-known for their associations to particular cattle phenotypes (Supplementary Fig. 19). These include the well-known coat-color genes ASIP (Pasundan, Pesisir, Aceh, and East Asian zebu), KITLG (Madura, Pesisir, Aceh, and East Asian zebu), TYR (Pesisir, Aceh and East Asian zebu), and KIT (Pesisir, Aceh, and East Asian zebu). Lactase LCT and myostatin MSTN were also in outlier windows in four populations, (Madura, Pesisir, Aceh, and East Asian zebu), although the same windows also contain other genes that are plausible targets of selection, e.g., MCM6 which is also associated with coat color47 and is positioned in the LCT window. LCT and MSTN have well-known roles in lactose digestion48,49 and the growth of muscle fibre50, respectively. In addition, the bovine MHC region (BoLA), which is strongly implicated in the immune system of all vertebrates51, was also among the regions with highest summed normalized banteng ancestry score (Fig. 5a; Supplementary Fig. 19). Of these regions, ASIP and KIT regions were also clearly identified by the Ux statistic (Fig. 5b; Supplementary Fig. 26), corroborating that these two color coding genes are examples of convergent adaptive introgression. Both genes are within extensive genomic regions that have elevated banteng ancestry across all admixed cattle populations (Fig. 5b; Supplementary Fig. 19). The window containing ASIP and the one immediately downstream had the two highest summed normalized banteng ancestry across the genome, with East Asian zebu, Aceh, and Pesisir having particularly elevated banteng ancestry in this region (Fig. 5a; Supplementary Fig. 19). When investigating the haplotype structure in ASIP, we identified several banteng-like haplotypes segregating among the admixed cattle populations (Supplementary Fig. 27), suggesting introgression of different ASIP-carrying haplotypes, some of which are not common in extant Javan banteng and Bali cattle. We discovered nine non-synonymous mutations (Supplementary Data 13) unique to the banteng-derived ASIP. Two of these are classified as potentially high impact with regards to protein structure.

The remaining highlighted gene regions identified with LOTER have less straightforward interpretations, as they could represent regions with incomplete lineage sorting or with more complex haplotype structure complicating the assignment of regions to either a banteng or zebu origin. This does not exclude that haplotypes in these gene regions are adaptively introgressed as they are consistently identified as closer to existing banteng than zebu haplotypes across admixed breeds.

Discussion

We generated the whole-genome data set from Indonesian cattle, a group of diverse breeds of interest for the history of cattle and their unique genetic variation caused by wild Bos introgression. Although admixture between zebu and banteng in Indonesian cattle is not a novel finding14,15,23,52,53, the admixture proportions have so far not been accurately estimated, and the evolutionary history of different breeds has not been investigated, nor has the genomic landscape of ancestries. Our findings uncover several novel and relevant aspects for livestock genetics — we identify 3,454,058 novel introgressed SNP variants in cattle (Supplementary Data 14), and we have provided a genomic map of banteng introgression both within and across a number of cattle breeds with unique introgression histories, potentially useful for genetic improvement in cattle.

The magnitude of wild Bos introgression in Indonesian cattle is unprecedented in any other cattle populations, and it makes these populations of global importance for cattle genetic resources. Similar to East Asian cattle, banteng-like introgression has elevated the genetic diversity in Indonesian cattle relative to non-admixed zebu breeds4,12, and in Madura we found the highest genetic diversity measured to date in any cattle population. We found that the ancestor of Madura cattle was the result of a pulse-like admixture between an introduced zebu and a genetically diverse endemic Bos javanicus population around 1345–1883 years ago. Our genetic dating of this event is remarkably consistent with the historical record54 and coincides with the earliest archaeological evidence of cattle on Java20. The high diversity of the banteng-derived introgressed regions, combined with the fertility barriers between banteng and zebu requiring several generations of backcrossing to achieve full fertility in hybrids19, both point towards a large-scale and possibly human-controlled practice of interbreeding. We also demonstrated that these admixed archaic Javan cattle were subsequently mixed with an Ongole-like zebu population to form the two West Javan breeds, Pasundan and Jabres, which therefore inherited significant amounts of banteng ancestry from Madura-like ancestors.

In contrast, we found uniformly low genetic diversity in banteng and Bali cattle. The captive banteng and the feral Bali cattle in Australia had particularly low genetic diversity and up to 61% of their genome in runs of homozygosity. Interestingly, the banteng ancestry component in the Javan cattle breeds was different from that in extant Bos javanicus, with higher genetic diversity than the extant individuals. This indicates that Javan banteng might have experienced a reduction in genetic diversity since the ancestral population introgressed into the admixed Javan cattle about 1345–1883 years ago, and that extant Indonesian cattle therefore carry otherwise lost B. javanicus genetic variation, potentially available for genetic rescue. Besides Bali cattle in Kupang, where we identified a weak signal of historical introgression, as well as an F1 hybrid and one other recently admixed individual, we found no indication of introgression of cattle alleles into other Bali cattle and banteng individuals. This alleviates previous concerns that Bali cattle are experiencing genetic swamping from cattle22. However, the reduced genetic diversity and high inbreeding coefficients in Bali cattle do warrant concerns about breeding practices55, and suggest that more focus is needed to prevent inbreeding depression in this important local livestock. The feral population of Australian Bali cattle has been proposed as a possible back-up population for wild Javan banteng, considering that it is the largest existing wild population of banteng32,56. However, we corroborated that this population has strongly reduced genetic diversity and increased inbreeding56 as a result of its small founder population of just 20 individuals57,58, casting doubt on its utility as a back-up population without deliberate intervention using genetic rescue59.

We identify three separate origins and introductions of cattle to Indonesia: two early introductions (>1345 years ago) giving rise to the breeds now on Java and Sumatra, respectively, and one recent and well-documented introduction of Ongole cattle from India20. Due to the absence of genomic data from most of mainland Southeast Asia, we could not infer the geographical source of the zebu founders of the Javan and Sumatran cattle, but our analyses show that the Sumatran zebu ancestry is more similar, but not identical, to East Asian zebu, and that Sumatran cattle share features of Bos introgression with East Asian zebu. On the other hand, the Javan zebu ancestry is more distant from any other known Asian cattle, including South Asian and East Asian, and Javan breeds show a distinct age and source of Bos introgression. In conjunction with mtDNA similarities (Supplementary Fig. 15) and previously inferred SNP genotype similarities among Sumatran, Thai, and Chinese breeds52,53,60, we conclude that the most likely source of the early introduction of cattle to Sumatra was mainland Southeast or eastern Asia, in contrast to previous hypotheses based on strong cultural and trade links between the Indian subcontinent and Indonesia in this period15,20. Hence, our results support the presence of extensive connectivity between mainland and maritime Southeast Asia in the period before 700 AD, possibly through Indianized empires in Southeast Asia with extensive maritime trading networks61. In contrast, our ability to trace the source of Javan zebu cattle is limited, and we simply note that Javan cattle are highly unique compared to all other Asian cattle with genomic data, both with respect to their zebu ancestry and their introgression signature. Interestingly, recent linguistic research established that the ancient Javanese word for zebu cattle derives from a Mon-Khmer language, consistent with cattle being introduced to Indonesia through Indianized cultures in mainland Southeast Asia62.

Consistent with two distinct, early cattle introductions, Javan and Sumatran breeds also have distinct signatures of introgression from banteng-like (wild Bos) populations. We hypothesize that at least part of the introgressed component in the Sumatran breeds Aceh and Pesisir comes from an alternative Bos source not represented in this study. Mainland or Bornean banteng, gaur, gayal (B. frontalis) and kouprey (B. sauveli) could all be candidates, and the latter three have possibly introgressed into zebu in East Asia12,23. The challenges of establishing the species source of wild Bos introgression have been highlighted before12,33, and are exacerbated by the paucity of whole-genome data from these endangered species.

There are both theoretical arguments and empirical evidence supporting that negative selection on introgressing haplotypes is the norm28,46,63, even though introgression can also be adaptive64,65,66. In cattle, introgression is thought to have helped particular breeds adapt to local conditions when introduced to new areas of the world5,23,67. We found several pathways and quantitative trait loci plausibly related to local adaptations, or to the individual agro-ecologies of the different breeds, that are enriched in the regions of high banteng ancestry, including some related to the immune system, growth and development, and fertility. We found only weak evidence of genome-wide linked negative selection acting on banteng haplotypes in the admixed breeds, in contrast to Homo sapiens/neanderthalensis68,69 and swordtails24. Previous studies suggested that overall linked negative selection on introgression is expected to be correlated with divergence time70, but our results suggest that introgression is overall much better tolerated among Bos species than among hominins and swordtails, despite Bos divergence times being intermediate between these two other examples (~ 500,000 to 700,000 years ago estimated for banteng-zebu; Supplementary Fig. 28). This is an important finding, because it suggests that genetic variation contained in wild Bos may be suitable to genetically improve cattle worldwide. Limited genetic incompatibility among Bos could potentially explain the highly reticulated evolution of the Bos genus23, although more studies are required to test this hypothesis. Moreover, it is possible that introgression into a domesticated species can follow distinct selection trajectories compared to non-domesticated species due to the fundamentally different pressures imposed by artificial as opposed to natural selection71, an intriguing possibility that warrants further investigation.

Correlated genomic distributions of banteng ancestry across breeds with presumed distinct admixture histories, e.g., Madura and East Asian zebu, suggests the presence of convergent selection, consistent with parallelism found in replicate populations of hybrid swordtail fishes27. Regions almost devoid of banteng ancestry in all breeds, including on large parts of the X chromosome, could contain Bateson-Dobzhansky-Muller incompatibilities, resulting in consistent negative selection on introgressing haplotypes28, or regions where banteng haplotypes have convergently been removed by artificial selection. Incompatibility in specific regions of the genome is not inconsistent with a lack of genome-wide linked selection against the minor parent ancestry. We found lower cross-breed correlations of banteng ancestry in the candidate regions for positive selection than genome-wide (Fig. 5c; Supplementary Fig. 29), suggesting that positive selection on introgression is less convergent than negative selection. This pattern is expected if negative selection is caused by ubiquitous species-wide features, such as genetic incompatibilities or genome architecture, whereas adaptive introgression is generally idiosyncratic in each breed, consistent with their diverse morphology and agro-ecology20.

Despite predominantly breed-specific adaptive introgression, we did find cases of likely convergent adaptive introgression, such as in the coat-color genes ASIP, KIT, and possibly KITLG, and TYR, suggesting that coat color is a trait that has recurrently been under adaptive introgression in cattle receiving gene flow from banteng-like sources. A region around ASIP in particular had a strong signal of convergent adaptive introgression, with multiple different haplotypes introgressing in different admixture events. This is consistent with coat color being among the most generic domestication traits72, and possibly one of the earliest targets of artificial selection in the domestication process73. If coat color is generally under artificial selection in livestock, introgressing genetic variants affecting this trait would provide a new substrate for such selection64. Interestingly, the Indonesian cattle breeds examined here share a tan coat color with banteng, supporting that introgressing coat color genes may have been under positive selection. These results are consistent with previous research identifying ASIP and other coat-color genes as candidates of both positive selection on standing variation in zebu cattle5 and adaptive introgression in Chinese cattle4,5,12.

We identified many SNP variants in Indonesian cattle that are not found in cattle breeds lacking wild Bos introgression, providing a rich and novel source of genetic variation of global importance for managing the genetic resources of cattle. We show that the introgressed regions in high frequency across admixed breeds have important functional associations in cattle, making them potentially relevant for targeted cattle breeding and genetic improvement. As many of the introgressed genetic variants are previously unknown and therefore have not been included in previous cattle genome-wide association studies, they could provide new substrate for production-related selection. In addition, their presence across multiple admixed cattle populations provides a catalog of variants that have already passed through a sustained filter of selection in a zebu-genomic background, proving their genomic compatibility in a cattle background. We anticipate that future studies will be able to test the potential phenotypic associations of some of the regions we have identified here, enhancing the utility of this catalog of introgressed variants for cattle breeding. Such introgressed diversity in tropical cattle breeds may prove vital for future adaptation of other cattle breeds to increasing global temperatures as predicted over the coming decades.

Methods

Sample collection and laboratory protocol

The research presented in this study complies with all relevant ethical regulations and was conducted in accordance with the Code of Conduct for Responsible Research of the University of Copenhagen. We collected 233 samples from 6 Indonesian cattle breeds (Aceh, Pesisir, Pasundan, Jabres, Madura, and Sumba Ongole), 3 Bali cattle populations (Bali, Kupang, Australia), and three individuals of Javan banteng from captivity in Texas, USA (Fig. 1a; Supplementary Data 1; Supplementary Data 2; Supplementary Fig. 1). The Bali cattle from Australia come from a feral population in Garig Gunak Barlu National Park in northern Australia, descended from 20 individuals that were released from an abandoned British outpost in 184932,56. Samples consisting of blood were kept in an EDTA buffer in the field, stored at −196 °C in dry shipper as soon as possible for transferring from the field to the laboratory in Bogor, and were further transferred to a −80 °C freezer for long-term storage. We then followed the manufacturer’s protocol instructions of the QIAGEN Blood and cell culture Kit to extract DNA. Before we did the default protocol, we added three treatment steps: (1) adding 500 µl of ice cold water to the blood samples, (2) centrifuge the diluted blood samples for 20 min with the speed of 17,900 x g in 4 °C, (3) discard the supernatant without disturbing the pellet. These extra steps were required to do the default kit protocol because of the humid climate in the Indonesian lab. Before using gel electrophoresis to check the quality of the genomic DNA, we further measured the DNA concentrations with a Qubit 2.0 Fluorometer and a Nanodrop. After DNA extraction, 1 mg genomic DNA was fragmented by Covaris (350 base pairs on average), followed by purification by AxyPrep Mag PCR clean-up kit. The fragments were end-repaired by End Repair Mix and then purified. The repaired DNA was combined with A-Tailing Mix, then the Illumina adaptors were ligated to the DNA adenylate 3’ ends, followed by product purification. Size selection was performed targeting insert sizes of 350 base pairs (bp). Several rounds of PCR amplification with PCR Primer Cocktail and PCR Master Mix were performed to enrich the adaptor-ligated DNA fragments. After purification, the size and quality of libraries was assessed by the Agilent Technologies 2100 Bioanalyzer and ABI StepOnePlus Realtime PCR System.

Additionally, we downloaded 81 publicly available, whole-genome sequencing datasets: 8 samples from Bali cattle from an unknown locality in Indonesia, 42 individuals of Bos indicus spreading from East Asia, South Asia, Latin America, and Africa, 27 individuals of Bos taurus from Asia, Middle East, Europe, and Africa, and two gaur (Bos gaurus), two Javan banteng (Bos javanicus) from zoological gardens (Supplementary Data 2; Supplementary Fig. 1).

Sequencing and mapping

All samples were sequenced using illumina paired-end 2 × 150 bp reads. This includes 230 samples sequenced to depth of 9.62X–17.0X coverage on Illumina NovaSeq platform and 3 samples from captive Javan banteng sequenced to depth of 15.9X–34.9X on the Illumina HiSeq2500 platform (Illumina Inc., San Diego, CA, USA). We assessed the quality of the raw reads using FastQC (bioinformatics.babraham.ac.uk/projects/fastqc) and MultiQC74 before mapping.

For mapping, we used a modified version of PALEOMIX BAM pipeline75 (github.com/xiqtcacf/IndonesianCattle-Scripts), which is a pipeline designed for the processing of demultiplexed, high-throughput, short-read sequencing data. We first trimmed Illumina universal adapters using AdapterRemoval v2.3.276. We merged read pairs with overlapping sequences of at least 11 bp to improve the fidelity of the overlapping region by selecting the highest-quality base when mismatches are observed. Mismatching positions in the alignment, where both read bases had the same quality, were set to ‘N’ via the ‘--collapse-conservatively’ option. We did not trim Ns or low-quality bases and only empty reads resulting from primer-dimers were discarded. We then mapped all trimmed reads using BWA-mem v0.7.17-r11887077 to two chromosome-level reference genomes: (1) BosTau9 (GenBank: GCA_002263795.2, ARS-UCD1.2), a female taurine from Hereford breed, and (2) Waterbuffalo (GenBank: GCA_003121395.1, UOA_WB_1), a female water buffalo from the Mediterranean breed. PCR duplicates were flagged using samtools v1.11 ‘markdup’ for paired reads and PALEOMIX ‘rmdup_collapsed’ for merged reads.

We merged the resulting BAM alignments from collapsed and paired reads for each individual, and filtered them based on standard BAM flags to exclude unmapped reads, reads with unmapped mate reads, secondary alignments, reads that failed QC, PCR duplicates, and supplementary alignments. We further excluded reads in alignments with inferred insert sizes <50 bp or >1000 bp, reads where <50 bp or <50% of the reads were aligned, and read pairs in which mates mapped to different contigs or not in the expected orientation. We finally generated statistics of the filtered BAM files by samtools ‘stats’ and ‘idxstats’78.

Sample filtering

Heterozygosity

We excluded samples with extraordinarily high heterozygosity, because these samples likely suffer from DNA contamination or considerable sequencing errors. We calculated heterozygosity per individual based on site frequency spectrum (SFS) using genotype likelihood with the GATK model in ANGSD. The analysis revealed six individuals with excessively high heterozygosity (≥ 0.00620; Supplementary Data 2) and excluded five out of six for downstream analyses. We kept the sample with highest heterozygosity from Kupang (N_31B) as preliminary analyses suggested it might be a potential F1 hybrid.

Relatedness filtering

We removed duplicates and closely related samples using the methodology described in Waples et al. 79 We first computed the two-dimensional site-frequency spectrum (2d-SFS) for each pair of samples and then calculated three statistics from the 2d-SFS: R0, R1, and the KING-robust kinship coefficient. We found 54 duplicated pairs with KING-robust kinship > 0.460, of which most were from the Jabres breed, and 27 pairs of up to approximately second-degree relatives (KING-robust > 0.150). We excluded all but one sample with lower coverage from identified duplicated and related pairs, leading to 81 samples discarded (Supplementary Data 2).

Site filtering

Reference genome filtering

We implemented reference genome filtering based on different criteria. We used GenMap v1.2.080 to calculate the mappability score of each site of both the BosTau9 and the Waterbuffalo reference genomes, conservatively using 100 bp k-mers with up to two mismatches allowed (-K 100 -E 2), and default remaining settings. We removed all sites with a mappability score <1 for downstream analyses. We used RepeatMasker v.4.1.1 (repeatmasker.org) to identify repeat regions in both reference genomes, using ‘rmblast’ as the search engine and ‘mammal’ as the query species with default settings. We also excluded repeat regions identified by RepeatMasker, annotated sex chromosomes and scaffolds that were not assembled into chromosomes (Supplementary Data 3). Additionally, we inferred the sample sex using SATC81, based on the normalized sequencing depth on sex-linked scaffolds for each sample (Supplementary Data 2).

Global depth filtering

For each of the two mapping datasets, we estimated the global depth (read count) per site across all samples using the ANGSD command ‘-minMapQ 25 -minQ 30 -doCounts 1 -doDepth 1 -dumpCounts 1 -maxdepth 4000’ and then estimated the per-site median depth. We excluded sites with a global depth <0.5 times the median (0.5 × 1717 = 858.5) and >1.5 times the median (1.5 × 1717 = 2575.5) from all analyses (Supplementary Data 3).

Excess heterozygosity filtering

We removed regions with excessive heterozygosity, which is likely caused by problematic mapping due to repetitive or paralogous regions. We first generated a preliminary file of genotype likelihoods using ANGSD with the GATK model (-GL 2) from common polymorphic sites (MAF ≥ 0.05 and SNP p < 0.000001), base quality at least 25 (-minQ 25), and minimum mapping quality of 30 (-minMapQ 30). Using these genotype likelihoods as input to PCAngsd v0.98582,83, we then calculated the per-site inbreeding coefficients (F), ranging from −1 where all samples are heterozygous to 1 where all samples are homozygous, and performed a Hardy-Weinberg equilibrium likelihood ratio test accounting for population structure. The optimal number of principal components to model the population structure was inferred based on Velicer’s minimum average partial test84 implemented in PCAngsd. Finally, we removed windows of 10 Kb around sites with significant excessive heterozygosity estimates (F < −0.95 and p < 0.000001) based on the per-site inbreeding coefficients for both BosTau9 and Waterbuffalo reference (Supplementary Data 3).

Genotype calling and imputation

We performed genotype calling for both datasets mapped to BosTau9 and Waterbuffalo reference genomes using bcftools v1.1485. Genotype calling was only performed on the samples maintained after sample filtering and only on genomic regions retained after site filtering. We used the ‘bcftools pileup’ function based on reads with a minimum base quality of 25 and a minimum mapping quality of 30, enabling ‘-per-sample-mF’ to increase calling sensitivity. We then did genotype calling by using ‘--multiallelic-caller’. Finally, we removed both multiallelic sites and indels, and applied additional filtering using the setGT plugin of bcftools, imposing a minimum depth of coverage per site of 10 and only accepting heterozygous calls with at least 3 reads supporting each allele.

We did genotype imputation and phasing to remedy genotype missingness and refine the genotypes, because some samples had low depth for regular genotype calling. To prepare the input, we extracted bi-allelic SNPs from the genotype data mapped to both references: BosTau9, or ‘internal’ reference and Waterbuffalo, or ‘external’ reference. We did imputation and phasing using BEAGLE v3.3.286 separately for each chromosome. We visualized the distribution of genotype discordance between the original vcf and imputed vcf genotype files (Supplementary Fig. 30a). In order to evaluate the accuracy of imputation, we additionally conducted the analysis by downsampling the highest-depth banteng individual (34.86X, LIB112407_Banteng_85B_Texas) to depths 1X, 5X, 10X, then imputing them, and comparing the imputed genotypes with the high-quality genotype calls for the full data from this sample using an R package ‘vcfppR’87. The analysis showed a very high concordance between imputed genotype calls and true genotype calls, supporting the accuracy of imputation (Supplementary Fig. 30b).

PCA and admixture analyses

To investigate population structure, we used HaploNet29, which implements a neural network on local clustering of phased data. We trained the HaploNet model using default settings and produced log-likelihoods that can be processed further to PCA and Admixture. We used ten eigenvectors to capture population structure between and within the filtered dataset of 231 individuals (without the two gaurs) mapped to BosTau9. For the admixture analysis, we set the number of ancestry (K) from 3 to 12, with 50 independent runs for each K. We used a convergence criterion of reaching within 5 log-likelihood units of the lowest log-likelihood in at least 3 independent replicates. We obtained convergence with K from 3 to 7. Based on the HaploNet results at K = 7 we then defined a subset of individuals as non-admixed representatives of each breed or population by removing samples that contained > 10% of admixture from a different ancestry source than the population majority.

Genetic diversity, runs of homozygosity, and population differentiation

Heterozygosity

We assessed genetic diversity of cattle based on genome-wide individual heterozygosity, using filtered genotype-called data that included non-variable sites with a range of depth from 6 to two times of average depth of each sample, a minimum allelic support of 3, a minimum mapping quality of 30, and a base quality of 25. We calculated the individual heterozygosity as the number of heterozygous genotypes relative to the number of total sites.

Runs of homozygosity

We first explored various different approaches and data filtering options to optimize the detection of runs of homozygosity (ROHs) in each individual. For assessment and validation, we visualized ROHs across the genome similar to the approach in Liu et al. 38 and checked visually the performance of ROH calling under different filtering and PLINK detection settings. The assessment criteria include looking for signs of long apparent ROHs being broken up, and the ability of smaller regions of consistently reduced heterozygosity to be called as ROHs in the analysis. Based on these analyses, we used the imputed datasets mapped to BosTau9 to estimate ROHs using PLINK v1.90b6.24. In PLINK we applied maximum heterozygous calls of 5 (--homozyg-window-het 5), a minimum of 500 kilobases for a ROH (–homozyg-kb 500), and filtered out all variants with missing calls (--geno 0). We merged distinct ROHs within 100 Kb distance, and categorized all resulting ROHs into five length groups: 0.5–1 Mb, 1–2 Mb, 2–5 Mb, 5–10 Mb, and >10 Mb.

Global F ST

To infer population differentiation, we calculated genome-wide global FST for each pair of populations (after removing admixed individuals) using ‘--fst’ implemented in PLINK2.088. This calculates Hudson’s FST estimator, which is robust to differences in sample size between populations89. We further built a NeighborNet tree based on the pairwise global FST distances in order to show the relationships of the populations.

Population history and introgression analysis

Treemix

To investigate the history of population splits and historical admixture events, we performed a TreeMix analysis90 using the imputed data mapped to the Waterbuffalo reference. Because relatedness can underestimate covariance and lead to spurious inferences of migration using Treemix, we did a more stringent sample filtering for the input datasets by applying a threshold of KING kinship coefficient of 0.1 to exclude potentially second degree of related samples. In the TreeMix analysis we included the Indonesian cattle breeds, Bali cattle, captive banteng, East Asian zebu, East Asian taurine and European taurine, two gaurs, and the Waterbuffalo to root the tree (Supplementary Data 2). We ran TreeMix assuming 0–10 migration events. For each number of migration events (m), we ran 100 iterations using bootstrap (-bootstrap) and a block size of 1000 SNPs (-k 1000). We inferred the final, optimal number of migration edges (m) from the second-order rate of change in likelihood (Δm) weighted by the standard deviation using the ‘Evanno’ method implemented in OptM91.

D-statistics

To infer the evolutionary relationships and ancient admixture events, we calculated D-statistics (Patterson’s D, also called ABBA-BABA) using the R package ADMIXTOOLS292. We used datasets mapped to the Waterbuffalo reference to mitigate the effects of reference bias. We ran D-statistics analyses of the type (H1-H2-H3-H4) with both topology of captive banteng-cattle-South Asian zebu-water buffalo and cattle-South Asian zebu-captive banteng-water buffalo, using the function ‘qpdstat’ implemented in ADMIXTOOLS2.

mtDNA, Y-chromosome DNA analyses, and phylogenetic tree

To infer maternal and paternal phylogenetic relationship between cattle, we inferred the phylogenetic tree for both mitochondrial DNA (mtDNA) and Y-chromosome (Ychr). To identify the matrilines in the cattle, we generated consensus mitochondrial sequences. Briefly, we used the -doFasta 1 option of ANGSD93 with quality values specified as ‘-minMapQ 30 -minQ 30 -setMinDepth 5 -uniqueOnly 1 -remove_bads 1’ and also ‘-doCounts 1’ to generate consensus fasta from whole-genome sequencing data mapped to BosTau9 mitochondrial scaffold (NC_006853). We aligned these fasta sequences using the FFT-NS-1 (fast) option of MAFFT94. We imported the aligned sequences into Jalview alignment editor95 and removed the regions in the alignment with high ‘N’, and exported the edited sequence in fasta format. We then aligned these edited sequences using the G-INS-i (accurate) option of MAFFT and wrote the output in fasta format. For creating a haplotype network, we converted the fasta files to nexus format and imported to POPART96 to create a minimum spanning haplotype network with an epsilon 0. For creating a neighbor-joining tree, we imported the aligned sequences to TreeViewer97 and used a Hamming distance model with 5000 bootstrap replicates.

We built the phylogenetic tree for the Y chromosome (Ychr) using BEAST v1.10.498. We first generated the consensus Ychr (only in males) sequences using ANGSD from whole-genome sequencing data mapped to taurine Ychr (GenBank: CM001061.2), with the same settings as for mtDNA consensus sequences. We removed heteroplasmic sites by masking (as ‘N’) any Y chromosome site where <95% of reads carried the same base. After consensus calling, we then performed phylogenetic analyses using the GTR + G + I substitution model and a coalescent Extended Bayesian Skyline Plot prior to avoid restricting the tree by imposing a confining demographic prior. We then ran the Markov chain-Monte Carlo (MCMC) chain for 107 steps, sampling trees and parameters every 1000 steps. We assessed convergence and proper mixing by visual inspection and by estimating parameter effective sample sizes using TRACER99. We used TreeAnnotator100 to make a maximum clade credibility tree, discarding the first 1000 trees. We used Figtree v1.4.4 (tree.bio.ed.ac.uk/software/figtree) to visualize the maximum clade credibility tree.

Divergence time

We used MSMC2101 to estimate the divergence time between zebu (Bos indicus), banteng (B. javanicus) and Bali cattle. We used the phased callable regions of two individuals per population, randomly sampling 10 million SNPs from the genome. We scaled the results for visualization by assuming a generation time of 5–7 years4,43,44, and a mutation rate of 1.26 × 10−8 generation4.

Local ancestry inference in admixed populations

LOTER

After having identified cattle populations with signs of ancestral admixture, we did local ancestry inference using LOTER39. LOTER has been used for a wide range of species such as humans102,103, primates104,105, cattle7, and rapeseed106, and does not require prior knowledge such as recombination maps to be implemented. We used imputed and phased datasets with a total of 22,158,517 SNPs mapped to BosTau9 as input for LOTER. We used unadmixed zebu and banteng individuals as two ancestry source references: (1) the zebu reference population consisted of 14 Sumba Ongole and 11 South Asian zebu, and (2) the banteng reference population consisted of 19 Bali cattle from Bali, 12 Bali cattle from Australia, and five captive banteng. In the reference sets we considered that a low probability of being admixed was more important than having a larger reference set of individuals with more uncertain admixture profiles. We used a total of 90 individuals as target admixed individuals from the following populations with introgression from banteng or another Bos source: Aceh, Pesisir, Pasundan, Jabres, Madura, and East Asian zebu. In addition, we included Bali cattle from Kupang, because this population showed signs of introgression from cattle. We performed all analyses using the ‘lc.loter_smooth’ function, which enables a phase-correction module. We estimated the overall proportion of banteng ancestry in each individual by calculating the number of SNPs inferred to be derived from banteng and dividing by the total number of SNP sites. Afterwards, we used non-overlapping sliding windows of 50 Kb to consolidate the raw output from LOTER across individuals in each population. For each admixed population, we calculated the proportion of banteng ancestry in each 50 Kb window by calculating the proportion of SNPs inferred to be of banteng ancestry in each individual (two haplotypes per individual), and taking the mean of this value across all haplotypes in the population.

As local ancestry inference can potentially be affected by the choice of reference genome we assessed whether mapping to a banteng reference would impact the LOTER analyses. We downloaded a recently available banteng reference genome (RefSeq: GCF_032452875.1, ARS-OSU_banteng_1.0) and performed mapping of all the raw data as described above to this reference genome. We then redid all steps described above in the “Site filtering” and “Genotype calling and imputation” sections separately for this alternative mapping, and performed a LOTER analysis as described above. Finally, we calculated the genome-wide proportion of inferred banteng ancestry based on this alternative mapping and plotted the count of SNPs inferred to be of banteng ancestry in each 50 Kb genomic window for each of the two mapped data sets, for two example individuals from Madura (N_911 and N_935). Comparability between the two mapped data sets was ensured by exploiting a liftover of genomic coordinates between the banteng and cattle reference genome, available on NCBI. The analyses found almost identical genome-wide banteng proportions using either mapped data set (Madura population as example shown in Supplementary Fig. 18a), and that the correlation between banteng-inferred SNPs across individual windows was also very high (Supplementary Fig. 18b). We therefore conclude that our findings are robust to the choice of reference genome.

F4 ratios

To estimate the ancestry proportions in each admixed individual, we calculated F4 admixture ratios using ‘qpadm’ implemented in ADMIXTOOLS292. This models a target population as a mixture of two source populations given a set of outgroup populations40. We used the same ancestry source references and target admixed populations as in the LOTER analysis. We then estimated F4 ratios in the form of α = f4 (taurine, water buffalo; banteng source, target) / f4 (taurine, water buffalo; banteng source, zebu source). We used 5 × 106 as the SNP block size for jackknifing.

Hmmix

Additionally, we detected the segments of individual genomes of archaic introgression on Indonesian cattle (Aceh, Pesisir, Pasundan, Jabres, Madura) and East Asian zebu population, with South Asian zebu and Sumba Ongole as outgroups using Hmmix v0.6.941. This approach is based on a hidden Markov model that identifies genomic regions with a high density of single nucleotide variants not seen in outgroup populations (non-admixed); therefore, it can be used without relying on ancestry reference sources. The rationale behind this approach is to identify regions of high SNP density after removing variation found in outgroup populations, because introgressed regions with higher SNP density have spent more time accumulating variation that is not found in the outgroup compared to non-introgressed regions. We first prepared the input files for this method from imputed datasets mapped to BosTau9 using the sites retained after site filtering as weights, local mutation rates, and individual observation files using scripts provided with the repository for Hmmix (github.com/LauritsSkov/Introgression-detection). We then applied the method to Indonesian cattle and East Asian zebu populations using the following different prior parameters as model training to detect the best-fitting hidden Markov model parameters: Aceh (starting_probabilities = 0.93, 0.07, transitions = 0.98, 0.02 and 0.25, 0.75, emissions = 2, 25); Pesisir (starting_probabilities = 0.88, 0.12, transitions = 0.99, 0.01 and 0.09, 0.91, emissions = 2, 25); Pasundan (starting_probabilities = 0.80, 0.20, transitions = 0.99, 0.01 and 0.05, 0.95, emissions = 2, 25); Jabres (starting_probabilities = 0.76, 0.24, transitions = 0.99, 0.01 and 0.04, 0.96, emissions = 2, 25); Madura (starting_probabilities = 0.63, 0.37, transitions = 0.98, 0.02 and 0.04, 0.94, emissions = 2, 25); East Asian zebu (starting_probabilities = 0.84, 0.16, transitions = 0.93, 0.07 and 0.36, 0.64, emissions = 2, 25). Subsequently, we decoded the data with the best hidden Markov model parameters that maximized the likelihood and identified the archaic introgressed segments. We annotated the archaic introgressed regions by potential source populations (or species) by calculating the ratio of inferred archaic SNPs in each archaic fragment that was shared with each of two possible source populations: 1) a banteng population consisting of five Javan banteng, 19 Bali cattle from Bali, and 12 Bali cattle from Australia, and 2) two gaur individuals. Moreover, we calculated the identity-by-state matrix for all pairs of admixed cattle individuals based on their overlapping archaic regions. We did all analyses using a 10 Kb window and retaining only the archaic regions with probability >0.9.

U X and related analyses

We also explored a metric proposed by Racimo et al. 107 to identify putatively adaptively introgressed regions. The metric tabulates the sites that are nearly fixed for different alleles in cattle and banteng (banteng-specific alleles), and where the banteng-specific allele occurs at a high frequency in the admixed population. Known as the UA,B,C(w,x,y)107 statistic, where w = the maximum allele frequency in unadmixed cattle, x = minimum allele frequency in the target admixed population, and y = the minimum allele frequency in banteng. We calculated both UA,B,C(0.05,0.50,0.95) and UA,B,C(0.05,0.20,0.95) in non-overlapping, 50 Kb windows in each admixed population. We used the same reference populations to represent unadmixed zebu and banteng as in the LOTER analysis. For practicality, we refer to these statistics as Uabc50 and Uabc20, respectively. However, it is challenging to decide on the value of minimum allele frequency x in population B that gives the best discriminative power for adaptive introgression107, and we found that this problem is exacerbated when the original admixture proportions as well as sample sizes differ among populations. In addition, this count statistic is influenced by the genome-wide variation in absolute sequence divergence between populations A and C, leading to a potential decoupling of the UA,B,C count from the proportion of local ancestry. Therefore, we also calculated another statistic (UX) by calculating x in UA,B,C(0.05,x,0.95) for each admixed population in 50 Kb bins. In other words, we calculated the mean allele frequency of population B in sites that had <0.05 derived allele frequency in unadmixed cattle, and >0.95 across Bali cattle and banteng. This mean allele frequency in banteng-diagnostic sites UX has a more continuous distribution than Uabc20 and Uabc50, and had a higher correlation with local LOTER and Hmmix ancestry proportions than any of the UA,B,C we examined, but still constitutes an independent approach for inferring regions of high banteng ancestry in admixed cattle.

Correlation between banteng ancestry and genomic features: recombination rates, coding region density, and conservation score

To investigate correlations between genomic features and banteng ancestry in admixed breeds, we compared the estimated proportion of banteng ancestry in each 50 Kb window with the mean recombination rate, the mean coding region density, and the number of conserved sites in the same window for the Madura population. For recombination rates, we obtained a sex-specific cattle recombination map108 and did linear interpolation for each SNP by using the ‘approx’ function in R. We then aggregated the recombination map to the 50 Kb window for both sexes. We also obtained information regarding the number of sites in coding regions and the number of conserved sites (phastCons30way) from the Ruminant Genome Database (RGDv2109) binned to the same window size. The latter was based on the BosTau9 version of the Bos taurus reference genome collated with the Y chromosome of the Btau5.0.1 version (ARS-UCD1.2_Btau5.0.1). For each comparison, we split the windows into 10 quantiles and then calculated the mean proportion of banteng ancestry for each quantile along with its standard error as a measure of uncertainty. We plotted the distributions across quantiles as both scatter plots overlaid with the means, as well as kernel density plots. We also applied a genome-wide Spearman rank correlation test of banteng ancestry proportions among each pair of admixed breeds across all windows and those in the top 5%.

Ancestry-specific population structure in admixed populations

To investigate each of the ancestry sources in admixed populations, we inferred population structure for zebu-specific and banteng-specific ancestry regions, using EMU, a method designed to be robust to both random and non-random missingness110. We used imputed datasets mapped to BosTau9 and performed a stringent-sample filtering by removing one of each pair of K1 > 0.2 identified by NgsRelate111,112 within each population (Supplementary Fig. 31). For inferring zebu ancestry structure, we first extracted SNPs inferred to be of zebu ancestry by LOTER per individual, and regions identified as ‘cattle’ by Hmmix per individual. We then merged all admixed individuals with LOTER-zebu and with Hmmix-zebu ancestry with the unadmixed zebu population (Sumba Ongole and South Asian Zebu) separately as inputs for EMU analyses. To infer population structure with EMU, we applied seven eigenvectors (--n_eig 7), maximum iterations of 1000 (--iter 1000), and a threshold for minor allele frequencies of 0.05 (-f 0.05). We estimated PCA for banteng ancestry structure in a similar way to inferring zebu ancestry by extracting SNP positions with banteng ancestry using LOTER and regions with ratio > 0.8 (number of banteng population SNPs ÷ derived number of SNPs) of Hmmix per admixed sample, respectively. EMU-PCA was then performed on this set of banteng annotated ancestry regions in admixed individuals and the whole genome of unadmixed banteng populations (Bali cattle from Bali, Bali cattle from Australia, and captive banteng) using the same settings as above, except for eight eigenvectors (--n_eig 8).

As some additional whole-genome sequenced samples from Asian zebu became available during the preparation of the study that could potentially be of relevance to place the masked zebu ancestry of the Indonesian cattle into a geographical context, we downloaded data from Chen et al.113 and Chen et al.5 (Supplementary Data 7) and performed mapping, imputation and LOTER analysis as described for our original data set. We then included them in an additional EMU-PCA. These samples were placed on a cline towards the East Asian zebu samples in the zebu-specific EMU-PCA, consistent with their geographical placement in northernmost Southeast Asia along the shortest route between India and China (Supplementary Fig. 11b). While supporting the credibility of the EMU-PCA as a method to detect historical dispersals, this result does not enable us to further disentangle the origin of zebu cattle introduced to Indonesia. Only samples from further south in mainland Southeast Asia would have enabled this.

To check the performance of masking based on LOTER results, we performed a check by recalculating D-statistics after masking all SNPs inferred to be introgressed in each sample. This plot demonstrates the ability of the LOTER based masking to remove all or by far most of the introgression from banteng-like Bos sources (Supplementary Fig. 11a).

Estimation of admixture time

To infer the timing of admixture events, we traced the ancestry of discrete genomic segments for all of Indonesian cattle populations and East Asian zebu using Ancestry_HMM, a hidden Markov model-based method42,45. We fitted a single-pulse admixture model to the genome-wide variation data and used the mean ancestry proportions estimated by LOTER as the assumed admixture proportion (-p 1 100000 -0.6 -p 0 -500 -0.4). We also tried a two-pulse admixture model for the Javan breeds with the settings: ‘-p 1 100000 -0.6 -p 0 -1000 -0.2 -p 0 -200 -0.2’. However, the higher admixture time and lower admixture (~ 0.01%) estimated for the first pulse when using the two-pulse model suggests that the two-pulse model is a poor fit to the data45 (Supplementary Fig. 32). We quantified uncertainties by doing 100 bootstrap replicates for each population using a block size of 5000 SNPs (-b 100 5000).

Overlapping introgressed segments among LOTER, Hmmix, and U X

To obtain the top 5% windows of highest-inferred banteng ancestry in each of the five cattle groups (Aceh, Pesisir, Pasundan, Madura, and East Asian Zebu), we merged the results from LOTER, Hmmix, and UX using non-overlapping 50 Kb windows made according to the taurine autosomal chromosomes (BosTau9) and annotated with genes coming from BosTau9 reference-genome annotation (GFF, Ensembl version 106) using bedtools intersect114. For each of the 50 Kb windows, we counted the proportion of banteng SNPs coming from LOTER (anc1), number of SNPs in UX, and the mean proportion of archaic regions with probability ≥ 0.9 from Hmmix. For each statistic, we determined the top 5% quantile from each cattle group to filter out windows that do not contain the highest banteng ancestry. Due to the high number of windows with Uabc50 and Uabc20 = 0, the top 5% quantile from UX statistics are 0 and hence not used for this filtering step. Before obtaining the top 5% windows of highest banteng ancestry, we first fitted a linear model to explore the predictive power of Hmmix for LOTER and found that Hmmix can explain more than 50% of the variation in LOTER values (Supplementary Fig. 23). However, consistently higher proportion of introgressed ancestry inferred by Hmmix than by LOTER and Ux, could potentially be due to either Hmmix finding introgression from a wider array of sources than LOTER (e.g. other bovines present in SEA), or to a higher tendency of false positives in Hmmix, or a mixture of the two. We thus only kept windows that passed the top 5% quantile of LOTER values in each cattle group. We then listed these regions for each cattle group and annotated them with a gene list of BosTau9 from NCBI Bos taurus Annotation Release 106 (2019-12-18, GCF_002263795.1_ARS-UCD1.2_genomic.gff.gz115) and QTL information of Bos taurus ARS-UCD1.2 from QTLdb release 53116 to identify signs of adaptive introgression. We used values from Hmmix and UX for further validation as an overall sanity check, or robustness analysis.

Gene enrichment analysis

To characterize the functional associations of these candidate regions for adaptive introgression, we performed GO-enrichment analyses for the outlier gene set from each breed. For each unique gene in the top 5% list, as well as genes in the zero-banteng regions, we performed an overrepresentation analysis for gene-enrichment by using the g:GOSt feature of the web-based g:Profiler117. Genes overlapping the top 5% banteng ancestry regions of each cattle population (Aceh, Pesisir, Pasundan, Madura, and East Asian zebu) were used as input to g:GOSt v110 using Bos taurus as the annotation set and a g:SCS significance threshold of 0.05. For each cattle population we collected the outcome of the gene enrichment based on molecular function, cellular component, biological process, as well as terms from KEGG, REAC, and HP if any (Supplementary Data 8). We then made an UpSet plot for genes in the top 5% using the ComplexUpset package118 in R v4.3.3119 to see how many outlier genes were shared across multiple cattle populations. To assess any overall overrepresentation or underrepresentation of genes contained in the top 5% banteng ancestry regions in each breed, we performed a simulation by randomly choosing 1000 times 5% of the genomic windows (2482 windows) out of the total of 49,624 genome-wide windows. We then counted the number of genes for each iteration and plotted the results in a histogram.

Haplotype structure

We visualized haplotype structure of the ASIP gene on chromosome 13 between 63.64 Mb and 63.67 Mb using Haplostrips120. The software extracts the haplotype data from the phased genotypes, keeps the samples belonging to populations of interest and chooses only the most informative sites by eliminating variations with very low frequency in all the populations, and finally produces a plot that displays the haplotypes in rows while each column represents a SNP within a region of interest. Populations of captive banteng and Bali cattle were lumped as banteng and were treated as the reference population. We also built the haplotype network for ASIP gene using the same methods described above for inferring mtDNA phylogenetic tree.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.