Introduction

The Sahiwal, probably the heaviest milker of all indigenous cattle breeds with a well-developed udder, has its origin in the dry Punjab region of India-Pakistan subcontinent and has become one of the best dairy breeds of choice in both India and Pakistan due to its remarkable qualities of high milk yield along with heat tolerance, tick tolerance and high resistance to both internal and external parasite1. They have since gained recognition for their resilience to external and internal parasites and for being able to yield sufficient milk in a subsistence agriculture system. Many nations throughout the world have already acknowledged the superior qualities of Sahiwal cattle which are regarded as one of the transboundary breeds of cattle present in more than one country selected for dairy and meat production purposes worldwide. They have been imported to several nations for use in the creation of synthetic breeds as well as purebred raising and some of these synthetics include Australian-Friesian-Sahiwal, Australian Milking Zebu, Frieswal, Jamaica Hope, Karan Swiss, Mafriwal, Mpwapwa, and Taurindicus2.

In India, the breeding tract of Sahiwal lies in the Indo-Pakistan border in Ferozepur, Haryana, Amritsar district of Punjab, Sri Ganganagar district of Rajasthan and the adjoining areas. They are believed to be related to cattle of Afghanistan, Red Sindhi and Gir cattle breeds3. After the independence of India in 1947, many of the traditional Sahiwal breeders migrated to India, while the majority remained in Pakistan. While the Indian Sahiwal cattle were mostly reared for milk production, the same cattle were reared as dual-purpose breeds for both dairy and meat. After the elapse of almost seven decades which roughly denotes 7–8 generations and with limited cross-border exchange of Sahiwal germplasm, the interesting question arises whether the natural and artificial selection could alter the proportion of variation in the Sahiwal populations reared for different purposes from both the countries, leaving any differential signature patterns in their genomes. It is well known that selection for economic traits has left impressive selection footprints in the genome of cattle. Identifying these selection signatures can help in the genomic breeding efforts for increasing dairy cattle production.

Moreover, as long as the nation is focused on choosing the best animals with the highest potential for milk production, genomic regions of Sahiwal cattle will endure severe selective pressures for a considerable amount of time4,5. To fully understand the molecular mechanisms influencing quantitative as well as other important traits, it is therefore essential to examine the genomic signatures of selection in Sahiwal cattle. Selection signatures are distinct DNA variations that result from changes in the genomes of both chosen and neutral loci of a species that have been selected over time6. The SNP genotypic data can help in the detection of selection signatures to uncover genes and advantageous mutations associated with ecologically and economically important traits. Genomic selection signatures show breed specificity and explain putative regions of importance in indigenous and endangered breeds containing variants underlying phenotypic variations7. Identification of beneficial mutations exerts a selective advantage in a particular livestock population and gives knowledge on the evolutionary history of the development of different livestock breeds which in turn provide interesting insights to highlight new targets for selection and genetic improvement of the breed.

Several studies on signatures of selection for Sahiwal was conducted and demonstrated their utility in locating polymorphisms and/or potential genes responsible for economically significant traits5,8,9,10. These discoveries are vital to understand the processes that account for the variations in phenotype between breeds. However, in contrast to these earlier studies, which focused mainly on the assessment of selection signatures of Sahiwal cattle either from India or Pakistani decent, our study consisted of a comparative assessment of selection signatures in the genome of transboundary Sahiwal cattle using SNP genotype data from three different populations having Indian and Pakistani origin. The majority of the previous research frequently relied on a number of single statistical tests, having a lower efficiency to identify selection signatures. Therefore, integrating several statistical tools of selection signatures into a multi-point statistic viz. De-Correlated Composite of Multiple Selection Signals (DCMS) framework will be more reliable and robust. By integrating multiple statistics of signatures of selection within a single DCMS framework and accounting for their correlation, the DCMS method yields more efficient results than other statistical tests11, which in turn provides us with comparative genomic information of Sahiwal of different origins with better statistical power and resolution12.With this background and to address the issue of transboundary Sahiwal populations, the present study was undertaken to analyze three genomic data sets of Sahiwal (two from India and one from Pakistani origin) to determine the comparative differences in the genomic signature patterns among these three populations for major economic traits.

Results

Quality control and effective population size (Ne)

The present study employed three Sahiwal populations, two from India (NDRI and Hisar) and one from Pakistan13 with a total of 240 animals. After quality control of genotypes for Hardy–Weinberg equilibrium, minor allele frequency, genotype call rate, and duplicated genotype parameters, a total of 39,040, 41,227, and 20,150 SNPs remained in the final dataset of NDRI, Hisar, and Pakistan Sahiwal genotype data, respectively. QC of genotype data minimizes biases in the data and accordingly, the final dataset was utilized for Ne estimation in three Sahiwal herds. It was found that Ne was 54, 55, and 93 (--effective-size 54, --effective-size 55 and --effective-size 93) for NDRI, Hisar, and Pakistan Sahiwal in the first generation using GONE software (Fig. 1)14. The GONE software is an advancement over SNeP and similar methodologies, in which the value of LD between loci at a given genetic distance is determined by the combined effects of genetic drift and recombination that have accumulated throughout earlier generations. All of the Ne readings in this study are higher than 50, indicating that the population has reached the FAO’s minimum recommended level of 5015.

Fig. 1
figure 1

Effective population size of NDRI, Hisar and Pakistan Sahiwal.

De-correlated composite of multiple selection signals (DCMS)

The DCMS method combines five statistics viz., FST, Haplotype Homozygosity Statistics (H1), Modified Haplotype Homozygosity (H12), Tajima’s D index, Nucleotide Diversity (π) between the Sahiwal populations and was estimated for each of the population using a MINOTAUR R package16. The selection analysis discovered 31, 24, and 15 genomic regions for NDRI, Hisar, and Pakistan Sahiwal. The total protein-coding genes of NDRI were found to be 115, 94 for Hisar, and 52 for Pakistan Sahiwal populations. The identified regions through gene annotation (Table 1) revealed several recognized and new candidate genes linked to important economic traits. The study revealed that genes like PCP4, MCOLN2, Lpar3, DEK, LPIN2, DLGAP1 were found to be common among the Indian Sahiwal (NDRI and Hisar) which was associated with feed efficiency, immune response, mastitis resistance, resistance to bovine tuberculosis, milk protein and milk yield. Furthermore, in NDRI and Hisar Sahiwal populations, a number of other candidate genes were found mostly related to reproduction and production traits, viz. SH3BGR, PSMG1, BRWD1 that were associated with utero embryonic development, fertility, reproductive functions and cellular response to heat stress; BPIFB1 for innate immune response; NEK11 for milk composition traits; KCNH3 for growth and milk protein content; MITF for coat colour; PICALM for milk protein content and cheesemaking properties. While, most of the candidate genes found in the Pakistan Sahiwal population were related to growth and carcass traits, viz. RALGAPA2 and CFAP61 for subcutaneous fat thickness, feed intake, conformation, weight; RNF111 for yearling weight; GALNT17 and AUTS2 for milk fat and protein, feed efficiency and TCEA3 for bovine MDSCs (muscle-derived satellite cells). The distribution of the regions of signals of selection throughout the genomes for the three Sahiwal population is presented in Fig. 2 .

Table 1 Gene annotation through DCMS methods in three different Sahiwal population.
Fig. 2
figure 2

Manhattan plot showing the genomic areas identified by the DCMS (a) NDRI Sahiwal (b) Hisar Sahiwal (c) Pakistan Sahiwal. The blue lines represent q-value < 0.1.

Gene network analysis and hub genes identification

The protein network was constructed based on the information in the STRING v12.0 database (Fig. 3). To identify the most significant modules, the top-rank genes based on MCC values were identified for the three Sahiwal populations. The top genes with the highest rank were SH3BGR, BPIFB1, and RALGAPA2 for NDRI, Hisar, and Pakistan Sahiwal (Fig. 4a, b and c).

Fig. 3
figure 3

Annotated gene network interactions (a) NDRI Sahiwal (b) Hisar Sahiwal (c) Pakistan Sahiwal.

Fig. 4
figure 4

Hub genes identification (a) NDRI Sahiwal (b) Hisar Sahiwal (c) Pakistan Sahiwal.

Signatures of selection for milk and related traits

Sahiwal is a well-known milch breed which have improved milk production qualities having a lean conformation and a brown coat colour. The DCMS analysis in our study identified genes on NDRI Sahiwal viz., NEK11, HMGCS1, NIM1K, BTN1A1 located on BTA 1, 20, 23 respectively that are associated with milk traits (Table 1). A pathway of lipid metabolism regulated by PPARA (peroxisome proliferator-activated receptor alpha) was involved with HMGCS1 whereas NIM1K was related with lactation persistency and synthesis of milk cholesterol and lipid32,80. BTN1A1 gene was found to be responsible for secretion of milk fat and affecting fat percent. Other genes like FBP2, GRIN2C, DLGAP1, CCNT1 and LPIN2 were further identified to be associated with milk protein percentage.

The genes under selection in Hisar Sahiwal related to milk production traits include KCNH3, ATL1, MAP4K5, DLGAP1 and PICALM located in BTA 5, 10, 24 and 29 (Table 1). KCNH3 gene was found to be linked with milk protein content5. In humans, KCNH3 expression is linked to poor overall and disease-free survival with ovarian cancer patients81. The genes MAP4K5 and ATL1 at BTA 10 were also responsible for milk production45,46. Pakistan’s Sahiwal yielded the genes BACH2 and SLC24A3 associated with milk fatty acids and milk production traits. ADAMTS9 and GALNT17 were likewise related to milk protein and fat processing.

Signatures of selection for reproduction traits

The genes responsible for reproduction traits include SH3BGR, PSMG1, and BRWD1 where all are located at BTA 1 having the highest score from hub genes identification in NDRI Sahiwal (Fig. 4a). PSMG1 and BRWD1 were associated with fertility and reproductive functions. Additionally, HMGN1 and B3GALT5 were involved with Bovine Maternal-zygotic transition22 and heifer early calving. USP3 gene at BTA 10 was found to be associated with embryo development in Hisar Sahiwal whereas in Pakistan Sahiwal, SLC24A3 was identified at BTA 13 related to fertility traits in the Rustaqi breed of Iraq72.

Selection signatures for immune response and disease resistance

The study revealed that the MCOLN2 gene at BTA 3 was involved with immune response in ICAR-NDRI and Hisar Sahiwal. MCOLN2 (Mucolipin-2) contributes to the Arf6-associated recycling pathway and is thought to be mostly found in recirculating endosomes. In most tissues, the expression is minimal, but in the thymus and spleen, it is highly expressed which suggests its role in immunity25. BPIFB1 gene at BTA 13 was found to be associated with innate immune response in Hisar Sahiwal having the highest hub gene score (Fig. 4b). Several genes associated with disease resistance against bovine tuberculosis, including DEK and RNF144B were discovered to be under selection in NDRI Sahiwal (Table 1). In Hisar Sahiwal, DEK gene was also found to be involved with resistance to bovine tuberculosis. Lpar3 gene was likewise identified to be involved with mastitis resistance which corresponds with mastitis resistance in cattle of NDRI and Hisar Sahiwal26. In Pakistan Sahiwal, MGAT5 was also identified to be linked with mastitis resistance.

Signatures of selection for body growth and feed efficiency

Several candidate genes were identified to be associated with growth and feed efficiency traits in Sahiwal cattle. NHL repeat containing E3 ubiquitin protein ligase 1 (NHLRC1) was one of the significant genes found on BTA 23 in NDRI Sahiwal cattle. We also identified PCP4 to be related to feed efficiency in NDRI and Hisar Sahiwal. TAL1 and CYP4A11 genes at BTA 3 were similarly found to be associated with growing skeletal muscle during puberty42 and growth and fat deposition43.

The most significant genes with the highest hub gene score in Pakistani Sahiwal cattle were located at BTA 13 namely RALGAPA2, CFAP61, SLC24A3, and RIN2 (Fig. 4c). These genes were associated with subcutaneous fat thickness, feed intake, conformation, and weight70,71. Several genes like TCEA3, TTC21B, UBXN4, and CNTNAP5 located at BTA 2 were identified for Bovine MDSCs (muscle-derived satellite cells), hip width/rump width, carcass traits and conformation and growth61,63,64,65. For feed intake, PTPRZ1 and CADPS2 genes were identified and AUTS2 at BTA 25 was associated with feed efficiency79. Additionally, it was discovered that at BTA 10, RNF111 was involved with yearling weight70, and CCNB2 (Cyclin B2) had a significant impact on the acceleration of the cell cycle and rumen development.

QTL identification and enrichment analysis

According to the QTL identification, 52.3% of the milk-type QTLs in NDRI Sahiwal cattle are located in significant genomic areas, and other QTL types such as exterior, health, meat and carcass, production and reproduction were annotated and accounted for 8.77, 7.63, 10.11, 12.46, 8.73% respectively (Fig. 5a). These QTLs were mapped to BTA 1, 3, 20, 23 and 24. The top most significant QTLs were found to be associated with milk protein percentage, milk fat percentage, milk yield, milk lauric acid content, and lactation persistency (Fig. 5b).

Fig. 5
figure 5

(a) Pie chart demonstrating percentage of six QTL classes that are annotated in the important genomic areas of NDRI Sahiwal cattle (b) QTL enrichment analysis for NDRI Sahiwal cattle.

In Hisar Sahiwal cattle, the QTL identification accounted for 55.54% of the milk-type QTLs which are located in significant genomic areas. The other QTL types such as exterior, health, meat and carcass, production, and reproduction were annotated and accounted for 5.56, 7.39, 11.81, 11.11, and 8.58% respectively (Fig. 6a). These QTLs were mapped to BTA 3,5,10, 23, 24 and 25. The top most significant QTLs were found to be associated with milk fat yield, milk kappa-casein percentage, and milk protein yield (Fig. 6b).

Fig. 6
figure 6

(a) Pie chart demonstrating percentage of six QTL classes that are annotated in the important genomic areas of Hisar Sahiwal cattle (b) QTL enrichment analysis for Hisar Sahiwal cattle.

In Pakistan Sahiwal cattle, the QTL identification accounted for 40.07% of the milk-type QTLs which are located in significant genomic areas. The other QTL types such as exterior, health, meat and carcass, production, and reproduction were annotated and accounted for 12.13, 8.59, 13.43, 13.91, and 11.87% respectively (Fig. 7a). These QTLs were mapped to BTA 2, 10, 13, 14, 22 and 25. The top most significant QTLs were associated with body weight, fat thickness, structural soundness, retail product yield, and milk-alpha casein percentage (Fig. 7b).

Fig. 7
figure 7

(a) Pie chart demonstrating percentage of six QTL classes that are annotated in the important genomic areas of Pakistan Sahiwal cattle (b) QTL enrichment analysis for Pakistan Sahiwal cattle.

Discussion

Transboundary cattle are cattle breeds that are distributed across borders rather than regional breeds that are unique to a single nation. They have the potential to expand throughout countries and enhance the world’s supply of animal products82. In India, there are sixty indigenous breeds of cattle, eight regional transboundary breeds, and seven international transboundary breeds83. Sahiwal cattle originated from the dry Punjab region which lies along the Indian-Pakistani border, and is one of the transboundary breeds of cattle present in more than one country selected for dairy and meat production purposes worldwide. They are considered one of the best Zebu cattle breeds that can potentially play the same role as Holstein in tropical environments2.

In our study, two Sahiwal populations from India (Karnal and Hisar) and one original herd from Pakistan (Punjab) were analyzed simultaneously, which may be the first comparative study to check the status of signatures of selection among these populations, using a robust multi-point statistical tool DCMS. Studies in the past have shown that composite measurements of signals of selection can offer an objective standard to more accurately identify variations under selection12. Thus, candidate genes can be discovered with more power and greater precision for forthcoming studies in medicine, agriculture, and animal breeding to identify signals of selection. According to our results, several genomic regions and genes related to economic traits, including milk production, growth, feed efficiency, and reproduction were discovered.

Significant genes like HMGCS1 (20:314.31-314.56) and NIM1K (20:314.64-315.42) were associated with lactation persistency and synthesis of milk cholesterol and lipid31,79 in NDRI Sahiwal. Moreover, the production of milk fat and its impact on fat percentage have also been linked to the BTN1A1 (23:315.85-315.91) gene. This gene exhibited polymorphism in the native cattle populations of Tharparkar, Sahiwal, Jhari, and Belahi as well as in crossbreeds between Holstein Friesian, and Jersey whereas found to be monomorphic in water buffalo populations of Murrah, Chilika, Gojri, Chhattisgarhi and Bargur35. Furthermore, the production and composition of fatty acids were found to be influenced by the fatty acid synthesis process carried out by the enzymes ACSF3 (18:142.03-142.08), CREB1 (2:958.46-958.97), and FADS6 (19:565.20-565.36), which all had high fatty acid levels. A role for milk protein was observed in LPIN2 (24:371.40-372.14) in both the populations of NDRI and Hisar Sahiwal. Lipins (LPIN2) operate as transcriptional co-regulators of gene expression in addition to function as phosphatidate phosphatases, giving them dual roles in lipid metabolism84. According to research on mice, the LPIN2 gene was functional during the normal development of adipose tissue and might be involved in the metabolism of triglycerides in humans. This gene is a potential gene for human lipodystrophy, a condition marked by insulin resistance, fatty liver, loss of body fat, and hypertriglyceridemia85. In Valle del Belice sheep, it was discovered to be connected to lipid and milk protein metabolism86.

In Pakistan Sahiwal, ADAMTS9 located at BTA 22 (367.60-369.22), was involved in the processing of milk proteins34 and further found to be associated with lipid metabolism in Sanjiang cattle87. This gene further can control the number of mitochondrial complexes in skeletal muscles and insulin sensitivity88. It can therefore be used as a helpful molecular marker to improve goat growth characteristics89,90. GALNT17 (25:291.33-295.63) was also shown to be connected to milk protein and fat. A GWAS analysis in Danish Jersey and Holsteins also found that GALNT17 was associated with milk fat and protein traits78. According to the QTL annotation analysis, 52.3% and 55.54% were linked to milk type QTL of NDRI and Hisar Sahiwal whereas 40.07% of milk type QTL for Pakistan Sahiwal (Figs. 5a, 6a and 7a). These results revealed that milk type QTLs of Indian Sahiwal (NDRI and Hisar) was higher as compared to Pakistan Sahiwal. In a study of Genome-Wide Assessment of Signatures of Selection in the Pakistan Sahiwal Cattle10, lower milk-type QTLs of 25.08% was as well identified. Hence, we can deduce that Indian Sahiwal cattle contribute a major role in the selection of animal for milk production and its related traits whereas Sahiwal cattle from Pakistan have not undergone extensive selection for traits related to milk production.

For reproduction traits, SH3BGR (1:139.52–139.60) was associated with utero-embryonic development18 which is highly significant in NDRI Sahiwal. This gene has a connection to thioredoxin, which has the ability to stimulate growth hormone production in tissue-culture cells91. Reproduction traits contribute 8.73% to QTL annotation in NDRI Sahiwal, making the most contribution according to the QTL annotation analysis (Fig. 5a). Reproductive functioning and fertility have also been linked to PSMG1 and BRWD1 (1:139.22–139.36). In Holstein cattle, these genes were previously shown to be selective50. In mice, BRWD1 epigenetically regulates meiotic chromosomal stability, an essential process for female fertility92. In association with the maternal-zygotic transition in cattle, HMGN1 (1:139.41-139.41), was found to be involved22. HMGNs influence neuronal, ocular, reproductive, and pancreatic cell development in addition to their role in embryogenesis93. USP3 gene, a member of the ubiquitin-specific proteases (USPs) family at BTA 10 (464.13-465.13) was found in Hisar Sahiwal to be associated with embryo development in cattle49, while it was found to be involved with protein degradation in river buffalo94. The significance of reproduction traits in Indian Sahiwal may be due to the fact that prioritization was not only given to milk production traits but also to reproduction traits, which are linked to the selection of superior germplasm and the creation of appropriate breeding programs for long-term genetic improvement.

Sahiwal was renowned for its strong resistance to internal and external parasites, as well as its ability to withstand heat and ticks. These distinctive traits are manifestations of the robust genetic makeup that drives innate immunity and its relationship to acquired/adaptive immunity95. The innate immune response was linked to the BPIFB1 (13:627.62-627.85) gene in Hisar Sahiwal. This gene shares structural similarities with LPS-binding protein and BPI protein, two innate immune molecules known for their functions in detecting and reacting to Gram-negative bacteria, where these proteins support innate immunity96. Additionally, several bovine tuberculosis disease resistance genes, including DEK and RNF144B (23:391.78–393.87), were discovered to be under selection in Hisar and NDRI Sahiwal. These protein-encoding genes regulate NF-κB in human macrophages. Ring finger protein 144B (RNF144B) contribution to bovine tuberculosis has also been documented in Holstein-Friesian cattle97. ABT1 gene (23:315.20-315.23) was identified to be associated with resistance to the Bovine Leukaemia virus in NDRI Sahiwal. In Argentinean dairy cattle, the expression of the ABT1 gene transcription factor was higher in low-pro Viral Load cows than in high-pro Viral Load cows at a 95% significance level33. These results suggested that selection under different tropical environment situations has resulted in the existence of immunity-related genes in NDRI and Hisar Sahiwal cattle populations.

Body conformation and feed efficiency a complex characteristic that are controlled by multiple biological mechanisms. PCP4 gene (1:139.93-104.005) was associated with feed efficiency in NDRI and Hisar Sahiwal. TAL1 and CYP4A11 genes located between BTA 3 (990.05-992.22) were found in Hisar Sahiwal and linked to growing skeletal muscle during puberty38 and growth and fat deposition43. CYP4A11 is a significant omega hydroxylase of lauric acid (medium-chain fatty acids) which plays a role in blood pressure regulation, fatty acid metabolism and the conversion of arachidonic acid to 20-hydroxyeicosatetraenoic acid (20-HETE)98. Several genes like CREBBP (25:305.43-317.33), ADCY9 (25:324.69-334.91), MGRN1 (25:375.51-379.18) and TFAP4 (25:344.38-345.66) were further identified in Hisar Sahiwal to be associated with feed conversion54 lipid and meat characteristics and ear size genes55 bull fertility directly involved in spermatogenesis and meat tenderness57,] and bovine satellite cells56. In pigs, the gene CREBBP was found to be crucial for pig growth and feed conversion54. Furthermore, in Pakistan Sahiwal, RALGAPA2 (13:399.01-401.88) was linked to the thickness of subcutaneous fat which was responsible for encoding the catalytic alpha subunit 2 (α2) of Ral GTPase-activating protein (RalGAP). Proteins utilised in membrane trafficking or cellular vehicles are encoded by the genes RALGAPA2 and Ras and Rab interactor 2 (RIN2)99. In Pakistan Sahiwal, these genes are determined to be highly significant, with the highest contributions to QTL enrichment analysis coming from body weight, fat thickness, and structural soundness (Fig. 7b). The RIN2 (13:393.64-395.73) gene was likewise implicated in muscle growth and production in crossbred small-tailed Han and Dorper x small-tailed Han sheep73. The hatching weight and fat features of chickens were found to be substantially correlated with this gene100. ASAP1 (14:102.95-106.04) and EYA1 (14:347.86-351.46) were similarly found to be related to the circumference of the scrotum, beef quality, and production traits74,75. An investigation of gene expression revealed that EYA1 might play significant functions in both differentiated and undifferentiated bovine muscle101. Additionally, the QTL annotation for Pakistan Sahiwal showed that production traits contribute 13.91% (Fig. 7a) which was higher than both NDRI and Hisar Sahiwal in the study. Based on these findings, we can conclude that the Pakistani Sahiwal cattle was under intense selection for production traits that had important genes associated with it.

Conclusion

In our knowledge, this is the first attempt to study the patterns of comparative selection signatures in the genomes of transboundary Sahiwal cattle using a multi-point composite statistic of DCMS. The study enriched our understanding how different selection utilities left differential genomic footprints for various economic traits in these cattle transcending the geographical boundaries. This study also demonstrated the power and reliability of DCMS technique in the detection of selection signatures over other univariate statistical techniques. The result revealed a number of major genes primarily focused on the milk production (NEK11, HMGCS1, BTN1A1, KCNH3) and reproductive traits (SH3BGR, PSMG1, BRWD1, B3GALT5) in the Indian Sahiwal, while Pakistan Sahiwal population were selected mostly for growth and meat traits having different sent of candidate genes (RALGAPA2, RIN2, CFAP61). Despite selection for different utilities, the Sahiwal retained their fundamental genomic signature patterns associated with milk, growth and reproduction. Our findings added further insight of the genomic footprint of the Sahiwal, one of the most significant international transboundary cattle that will benefit the ongoing genetic improvement programme in these countries.

Materials and methods

Animal resources, SNP genotyping and quality control

This study considered two Sahiwal cattle populations from India, National Dairy Research Institute (NDRI) and Hisar, as well as one Sahiwal cattle population from Pakistan. A total of 240 genotypic data samples counting NDRI Sahiwal (n = 193), Hisar Sahiwal (n = 30), and Pakistan Sahiwal (n = 17) were obtained for analyses (Table 2). The 50 K SNP data chip was used in which NDRI Sahiwal and Hisar Sahiwal were obtained from NDDB, India, which was customized from commercially available Illumina Bovine SNP chip (BovineSNP50K v3 Bead Chip). Pakistan Sahiwal 50 K SNP data was sourced from public data repository13.

Table 2 Descriptive statistics of the studied Sahiwal populations.

The quality control (QC) of the genotyped data was implemented in PLINK1.9 program105. Only the SNPs found on autosomes were taken into consideration for analysis, eliminating the unmapped SNPs and SNPs found on the X and Y chromosomes. SNPs having a Hardy–Weinberg equilibrium below 0.001, a minor allele frequency of less than 0.05, and a genotype call rate of less than 0.95 were excluded. Quality control of genotypes was again performed for phasing of haplotypes with SHAPEIT v2.r904 program106 to get high-quality SNPs. The version and URL of the software/Package used in this study are provided in the Supplementary Table S1.

De-correlated composite of multiple selection signals (DCMS)

In this study, De-Correlated Composite of Multiple Selection Signals (DCMS) was used to integrate several statistics for selection signature detection while taking into consideration the correlation between various statistics. It comprises of both intra-population and inter-population statistics to identify selection signatures entailing five methods as described in107.

  1. 1.

    FST (Fixation index).

  2. 2.

    Haplotype Homozygosity Statistics (H1).

  3. 3.

    Modified Haplotype Homozygosity (H12).

  4. 4.

    Tajima’s D index.

  5. 5.

    Nucleotide Diversity (π).

where plt - the p-value at position l for statistic t; rit -the weighing factor at each locus, and n is the total number of test statistics (combined) in the DCMS12,108,109,110,111. All the statistics were transformed into p-values using one-tailed and two-tailed ranks, which are fractional ranks that fall between 1/(n + 1) and n/(n + 1), respectively, in order to produce the DCMS.

The threshold for selection of DCMS value was set to q < 0.1. This was because when q < 0.05 or 0.01 was taken as significance threshold, only few significant markers were found. Therefore, the q value was set to < 0.1 for selecting more significant markers.

Effective population size (Ne)

Effective population size is the size of an idealised population going through the same rate of genetic drift as the population under studied112. The parameter (--effective-size) was required for phasing as a number of genetic analyses depend on the process of haplotype phasing, which determines which genetic variants are physically situated on the same chromosome. Ne was then estimated using GONE software14 for each of the three Sahiwal populations under study after which Ne was subsequently included in the phasing parameter using the SHAPEIT v2.r904 program106.

Fixation index (FST)

The fixation index, a population differentiation measurement, was computed using PLINK1.9 --fst and --within functions for every SNP and breed. The R program’s runmed function was used to smooth the FST values of each SNP after FST values less than 0 were transformed to zeros.

The FST (Fixation Index) is a fundamental concept in population genetics that measures genetic divergence between populations. It measures the proportion of genetic diversity caused by differences between populations rather than within them. As a result, FST is a pairwise comparison statistic and must be compared between populations. In our present study, the three Sahiwal populations i.e., NDRI, Hisar and Pakistan were merged into three groups pair-wise, viz. NDRI and Hisar were merged in Group1, NDRI and Pakistan were merged in Group2, and Pakistan and Hisar were merged in Group3. Subsequently, PLINK1.9 was used to estimate pair-wise FST values.

Haplotype homozygosity statistics (H1 and H12)

The SHAPEIT v2.r904 programme106 was used individually for phasing each chromosome. Then, Haplotype Homozygosity Statistics (H1 and H12) were obtained from the phased file using haplotype frequency spectrum statistics including the LASSI composite likelihood ratio statistic113, H12 and H2/H1111 which was calculated using LASSIP v1.1.1 software.

Tajima’s D and nucleotide diversity (π)

The VCFTOOLS v0.1.16 program was used to estimate Tajima’s D and pi statistics114. Using the --TajimaD function, Tajima’s D statistics were determined for each breed and chromosome individually, taking into account non-overlapping sliding windows of 300 Mb (--TajimaD 300000). SNPs within a 300-Mb bin were assigned the predicted D values for that bin, and missing values were replaced with zeros. For Nucleotide Diversity, the --site-pi tool was used to calculate the pi statistics for each breed and chromosome independently. The outputs were then smoothed for each chromosome using the R’s runmed function with a window size of 31 SNPs (k = 31, endrule = “constant”) in order to remove noise107.

Calculation of DCMS statistic

To create a new composite signal known as DCMS, all five statistical analyses for each SNP - H1, H12, FST, Tajima’s D index, Nucleotide Diversity (π) were combined. Applying the stat_to_p-value function in the MINOTAUR package in the R environment, the left-tailed test was applied to Tajima’s D values and π based on the functional ranks, while the right-tailed test was applied to H1 and H12 and the FST statistic, respectively16. Subsequently, a correlation matrix of n × n order was calculated using the covNAMcd function (alpha = 0.75, nsamp = 50,000) from the rrcovNA v.0.5-2 R package115. This matrix was imported into the MINOTAUR R package16 to compute the genome-wide DCMS values. Then, MASS v.7.3–61 R package116 was utilised to convert the DCMS data into a normal distribution by applying the robust linear model (rlm)107. The fitted model’s outputs, i.e. Mu [mean] and SD [standard deviation] were input into the pnorm R function to determine the DCMS statistics’ p-values: dcms_pvalues = pnorm(q = dcms, mean = mu, sd = SD, lower.tail = FALSE). Ultimately, the p-values that was acquired were transformed into the corresponding q-values by the use of the q-value R function following Benjamini and Hochberg method117.

Gene annotation and functional annotation

Genes that were located in the genomic regions were considered as significant if q-value is lower than 0.1. The R package GALLO v1.4 (Genomic Annotation in Livestock for positional candidate Loci) was used to annotate the genes and QTLs118. To locate the gene and QTL respectively, the gene and QTL annotation files (.gtf and.gff files) produced from ARS-UCD1.2 assembly119 and the Animal QTL Database120 were utilised. The genes and QTL enrichment analysis was conducted using the same GALLO v1.4 programme for all QTLs identified using the chromosome-based technique and functional annotation was done through PANTHER v18.0121.

Network formation and hub genes identification

Using STRING v12.0122, we integrated the protein coding genes to derive the biologically relevant interface permitting the flow of information. Then the results were visualized using CYTOSCAPE v3.10.1123 software and the hub genes were identified based on the number of associations with other genes in the protein network using widely known technique algorithm viz. MCC.