Introduction

The increasing global population and the essential role of milk in meeting nutritional needs have made enhancing the performance of domestic animals, particularly dairy cows, a top priority in breeding goals and programs worldwide1. This focus centers on improving key economic traits like milk production. Milk production and udder health are crucial economic factors that significantly impact the profitability of dairy operations2. Improvements in milk production traits directly benefit these operations, while enhancing resistance to mastitis can reduce the financial burden associated with treatment1,3.

Over time, substantial progress has been made in enhancing the production performance of dairy cows. However, mastitis remains a significant challenge. This infectious disease, caused by environmental and management factors combined with the animal's often weakened resistance and immunity (primarily acquired) to pathogens, leads to substantial economic losses in the dairy industry and raises concerns about the quality of dairy products globally1,2. The high costs associated with mastitis have led to increased attention to mastitis resistance as a vital breeding goal, considering economic and animal welfare aspects4. Direct recording of mastitis occurrences is not routine in most countries, and direct selection for mastitis resistance is uncommon1,5,6. Consequently, the somatic cell count (SCC) in milk, or its logarithmic transformation, measures mastitis due to its higher genetic variance, ease of recording, and strong positive correlation with mastitis incidence5. These complex traits, influenced by multiple genes, are affected by various factors, including management practices, environmental conditions, and the animal's physiological state. Control of these traits involves numerous genes and variants, each with minor effects on the observed phenotype1,6. Strong genetic selection and improved management and nutrition can lead to increased milk production and decreased mastitis prevalence. Research highlights a positive yet detrimental relationship (antagonism) between somatic cells and production traits, notably milk production1,2,7. The somatic cell count is a crucial indicator for assessing the quality and health of raw milk and is a factor in pricing. An elevated SCS negatively impacts raw milk's processing quality and overall quality due to changes in its composition, including fat, protein, lactose, and acidity levels1,2.

The advent of genome-wide panels of single nucleotide polymorphisms (SNPs) has revolutionized the field. SNPs are extensively valuable for detecting and localizing quantitative trait loci (QTLs) for complex traits across diverse species1,2,8. They have proven robust and practical tools for identifying accidental mutations linked to economically significant traits in livestock 3,9,10 and human diseases11,12. Numerous studies over the years have focused on identifying QTLs for various traits in dairy cattle, leading to the discovery of many QTLs across different chromosomes9,13. New sequencing technologies have opened new opportunities to identify markers associated with economically essential genes and milk production traits. Genome-wide association Studies (GWAS) have emerged as a highly efficient strategy for uncovering candidate genes and markers associated with quantitative traits3. The primary aim of GWAS is to pinpoint the most likely genomic locations that control these traits14. Moreover, genome-wide scanning studies contribute to a deeper understanding of the genes and polymorphisms linked to economic traits, ultimately shedding light on the underlying mechanisms of the traits under investigation1,15. In dairy cattle, the GWAS method has been instrumental in estimating SNPs influencing production traits like milk yield, fat yield, protein percentage1,3,4,16,17, and health traits such as mastitis, uterine health1,18, longevity within the herd19,20,21, and reproductive traits16,22,23,24,25.

While different studies have reported SNPs and genes affecting somatic cells and the occurrence of mastitis, these findings have often varied, with limited overlap in identified SNPs between studies. Several factors contribute to these discrepancies, including environmental conditions, the specific type of dairy management (industrial or semi-industrial), variations in native pathogens and the host's response, and the genetic background of the studied population. These factors significantly impact the relationship between genetic variants and genes across the genome and the resulting phenotype26,27. Notably, this is the first study conducted in Iranian Holstein cows using a GWAS approach to investigate milk production and mastitis traits. In this study, 150 animals from Herd 1 were genotyped using a 30K SNP array, and 60 animals from Herd 2 were genotyped using a 50K SNP array, totaling 210 animals. Genotype data were subsequently imputed to whole-genome sequence level using the 1000 Bull Genomes Project reference panel, resulting in 6,583,595 high-confidence imputed SNPs used for GWAS. Consequently, the primary objective of this study is to examine the association of genome-wide SNPs with somatic cell count, milk yield, milk fat (%), and milk protein (%) traits. This comprehensive analysis seeks to identify known and novel genes or genomic and chromosomal regions linked to the inheritance of these traits, individually or in combination, within the Iranian Holstein cattle population.

Results

Descriptive statistics and

Descriptive statistics for milk production traits and somatic cell count in the Iranian Holstein population are presented in Table 1, and the distribution of each milk production trait and somatic cell count is shown in Fig. 1. On average, Iranian Holstein cows had a milk yield of 38.32. The mean milk fat percentage, milk protein percentage, and somatic cell count were 3.304, 2.899, and 64.41, respectively. The coefficient of variation for milk production traits and somatic cell count indicated acceptable diversity for these traits in Iranian Holstein cows, with values of 19.87, 7.24, 15.33, and 133.69 for milk yield, protein percentage, fat percentage, and somatic cell count, respectively. The estimates of variance components and heritability for the four traits (milk yield, milk protein, milk fat, and somatic cell score) from single-trait animal models are shown in Table 2. Overall, the heritability values for milk yield, milk protein, milk fat, and somatic cell score were 53%, 52%, 43%, and 39%, respectively.

Table 1 Summary of the data set used in this study.
Fig. 1
figure 1

The distribution of milk yield (A), milk protein (B), milk fat (C), and somatic cell count (D) traits.

Table 2 Estimates of additive genetic variance (\({\widehat{\upsigma }}_{a}^{2}\)), variance of the random herd-year month of testing effects (\({\widehat{\upsigma }}_{htm}^{2}\)) residual variance (\({\widehat{\upsigma }}_{e}^{2}\)), and heritability (\({\widehat{h}}^{2}\)) for milk yield, milk protein, milk fat, and somatic cell count traits in Iranian Holstein cattle population.

GWAS for somatic cell count and milk production

The results of the GWAS analysis for all studied traits (milk production, fat percentage, protein percentage, and somatic cell count) were reported based on the significance threshold of P value < 1 × 10⁻⁸ (supplementary 1). For the MY trait, 86 SNPs were identified on the following chromosomes: BTA2 (19), BTA5 (30), BTA17 (1), BTA21 (33), BTA24 (2), and BTA28 (1). And also, 18 SNPs were observed for the MF (milk fat) trait on BTA7 (11), BTA21 (6), and BTA26 (1). Furthermore, for the MP (milk protein) trait, 22 SNPs passed the significance threshold (P value < 10⁻⁸) and were located in the regions of BTA7 (11), BTA21 (9), BTA22 (1), and BTA26 (1). Moreover, the GWA for the somatic cell count (SCC) trait showed 58 marker-trait associations (P value < 1 × 10⁻⁸) locating on chromosomes BTA2 (1), BTA17 (42), BTA19 (12), and BTA21 (1). The Manhattan and Q-Q plot plots for the studied traits are illustrated in Fig. 2. The Manhattan plots clearly illustrate distinct genomic regions of association for each trait, with particularly strong signals on BTA5 and BTA21 for milk traits, and BTA17 for SCC, suggesting potential QTL hotspots. The Q-Q plots show a strong deviation from the expected distribution under the null hypothesis, further confirming the presence of true genetic associations and the robustness of the GWAS. Notably, several novel genes such as ATE1, FGFR2, LYZ, and MAML3 were identified near the top-associated SNPs, highlighting their potential roles in milk composition, udder health, and host defense mechanisms. These findings provide new insights into the genetic basis of production and health traits and offer promising targets for genomic selection and functional validation in dairy cattle.

Fig. 2
figure 2

Manhattan plot of the genome-wide p values of association for milk yield (A), milk protein (B), milk fat (C), and somatic cell count (D) traits in Holstein cow. The solid line represents the p < 10−8 significance threshold.

QTL regions for somatic cell count and milk production

In Table 3 summarizes the important SNPs (P < 1 × 10⁻⁸) linked to milk production characteristics in Iranian Holstein cows that are situated close to identified QTLs. The results indicate that on chromosomes BTA7, BTA21, and BTA26, important QTLs associated with milk decenoic acid content (MFA-C10:1), milk capric acid content (MFA-C10:0), milk myristoleic acid content (MFA-C14:1), milk palmitoleic acid content (MFA-C16:1), milk lauroleic acid content (MFA-C12:1), milk myristic acid content (MFA-C14:0), milk palmitic acid content (MFA-C16:0), milk protein yield (PY), milk yield (MY), yield grade (YGRADE) were identified in proximity to the significant SNPs for the milk fat percentage trait. Near the significant SNPs associated with the milk yield trait on chromosomes BTA2, BTA5, BTA17, BTA21, BTA24, and BTA28, QTLs related to milk fat yield (FY), somatic cell count (SCC), bovine respiratory disease susceptibility (BRDS), milk yield (MY), milk protein yield (PY), body weight (BW), fat percentage (FATP), bovine tuberculosis susceptibility (BTBS), Clinical mastitis (CM), and Age at first calving (AGEFC) were observed (Table 3). Additionally, QTLs associated with specific somatic cell count (SCC), body weight (BW), milk protein percentage (PP), milk yield (MY), muscularity (MUSC), and average daily gain (ADG), traits were identified for the SCS trait. Regarding the milk protein percentage trait, several important QTLs, including milk protein yield (PY), milk decenoic acid content (MFA-C10:1), milk capric acid content (MFA-C10:0), milk myristoleic acid content (MFA-C14:1), milk palmitoleic acid content (MFA-C16:1), milk lauroleic acid content (MFA-C12:1), milk myristic acid content (MFA-C14:0), milk palmitic acid content (MFA-C16:0), calf size (CALFSZ), milk yield (MY), carcass weight (CWT), muscularity (MUSC), feed conversion ratio (FCR), average daily gain (ADG), and body weight (BY) were determined to be close to the significant SNPs on chromosomes BTA7, BTA21, BTA22, and BTA26.

Table 3 QTLs located in close distance to the most significant single nucleotide polymorphisms (SNPs) associated with milk yield, milk protein, milk fat, and somatic cell count traits in Holstein cows.

Gene ontology for somatic cell count and milk production

Over 137 genes associating to milk production and somatic cell count traits were discovered using the gene ontology analysis (supplementary 2), 33 of them are essential genes (Table 4). For the milk fat percentage trait, five candidate genes were discovered around SNPs 26:41,368,775 (2), 21:5,475,347, and 21:5,525,195 (2), which influence the activity of the ATE1, FGFR2, ALDH1A3, CHSY1, and GABRG3 genes (Table 4). And also, for the milk protein percentage trait, six candidate genes were identified, affecting the activity of ATE1, FGFR2, ZNF346, FGFR4, TMEM40, and NTRK3 (Table 4). Furthermore, nine candidate genes were identified around SNPs 2:117,632,966 (2), 2:117,637,569, 24:33,558,520 (3), 24:42,643,763, and 5:19,359,629 (2) for the milk yield trait, affecting the activity of the FBXO36, PID1, TRIP12, CD52, WDTC1, MATN1, CIDEA, LYZ, and CPM genes (Table 4). Moreover, the 13 candidate genes were discovered around those SNPs that associated to somatic cell count trait, relating the activity of the FBXO42, MAML3, SGMS2, SCLT1, HADH, CYP2U1, DLK1, THRSP, ANKRD26, TMEM26, VEGFA, MED4, and VAV1 genes (Table 4).

Table 4 The candidate or nearest genes to the most significant single nucleotide polymorphisms (SNPs) in significant regions based on 5 × 10−8 for milk yield, milk protein, milk fat, and somatic cell count traits in Holstein cows.

Gene networks

The results of the gene network analysis for milk production traits, including milk yield (Fig. 3), milk protein percentage (Fig. 4), milk fat percentage (Fig. 5), somatic cell count (Fig. 6), and all traits are shown in Fig. 7. A densely co-expressed network was drawed by using Gene Mania (Fig. 7). This network consisted of 137 genes with 1764 interactions. Among these genes, CAND1, VEGFA, AFGLS2, FGFR2, NUP107, and MPPE1 genes have played roles in several intracellular transport processes. Therefore, the identified candidate genes in our study exhibited significant protein–protein interactions to each other or related genes.

Fig. 3
figure 3

Gene networks analysis for milk yield trait in Holstein cows. Dark circles with and without slash represent candidate genes and associated genes, respectively. Arrows in pink, blue, red and bone color represent co-expression, pathway, physical interactions and shared protein domains, respectively.

Fig. 4
figure 4

Gene networks analysis for milk protein percentage trait in Holstein cows. Dark circles with and without slash represent candidate genes and associated genes, respectively. Arrows in pink, blue, red and bone color represent co-expression, pathway, physical interactions and shared protein domains, respectively.

Fig. 5
figure 5

Gene networks analysis for milk fat percentage trait in Holstein cows. Dark circles with and without slash represent candidate genes and associated genes, respectively. Arrows in pink, blue, red and bone color represent co-expression, pathway, physical interactions and shared protein domains, respectively.

Fig. 6
figure 6

Gene networks analysis for somatic cell count trait in Holstein cows. Dark circles with and without slash represent candidate genes and associated genes, respectively. Arrows in pink, blue, red and bone color represent co-expression, pathway, physical interactions and shared protein domains, respectively.

Fig. 7
figure 7

Gene networks analysis for all traits in Holstein cows. Dark circles with and without slash represent candidate genes and associated genes, respectively. Arrows in pink, blue, red and bone color represent co-expression, pathway, physical interactions and shared protein domains, respectively.

Discussion

Phenotypes of milk production traits are primarily quantitative and governed by polygenic mechanisms. Extensive research has been conducted on milk traits over the years. For instance, in 1944, a study confirmed significant QTLs associated with protein yield and fat yield traits, linked to beta-lactoglobulin and kappa-casein, respectively28. Subsequent studies identified numerous QTLs associated with milk traits across 30 different bovine chromosomes1,3,4,29,30,31,32.

Despite the numerous studies, the genetic mechanisms controlling these traits remain largely unclear. Therefore, further research to elucidate the genetic mechanisms governing these traits is precious. To this end, a GWAS was conducted on 210 Iranian Holstein cows, identifying several significant SNPs associated with milk production traits, including milk yield, milk fat, milk protein, and somatic cell count. In this study, significant milk yield SNPs were identified on chromosomes BTA2, BTA5, BTA17, BTA21, BTA24, and BTA28, consistent with previous research findings3,15,30,32,33. Eighteen marker-trait associations were found on chromosomes BTA7, BTA21, and BTA26 for milk fat percentage, corroborating earlier studies29,33,34. For milk protein percentage, 22 SNP markers were identified on BTA7, BTA21, BTA22, and BTA26, with some overlap with previous reports, which identified chromosomes 21 and 22 as the main contributors to this trait3,32,35,36. Several SNPs identified on chromosomes BTA2, BTA17, BTA19, and BTA21 for somatic cell count were also noted in prior research, though some significant SNPs discovered in this study had not been previously reported37,38,39,40,41.

Many genes were located alongside the identified markers, which may directly or indirectly influence the expression of genes associated with milk production traits. However, no reports have yet been published on the effects of some of these genes on milk production traits in cattle, indicating the need to expand our knowledge regarding the functions of these genes in bovines. On Chr26, tow genes (ATE1 and FGFR2) associated with milk fat percentage and milk protein percentage was identified. The ATE1 gene, identified in this study as significantly associated with somatic cell count in Iranian Holstein cows, plays a critical role in protein post-translational modification through arginylation, a process essential for regulating protein stability and degradation. This gene is known to be involved in various cellular functions, including stress response, apoptosis, and cell cycle control. Its identification as a candidate gene in the context of milk production suggests that ATE1 may influence immune and inflammatory responses in the mammary gland, potentially affecting mastitis susceptibility. This makes ATE1 a promising target for further functional studies and a valuable marker for improving udder health in genomic selection programs42. The FGFR2 (Fibroblast Growth Factor Receptor 2) gene emerged as a candidate associated with supernumerary teats (SNT) in the GWAS of Iranian Holstein cows, suggesting a potential role in mammary gland morphology and development. FGFR2 is a key component of the fibroblast growth factor signaling pathway, which regulates cell growth, differentiation, and tissue development. Previous studies have linked FGFR2 to mammary gland proliferation and its dysregulation to breast cancer development. Specifically, FGFR2 expression has been observed in the endometrial and trophoblastic epithelium, and its activation has been shown to influence epithelial integrity and fertility. These functions underscore FGFR2’s involvement in reproductive and mammary traits, making it a biologically plausible candidate gene for traits like supernumerary teats, which have implications for udder health, milkability, and the efficiency of mechanized milking systems42. ATE1 is a eukaryotic protein that plays a role in metabolism and apoptosis, reducing chromosomal aberrations through cell–cell contact43. A GWAS conducted by Fang et al.42 on Capra hircus demonstrated that the ATE1 gene is associated with udder size. Another gene identified in this study, FGFR2, has been linked to breast cancer44. Overexpression of growth hormone (GH) has been shown to promote mammary proliferation via FGFR2 and FGF742,45. On Chr24, the several genes (ALDH1A3, CHSY1, and GABRG3) were found alongside significant markers for milk fat percentage. The third enzyme from the aldehyde dehydrogenase 1 family, encoded by the ALDH1A3 gene, plays a detoxification and antioxidant role by converting retinaldehyde to retinoic acid44. In a GWAS conducted on Chinese Holstein cows, ALDH1A3 was associated with milk production traits, such as fat and protein content46. The CHSY1 gene has been previously shown to contribute to bone growth47, and this study demonstrates that it may also be linked to milk-related traits. Another essential gene identified is GABRG3, associated with teat size48. In other GWAS studies on cattle, GABRG3 has also been linked to carcass traits and feed efficiency49,50,51.

On Chr2, several genes associated with milk yield traits were identified, including the FBXO36 gene, which was linked to milk yield in this population. FBXO36, a member of the F-box protein family, plays a role in protein ubiquitination and is involved in critical cellular functions such as nutrient sensing, signal transduction, circadian rhythms, and the cell cycle, contributing to mastitis resistance in Holstein cows52,53,54. The function of this gene has been demonstrated in various cattle populations, showcasing its multifunctional role. These associations include specific diseases, infections, and biological functions related to adaptation55,56. Additionally, on the same chromosome, the PID1 gene plays a role in human lipid metabolism, reducing the sensitivity of adipocytes to insulin through the interaction of the phosphotyrosine-binding domain 1 with the lipoprotein receptor57. A GWAS study on cattle has identified the role of the PID1 gene in lipid metabolism and fatty acid synthesis58. TRIP12 is another gene that regulates the balance between protein synthesis and degradation and is involved in mammal muscle differentiation59. The exact role has been proposed for TRIP12 in intramuscular fat content in cattle58,60. Other critical genes on this chromosome include CD52, WDTC1, and MATN1. The CD52 gene encodes a glycoprotein that reduces T-cell activation61. The WDTC1 gene regulates fat-related gene transcription62. Reduced expression of MATN1 has been associated with impaired muscle growth63. On Chr24, the CIDEA gene was found alongside significant markers. Previous reports have highlighted its role in lipid synthesis in milk, which is influenced by the complex regulation of multiple gene expressions. CIDEA is a protein expressed in adipose tissue and associated with lipid droplets64. High expression of this gene in the mammary glands of lactating mice has been linked to lipid secretion65. Additionally, the CIDEA gene and several lipogenic enzymes are regulated post-partum in the mammary tissue of cattle66. On Chr5, the LYZ (Lysozyme) gene was identified by Salehin et al.67. They reported the significant effect of the LYZ gene on somatic cell count and milk production in cattle. The LYZ gene is of significant importance due to its strong antibacterial and immune-regulatory properties, particularly within the mammary gland of dairy animals. This gene encodes for lysozyme, an antimicrobial enzyme abundantly secreted in milk, saliva, and other bodily fluids, where it plays a crucial role in the innate immune system by breaking down bacterial cell walls. In the context of dairy production, LYZ is highly expressed in the mammary gland of buffaloes, contributing to their enhanced resistance to mastitis compared to cattle68. Therefore, the LYZ gene is not only a key marker for udder health and milk quality but also a promising candidate for genomic selection and therapeutic applications aimed at improving disease resistance in dairy herds. Another gene identified on this chromosome was CPM. The CPM protein plays a role in adipose tissue differentiation and has been identified as a candidate gene for milk fatty acids in Holstein cows69.

In Chr19, UCP1 gene was detected near significant SNPs with SCS trait. UCP1 gene is a mitochondrial carrier protein. Król et al.70 showed that the expression of UCP1 gene decreases during lactation in mice. Also, the effective function of UCP1 gene on milk protein percentage, milk fat percentage and milk yield has also been reported71. CYP2U1, SGMS2 and HADH genes cause the secretion of fat cells in milk because they play an important role in the metabolism of lipids and fatty acids72. In a GWAS experiment on cows, the role of these three genes (CYP2U1, SGMS2 and HADH) was reported as candidate genes for milk fat73. In Iranian Holstein cattle, SNP 17:28,549,748 in BTA17 was associated with SCS. According to Duchemin et al.74, this region contains the SCLT1 gene, which affects the fatty acid composition of milk from Holstein cows. The identified THRSP gene was located in the vicinity of the significant SNP associated with the SCS trait. THRSP gene in goat, with chest circumference and body weight75, with average daily weight gain, waist-eye area and back fat thickness in pig76 and in cattle with fatty acid composition milk74 and water holding capacity are correlated with meat tenderness77.

A new strategy in animal breeding programs, including for cattle, is using genomic information for economically important traits58. Identifying biological processes and genomic regions influencing milk production traits is essential for understanding the underlying genetic mechanisms. This study has identified novel genes as well as previously reported genes. In future breeding programs, the identified candidate gene variants can be utilized to improve milk production traits in dairy cattle. Additionally, validation studies involving gene expression analysis may be necessary in certain animal groups due to possible mutations in the identified candidate genes. This is essential for confirming the impact of these genes on the traits under investigation.

Conclusions

The genetic evaluation of milk production traits and somatic cell count in Holstein cows can be facilitated by combining genomic data in GWAS studies. We have identified several SNPs, important regions in various BTAs, and a list of candidate genes (both novel and known) that may contribute to variations in milk production traits and somatic cell count in Holstein cows. The genes ATE1, FGFR2, ALDH1A3, CHSY1, GABRG3, FBXO36, PID1, TRIP12, CD52, WDTC1, MATN1, CIDEA, LYZ, CPM, UCP1, MAML3, SGMS2, HADH, CYP2U1, SCLT1 and THRSP have been suggested as candidate genes for milk production traits and somatic cell count in Holstein cattle. These genes may be used for higher profit identification, causal mutations, and genomic predictions for milk production traits and somatic cell count in dairy cattle. This study demonstrated the feasibility of genetic evaluation for milk production traits and somatic cell count in the Iranian Holstein population, and it should be incorporated into the selection index for Iranian dairy cows.

Materials and methods

Phenotypic data

In the dairy farm of Ferdous Pars Agriculture Development, Iranian Holstein cows were selected. To conduct this study, 210 female cows (150 and 60 cattle, respectively, in herds 1 and 2) were selected for the study based on the breeding value of the milk production trait78. Animals were chosen using the two-tailed selection strategy outlined by Jiménez-Montero et al.79, which was based on estimated breeding values (EBVs) for milk yield. The EBVs were calculated by the National Animal Breeding Centre of Iran (Karaj, Iran) using a lactation model, as described in Eq. (1).80. The authors of the article confirm that the study was reported in accordance with the ARRIVE guidelines.

$${y}_{ij}=\upmu +{hys}_{i}+ {a}_{ij}+ {e}_{ij}$$

In this model, yij​ represents the milk yield, adjusted to a standard 305-day lactation period with twice-daily milking. The term μ denotes the overall population mean, hysi accounts for the fixed effect of the i herd-year-season group, aij​ represents the breeding value of the jth animal within the ith herd-year-season group, and eij captures the random residual error. The average accuracy of the estimated breeding values (EBVs) for milk yield was calculated to be 0.6180.

The following cases were also taken into consideration during the sampling in addition to those mentioned above: the sampling involved analyzing the livestock's pedigrees using the CFC V9.0 SP7 software81, and ensuring that both herds had a high diversity of livestock was done by choosing livestock with minimal kinship relationships80. A complete pedigree (The pedigree of the cows is given in Supplementary 3) and records were available for the selected animals, and it was ensured that the animals were not candidates for elimination. During the first to sixth lactation of 210 Holstein cows located on one Iranian farms with two herds, 75,228 phenotypic records were collected from May 2013 to December 2020. Among the traits studied were test-day milk yield (MY; kg/d), somatic cell count (SCC, converted according to Ali and Shook,82), milk protein percentage (PP, %), and milk fat percentage (FP, %). A summary of the phenotypic data is shown in Table 1.

Genotype imputation and quality control (QC)

One-hundred fifty (150) and Sixty )60( animals from herd 1 and 2 were genotyped by the GGP-LD v.4 SNP panel (with 30,108 SNPs) and the Affymetrix Axiom Bovine Array-50 K (with 51,987 SNPs), respectively.

Using the software PLINK 2.0 to control genotyping quality, four criteria were used. Those animals with over 5% missing genotypes were excluded, those with minor allele frequencies (MAFs) less than 5%, and SNPs that were not genotyped for more than 5% of animals and chi scores were less than 10–6 (Chi-square < 10−6) were excluded from the Hardy–Weinberg equilibrium test. To check imputation accuracy and identify and remove markers that had lower accuracy and stepwise imputation, Minimac3 2.0.1 software was used83.

The 210 cows (150 from Herd 1 and 60 from Herd 2) were genotyped using two SNP panels: the GGP-LD v.4 (30,108 SNPs) and the Affymetrix Axiom Bovine Array-50 K (51,987 SNPs). These animals comprised the target population84. Genotypes were then imputed to whole-genome sequence level using a reference population of 234 animals from the 1000 Bull Genomes Project. This reference panel included key progenitors from four major breeds: Holstein–Friesian (n = 129), Fleckvieh (n = 43), Jersey (n = 15), and Angus (n = 47), each genotyped using the BovineHD BeadChip and whole-genome sequencing data85. Quality control was applied to both SNP chip and sequence data, resulting in 578,505 SNPs from the BovineHD chip and 12,063,146 SNPs from the sequence data after filtering. Genotype phasing was conducted using Eagle v2.3, and imputation was performed with Minimac3 for both reference and target populations78. After removing imputed SNPs with an accuracy (R2) below 0.30, 6,583,595 high-confidence SNPs were retained and used in the genome-wide association analysis.

GWAS for somatic cells count and milk production

A mixed linear model in EMMAX was used to association studies between imputed genotypes and milk production and somatic cell count traits86. EMMAX adjust for both population stratification and relatedness in the association study. The mixed model used for this study was as follow Eq. (1):

$$ {\text{y}}{\kern 1pt} = {\kern 1pt} {\text{Xb}}{\kern 1pt} { + }{\kern 1pt} {\text{Zu}}{\kern 1pt} { + }{\kern 1pt} {\text{e}} $$
(1)

where X is a n × q matrix of fixed effects including overall mean, covariates and the testing SNP; y is a n × 1 vector of the phenotypic measurement, b is a q × 1 vector denoting the coefficients of fixed effects; Z is a n × t incidence matrix which relates phenotypes to the corresponding random polygenic effect; u is a t × 1 vector of the random polygenic effect and e is a n × 1 vector of the residual effects. Furthermore, Var(u)=\({\sigma }_{g}^{2}K\) and var(e)=\({\sigma }_{e}^{2}I\) that I is identity matrix and K is a kinship matrix among all imputed sequence genotypes.

In GWAS, a Bonferroni-corrected genomic threshold of 1 × 10–8 (P < 0.05 / total number of SNPs) for association study is known. We used the R 4.3.2 software to draw the Manhattan plot using the qqman package87.

Gene annotation

Our study used Ensembl annotations of the UMD3.1 genome version (http://www.ensembl.org/biomart/martview) to identify candidate genes surrounding (within one megabase) SNPs that passed the threshold of P < 1 × 10–8. An analysis of gene ontologies was performed using DAVID Bioinformatics Resources version 6.7 (http://david.abcc.ncifcrf.gov/). Also, to identify those QTLs that fall within 1 Mb of SNPs that meet the threshold of P < 1 × 10–8, the QTLdb of cattle was used (https://www.animalgenome.org/cgibin/QTLdb/BT/index). The GeneMANIA (http://genemania.org/) was then used to draw gene networks.