Introduction

Cow’s milk contains approximately 3.4% of total protein, of which 78–80% are caseins1. Among then, β-casein is the second most abundant casein type in cow’s milk, after αs1-casein, comprising 30–37% of the total casein content1,2. The β-casein gene (CSN2), located on the casein (CN) locus of chromosome 6, is highly polymorphic. At least 17 genetic variants have been described3,4, with A1 and A2 being the most prevalent in commercial dairy populations5,6. These two variants differ by a single nucleotide acid polymorphism (SNP), resulting in histidine (A1 β-casein)-to-proline (A2 β-casein) at position 67 of the protein7,8, potentially affecting its digestibility and interaction with gastrointestinal enzymes. This SNP has attracted interest due to the formation of the bioactive peptide β-Casomorphin-7 (BCM-7)9,10 during digestion of A1 β-casein, which has been associated with gastrointestinal effects such as delayed transit and inflammation11,12,13,14,15,16. Contrastingly, milk containing exclusively A2 β-casein, has been associated with softer stools17 and improved digestive comfort12,14,18,19,20,21. Therefore, this type of milk is already widely available commercially as an alternative for those with gastrointestinal discomfort12,19.

Despite some of the uncertainties, particularly the limited number of clinical studies in humans and the absence of sufficient scientific evidence regarding the health effects of A1 β-casein2,22,23, a trend toward selective breeding of A2A2 cows has emerged. Global semen trade companies have introduced the A2A2 genotype in their sire directories as a trait of interest in breeding programs. Furthermore, dairy products containing only A2 variant produced by cows that carry A2A2 genotype are already available in several regions, such as New Zealand, North America, Australia, China, the United Kingdom, the Netherlands, Italy, and Spain, often at premium prices2,24,25,26. This type of milk represents an opportunity to add value to dairy products and may offer a diversification strategy for small-scale dairy farms in competitive milk market25,27.

Nevertheless, beyond consumer heath, the inclusion of A2 β-casein genotype as a selection criterion also raises questions about potential impacts on milk synthesis and composition. While numerous association studies have investigated the relationship between β-casein genotypes and milk production traits, results remain inconsistent across populations and should be interpreted with caution. Some found no significant associations28,29,30, but others linked A2A2 to higher milk/protein yield and lower fat percentage or somatic cell count27,31,32,33. Moreover, at proteomic level, Wang et al.34 showed that specific proteins casein micelles, whey and milk fat globule membrane (MFGM) may vary with β-casein genotype. They observed that A1A1 milk was enriched in ceruloplasmin, protein S100-A9, and cathelicidin-2, whereas A2A2 milk showed higher levels of lactoferrin and CD5L, proteins involved in immune response and lipid metabolism. These compositional differences may reflect underlying molecular mechanisms regulated at the transcriptional level. Indeed, milk protein genes, although not regulatory per se, can include SNPs within their coding gene sequence that may be in linkage disequilibrium with cis-acting regulatory elements or may influence transcript splicing and stability35,36. For example, in Huang et al.35 associations between SNPs in casein genes and the concentrations of major caseins and total protein content were reported, suggesting possible genotype-dependent variation in gene expression.

High-throughput transcriptomic technologies such as RNA-seq enable to investigate whether such genetic variation translates into differences in gene expression in the mammary gland37,38,39,40,41,42,43,44. Previous studies have demonstrated that milk fat globules (MFG), can entrap variable amounts of cytoplasmic material from mammary epithelial cells (MEC), providing a non-invasive alternative to tissue biopsies. This approach does not disrupt the normal lactation process and allows for repeated sampling during lactation45,46,47, allowing representative transcriptomic profiling of the lactating mammary gland47,48.

This transcriptomic approach offered a promising framework to explore whether β-casein genotype would be associated with differential gene expression patterns during lactation in the context of A2A2-based breeding strategies. Building on this rationale, we hypothesised whether significant differences exist in gene expression profiles or key metabolic pathways involved in milk synthesis and composition between cows with different β-casein genotypes. Accordingly, the aim of the present study was to compare the MFG transcriptome of Holstein cows with A1A1 and A2A2 genotypes, using conventional gene-level differential expression analysis.

Results

RNA-sequencing, mapping and gene expression estimation

Upon RNA-Seq analysis, a total of 740 million reads were generated from sequencing 14 cDNA libraries, resulting in an average of 52.7 million paired-end reads of 151 bp in length per sample. Results indicated that the quality of the sequencing data was high, with 94% of bases scoring Q30 or above for all samples. In addition, a high alignment rate was obtained with an average of 86.02% of reads mapped to unique positions in the protein coding regions of the bovine reference genome (Table 1).

Table 1 Summary of the RNA-sequencing (RNA-seq) data after sequencing reads mapping to the reference genome.

This alignment rate was similar to the results obtained in RNA samples isolated from milk somatic cells of dairy sheep49. Contrastingly, a lower alignment rate was obtained by Cánovas et al.38 (60–75%) for the RNA isolated from five different sources in the mammary gland (including milk somatic cells, antibody-captured milk mammary epithelial cells (mMEC), laser microdisected mammary epithelial cells (LMEC), mammary gland tissue and MFG), and by Wickramasinghe et al.41 (65%) for RNA isolated from milk somatic cells of dairy cows.

After the removal of lowly expressed genes, only genes with more than 10 counts in at least 6 samples were retained. This resulted in 11,180 genes identified using the DESeq2 package and 11,257 genes using the EdgeR package as expressed in the analysed MFG transcriptomes. The gene expression level distribution in the MFG transcriptome is represented in Fig. 1. Based on other studies, genes were categorised into normalised read count groups as follows40,41,49: lowly expressed genes (< 10 FPKM), medium expressed genes (≥ 10 FPKM to 500 FPKM), highly expressed genes (≥ 500 to 4000 FPKM) and most highly expressed genes (≥ 4000 FPKM).

Fig. 1
figure 1

Gene expression level distribution in the milk fat globule (MFG) transcriptome analysed by RNA-sequencing (RNA-seq). Genes were categorised into fragments per kilobase per million mapped reads (FPKM) groups as follows: lowly expressed genes (< 10 FPKM), medium expressed genes (≥ 10 FPKM to 500 FPKM), highly expressed genes (≥ 500 to 4000 FPKM), and most highly expressed genes (≥ 4000 FPKM). The 11,257 expressed genes with EdgeR are represented.

Gene ontology enrichment analysis of highly expressed genes in the milk fat globules

We first examined the global transcriptomic activity of the mammary gland of A1A1 and A2A2 cows, focusing on the most highly expressed genes. A total of 142 genes were identified as highly expressed in the analysed samples, having an average of FPKM ≥ 500. To further investigate the functional associations of these highly abundant genes in the bovine MFG, the DAVID GO terms enrichment analysis was carried out37,50,51. The significantly enriched GO terms were classified into 27 functional groups, being the 142 highly expressed genes enriched for 8 molecular functions terms and 15 biological processes terms (FDR < 0.05). The most significant enriched terms for each category are shown in Fig. 2 and 3.

Fig. 2
figure 2

Enrichment analysis of Gene Ontology (GO) molecular functions terms in highly expressed genes using DAVID Bioinformatics Resources (average FPKM ≥ 500; FDR < 0.05). The number of genes in each GO category is indicated within the plot. FPKM: fragments per kilobase per million mapped reads.

Fig. 3
figure 3

Enrichment analysis of Gene Ontology (GO) biological processes in highly expressed genes using DAVID Bioinformatics Resources (average FPKM ≥ 500; FDR < 0.05). The number of genes in each GO category is indicated within the plot. FPKM: fragments per kilobase per million mapped reads.

The specific genes involved in each function and process are detailed in Supplementary Table S1 and S2. The “Structural constituent of ribosome” was the most significantly enriched molecular function (FDR = 2.24 × 10–82), which included 67 highly expressed genes and accounted for the 50% of the molecular function terms. Other GO terms related to protein synthesis machinery were “RNA binding” (FDR = 8.62 × 10–14), including 29 genes, followed by “rRNA binding” (FDR = 1.63 × 10–09) with 10 genes, “translation elongation factor activity” (FDR = 1.68 × 10–05) with 6 genes and “large ribosomal subunit rRNA binding” (FDR = 0.019) with 3 genes. Apart from that, two molecular functions related to the cellular energy supply machinery were detected: the “NADH dehydrogenase (ubiquinone) activity” (FDR = 8.57 × 10–06) and the “cytochrome-c oxidase activity” (FDR = 5.36 × 10–04), which comprised 6 and 4 genes, respectively. In respect to biological processes, “translation” was the most enriched GO term (FDR = 2.26 × 10–62), which included 52 genes. Similarly, other biological processes in relation with translation process were the GO terms “cytoplasmic translation” (FDR = 2.10 × 10–19), “ribosomal small subunit biogenesis” (FDR = 9.26 × 10–18) and “translation elongation” (FDR = 3.87 × 10–08), comprising 17, 15 and 7 genes, respectively. Furthermore, four biological processes related to hormonal regulation, namely “response to 11-deoxycorticosterone” (FDR = 3.36 × 10–07), “response to dehydroepiandrosterone” (FDR = 3.36 × 10–07), “response to progesterone” (FDR = 7.50 × 10–05) and “response to estradiol” (FDR = 8.71 × 10–05) were also significant, including all of them the milk protein genes encoding α-S1-casein (CSN1S1), α-S2-casein (CSN1S2), β-lactoglobulin (PAEP/LGB), β-casein (CSN2), κ-casein (CSN3) and α-lactoalbumin (LALBA). Finally, biological processes related to the mitochondria and the cellular energy supply were also enriched with GO terms “mitochondrial electron transport” (FDR = 5.32 × 10–04), “electron transport coupled proton transport” (FDR = 2.40 × 10–04), and “ATP synthesis coupled electron transport” (FDR = 2.40 × 10–04), which included 5, 4 and 4 genes, respectively.

The results of the DAVID functional annotation clustering (FAC) analysis (Fig. 5 and 6), carried out with the 142 highly expressed genes, showed that these genes are mainly related to protein synthesis, more specifically, to the translation machinery, translation process (including initiation, elongation and termination), translation regulation and, post-translational modifications (annotation clusters 1, 2, 3, 4, 5, 7, 10, 11, 12). In addition, significant clusters were found for hormonal regulation of the mammary gland (annotation cluster 6) and mitochondrial processes involved in cellular energy supply (annotation clusters 8 and 9).

The most highly expressed genes in MFG transcriptome are shown in Table 2 (top 25 transcripts with the highest abundance, FPKM ≥ 4000). The milk protein genes CSN1S1, PAEP/BLG, CSN2 and the gene encoding for the protein glycosylation-dependent cell adhesion molecule (GLYCAM1) were the genes showing the highest expression levels with mean values of 90,400, 71,021, 53,578 and 49,494 FPKM, respectively. Although the other main milk protein genes, CSN1S2, CSN3 and LALBA, were also among the top 25 transcripts, they exhibited lower expression levels, with mean values of 30,790 FPKM, 25,112 FPKM, and 17,897 FPKM, respectively. In addition, several genes involved in cellular energy production were highly expressed, including the cytochrome c oxidase (COX) subunits 1, 2 and 3 (COX1, COX2 and COX3); the NADH-dehydrogenase (ND) subunits 1, 3 and 4 (ND1, ND3 and ND4); the ATP synthase F0 subunit 6 (ATP6); and the cytochrome b (CYTB) genes. In particular, COX1 and COX3 genes had significant expression and, although the differences were not statistically significant, COX1 showed average expression levels of 25,253 FPKM in A2A2 cows and 54,066 FPKM in A1A1 cows, while COX3 had 23,226 FPKM in A2A2 cows compared to 45,669 FPKM in A1A1 cows.

Table 2 Most highly expressed genes (FPKM ≥ 4000) in the milk fat globules transcriptome of from cows with A1A1 and A2A2 β-casein genotypes1. Values represent individual FPKM for each animal and the average across genotypes. Abreviations: CSN1S1 = α-S1-casein; PAEP/LGB = β-lactoglobulin; CSN2 = β-casein; GLYCAM1 = Glycosylation-dependent cell adhesion molecule; COX1 = Cytochrome c oxidase subunit I; COX3 = Cytochrome c oxidase subunit III; CSN3 = κ-casein; CSN1S2 = α-S2-casein; PLIN2 = periliphin-2; ATP6 = ATP synthase F0 subunit 6; ND3 = NADH dehydrogenase subunit 3; LALBA = α–lactoalbumin; COX2 = Cytochrome c oxidase subunit II; ND1 = NADH dehydrogenase subunit 1; CYTB = cytochrome b; FABP3 = Fatty Acid Binding Protein 3; RPLP0 = Ribosomal Protein Lateral Stalk Subunit P0; TPT1 = Tumor protein translationally-controlled 1; XDH = Xanthine Dehydrogenase/Oxidase; ND4 = NADH dehydrogenase subunit 4; RPLP1 = Ribosomal Protein Lateral Stalk Subunit P1; RPS2 = Ribosomal protein S2; RACK1 = Receptor for activated C kinase 1; FTH1 = Ferritin heavy chain 1.

Furthermore, the results indicated that from the ribosomal protein genes, which accounted for 48.9% of the highly expressed genes, only three showed FPKM values higher than 4000. Specifically, ribosomal proteins lateral stalk subunit P0, lateral subunit P1, L23 and S2 (RPLP0, RPLP1, and RPS2). In relation to genes associated with milk fat, perilipin 2 (PLIN2) with 21,930 FPKM, fatty acid binding protein 3 (FABP3) with 10,090 FPKM, and xanthine dehydrogenase/oxidase (XDH) with 5,087 FPKM were found among the most highly expressed genes in the MFG transcriptome.

Differentially expressed genes between cows with differential β-casein genotype

Differential expression analysis was performed using the total number of genes expressed in the MFG transcriptome: 11,180 genes identified using the DESeq2 package and the 11,257 genes using the EdgeR, respectively. The analysis with DESeq2 identified two genes as differentially expressed: 16S rRNA and NADH dehydrogenase subunit 6 gene (ND6) (log10(Padj value) < 0.05 and log2FC >|1|). Both genes were down-regulated in A2A2 genotype cows with respect to A1A1. With EdgeR, two differentially expressed genes (DEGs) were also identified: ND6, down-regulated in A2A2 genotype cows with respect to A1A1, and calpain-6 gene (CAPN6), which was up-regulated in A2A2 genotype cows (FDR < 0.05 and log2FC >|1|). The differential expression analysis results are shown in Table 3.

Table 3 Differentially expressed genes (DEGs) identified in milk fat globule (MFG) transcriptome of cows with differential β-casein genotype (A2A2 vs. A1A1) using DESeq2 and EdgeR packages.

Discussion

The use of MFG as a source of RNA for transcriptomic profiling in the present study was based on the evidence of that this milk-derived material can entrap variable amounts of cytoplasmic material from MEC43,45 and their unique advantage as a non-invasive and more ethical sampling. Consequently, MFG could contain transcripts that directly originate from MEC, supporting their use as a proxy for mammary gland transcriptomes (Fig. 4). Indeed, they have been successfully used in previous studies to explore gene expression dynamics related to milk synthesis and activity in the mammary gland40,42,43,52,53,54,55. Cánovas et al.38 further demonstrated a strong correlation between gene expression profiles from MFG and mammary gland tissue (Pearson r = 0.88–0.92), highlighting their transcriptomic representativeness. Additionally, a recent study reported that approximately an 80% of similarity in miRNAs was shared between MFG and mammary tissue, further supporting their biological relevance.

Fig. 4
figure 4

Milk fat globule (MFG) formation in mammary epithelial cells (MEC). During MFG formation, triacylglycerol (TAG) cores are surrounded by a phospholipid monolayer derived from the endoplasmic reticulum, giving rise to lipid droplets (LDs). These LDs are then enveloped by an outer bilayer membrane originating from the apical plasma membrane of MEC, forming the milk fat globules (MFG). The resulting MFGs are enclosed by a trilayered milk fat globule membrane (MFGM), which includes both the phospholipid monolayer from the ER and the bilayer from the plasma membrane of MEC. MFG may enclose variable amounts of MEC cytoplasm between the monolayer and bilayer membranes, forming cytoplasmic crescents, which may also contain RNA in varying amounts.

However, the MFG fraction also presents technical challenges. Due to their low cytoplasmic content and enriched small RNAs56, MFG contain low amounts of ribosomal RNA (18S and 28S)40,57, resulting in consistently low RIN values. In our study, the mean RIN was 3.59 ± 0.27, below conventional thresholds established for RNA-seq studies. Nevertheless, other RNA quality indicators, such as RNA IQ, and A260/280 ratio, and sequencing parameters (e.g., read depth, base quality, alignment rate), confirmed the suitability of the RNA extracts as described previously58. Moreover, the limitations of RIN as a universal metric for RNA quality have been highlighted in other low-rRNA samples, such as spermatozoa, where small RNAs predominate and make RIN values unreliable as a unique quality criterion59. Similarly, transcriptomic studies of milk samples in other dairy contexts have reported acceptable RNA-seq performance despite low RIN values56,60. Altogether, these findings support the use of MFG as a biologically informative and technically viable RNA source, but their particularities regarding RIN values should be taken into account, and it is essential to ensure that other quality parameters such as A260/280 ratio, RNA IQ, and sequencing metrics are properly met.

During lactation, the demand for protein and energy increases substantially. Protein synthesis requires ribosomal components, amino acids, and large amounts of energy61. In this study, 48.9% of the 142 highly expressed genes (average FPKM ≥ 500) encoded ribosomal proteins, which are key for ribosome biogenesis, assembly, and translation, essential processes to meet lactation demands37,62,63,64. The subsequent GO analysis showed that “structural constituent of ribosome” (GO:0003735) was the most enriched molecular function, grouping 67 of the 142 genes, while “translation” (GO:0006412) was the top biological process with 52 associated genes (Fig. 2 and 3). To make the biology clearer, the DAVID Functional Annotation Clustering (FAC) was conducted (Fig. 4 and 5) to group together a significant number of genes that share similar annotations65. Both GO terms and related term were grouped into cluster 1 (enrichement score: 74.42), including “ribosome” (bta03010), “ribosomal protein” (KW-0689), “ribonucleoprotein” (KW-0687), and “cytosolic large ribosomal subunit” (GO:0022625).

Fig. 5
figure 5

Bubble map of enriched clusters of DAVID functional annotation clustering (FAC) of the highly expressed genes (average FPKM ≥ 500) in the milk fat globule (MFG) transcriptome of cows with A1A1 and A2A2 β-casein genotypes. The figure shows clusters with enrichment score values higher than 4. Clusters with an enrichment score > 1 and FDR < 0.05 were considered significant. Count refers to the number of genes involved in the term and the percentage is calculated as (involved genes/total genes). FDR = false discovery rate.

Fig. 6
figure 6

Bubble map of enriched clusters of DAVID FAC of the highly expressed genes in the milk fat globule (MFG) transcriptome of cows with A1A1 and A2A2 β-casein genotypes. The figure shows clusters with enrichment score values lower than 4. Clusters with an enrichment score > 1 and FDR < 0.05 were considered significant. Count refers to the number of genes involved in the term and the percentage is calculated as (involved genes/total genes). FDR = false discovery rate.

Moreover, annotation clusters 2, 3, 4, 5, 7, 10, 11, and 12 were associated with protein synthesis. Cluster 2 (enrichment score: 29.39) grouped genes involved in post-translational modifications, including “isopeptide bond” (KW-107), “crosslink Glycyl lysine isopeptide (Ly-Gly) interchain with G-Cter in SUMO2” (CROSSLINK), and “ubI conjugation” (KW-0832). Cluster 11 (score: 1.64) focused on the ubiquitin system, with terms as “ubiquitin protein ligase binding” (GO:0,031,625) and “ubiquitin-dom” (IPR019956), key to tagging proteins for degradation. Both clusters are linked by ubiquitin-like modifiers such as SUMO2, which conjugates to proteins via isopeptide bonds and facilitates various post-translational modifications processes such as nuclear transport, DNA repair, and proteasomal degradation66,67. Cluster 3 (enrichment score: 21.09) included terms such as “cytosolic small ribosomal subunit” (GO:0022627), “ribosomal small subunit biogenesis” (GO:0042274), “nucleolus” (GO:0005730), and “small-subunit processome” (GO:0032040), all related to ribosomal small subunit biogenesis in the nucleolus68. Cluster 4 (enrichment score: 8.03) featured “RNA binding” (KW-0699) and “rRNA binding” (GO:0019843), reflecting the central role of RNA-binding proteins (including ribosomal proteins and ribosome biogenesis factors) and RNA molecules in ribosome function69. These RNA-binding proteins are also crucial for numerous biological processes, ranging from transcription and splicing to intracellular transport, translation, and decay69. Cluster 5 (enrichment score: 5.42) included “translation prot SH3 like sf” (IPR008991), referring to the domains with SH3-like topology, which are found in proteins associated with translation machinery. In addition, cluster 7 (enrichment score: 3.95) was related to the translation elongation, whilst cluster 10 (enrichment score: 1.74) was involved in translation regulation. Finally, cluster 12 (enrichment score: 1.46) featured Zinc Finger domains, which are a family of transcriptional factors with different DNA-binding domains involved in cell differentiation, development, or apoptosis70. Hence, GO and FAC analysis of highly expressed genes in the bovine MFG transcriptome corroborated that the most were associated with protein synthesis.

Protein synthesis is an active and energy-intensive process involving transcription, translation, and protein folding, thus, requiring the expression of genes related to cellular energy production37,71,72. Lactation further increases these demands, presenting a significant metabolic challenge for female mammals due to hormonal fluctuations and rising energy requirements, particularly from early to peak lactation72,73. In this study, GO analysis revealed significant enrichment of the molecular function terms “NADH dehydrogenase (ubiquinone) activity” (GO:0008137) and “cytochrome-c oxidase activity” (GO:0004129), corresponding to respiratory chain complex I and IV, respectively. Additionally, enriched biological processes in MFG included “mitochondrial electron transport” (GO:0006120), “electron transport coupled proton transport” (GO:0015990), and “ATP synthesis coupled electron transport” (GO:0042773). Moreover, annotation cluster 8 (enrichment score: 3.45) and cluster 9 (enrichment score: 2.43) grouped these GO terms with others from KEGG and UP_KW, such as “respiratory chain” (KW-0679), ‘electron transport” (KW-0249), “oxidative phosphorylation” (bta00190), “mitochondrial respiratory chain complex IV” (GO:0005751), and “mitochondrion” (KW-0496). These results would highlight the central role of mitochondria in energy production to support milk synthesis, as they generate up to 90% of cellular ATP through oxidative phosphorylation72,73.

In relation to milk proteins, caseins and whey proteins are the major proteins, and their synthesis is hormonally regulated41,61,71,74. In particular, prolactin, growth hormone, thyroid hormone and corticosteroids (including glucocorticoids and mineralcorticoids), play direct roles in milk synthesis71. GO analysis showed enrichement of biological processes such as “response to dehydroepiandrosterone” (GO:1,903,494), “response to 11-deoxycorticosterone” (GO:1903496), “response to progesterone” (GO:0032570) and “response to stradiol” (GO:0032355) (Fig. 3). The milk protein encoding genes CSN1S1, CSN1S2, CSN2, CSN3 and LALBA were associated to these processes. Furthermore, the cluster 6 (enrichment score: 4.28) grouped these GO term with INTERPRO and UP_KW entries related to milk proteins, such as “casein” (IPR001588), “milk protein” (KW-0494) or “secreted” (KW-0964), reinforcing their link hormonal regulation. The results are consistent with previous studies; for instance, Jia et al.50 found similar hormonal process enrichment in a proteomic study of the MFGM in Holstein cow. Dehydroepiandrosterone (DHEA), 11-deoxycorticosterone (11DOC), estradiol and progesterone are endogenous steroid hormones released by endocrine glands and cells, distributed to various organs and tissues via bloodstream, and regulate organ metabolism and physiological functions50,75. DHEA and 11DOC modulate responses to stressful stimuli50,76,77, with DHEA having anti-inflammatory and antioxidant properties with protective and regenerative functions50. Estrogens, including estradiol and progesterone, promote MEC proliferation and differentiation in preparation for lactation74,78.

Regarding the most highly expressed genes in the MFG transcriptome (FPKM > 4000; Table 2), 25 genes out of the 11,257 detected genes exhibited very high expression levels (EdgeR), accounting for 63.5% of the total reads (Table 2). This result suggests that a small number of genes contribute to the majority of the RNA transcripts from MFG. Notably, the top-expressed genes included those encoding major milk proteins, namely the four caseins (CSN1S1, CSN1S2, CSN2, and CSN3) and the two whey proteins (PAEP/BLG and LALBA). This align with the previous reports from bovine mammary gland studies by Cánovas et al.38, Wickramasinghe et al.41 and Becket et al.40. Among them, CSN1S1 and PAEP/BLG genes showed the highest FPKM values, consistent with Fang et al.79, who analysed RNA-seq data from 91 tissues and cell types, including mammary gland.

GLYCAM1 gene was also among the most highly expressed genes, with an average FRKM of 49,494. GLYCAM1 is MFGM-specific protein in milk and one of the most abundant in this fraction80. In the mammary gland, this gene may play a role in the cellular transport and secretion of MFG, while also providing a protective function for immunologically immature offspring81. GLYCAM1 expression pattern is similar to milk proteins, being induced during pregnancy and lactation under the hormonal control of prolactin and progesterone82.

Throughout lactation, milk fat yield increases due to enhanced de novo fatty acid synthesis in the mammary gland83. Indeed, several lipid-metabolism related genes were among the top-25 most highly expressed. Notably, PLIN2 showed the highest expression (21,930 FPKM). This protein, located in the MFGM84,85,86,87,88, is involved in triglyceride accumulation and act as a transcriptional regulator of other milk fat genes64. Additionally, FABP3 gene, with 10,090 FPKM, plays a crucial role in triglyceride formation by transporting long-chain fatty acids into the nucleus and activating key receptors such as the peroxisome proliferator-activated receptor gamma (PPARγ)61,89. During milk fat synthesis, these triglyceride cores are initially enveloped by a monolayer membrane derived from the endoplasmic reticulum, forming lipid droplets. These droplets are then surrounded by an outer bilayer membrane from the apical plasma membrane of MEC, leading to the formation of MFG86,90,91,92. The XDH gene (5,808 FPKM), encoding a redox enzyme XDH, plays a key role in the incorporation of lipid droplets into the apical plasma membrane and the subsequent secretion of MFG. Vorbach et al.93 demonstrated that the knockout of XDH impairs MFG secretion, as it is essential for the proper function of proteins such as butyrophilin (BTN1A1)94. Therefore, the higher expression levels of XDH observed compared to BTN1A1 (625.83 FPKM), are consistent with the essential role of XDH. BTN1A1 is also vital for the regulated secretion of lipid droplets, serving both as a structural and signalling receptor through its interaction with XDH94,95,96.

The presence of genes associated with the protein synthesis machinery (RPLP0, RPLP1, RPS2 and RACK1) among the most highly expressed is consistent with the increase protein synthesis demands during lactation, as previously mentioned. Additionally, TPT1 gene, involved in essential functions for efficient milk production during lactation, such as cell survival, protein synthesis and calcium regulation97,98, also showed high expression (6,674 FPKM). Finally, the FTH1 gene with an expression level of 4,260 FPKM, engaged in iron homeostasis, is involved in various cellular functions, including cellular proliferation and immune responses99.

Several genes related to cellular energy supply were also abundantly expressed, notably components of the mitochondrial respiratory chain: cytochrome c oxidase (COX1, COX2 and COX3) and NADH-dehydrogenase (ND1, ND3 and ND4), along with CYTB and ATP6 genes. These genes, part of complexes IV and I, III and V, contribute to electron transport and proton pumping aerobic energy generation in the form of ATP synthesis via mitochondrial oxidative phosphorylation73,100.

Taken together, all the highly expressed genes detected in our study are commonly expressed during lactation in cows61,86, results that are consistent with previous studies developed in the MFG38,40,42.

After characterising the overall transcriptomic profile of mammary gland through MFG analysis, we next assessed whether there were differential gene expression patterns between cows carrying the A1A1 and A2A2 β-casein genotypes. Only two DEGs were detected with each statistical package. The ND6 gene, involved in energy production machinery via mitochondrial electron transport and proton pumping for ATP synthesis73, was down-regulated in A2A2 cows (log2FC = −1.3567; DESeq2). The average expression of ND6 was 86 FPKM. Similarly, the 16S Mt-rRNA gene, a key component of the mitochondrial ribosomes, called mitoribosomes101,102, showed lower expression in A2A2 cows (log2FC = −1.3086; DESeq2), with average expression level of 6,484 FPKM in the A1A1 cows versus 2,607 in the A2A2 cows. These results may suggest slightly enhanced mitochondrial activity in A1A1 cows. On the other hand, EdgeR analysis identified ND6 as down-regulated in A2A2 cows with respect to A1A1 (log2FC = −1.3433), and calpain-6 gene (CAPN6), which was up-regulated in A2A2 cows (log2FC = 3.7758). Calpain genes encode a family of complex multi-domain intracellular cysteine proteases that perform limited cleavage of target proteins in response to calcium signalling103. However, the specific role of this gene in the MEC remains unclear104. The overall DEG count was very low, with only two DEGs out of 11,257 genes expressed in MFG, as shown in MA plot (Fig. 7). While the functional analysis of highly expressed genes revealed transcriptomic signatures related to milk synthesis and mammary epithelial activity, the minimal number of DEGs identified between genotypes limits the depth of genotype-specific functional interpretation. This may reflect either a true biological similarity between A1A1 and A2A2 cows or the limited sensitivity of differential gene expression analysis. Future studies using alternative analytical approaches, such as isoform-level differential expression analysis105,106, may help uncover additional regulatory differences not detectable in a gene-level analysis, or alternatively confirm the near absence of transcriptomic divergence between genotypes.

The detected transcripts distribution is represented through a MA plot (M = log ratio and A = mean average) in Fig. 7.

Fig. 7
figure 7

MA plot (M = log ratio and A = mean average) of the log2 average normalised counts (logCPM) against the log2 fold-change (logFC) across the genes generated by glimmaMA function in EdgeR. Each gene is represented by a grey dot. Significant differentially expressed genes (DEGs) are coloured in red (down-regulated) and in green (up-regulated).

Conclusion

The functional analyses of the highly expressed genes in MFG (FPKM ≥ 500) revealed significant enrichment of pathways associated with milk proteins and related hormones, milk fat secretion, ribosomal proteins, and cellular energy metabolism, which are commonly expressed during lactation and align with the high metabolic demands of this stage. In addition, key genes among the most highly expressed (FPKM ≥ 4000) were identified, including those involved in milk proteins (CSN1S1, CSN1S2, CSN2, CSN3; PAEP/BLG, LALBA), milk fat secretion (GLYCAM1, PLIN2, FABP3, XDH), ribosomal proteins (RPLP0, RPLP1, RPS2, RACK1), and, notably, components of the mitochondrial respiratory chain for energy production to support milk synthesis (COX, ND).

The differential expression analysis revealed highly similar gene-level transcriptomic profiles between A1A1 and A2A2 cows. Only two DEGs were identified with each statistical package between genotypes. Among them, ND6 and 16S Mt-rRNA are mitochondrial genes involved in OXPHOS and energy production, and their reduced expression may suggest slightly impaired mitochondrial activity in A2A2 animals. Overall, these results indicated that the β-casein genotype is not strongly associated with substantial transcriptional differences in the mammary gland, a finding that may be relevant in the context of ongoing selective breeding strategies for A2A2 animals in dairy sector. However, further research is needed to assess whether additional regulatory mechanism such as alternative splicing events or non-coding RNAs, may influence milk synthesis in the mammary gland, beyond what is observable with a gene level differential analysis.

Material and methods

Animals and housing

The cows used in this study were housed in a freestall barn with individual cubicles equipped with sand bedding and belonged to a commercial farm (S.A.T Etxeberri, Navarre, Spain). Animals had free movement in the feeding and walking areas, with continuous access to fresh water. The barn was equipped with mechanical ventilation, including fans, to ensure air circulation, as well as cow brushes to promote comfort and welfare. Milking was performed using an Automated Milking System (AMS) (Lely Astronaut A3, Lely, Maassluis, the Netherlands). All cows received a total mixed ration (TMR) diet twice a day. The detailed ingredient, chemical composition and energy of TMR is provided in Table 4. In addition to the TMR, the cows received 6 kg of concentrate feed when being milked in the milking robot (204.4 crude protein (CP), 5.5 ether extract (EE), 55.4 crude fiber (CF), 361.6 starch, 11.3 Ca, 5.3 P, and 3.3 Mg expressed as g/kg dry matter (DM)).

Table 4 Ingredient composition, chemical composition and, energy values of the total mix ration (TMR) fed to dairy cows.

Since no animals were use, bred, or subject to experimental procedures, specific ethical approval was not required. Only milk samples, provided by a commercial dairy farm, were used for analysis. The collection of milk samples by responsible staff of the farm was performed in accordance with the regulations of the Ethics, Animal Experimentation and Biosafety Committee of the Public University of Navarre (2007–11-12), and the Spanish national regulation (RD 53/2013), which establishes the basic standards for animal safety in experimentation and other scientific purposes, including teaching107.

Animals selection and milk sample collection

A total of 14 lactating healthy Holstein cows were selected for the study based on β-casein genotype (CSN2 gene): A1A1 (n = 7) vs. A2A2 (n = 7). All cows were in the third lactation, with 225.8 ± 73.1 (means ± SD) days in milk and an average age of 59.0 ± 3.8 months. They had similar milk yield characteristics, with average kilograms of milk, fat, and protein percentages of 47.8 ± 7.7, 3.83 ± 0.54, and 3.31 ± 0.22 for A1A1 cows, and 47.2 ± 5.7, 3.87 ± 0.55, and 3.56 ± 0.26 for A2A2 cows. These values were obtained from the official records (Spanish Federation of Holstein Cattle; CONAFE) of test corresponding to the closet day preceding RNA sampling, they represent standardised phenotypic data widely used for genomic evaluations and phenotypic analyses. To increase genetic diversity and reduce potential biases arising from multiple daughters of the same sire, cows selected were daughters of 12 different bulls. The study was conducted between mid-May and mid-June 2023, during the morning milking (9:00–11:00 am). Prior to sampling, all equipment used was disinfected with both ethanol 70% and RNaseZap (Ambion, Austin, TX, USA). Nipple cleaning was also carried out by applying both ethanol 70% and povidone iodine twice, followed by careful drying of the entire nipple. The first portion of milk was discarded prior to sampling to avoid collecting foremilk and potential contamination, and to ensure sample representativeness. Milk was then manually collected from individual cows by the responsible farm staff and transferred into 50 mL RNase-free tubes. Approximately 20–25 tubes were collected from each animal to ensure enough MFG fraction for RNA isolation. Full milk samples were immediately snap-frozen in liquid nitrogen and transported to the laboratory, where they were stored at −80 °C until further analysis.

RNA isolation and quality assessment

RNA isolation from MFG was performed as described in detail Jiménez-Montenegro et al. (2025)58. Briefly, milk samples stored at −80°C were thawed at room temperature and centrifuged at 6000 g during 10 min at 4°C. Then, the milk fat layer (the MFG fraction) was collected using a sterile spatula and was transferred into 15 mL RNase-free tubes. The total volume of milk used from each cow varied (200 to 500 mL) in relation to the MFG quantity collected from each cow (0.5–5 mg MFG fraction/50 mL tube). The collected MFG fraction was homogenised in Trizol LS reagent (Invitrogen, Waltham, MA, USA) at a proportion of 3 mL Trizol LS reagent per 1 mL of collected MFG and the obtained mixture was stored at −80°C until used for RNA isolation. Total RNA was isolated from MFG using Trizol LS reagent (Invitrogen, Waltham, MA, USA) and GenElute mammalian RNA isolation kit (Sigma-Aldrich, CA, USA). Finally, the isolated RNA was eluted into 40 µL of diethyl pyrocarbonate (DEPC) treated water. Then, the RNA quantity, quality and integrity were evaluated using different instruments to ensure accurate and reliable measurements, as previously described in Jiménez-Montenegro et al.58. The RNA quality and integrity were determined using the Nanodrop 2000 spectrophotometer (Thermo Scientific, Madrid, Spain), which determines the ratio A260/280 and A260/230; the TapeStation 4200 RNA Screentape (Agilent Technologies, Santa Clara, CA, USA), which determines the RNA Integrity Number (RIN) and the Qubit 4 fluorometer (Life Technologies, CA, USA), which stablishes the RNA Integrity and Quality (IQ) value. RNA quantity was measured with the Nanodrop 2000 spectrophotometer, the Qubit 4 fluorometer and the Victor Nivo Multimode Microplate Reader (PerkinElmer, Waltham, MA, USA). Briefly, the mean RNA concentrations obtained with Nanodrop was 120.43 ± 22.27 ng/μL and Qubit HS assay and QuantIT RiboGreen assay yielded 102.87 ± 15.64 ng/μL and 109.43 ± 22.69 ng/μL, respectively. The mean A260/280 and A260/230 ratios were 2.03 ± 0.01 and 1.34 ± 0.06, respectively. The average RIN value of all the RNA extracts was 3.59 ± 0.27 (mean ± SE), which was below the conventional benchmark of 7 for RNA-seq studies108,109,110. In contrast, average IQ and A260/280 values were within the ranges accepted by the bibliography for optimal sequencing (9.51 ± 0.15 and 2.03 ± 0.01, respectively)111,112,113. However, as demonstrated in our previous methodological study using the same milk samples and animals58, milk fat globules are a non-conventional RNA source with inherently low rRNA content, leading to artificially low RIN values. Despite this, all samples passed the standard sequencing quality checks, including sequence read lengths (151 bp) and base-coverage (100%), base content (49%), base quality scores (36), and alignment rates (above 85%), indicating high-quality RNA suitable for transcriptome profiling.

Library preparation and RNA sequencing

Sequencing libraries were constructed using the Illumina TruSeq stranded mRNA kit (Illumina, San Diego, CA, USA58. Briefly, messenger RNA (mRNA) molecules were purified from the total RNA and subsequently fragmented. The cleaved RNA fragments were reverse transcribed into first-strand complementary DNA (cDNA) using SuperScript II reverse transcriptase (Invitrogen, Waltham, MA, USA) and random primers. The second-strand cDNA was synthesised using DNA Polymerase I, RNase H and dUTP. Then, a single ‘A’ nucleotide was added to the 3’ ends of the blunt fragments to prevent them from ligating to each other during the subsequent adapter ligation reaction. The resulting products were then purified and enriched by PCR to create the final cDNA library. The libraries were quantified using KAPA Library Quantification kits for Illumina Sequencing platforms according to the qPCR Quantification Protocol Guide (Kapa Biosystems, Wilmington, MA, USA) and qualified using the TapeStation D1000 ScreenTape (Agilent Technologies, Santa Clara, CA, USA). Indexed libraries were then submitted to an Illumina NovaSeq6000 (Illumina, Inc., San Diego, CA, USA) platform for paired-end (2 × 150 bp) sequencing58.

Bioinformatics analysis

Quality control analysis of the raw sequencing data was performed using the FastQC tool version 0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Cutadapt software (version 4.0) was then used to perform trimming procedure, in which indexing adapters, as well as low quality reads (those with ambiguous bases and/or shorter than 10 bases), were removed from the analysis. After trimming, the cleaned sequencing reads were again subjected to FastQC analysis. Subsequently, cleaned sequencing reads were aligned to the bovine reference genome, ARS.UCD1.3 (Ensembl release 112), using the Spliced Transcripts Alignment to a Reference (STAR) aligner (version 2.7.11b). Within STAR, the quantmode function was utilised to quantify the number of sequencing reads aligned to each gene. Unmapped or multi-positions mapped reads were excluded from further analysis and only uniquely mapped reads were considered.

Outlier detection was carried out through a generalized model principal component analysis (GLM-PCA). Based on this analysis, samples that deviated significantly from the rest based on their scores in the principal component space were considered potential outliers and removed from downstream analysis to avoid biasing the results114,115. As a consequence, one animal with A1A1 genotype (cow 7) was considered an outlier.

A list of highly expressed genes was obtained by normalising the data through the calculation of FPKM with EdgeR (version 4.0.16). This FPKM value involved two normalisation steps: the number of fragments was normalised firstly by the sequencing depth of each sample in the read counts file, and secondly, by the transcript length of each gene in the bovine reference genome (ARS.UCD1.3; Ensembl release 112). Genes with an average FPKM value greater than 500 were considered highly expressed.

Subsequently, a GO enrichment analysis with the highly expressed genes was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID). Multiple testing was accounted based on the FDR of 0.05. Since DAVID supports various ontologies, a functional annotation clustering (FAC) was conducted for categories sharing a significant number of genes116,117. The use of a FAC reduces the complexity of handling similar redundant terms, allowing for a more focused biological interpretation at the group level117,118. The biometric indicators considered in the FAC were: Functional Annotations (UP_KW_Molecular_Function, UP_KW_Biological_Process, UP_KW_Cellular_Component, UP_KW_PTM, and UP_SEQ_Feature), Gene Ontology (GOTERM_MF_Direct, GOTERM_BP_Direct, and GOTERM_CC_Direct), Interactions (UP_KW_Ligand), Pathways (KEGG_Pathway), and Protein Domains (INTERPRO, PIR_Superfamily and SMART). Annotation clusters with an enrichment score of at least 1 were considered significant.

To evaluate the DEGs from the RNA-seq read counts, two different Bioconductor packages of Rstudio (version 4.3.2), DESeq2 (version 1.42.0)119 and EdgeR (version 4.0.16)120, were used. The selection of these two software was based on the literature evidence supporting their robustness, regarding that when it comes to integrating methods for identifying DEGs, their combination allows an enhanced sensitivity and leads to more reliable outcomes49,121. DESeq2 and EdgeR conduct pairwise comparisons among two or more groups utilising parametric tests, assuming read counts adhere to a negative binomial distribution with a gene-specific dispersion parameter. The main differences between these packages consist in the estimation of the dispersion parameter and the normalisation methods employed49,121. Firstly, read counts were filtered and genes expressed at very low level were removed from the analysis. Only genes with more than 10 counts in at least six samples were considered to be expressed. For this purpose, the fpm (fragments per million) function of the DESeq2 package (version 1.42.0) and the cpm (counts per million) function of EdgeR (version 4.0.16) were respectively used. In order to determine DEGs between A1A1 and A2A2 cows for β-casein, after filtering, read counts were normalised by sequencing depth, gene expression dispersions were estimated, the model was fittled and the contrasts were performed. With respect DESeq2, genes with an adjusted p-value (Padj value) < 0.05 and a log2 Fold Change (log2FC) >|1| were considered differentially expressed. For EdgeR, genes with a False Discovery Rate (FDR) < 0.05 and a log2FC >|1| were considered differentially expressed.