Introduction

The widespread developments of molecular techniques and bioinformatic algorithms have enabled the unraveling of genetic mechanisms driving differences between individuals, which are mainly attributed to single-nucleotide variants (SNVs), small insertions and deletions (indels; < 50 bp), and structural variations (SVs)1,2,3. Structural variants are generally defined as inherited variations in DNA sequences larger than 50 bp that exhibit changes in the copy number, orientation, and/or chromosomal location in relation to the reference genome4,5. The balanced SV types correspond to the mobile elements (inversions and translocations), which cause rearrangements of DNA sequences1,5. On the other hand, small DNA segments, such as insertions and deletions (indels) and copy number variants (CNVs), comprise the unbalanced SVs in which gains or losses of DNA segments are observed. CNVs are the most studied SV types in livestock and encompass duplication and deletion events ranging in size from 1 Kb to 5 Mb3,6,7.

Compared to SNPs, SVs affect larger fractions of the genome and account for the greatest amount of total polymorphic content among individual genomes8. SVs may have potentially greater effects than SNPs, from modification of genes to disruptions in the regulation of gene expression, leading to recessive disorders6,9. A variety of based platforms, such as comparative genomic hybridization and SNP arrays, have been commonly used for SVs-wide detection in livestock10. However, array-based approaches lack genome coverage, especially in segmental duplication regions, which are known for being hotspots for CNV formation11. The advantages of detecting SV using whole genome sequencing data (WGS) include higher coverage and resolution, more accurate prediction of copy numbers, more precise detection of breakpoints, and higher power to identify novel SVs11. In cattle, the study of structural variations has reached increasing attention because of their association with economically important phenotypes such as fatty acid profile12meat tenderness13feed conversion ratio14growth traits15milk production traits16and milk somatic cell score17.

Although WGS has become more feasible in livestock, most studies regarding structural variant-wide detection are still being conducted using SNP array-based approaches. Few published studies explored the structural diversity using whole genome sequencing data, all with less than 100 animals18,19,20,21,22,23,24. Furthermore, most have been focused on the assessment of copy number variants (CNVs) rather than the other SV types. The detection of balanced and unbalanced structural variants using whole genome sequencing data may contribute to the unraveling of the molecular mechanisms underlying genetic diversity, breed adaptation, and evolution, especially in Nellore cattle, which have not been well elucidated so far. Therefore, the aims of this study were: (i) to detect structural variants using whole genome re-sequencing data of 151 representative Nellore bulls by combining calling algorithms; (ii) to discover non-redundant and highly frequent structural variant regions (SVR) in the analyzed bulls; (iii) to search for positional candidate genes and quantitative trait loci (QTL) overlapping the most frequent SVR in the population; and (iv) to assess the functional impact of positional candidate genes overlapping SVR through enriched gene ontology terms (GO terms) related to biological process (BP), molecular function (MF), cellular component (CC), and biochemical pathways.

Materials and methods

DNA samples and re-sequencing

A total of 151 Nellore bulls comprised the analyzed population for the current study. The ancestors were selected based on their contributions to the genetic diversity of the Brazilian Nellore population. For that, pedigree information of 2,688,124 related individuals, with a total of 9,811 sires, from which 6,040 were founders, and 915,371 dams were used. The selected bulls for DNA re-sequencing were the ones less related to the others, those having the higher number of progenies and that have greater contributions to the genotyped population, which consists of more than 60,000 genotyped animals from five Brazilian Nellore breeding programs: DeltaGen (http://deltagen.com.br/), Nelore Qualitas (https://qualitas.agr.br/), Cia de Melhoramento (https://www.ciademelhoramento.com.br), PAINT (https://www.paintmga.com.br/), and Centro de Pesquisa de Bovinos de Corte - Instituto de Zootecnia (https://iz.agricultura.sp.gov.br/). For further information regarding the selected ancestors, see Fernandes Junior et al.2.

Genomic DNA was extracted from semen samples (100 µL) using the DNeasy Blood and Tissue DNA Extraction Kit (Qiagen, Valencia, CA, USA). Each purified DNA sample was quality-controlled by agarose gel electrophoresis (1%). Extracted DNA was quantified based on the absorbance using the NanoDrop 1000 Spectrophotometer (Thermo Fisher Scientific, Santa Clara, CA, USA, 2007).

The whole-genome re-sequencing of representative bulls was performed using Illumina sequencing by synthesis technology. From all DNA samples, 52 were re-sequenced using the Hiseq X Ten and the other 99 with the NovaSeq 6000 System platform (Illumina Inc., San Diego, USA). DNA libraries were constructed with end-linked adapters, and the resulting libraries were re-sequenced by paired-end sequencing (sequencing of both ends of DNA fragments), which produced an average of 360 million reads per bull with a length of 150 Bp (2 × 150) and an average genome coverage of 14.5x, ranging from 7.8 to 26.3x. Poor quality reads were trimmed using SeqClean software (https://hpc.ilri.cgiar.org/seqclean/seqclean) with default parameters. The SAMtools software v. 1.2225 (https://www.htslib.org/download/) was used to index reads, and the alignment to the bovine reference genome ARS-UCD1.2 (http://www.ncbi.nlm.nih.gov/genome/?term=bos+taurus) was performed using the software HISAT2 v. 2.2.026 (https://daehwankimlab.github.io/hisat2), in which 97.9% of reads were successfully paired. Further details regarding the re-sequencing and alignment can be found in Fernandes Junior et al.2 and Fernandes Junior et al.27.

Detection of structural variants

Detection of structural variants (SV) was conducted by adapting the workflow proposed by Letaief et al.28 in French dairy and beef cattle (Fig. 1). SV calling for each sample was performed using Parliament229, a software that integrates multiple bioinformatics tools for high-quality structural variation events calling. The consensus strategy of Parliament2 (https://github.com/fritzsedlazeck/parliament2) provides information about the size, type, location, and calling method for each SV event. We used BAM and BAI files containing the whole genome nucleotide sequences of 151 Nellore bulls, as well as BAM and BAI files containing the bovine reference genome sequence (ARS-UCD1.2). For this dataset, we executed Parliament2, setting the default parameters for each sample through a single Docker image across the DNAnexus platform. A total of three bioinformatic tools were integrated through Parliament2: CNVnator v. 0.4.130 (https://github.com/abyzovlab/CNVnator), Lumpy v. 0.3.131 (https://github.com/arq5x/lumpy-sv) and Delly v. 1.2.6 32 (https://github.com/dellytools/delly). CNVnator, Lumpy, and Delly are bioinformatics tools designed for detecting structural variants (SVs) from whole-genome sequencing (WGS) data. CNVnator identifies copy number variations (CNVs) by analyzing read depth. Lumpy utilizes a hybrid approach that combines paired-end, split-read, and read-depth signals to detect various types of structural variants, including deletions, duplications, inversions, and translocations. Delly, on the other hand, employs paired-end and split-read mapping techniques to accurately identify a wide range of structural variants, including complex rearrangements.

Fig. 1
figure 1

Flowchart of pipeline used to detect structural variants (SV). CNV copy number variants, RO reciprocal overlap, DEL deletions, DUP duplications, INV inversions. First, SV was detected separately by each tool within each sample through the integration software Parliament2. Second, filtering was applied at the software level, following the suggested criteria for each tool. Only SVs that passed all filters were retained and compared within the sample to retain common CNVs and INVs detected by the different tools. CNVs should share at least 90% of their sequence between the three tools, and INVs should share at least 90% of their sequence between Delly and Lumpy calls. Own work.

After computing Parliament2, we obtained VCF files containing the SV called from each bioinformatic approach for each of the 151 Nellore bulls, consisting of a total of 6 VCF files, described as follows: 2 VCF files containing CNVnator and Lumpy detected variants and a total of 4 VCF files containing each Delly detected variant type (duplications, deletions, insertions, and inversions).

Filtering of called structural variants

The filtering of detected SVs was performed at individual and population levels in order to remove unreliable raw SVs and reduce the false discovery rate. For the following analyses, different sets of utilities integrated by BCFtools33 and Bedtools34 were used. All called SVs were first filtered at the bioinformatic tool level, following specific criteria as suggested for each SV caller.

  1. 1.

    CNVnator: we only retained the called structural variants that were significant for the t-statistical test (p-value < 0.05), for which the null hypothesis statement is that the mean read signal depth in an SV region is the same as the mean signal depth in the sample.

  2. 2.

    Delly: only called SV that passed the quality filters (i.e., flagged PASS) and supported for more than 3 paired-end events (PE support > 3) were kept. After filtering, the VCF containing the remaining DEL, DUP, and INS variants was joined using the option concat of BCFtools to obtain a single VCF containing the copy number variants (CNVs), while the VCF containing filtered INV variants was maintained separated.

  3. 3.

    Lumpy: SVs with a read count supported by more than 1 split-read event or more than 3 paired-end events (SR > 1 or PE > 3) were maintained.

The SVs that passed the described criteria were compared at the individual level (i.e., within the same sample) to identify a high-confidence set of SVs for each bull (composed only of SVs common to all bioinformatic tools). As each of the three callers used has a different power to detect a specific SV type, such as CNVnator, which only detects CNVs, the within-sample analysis was performed for each SV type separately:

  1. 1.

    Identification of high-confidence CNVs: the consensus files containing the filtered variants from CNVnator and Lumpy, and the concatenated and filtered file containing duplications, deletions, and insertions from Delly, were compared against each other using the option intersect of Bedtools by fitting a minimal reciprocal overlap threshold of 90%.

  2. 2.

    Identification of high-confidence INVs: the consensus-filtered file containing the SVs detected from Lumpy and the one containing the inversions detected from Delly were compared against each other using the option intersect of Bedtools by fitting a minimal reciprocal overlap threshold of 90%.

The establishment of a reciprocal overlap threshold of 90% means that if at least 90% of the sequence of a certain SV detected by one caller is the same as the sequence of an SV predicted by the others, that SV must be deemed as a high-confidence SV for a certain bull.

After executing the within-sample filtering, we obtained two VCF files for each bull, one containing the high-confidence inversions (INVs) and the other one containing the high-confidence copy number variants (CNVs). All VCF files containing each type of high-confidence SVs were joined using the option concat of BCFtools to obtain two files containing the high-confidence CNVs and INVs, respectively, for all bulls. CNVs and INVs were then submitted to additional filtering, where SVs located in sexual chromosomes and with a length shorter than 1 Kb and higher than 5 Mb were removed. The remaining CNVs and INVs were considered for downstream analysis.

Identification of structural variant regions (SVRs)

The joining of structural variants into regions is required for population-wide analysis, as it allows the detection of DNA segments having different structural variation events35. In the present study, the structural variant regions (SVRs) were built by collating, separately, the two types of high-confidence SVs (CNVs and INVs) along all bulls that overlapped in at least 1 bp. For this, we used the utility merge of Bedtools (https://github.com/arq5x/bedtools2) fitting operations such as count and collapse, which allowed the identification of samples sharing a certain region and the occurring SV event. Regions that were not shared by at least 5% of the bulls (N = 7) were removed. As a result, we obtained two sets, one containing the regions of copy number variants (CNVRs) and the other one with the regions of inversions (INVRs), which were then joined in a single file using the option concat of BCFtools, to facilitate the descriptive analysis of SVRs (Fig. 2).

Fig. 2
figure 2

A flowchart of the pipeline is used to detect regions of structural variant (SVR). CNV copy number variants, RO reciprocal overlap, DEL deletions, DUP duplications, INV inversions, SVR structural variant regions, CNVR regions of copy number variants, INVR regions of inversions. All sets of detected high-confidence CNVs and INVs were merged across all samples with a minimal reciprocal overlap of 1 Bp. Own work.

Statistical analysis and graphical visualization of SVs and SVRs

All reported descriptive statistics and plots about the results were calculated using functions and utilities of the R package tidyverse36 through the Integrated Development Environment RStudio. The association of different metrics, such as the number, distribution, and (average) length of the predicted SVs and SVRs, with chromosome size and genome coverage rate, was assessed through the pairwise Pearson correlation method using the R package Performance Analytics37.

Functional analysis of highly frequent SVR

The functional implications of predicted SVR were assessed by searching genes and QTLs overlapping the most frequent SVR in the 151 bulls (SVR present in more than 50% of bulls, N > 75) using the GALLO package of R38. Genes and QTLs were retrieved from the Ensembl Genes database39 and cattle database from Animal QTLdb40respectively. Only significant QTLs (p-value < 0.05) and protein-coding genes overlapping the most frequent SVR were reported in this study.

Functional enrichment analysis for positional candidate genes overlapping the most frequent SVR was performed through PANTHER 17.0 software (https://pantherdb.org/). PANTHER statistical overrepresentation test was conducted using, as an analyzed list, the set of positional candidate genes harbored by the most frequent SVR (a total of 1,164 genes) and, as a reference list, the Bos taurus genes contained in PANTHER databases. The annotation data sets were PANTHER GO-slim Molecular Function, PANTHER GO-slim Biological Process, PANTHER GO-slim Cellular Component, PANTHER Pathway, and Reactome Pathway. A Binomial test was applied to test whether the genes overlapping the SVR were enriched for the gene ontology categories and biochemical pathways. This assesses the probability of observing a certain number of genes related to a specific biological mechanism within the found gene set, given all genes in the Bos taurus reference genome. As different biological functions were tested at the same time, the Bonferroni correction method for multiple testing was used, which allows for reducing the probability of false positive associations of genes and gene ontologies and/or pathways. Only genes with significant associations (p-value < 0.05) with gene ontologies (GO) or pathways were reported. The Tidyverse (https://www.tidyverse.org/) and ggplot2 (https://ggplot2.tidyverse.org/) R packages were used to graphically represent the QTL type and trait ID of significant QTLs, as well as the significantly enriched GO and/or pathways, through pie charts and bar plots.

Results and discussion

Genome-wide SV detection

Number of high-confidence SVs

A total of 215,031 high-confidence structural variants were obtained for the analyzed bulls that passed all the described filtering criteria. Statistics summary for the number and length of high-confidence SVs, including CNVs (deletions and duplications) and inversions, are presented in Table 1. Although insertions were detected by the SV callers, after applying filtering criteria and executing the reciprocal overlapping into the CNV set, no insertions remained, as they were detected with low quality and corresponded to small insertions (average length of 1 bp), which are not considered as structural variation.

Table 1 Statistics summary for the number and length of detected high-confidence svs.

In this study, SVs were detected using three complementary callers, such as CNVnator (based on read-depth), Delly (based on paired-end mapping), and Lumpy (which integrates both paired-end and split-read signals). Each tool’s output was first filtered using tool-specific quality thresholds. Then, for each SV type, only the variants that were shared across all three callers within the same individual, based on a 90% reciprocal overlap threshold, were retained as high-confidence SVs. Although this approach did not allow quantifying partial overlaps between specific pairs of tools (e.g., CNVnator vs. Delly), it ensured a highly stringent consensus set, minimizing the false discovery rate by focusing exclusively on SVs consistently detected by all methods. This strategy favored precision over sensitivity and reflected the complementary nature of the tools used, each with different strengths depending on SV type and detection principle. As such, the 215,031 high-confidence structural variants reported represent a conservative yet robust subset for downstream analyses (Supplementary Table S1).

From the total number of detected SVs, 197,045 corresponded to CNVs, and the number of deletions was 13.06 times higher than the number of duplications (Table 1). This result is similar to that reported by Silva et al.22who detected 11.8 times more deletions than duplications in Nellore cattle using whole genome sequencing data. It also agrees with the findings of Santana et al.14in which the number of loss events was higher than the number of gain events for the detected CNVs from SNP data in Nellore cattle, although the ratio presented here is higher than reported by the latter authors (1.18 times more losses than gains). Similarly, our result is congruent to the ratio reported by Braga et al.20 in Gyr cattle (1.96 times more deletions than duplications), who detected CNVs from WGS using CNVnator and Delly software, and agrees with the findings of Hu et al.41 in different cattle breeds, including Nellore, where the number of deletions was higher than the number of duplications. Interestingly, the authors pointed out that the proportion of observed deletions was higher in the Bos indicus than in Bos taurus animals. In contrast, the ratio of deletions over duplications observed in this study diverges from what was reported by Peripolli et al.18who used whole genome sequencing data from three Brazilian locally adapted taurine breeds, where the number of observed duplications was 1.85 times higher than the observed deletions.

The remarkably elevated number of deletions may be related to mechanisms causing structural variations within the genome. Non-allelic homologous recombination (NAHR) has been known as the main mechanism causing deletions and translocation events42. The findings of our study, supported by the previously reported in Nellore and other Bos indicus breeds, may suggest that the deleted DNA fragments have caused the evolution and diversity of Nellore cattle genomes, but this can only be known by investigating the genomes of a higher number of ancestors. Besides, it is plausible that the alignment of sequences to the Bos taurus reference genome led to bias. The comparison of the copy number likely excluded specific sequences present in Bos taurus that are no longer present in the indicine cattle genomes, leading algorithms to consider those sequences as deleted fragments41. Thus, a better characterization of structural variation in indicine cattle requires the availability of a reliable Bos indicus reference genome.

Regardless of the algorithm used to detect the structural variants from short-read sequencing, it is often expected to find a higher number of deletions than duplications. The read pair approach detects the duplicated copies in a mapped read located at a shorter distance than the fragment length, and when the length of the duplicated sequence is larger than the insert size, the algorithm tends to fail in the accurate detection of that duplication43. On the other hand, the read depth (RD) algorithm is more likely to easily detect deletions, as a decrease in the read depth or coverage at a certain genomic region signals them1. While duplications may not always result in a straightforward increase in read depth, once the duplicated region may have variable copy numbers and irregular coverage, making detection more challenging43. Furthermore, increases in the read depth or coverage are not always indicative of duplicated copies. These could be caused by alignment artifacts that cannot be differentiated from a true duplication through the RD approaches44.

Combining different SV calling algorithms should reduce the false discovery rate compared to the use of a single approach. Furthermore, the application of a stringent merge strategy and filtering criteria, as conducted in this study, may ensure a more reliable identification of copy number variants28. The increase in the accuracy of detection of all types of SVs can be achieved through long-read sequencing (up to 80 Kb), in which single reads span complex structural variants, but its use on a large scale continues to be limited because of the high cost8.

In relation to the balanced structural variants, inversions represented 8,36% of the total predicted high-confidence SVs (Table 1). The proportion of inversions observed in the present study was lower than that reported by Boussaha et al.24who used WGS data from 380 animals of different Bos taurus dairy and beef cattle breeds, in which the inversions represented more than 30% of the total number of detected SVs. Our results also differ from the report by Chen et al.45where the proportion of inversions was higher than 20% of the total number of detected SVs in a study using WGS data from 308 Holstein animals. However, our results agree with the report by the latter authors on Jersey cattle, in which inversions represented 9% of the total number of SVs detected. The proportion of detected inversions in our study was higher than that reported by Lee et al.46 (~ 4%), who used RD and SR approaches to detect SVs from WGS data in Holstein cattle.

There is still a lack of knowledge regarding the characterization of balanced structural variants, such as inversions, in Nellore cattle. This is because most of the reported studies in this breed are based on SNP genotyping data12,14,22from which it is not possible to identify mobile elements but only copy number gains and/or copy number losses4. Nevertheless, there is an increasing interest in creating structural variation maps in cattle, with a higher number of published studies using whole genome sequencing data in taurine dairy and beef cattle than in indicine cattle.

In evolutionary genomics, chromosomal inversions are known as a powerful force contributing to local adaptation and diversification, as they have the potential to protect inverted sequences from recombination, allowing the maintenance of favorable allele combinations47. Despite their evolutionary importance, there is not enough information to enable the characterization of this type of SV in cattle, and most of the studies continue focusing on deletion and duplication events. This is likely because the detection of SVs is often conducted through the RD approach, which only allows the identification of CNVs once balanced rearrangements (inversions and translocations) do not cause changes in read depth signature5,48. Furthermore, the complexity of the nature of these variants makes their interpretation and discovery more challenging when compared to the other SV types. The inversions and translocations’ breakpoints usually lie within complex repeats that reduce mappability, for which reliable detection requires the use of bioinformatics tools based on a combination of algorithms, such as split reads and paired reads48.

Length classification of high-confidence SVs

The high-confidence SVs have an average length of 86.9 Kb (± 365.1), ranging from 1 Kb to 4,989.5 Kb (Table 1). Although the CNVs, particularly the deletions, represent the highest number of the detected SVs, on average, they are the smallest (22.68 ± 167.31 Kb), while inversions are the longest ones (613.03 ± 851.34 Kb), followed by duplications (250.61 ± 560.12 Kb) (Table 1). Analysis of the distribution of SVs according to different size classes revealed that most CNVs are 1–5 Kb long (73%), while inversions are distributed in all size classes, but most of them are 200 Kb long (48%) (Fig. 3b). From the predicted deletions, the highest proportion is distributed into the classes 1–5 Kb and 5–10 Kb (77.2 and 16%, respectively), whereas predicted duplications are distributed in all size classes, where the lowest proportion are 5–10 Kb long (7.2%) (Table 2).

Fig. 3
figure 3

Distribution of the detected structural variants according to different size classes for copy number variants (CNVs—a) and inversions (INVs—b).

Table 2 Statistics summary of the number and proportion of each SV type according to different size classes.

The distribution of high-confidence CNVs in the established size classes agrees with the results of Bickhart et al.10who observed significantly shorter CNVs (< 15 Kb) in a Nellore individual than the average length of the CNVs detected in different taurine individuals using WGS data. In contrast, our results differ from those reported by Lemos et al.12 in Nellore cattle, who observed that most of the CNVs detected from SNP data had lengths ranging from 10 to 500 Kb and that less than 30% of the CNVs were 1–10 Kb-long. The results presented here also differ from those reported by Silva et al.22 in Nellore cattle, in which the highest number of CNVs predicted from SNP data ranged from 100 to 900 Kb.

It is important to note that the differences in CNV length between the report in the present study and the results reported in the literature are mostly attributed to the nature of the data used and, consequently, to the bioinformatic algorithm for CNV calling. WGS data provides higher resolution compared to SNP data, as it allows the identification of sequence reads spanning the entire genome. WGS-based algorithms enable the identification of smaller structural variants, especially CNVs, with greater power3. In contrast, CNV detection using SNP genotyping data often relies on analyzing changes in signal intensity at SNP probes. As there is a wide range of commercial SNP arrays, they all have different probe densities, which directly impact CNV results. SNP-based detection methods exhibit good performance in detecting large events, but compared to WGS-based algorithms, they tend to lack power in detecting smaller CNVs (< 10 Kb) because of the low SNP probe density and resolution4.

In general, the inversions are the largest structural variants detected, with an average length of 613.03 ± 851.34 Kb (median of 167.45 Kb) (Table 1). The length of inversions observed in this study is higher than that reported by Chen et al.45 in Holstein (median ~ 2.53 Kb) and Jersey (median ~ 1.34 Kb) cattle. Considering the observed average length of inversions, 48% are distributed into the size class > 200 Kb, 22% into the size class 50–200 Kb, while a low proportion (3%) is distributed into the size class 5–10 Kb (Fig. 4a; Table 2). These results differ from those reported by Boussaha et al.24 in European dairy and beef cattle animals, in which more than 50% of the detected inversions were distributed into the size class corresponding to 1–10 Kb.

Fig. 4
figure 4

(a) Number of high-confidence structural variants identified on each of the 29 autosomal chromosomes. DEL deletions, DUP duplications, INV inversions. (b) Average length (kb) of the predicted high-confidence structural variants across the 29 autosomes. (c) Average proportion (%) of each chromosome covered by high-confidence structural variants (average coverage rate).

It is outstanding that the combination of tools and SV calling algorithms enabled the detection of large inversions, as they tend to be harder to identify. There are few available and reliable information regarding inversions and translocations in livestock species, especially in cattle, limiting comparisons. More studies of inversions are available in humans than in cattle, as there is an increasing interest in assessing the impact of reversed chromosome regions on phenotypes and diseases49. Accordingly, the vast number of inversions in the human genomes available in the InvFEST public database50 are lower than 100 Kb long, from which variants with lengths lower than 10 Kb are considered reliable, and those larger than 10 Kb are probably false positives49.

Nevertheless, genomic studies on inversions reviewed by Wellenreuther & Bernatchez47showed that the average length of chromosomal inversion events in different species of plants and animals is 8.4 Mb, ranging from 130 Kb to 100 Mb. According to studies in evolutionary genomics in different species, including invertebrates and mammals, large chromosomal inversions are more favored by selection than shorter inversions, exhibiting lower rates of recombination between a larger number of genetically distant loci47,51.

Chromosomal location and distribution of high-confidence SVs

The predicted high-confidence structural variants were not uniformly distributed across the 29 autosomal chromosomes. The top five chromosomes harboring the highest number of predicted SVs are BTA1 (N = 14,996), BTA2 (N = 13,773), BTA5 (N = 12,834), BTA4 (N = 12,16), and BTA10 (N = 10,372), while the top five chromosomes harboring the smallest number of SVs are BTA25 (N = 2,770), BTA19 (N = 3,145), BTA22 (N = 3,676), BTA27 (N = 4,035), and BTA24 (N = 4,607). The distribution of each SV type along the 29 autosomes is shown in Fig. 4.

Regarding each SV type, the BTA1, BTA2, and BTA5 carry the highest number of deletions, whereas BTA23, BTA15, and BTA10 carry the highest number of duplications, and BTA4, BTA10, and BTA5 harbor the highest number of inversions (Fig. 4a). Our findings regarding the distribution of SV along the chromosomes are consistent with those observed by Peripolli et al.18 when analyzing WGS data from three Brazilian locally adapted taurine breeds, in which the largest number of structural variants were found on BTA 1 and 15. However, the authors only predicted CNVs (duplications and deletions). The chromosomal locations observed here are also similar to those reported by Letaief et al.28 in different French dairy and beef cattle breeds, in which the largest number of SVs was observed on chromosomes 1, 6, and 5 and the smallest number on chromosome 25. The authors also combined different tools for the SVs calling. The use of different bioinformatic algorithms when analyzing WGS data, as well as the filtering and validation of SVs at software, individual, and population levels, can lead to the identification of high-confidence structural variants. In contrast, the chromosomal location of SV was different from the reported in Nellore cattle from SNP and WGS data by Silva et al.22in which most predicted structural variations were located on BTA19, BTA11, BTA2, and BTA18 for SNP-based called SVs, and on BTA12, BTA15, BTA1, and BTA2 for WGS-based called SVs. Notice that the mentioned study was only focused on CNV assessment.

Chromosomal extension covered by high-confidence SVs

BTA10 was ranked as the fifth, regarding the total number of SVs that carried, but the average length of its harbored SVs (728.36 Kb) was notably high in relation to the average length of SVs located on the other chromosomes, followed by SVs located on BTA9, 23, 5, and 15, with average lengths ranging from 102.98 Kb to 143.48 Kb (Fig. 4b). However, the Pearson correlation did not indicate statistical evidence regarding the association between the number of SVs and the average length of SVs harbored by each chromosome (Pearson correlation score p-value = 0.197, p-value = 0.3053). Our findings were similar to those reported by Peripolli et al.18in which the longest CNV, corresponding to deletion events, was observed on chromosome 10. The BTA10 was the one having the largest proportion of average chromosome size covered by structural variation (average coverage rate = 50.4%), followed by BTA 23, 9, 15, and 28, with average coverage rates ranging from 7.6 to 11% (Fig. 4c). The lowest average coverage rate was observed on BTA25, which was also the chromosome harboring the lowest number of structural variants and the smallest ones and was followed by BTA17, 19, 16, 14, 8, and 3, with average coverage rates ranging from 0.9 to 1.1%. These results differ from those reported by Letaief et al.28where the highest percentage of sequence covered by copy number variations in French dairy and beef cattle breeds was found on chromosome 27, but agree in that the least covered sequence by structural variants was located on BTA25.

Pearson correlation analysis revealed a significantly high and positive association between the number of structural variants and the chromosome size (Pearson correlation score p-value = 0.87, p-value < 6.8e-10), which agrees with the findings of Letaief et al.28 but is different from the results reported by Boussaha et al.24 in French dairy cattle. In contrast, no statistical evidence regarding correlations between the chromosome size and the average proportion of chromosome extension covered by SVs (average coverage rate) was found in this study (Pearson correlation score p-value = 0.04, p-value = 0.833).

At the individual level, an average number of 1,437 SVs (± 894.7) were predicted for each bull, with a minimum number of 89 and a maximum of 3,428 SVs. The descriptive statistics about the number and length of each SV type observed in the analyzed bulls are presented in Supplementary Table S2. Structural variation encompasses, on average, 4.81% (± 4.38) of the Nellore bulls’ total genome extension. The largest genome coverage rate of SV observed in a bull was 20.27% and was largely attributed to inversion events, and the lowest genome coverage rate was 0.03% (Fig. 5, Supplementary Table S2). A significant, moderate, and positive correlation was found between the number of SVs predicted per bull in each chromosome and the chromosomal coverage rate (Pearson correlation score p-value = 0.36, p-value < 2.2e− 16).

Fig. 5
figure 5

Genome coverage rate: proportion (%) of whole genome extension covered by each detected high-confidence structural variant (SV) for each Nellore bull. Each bar represents a bull and colors the structural variant type. DEL deletions, DUP duplications, INV inversions.

Discovery of potential structural variant regions (SVR)

Number, chromosomal location and distribution of SVRs

High-confidence structural variants identified in the genome of bulls were used to detect genomic regions where different types of structural variants may occur. Specifically, a deletion may be observed in one bull and a duplication in another bull (across animals). Although less common, complex rearrangements may also lead to the coexistence of both deletion and duplication affecting overlapping segments within the same individual (within animal), as previously reported for certain types of structural variant architectures. The main purpose of translating structural variants into regions (SVR) is to characterize the genetic architecture underlying population history, to assess the potential impact of structural variations on gene function and expression, and to integrate information into genomic databases, allowing a more comprehensive understanding of differences between species, breeds, and populations35.

Due to the different nature of SV types, two datasets were formed: one containing the detected CNVs (encompassing duplication and deletion events), and the other containing only the inversions. This enabled the proper identification of the SV event occurring in a specific region. Thus, SVRs were constructed from those two data sets by merging the SVs found in the individual genome of bulls that reciprocally overlapped in at least one base pair, resulting in two types of SVRs: CNVRs (regions of copy number variants) and INVRs (regions of inversions). In addition, only SVRs that were frequent in at least 5% of the analyzed Nellore bulls were considered for further analysis. A total of 3,752 highly frequent SVRs were identified, from which 97% are composed of CNVRs (N = 3,641), and 3% correspond to INVRs (N = 111; Fig. 6a).

Fig. 6
figure 6

Proportion of each type of structural variant region (SVR). The proportion of SVRs corresponding to CNVR (regions of copy number variants) and INVR (regions of inversions) (a). Proportion of CVRs corresponding to CNVR-DEL (regions of deletions), CNVR-DUP (regions of duplications), and CNVR-MIXED (regions of deletions and duplications) (b).

From the total number of detected CNVRs, the vast majority (90%, N = 3,286) corresponds to deletion events (CNVR-DEL), a small proportion (7%, N = 256) to mixed events in which both deletions and duplications occurred (CNVR-MIXED), and the lowest proportion (3%, N = 99) to only duplications (CNVR-DUP) (Fig. 6b).

The distribution of events occurring in the detected CNVRs observed in this study is different from those reported by Lemos et al.12 in Nellore cattle, in which the CNVRs were composed in more balanced proportion by the three types of events (38.5% gains, 33% mixed, and 28.5% losses). The results also diverge from the reported CNVRs in Silva et al.22 in Nellore cattle, where 65% of the CNVRs corresponded to mixed events, 31.5% to duplication, and a small proportion (2.89%) to deletion events. Although the CNVRs detected in Nellore cattle by Santana et al.14 were attributed in higher proportion to copy number losses than copy number gains, the proportion of deletion events corresponding to the CNVRs observed in this study is higher than the reported by those authors (41.6% deletion events, 35% duplications, and 23% mixed events).

Using the information on WGS in Gyr cattle, Braga et al.20 found that the CNVRs detected from CNVnator calls were composed of more than 50% deletions, around 36% of duplications, and 11% of complex events (duplications and deletions). However, the results presented here are more consistent with what was reported by these authors, in which the CNVRs detected from Delly calls were attributed 70% to deletions, 7.8% to duplications, and 22.2% to complex events. Correspondingly, our findings agree with the results of Hou et al.52 in different breeds, including taurine, indicine, and African cattle, in which most of the detected CNVRs were composed of loss events (61.3%), but the representation of regions of deletions observed in this study is still higher.

A considerably higher number of deletions than duplications composing a CNVR was expected, given the total number of DEL calls obtained. Nevertheless, it is still surprising that a small fraction of duplications represents common events for all bulls. Interestingly, this finding is similar to the one reported by Letaief et al.28 in French dairy and beef cattle, in which only 1.6% of CNVRs corresponded to duplications, 4,1% to complex events (deletions and duplications), and regions of deletions comprised more than 95%. Our results are also in concordance with the results reported in a recent study in Simmental cattle combining WGS data from short and long reads, where only 4,2% of the CNVRs were constituted by duplication regions, 39.6% by mixed events, and more than 56% by deletions19. It is important to outline that the number of the detected duplications in this study corresponded only to tandem duplications, as the algorithms and bioinformatic tools used to conduct the SV calling do not enable the identification of discontinuous duplications. Further factors that have likely driven the small number of called duplications relative to deletions were already discussed and are basically referred to as the greatest ability of SV callers to identify sequence losses in short reads sequencing data43,45the difficulty to differentiate duplications from alignment artifacts43and potential molecular mechanisms such as NAHR that potentially leads to more deletions42.

The highest number of highly frequent SVRs was harbored by BTA1 (N = 261), followed by BTA6 and BTA2 with 211 SVRs each. BTA25, which carries the lowest number of SVs, has the lowest number of SVR as well (N = 47), followed by BTA19 and 22 with 67 SVR each (Table 3; Fig. 7). Regarding CNVRs, CNVR-DEL and CNVR-MIXED were uniformly distributed across the 29 autosomes, but CNVR-DUP was not observed in BTA20, BTA24, and BTA28 (Supplementary Figure S1). The BTA1 harbored the highest number of CNVRs (N = 254), most of them corresponding to deletions (N = 235 CNVR-DEL), followed by mixed (N = 12 CNVR-MIXED) and duplication events (N = 7 CNVR-DUP) (Supplementary Table S3, Figure S1). This result is in concordance with previous observations of CNVR distribution in Nellore cattle, reported by Lemos et al.12 and Silva et al.22where the highest number of CNVRs were located on BTA1 but differs from theirs in that the smallest proportion of CNVR were located on BTA29 and BTA27, respectively. Our findings are similar to the results of Letaief et al.28where the highest and the lowest number of CNVRs were distributed on BTA1 and BTA25, respectively, and a similar distribution was observed by Peripolli et al.18 in two of the three locally adapted taurine breeds. The SVRs composed of inversion events, BTA15 and BTA18 harbored the highest number of INVR, with 11 and 9 INVR, respectively, while BTA22 and BTA24 carried the lowest number of INVR (N = 1). INVR were distributed along the autosomes, except the BTA14, which did not harbor any INVR (Supplementary Table S3, Figure S2).

Fig. 7
figure 7

Distribution of non-redundant structural variant region (SVR) in more than 5% of the population across the 29 autosomes. CNVR regions of copy number variants, INVR regions of inversions.

Table 3 Statistics summary for the number of svrs, length and proportion of chromosome (Chr) extension covered by structural variant region (SVR) per autosomal chromosome.

Average length and autosomal genome extension covered by SVRs

The 3,752 SVRs exhibit an average length of 89.91 Kb (± 469.09) and a median of 2.08 Kb. The average length of INVR is 396.44 Kb (± 951.03 Kb) (Table 4). Regarding CNVRs, the length trend is similar to the single events, in which the region of deletions exhibits the highest number but the shortest length (25.95 ± 232.35 Kb), followed by regions of duplications (36.47 ± 91.23 Kb), and the regions of mixed events (799 ± 1.24 Kb) which represent the longest CNVR type (Table 4).

Table 4 Statistics summary for the number, length, and genome coverage rate for each structural variant region (SVR) type (present in at least 5% of bulls).

The longest SVR found has a length of 8,530.07 Kb, and the join of all SVRs length comprises a total of 337,348.3 Kb, which represents 13.12% of whole autosomal genome extension (attributed to CNVRs in 11.40% and INVR in 1.71%) (Table 4). This result is in concordance with what Lemos et al.12 previously reported, where the predicted regions with duplication and deletion events (CNVR) from SNP genotyping data represented 13% of the genome coverage. Remarkably, the authors analyzed a Nellore population from which the analyzed bulls belonged to the same population as the evaluated in this research. Thus, similar results regarding the genome coverage of structural variations were expected. However, some of our findings have differed from those previously reported by these authors, mainly due to differences in the number of animals analyzed, the data nature used, and the combination of different strategies and algorithms for structural variants calling.

The total genomic coverage of CNVR found in this study was higher than the reported coverage rate in other Nellore cattle populations. For example, Santana et al.14 observed a total coverage of CNVR of 9% for predictions made from SNP data, and Silva et al.22 reported an autosomal coverage of CNVR of 1.5% for predictions made from SNP and NGS data. When comparing our results with the reported for other taurine and indicine breeds, our observed coverage rate was also higher than those from Bickhart et al.10 (2.1%), Bickhart et al.23 (3.1%), Keel et al.53 (6.7%), and Hu et al.41 (1.5%), and Sun et al.19 (0.18%), in which CNVRs reported were those shared by diverse cattle breeds including Nellore, Brahman, Gir, and different Bos taurus breeds.

Likewise, SV distribution according to size classes, most of the detected SVRs (73%) have lengths ranging from 1 to 5 Kb, 14% are 5–10 Kb-long, 4% are 10–50 Kb-long, 3% 50–200 Kb-long, and 6% are larger than 200 Kb. (Table 5). CNVRs are mainly distributed into the classes 1–5 Kb and 5–10 Kb (Fig. 3a), while INVRs are more balanced and distributed into three size classes (10 to 50 Kb, 50 to 200, and larger than 200 - Fig. 8).

Fig. 8
figure 8

Number of structural variants regions (SVR) according to different size classes. CNV regions of copy number variants (a). INVR regions of inversions (b).

Table 5 Number and proportion of each type of detected SVR according to different size classes.

The top five longest CNVRs are located on BTA11, 26, 21, 2, and 24, with average lengths ranging from 186.99 to 116.22 Kb (Supplementary Figure S3a). Otherwise, the top five shortest CNVRs are harbored by BTA6, 16, 25, 17, and 1, with average lengths ranging from 34.23 to 49.5 Kb. It is remarkable that although BTA1 is ranked first in terms of the number, the harbored CNVR is of short lengths. This supports the results of the Pearson correlation test, which revealed that there is a trend that the higher the number of SVRs carried by each chromosome, the shorter the SVRs are (Pearson correlation coefficient p-value = -0.37, p-value = 0.04976). By their side, the top five longest regions of inversions are carried by BTA29, 10, 7, 12, and 20, while the shortest INVR is located on BTA22, 2, 16, 1, and 6, respectively (Supplementary Figure S4a).

Regarding the proportion of total chromosome extension that is covered by SVRs, the BTA29 is the most covered by regions with structural variation (chromosomal coverage rate of 30.53%), followed by chromosomes 26, 28, 20, and 21, with chromosomal coverage rates ranging from 22.46 to 24.39%. On the other side, BTA25 exhibits the lowest chromosomal coverage rate (5.21%), followed by BTA6, 16, 17, and 1, with chromosomal coverage rates ranging between 5.82 and 8.07% (Fig. 9; Table 3). BTA26 is the most covered chromosome by CNVR at 24.34%, while BTA25 showed the lowest covered at 4.68% (Supplementary Figure S3b).

Fig. 9
figure 9

Chromosomal coverage rate: percentage (%) of chromosome extension covered by structural variant regions (SVRs.)

The maximum percentage of chromosomal coverage observed in this study is higher than the maximum percentage reported by Lemos et al.12 (24.26%) in Nellore cattle, differing from their results regarding the chromosomal distribution of coverage rates. Specifically, Lemos et al.12 identified BTA19, BTA23, and BTA25 as the most covered chromosomes, whereas BTA9, BTA13, and BTA8 were the least covered by CNVR. The results also differ from those reported by Silva et al.22 in Nellore cattle, where CNVRs on BTA24 exhibited the greatest total length and those on BTA19 had the shortest. In our study, the highest chromosomal coverage by INVRs was observed on BTA29, whereas BTA22 exhibited sparse coverage by inversions (Supplementary Figure S4b). Notably, BTA29 did not exhibit the highest number of SVRs, nor were the longest SVRs observed on this chromosome. Rather, the highest coverage on BTA29 can be attributed to the presence of two large inversions, combined with its smallest length, which increases the proportion of its sequence affected by SVs. Our results coincided with those from Letaief et al.28 on French cattle, where chromosomes 25 and 1 were the least covered by structural variants. Pearson correlation analysis revealed a significant, strong, and positive association between the number of SVRs and the chromosome size (Pearson correlation score p-value = 0.93, p value = 4.84e− 13), which is concurrent with the findings of Letaief et al.28 and Silva et al.22. In contrast, no significant correlation was detected between chromosome size and the proportion of chromosomal length covered by SVRs (chromosomal coverage rate).

Functional annotation and enrichment of SVR

Most of the identified SVRs were common for less than 50% of bulls (Fig. 10), but a considerably high number of SVRs were shared by more than half of the studied population (N = 532). To assess the functional impact on genetic mechanisms underlying Nellore cattle structural variations, genes, and QTLs overlapping SVRs shared by more than 50% (N > 75 bulls) were retrieved from a publicly available genomic dataset, as described in the material and methods section. Figure 11 illustrates the chromosomal location of the most frequent SVRs (N = 532). Overall, from the 532 most frequent SVRs, 96 overlapped a total of 130 QTLs, which were distributed in 6 QTL types: exterior, health, meat and carcass, milk, production, and reproduction, which are related to a total of 50 traits (Supplementary Table S4). Figure 12 represents the number of SVRs overlapping QTLs, according to QTL type and trait associated.

Fig. 10
figure 10

Distribution of the number of the structural variant regions (SVR) shared by different proportions of Bulls.

Fig. 11
figure 11

Chromosomal location of the most frequent structural variant regions (SVR). Blue dots represent regions of deletions, green dots are mixed regions, red dots are the regions of duplications, and black dots the regions of inversions.

Fig. 12
figure 12

Distribution of structural variants (SVs) overlapping quantitative trait loci (QTLs). The number of QTL types identified (a) and the number of SVs per trait associated with the identified QTLs (b).

The highest number of SVRs overlapped with QTLs related to residual feed intake (N = 21), followed by structural soundness (N = 16), multiple births (N = 9), clinical mastitis (N = 7), and milk energy yield (N = 7). The SVRs overlapping QTLs were distributed in all chromosomes except on BTA24. The BTA6 was the chromosome harboring the largest number of QTLs overlapping SVRs (N = 11), followed by BTA7 (N = 8) and BTA3, 4, 5, and 23 with 7 QTLs each. In addition, the region named CNVR1799 (BTA6:72.92–74.15 Mb) that was present in 75 bulls overlapped a total of 6 QTLs, distributed in two QTL types: production (residual feed intake), and exterior (teat placement, foot angle, udder quality, udder depth, and degree of spotting). Regarding the INVR, 4 regions overlapped QTLs: INVR27 (BTA1:39.48–39.53 Mb) overlapping a QTL related to structural soundness, INVR131 (BTA3:20.94–22.58 Mb) overlapping a QTL related to milk yield, INVR549 (BTA15:78.78–80.14 Mb) overlapping a QTL related to liver abscess, and INVR557 (BTA16:3.11–3.12 Mb) overlapping a QTL related to structural soundness. Most of the 96 SVRs were linked to just one QTL.

Regarding the gene annotation, from the 532 most frequent SVRs, a total of 205 SVRs overlapped known genes (Supplementary Table S5). A total of 1,164 positional candidate genes surrounding the SVRs were annotated. The most annotated region of copy number variants named as CNVR3428 (BTA13:57.35–57.36 Mb) harbored a total of 84 genes, followed by the region CNVR4347 (BTA18:61.87–61.90 Mb), which harbored 58 genes, and the regions, CNVR1601 (BTA5:112.69-112.71 Mb), CNVR4532 (BTA20:33.44–33.51 Mb), and CNVR4703 (BTA21:31.13–31.19 Mb), harboring 51 genes each one (Supplementary Table S5).

From the SVR that overlapped positional candidate genes, there were in total 7 regions of inversions, from which the named INVR549 (BTA15:78.97–78.98 Mb) was the most annotated harboring 44 genes, followed by INVR131 (BTA3:20.94–22.58 Mb) that harbored 24 positional candidate genes (Supplementary Table S5). All the SVRs carrying known genes were distributed in all chromosomes, and the largest number of annotated genes was harbored by BTA21 (N = 116), followed by BTA5, 2, 13, and 18 with 109, 105, 100, and 78 annotated genes, respectively.

PANTHER overrepresentation analysis through the Binomial test and Bonferroni correction method showed that from the 1,164 positional candidate genes that overlapped the most frequent SVRs, 695 were significantly enriched for three gene ontology categories and one biochemical pathway (Fig. 13). The three gene ontologies were biological process (BP), with 632 genes grouped into 15 significant GO terms (Supplementary Table S6), molecular function (MF) with 139 genes grouped into 4 significant GO terms (Supplementary Table S7), and cellular component (CC) with 236 genes grouped into 2 significant GO terms (Supplementary Table S8). In addition, 26 genes were significantly associated with a biochemical pathway (Supplementary Table S9).

Fig. 13
figure 13

Results from PANTHER enrichment analysis. Number of Biological Process, Cellular Components, Molecular Function, and Biochemical Pathways (a) and the number of genes associated with GO terms (b).

Regarding biological-related GO terms, most of the enriched genes were significantly associated with cellular process (GO:0009987, N = 49), followed by biological regulation (GO:0065007, N = 386), regulation of biological process (GO:0050789 N = 369), regulation of cellular process (GO:0050794, N = 352), and response to stimulus (GO:0050896, N = 236). Enriched genes were also associated with biological processes related to cell signaling and response, including cell communication (GO:0007154, N = 195), signaling (GO:0023052, N = 194), signaling transduction (GO:0007165, N = 184), G protein-coupled receptor signaling pathway (GO:0007186, N = 72), nervous system process (GO:0050877, N = 61), sensory perception of chemical stimulus (GO:0007606, N = 57), sensory perception (GO:0007600, N = 57), cellular response to stimulus (GO:0051716, N = 31), cellular response to endogenous stimulus (GO:0071495, N = 32), and response to endogenous stimulus (GO:0009719, N = 31) (Supplementary Table S6).

For Molecular Function, 63 genes were significantly associated with odorant binding (GO:0005549), 129 to transmembrane signaling receptor activity (GO:0004888), 130 to signaling receptor activity (GO:0038023), and 139 to molecular transducer activity (GO:0060089) (Supplementary Table S7). With respect to cellular components, 32 genes were significantly associated with the organelle sub-compartment (GO:0031984) and 230 with the membrane (GO:0016020) (Supplementary Table S8). Finally, a total of 26 genes were significantly associated with the phospholipid metabolism pathway (Supplementary Table S9). It is outstanding that from the enriched set, there are several genes (91 genes) belonging to the olfactory receptor (OR) family, which overlapped SVRs located on BTA3, 5, 7, 15, 23, and 29. Most of these OR genes are harbored by regions of inversions, where 39 OR genes are located in the region named INVR549 (BTA15; 78.78–80.14 Mb - Supplementary Table S5) that is shared by 76 bulls and overlaps a health type QTL (liver abscess, Supplementary Table S4). The regions INVR131 (BTA3; 20.94–22.58 Mb - Supplementary Table S5) and INVR240 (BTA5:58.73–58.94 Mb; Supplementary Table S5) harbored 2 OR genes each one, and are shared by 83 and 86 bulls, respectively. Additionally, INVR131 overlaps a milk-type QTL related to milk yield (Supplementary Table S4). The other enriched OR genes are mainly harbored by copy number regions of mixed events (deletions and duplications), such as the region CNVR1946 (BTA7:8.71–10.10 Mb; Supplementary Table S5), which harbored 19 OR genes. This region is common for 101 bulls and overlaps a production type QTL related to residual feed intake (Supplementary Table S4). Similarly, 15 OR genes are located in the region CNVR5606 (BTA29:27.14–27.60 Mb; Supplementary Table S5), are observed in 87 bulls, and 7 OR genes are carried by the region CNVR4958 (BTA23:29.21–29.32 Mb; Supplementary Table S5) common for 76 bulls. Other 5 enriched OR genes overlapped the mixed CNV region CNVR1465 (BTA5:58.63–58.91 Mb; Supplementary Table S5), which is located near a reproduction QTL related to multiple births and that is common for 102 bulls. The OR5M10C gene is located in the region CNVR4931 (BTA23:17.75–22.14 Mb; Supplementary Table S5), which overlaps two reproduction QTLs (multiple birth and interval to first estrus after calving; Supplementary Table S4), and one health type QTL related to health complement hemolytic activity (Supplementary Table S4). Finally, the OR5AE3 gene is harbored by a deletion region named CNVR2043 (BTA7:40.99–41.03 Mb), which is common for 82 bulls.

Some authors have reported significant associations of SNPs located in the vicinity of OR genes with economically important traits in beef cattle. For example, the OR5M10 gene, which was observed in this study within the region INVR549, was reported by Higgins et al.54 near the SNP rs136789347, which was significantly associated with the average daily gain (ADG) in Irish beef cattle breeds. Here, the OR5M10 gene was significantly enriched for 12 of the 15 GO terms related to Biological Processes and for all GO terms related to Molecular Functions (Supplementary Table S6 and S7). Seabury et al.55 also reported significant associations of SNPs surrounding OR genes with average daily gain and residual feed intake in three different U.S. beef cattle breeds. In Nellore cattle, Olivieri et al.56 found 3 OR genes harbored by a genomic region located on BTA15 that explained more than 1% of additive genetic variance of dry matter intake (DMI).

The mixed region CNVR1465 (BTA5:58.63–58.91 Mb) carried the genes OR6C1, OR6C1N, and OR6C1S, which were also found by Santana et al.14 overlapping CNV regions significantly associated with feed conversion rate in Nellore cattle, in a genotyped-based study. Those genes were significantly enriched for the MF GO terms transmembrane signaling receptor activity (GO:0004888), signaling receptor activity (GO:0038023), and molecular transducer activity (GO:0060089) (Supplementary Table S7). The mentioned mixed CNV region overlapped a reproduction QTL related to multiple birth traits. Although most of the studies found in the literature reveal the functional role of the OR gene family on feed efficiency-related traits, some genes belonging to the gene family have also been associated with reproductive performance in cattle57. The OR genes are also involved in sperm chemotaxis and the development of female germ cells and oocytes58.

Similarly, the gene set OR7A112, OR7A78, OR7A95, OR7A97, and OR7A99 harbored by the mixed region CNVR1946 were also reported by Braga et al.20 overlapping CNVRs in Brazilian dairy Gyr cattle, from SNP and WGS data. In this study, those genes were significantly associated with 8 of the 15 BP-related GO terms, 3 MF-related GO terms, and the CC GO term membrane (GO:0016020). Among those genes, the OR795 gene was also reported by Rafter et al.59 as a flanked gene into a deletion region significantly associated with meat carcass traits in Charolais, Holstein, and Limousin cattle.

The OR genes comprise the largest superfamily in the mammalian genome. They belong to the G-protein-coupled chemoreceptors (GPCRs) and are essentially expressed on countless olfactory sensory neurons within the nasal cavity60,61. These protein-coding genes play roles in the reception of a wide range of chemical molecules in the environment62. Odorant mechanisms are essential for animals as they send signals to the brain in order to locate feed sources, danger, or potential mating62. Besides the nasal cavity, olfactory receptors can be expressed in other organs where they bind to nutrients and metabolites to trigger physiological responses such as changes in gut motility, ventilation rate, and cellular migration61. The OR activity is mediated by the central nervous system and is influenced by hormonal, nutritional, and metabolic mechanisms regulating energy balance that trigger behaviors such as feed searching and intake63,64,65.

The structural conformation of the OR gene family has been well-studied in several vertebrate species. There is evidence showing that the OR gene family has evolved due to duplication mechanisms, which has guaranteed their maintenance and survival in the vertebrates’ genomes1,66. Lee et al.62 observed a large variation in OR gene copy number in Korean Holstein and Angus cattle. The authors found via PCR that 40% of the tested OR loci exhibited CNVs, which suggests that copy number variations of OR genes in cattle are highly frequent and that their genetic diversity confers differences in olfaction capacity and, then, in fitness. Although duplications have been described as the main mechanism driving the maintenance of the OR genes, it results particularly interesting that regions of inversions detected in this study harbored such a great number of genes belonging to the olfactory receptor superfamily. Due to the functional relevance of OR genes for adaptation and fitness, it is plausible that under selection pressure, the desirable alleles for OR genes were protected against recombination for being widely located on regions of inversion events, thus playing important roles in the adaptation of Nellore to environmental conditions.

Another important gene family found overlapping the most frequent SVR is the B defensin gene family (DEFB), from which 6 genes (DEFB1, DEFB10, DEFB119, DEFB124, DEFB13, and DEFB7) were significantly associated with 11 biological processes GO terms, mainly related to cellular communication, signaling and response to stimulus (Supplementary Table S6). Defensins are protein-coding genes that have key functions in the innate immune system of multicellular organisms. Besides playing important roles in antimicrobial activity, these molecules act in diverse biological mechanisms such as cell signaling, ion channel transport, toxic functions, and enzyme inhibitor activities in animals67. As well as olfactory receptor genes, the immunity-related genes are known to be more likely duplicated than other genes. Our results are in concordance with those reported by Butty et al.21. These authors observed members of the DEFB family in their predicted CNV regions in a Holstein population from genotypes and WGS data.

Similarly, in a population of Holstein and Jersey cattle, Chen et al.45 identified that genes belonging to expanded gene families, such as DEFB, were more associated with SV regions than other genes. Particularly, the gene DEFB1, which was found within the mixed copy number region named CNVR5333 (27:6.18–7.28 Mb), was also reported in the two mentioned studies. The DEFB1 was significantly related to GO term cellular response to endogenous stimulus, signaling receptor activity, and molecular transducer activity. Important genes in the vicinity of structural variant regions have been previously reported in the literature for their association with economically important phenotypes. For example, members of the secretory phospholipase A2 family of genes (PLA2G2A, PLA2G2D1, PLA2G2E, and PLA2G5), which were harbored by the mixed copy number region CNVR717 (BTA2:132.53-133.81 Mb), were also found into CNV regions significantly associated to meat carcass traits in Nellore cattle by Berton et al.68. On the other hand, four members of Adhesion G Protein-Coupled Receptors (ADGRF1, ADGRF2, ADGRF4, and ADGRF5), were harbored by the mixed region CNVR4931 (BTA23:17.75–22.14 Mb), which were linked to one health QTL related to complement hemolytic activity, and to two reproduction QTLs related to multiple birth, and interval to first estrus after calving (Supplementary Table S4). Those genes were previously reported by Li et al.69 for being significantly associated with feed efficiency indicator traits in Canadian crossbred cattle, such as residual feed intake, daily dry matter intake, average daily gain, and metabolic daily gain.

A large set of genes belonging to the zinc fingers binding protein family were also found near the structural variant regions. Particularly the mixed regions CNVR4338 (BTA18:57.18–60.17 Mb) and CNVR4347 (BTA18:61.81–63.40 Mb) harbored various members of that gene family, such as the genes ZNF524, ZNF581, and ZNF677, which were previously associated with dry matter intake, average daily gain, and feed efficiency in an experimental population of Nellore cattle by Olivieri et al.56. Zinc finger proteins usually regulate gene expression by binding to specific DNA sequences via their DNA-binding zinc finger domains and recruiting cofactors that mediate chromatin remodeling and transcriptional repression or activation. In a transcriptomic study in Bos indicus cattle, Samuel & Dinka70 observed that ZNF genes are likely involved in the regulation of the expression of olfactory receptor genes. Thus, ZNF genes may also play roles in cell signaling and communication through the sense of olfaction, which is essential for environmental adaption. The genes identified surrounding structural variants regions are more involved in functions related to biological and cell regulation, response to stimulus, and cell communication and signaling. Those functions are essential drivers of the environmental adaptation of genomes. The reported literature in Nellore cattle supports that most of the described genes are potential candidates for feed efficiency indicator traits, which reflect the biochemical mechanisms in which they are involved that have led to improved fitness.

Functional insights from genes related to high-frequency SVRs

Adaptation to tropical regions represents one of the most significant challenges in cattle breeds, requiring a coordinated response to persistent heat stress, seasonal nutritional limitations, and pathogen exposure. By analyzing the 532 most frequent SVRs in Nellore cattle, we identified genomic regions enriched for genes and QTLs associated with thermotolerance, immune function, metabolic regulation, and tissue resilience, all key attributes for tropical adaptation (Supplementary Table S4 and S10).

The ability to handle elevated ambient temperatures is one of the most important attributes required for cattle raised in tropical regions. We have found SVRs encompassing genes related to cellular protection and heat dissipation, which are important traits for thermal tolerance (Supplementary Table S10). Notably, HSP90AB1 (BTA 23, 17.75–17.76 Mb), a heat shock protein essential for maintaining proteostasis under thermal stress, has been directly associated with enhanced thermotolerance in cattle populations71,72. Similarly, PRLR (BTA 20, 38.97–39.10 Mb), particularly its slick hair variants, promotes shorter hair coats that improve heat loss through convection and radiation, an adaptation widely documented in tropical breeds73,74. Genes such as THRB (BTA 27, 41.46–41.90 Mb) and MC4R (BTA 24, 59.16–59.17 Mb), which modulate metabolic rate and appetite, respectively, play a pivotal role in limiting endogenous heat production during thermal stress75,76,77while ion channel genes like KCNMA1 (BTA 28, 32.62–33.39 Mb) may facilitate vasodilation and peripheral cooling78.

The capacity to mitigate oxidative damage, often a by-product of heat exposure, appears to be influenced by SVRs mapping the genes EGLN1 (BTA 28, 4.09–4.15 Mb), a key regulator of the hypoxia-inducible factor (HIF) pathway, as well as NFE2L2 (BTA 2, 19.62–19.65 Mb), JUN (BTA 3, 87.27–87.27 Mb), and ATF2 (BTA 2, 21.74–21.79 Mb) all of which are involved in activating antioxidant and cellular stress responses78,79,80. Additional support for cutaneous resilience in hot and pathogen-rich environments may be conferred by DEFB1 (BTA 27, 6.72–6.72 Mb), while energy-sensing mechanisms mediated by PRKAA1 (AMPK) (BTA 20, 33.67–33.7 Mb) enable rapid metabolic adaptation under heat load81,82. In addition to thermal regulation, SVR surround immune competence genes are also involved in the expression of traits important for adaptation to tropical production systems. SVRs overlapping the BoLA complex, a cornerstone of adaptive immunity, have been repeatedly linked to resistance against mastitis, tick infestation, and other tropical diseases83. A set of innate immune genes (ALCAM, SYK, NLRP2, and FCAR) was identified within SVRs, together with immunoregulatory loci such as LAIR1, SIGLEC10, PILRA, and TNIP3. These immune functional gene sets contribute to a balanced immune response by mitigating chronic inflammation in challenging environments84,85. Genes regulating hematopoiesis and lymphocyte development, including GATA2, SP110, IL7R, LIFR, and OSMR, were also found86,87,88. In parallel, SLC11A2, which limits iron availability to intracellular pathogens, suggests the importance of nutritional immunity for disease resistance89.

Tropical forage systems often present marked seasonal fluctuations in nutrient content and availability. In this context, SVRs were enriched for genes supporting metabolic flexibility and nutritional efficiency. MC4R and PRKAA1 play a central role by regulating feed intake and energy homeostasis76. At the cellular level, nutrient sensing and growth coordination are enabled by IGF1R (BTA 21, 7.78–8.08 Mb), RPTOR (BTA 19, 51.69–52.01 Mb), and RICTOR (BTA 20, 35.36–35.5 Mb), key components of anabolic signaling pathways90,91. Genes such as SREBF2 (BTA 5, 112.83–112.89 Mb), PCK1 (BTA 13, 58.61–58.65 Mb) and GPD1 (BTA 22, 6.59–6.69 Mb) support lipid and glucose metabolism under nutritional stress92,93,94while detoxification enzymes encoded by ALDH4A1 (BTA 2, 133.76–133.79 Mb), AOX1 (BTA 2, 89.1–89.17 Mb) and AHR (BTA 4, 25.81–25.86 Mb) may facilitate adaptation to bioactive compounds found in diverse tropical forages95.

Transcription factors such as MORC3, ETS2, JUN, GATA2, and NFE2L2, which are located within SVRs (Supplementary Table S10), regulate the expression of gene sets activated by environmental stressors96,97,98,99. Epigenetic regulators, such as MYSM1 (BTA 3, 87.35–87.39 Mb), transcriptional repressors like SIN3A (BTA 21, 33.38–33.45 Mb), and stress regulators like PML (BTA 21, 34.58–34.63 Mb), may further fine-tune immune and metabolic responses100,101. Moreover, NR1D2 (REV-ERBβ) (BTA 27, 41.99–42.01 Mb), a circadian regulator, suggests that alignment between metabolic activity and environmental light-dark cycles may also contribute to tropical resilience102.

The maintenance of cellular integrity under chronic environmental stress is another key adaptation. SVRs overlapping XRCC6, BLM, FANCI, and RUVBL1 indicate active DNA repair mechanisms that prevent genomic instability under oxidative and thermal stimulus103,104,105,106. Tissue health is further supported by autophagy regulators (WDFY3 and OMA1) and anti-apoptotic factors such as BCL2L1, while redox regulators like TXNIP, NOX4, and SESN3 ensure intracellular oxidative balance107,108,109,110,111,112.

The structural integrity and plasticity of tissues, especially skin and connective tissue, are maintained by SVRs containing FSTL1, ADAMTS2, COL15A1, HMCN1, BMP7, XYLT1, FAT4, and CTNNA1. These genes contribute to extracellular matrix remodeling and efficient wound repair, enhancing resistance to ectoparasites and mechanical injury in extensive grazing systems111,112,113,114,115. The results demonstrate that SVRs in Nellore cattle capture functionally important genomic regions related to heat resilience, immune defense, metabolic adaptability, and tissue robustness. Considering this layer of structural genomic information into tropical breeding programs holds substantial promise for accelerating the development of resilient, high-performing animals suited to increasingly demanding environments.

Conclusion

This study provides a comprehensive assessment of structural variation in the Nellore cattle genome, using advanced WGS algorithms to detect deletions, duplications, and inversions. We identified 215,031 structural variants, including 183,032 deletions, 14,013 duplications, and 17,986 inversions, highlighting the widespread contribution of mobile elements to the genomic architecture of the Nellore population in Brazil. Notably, we are the first to report inversion events in Nellore cattle, emphasizing the influence of mobile elements on genomic structure. Among these, 3,752 structural variants were present in over 5% of bulls, and 532 structural variants were common to over 50% of the key ancestors analyzed. Notably, 97% of these SVRs mapped to regions of copy number variation (CNVR), and 3% were inversions (INVR). Among the CNVRs, 90% were deletions, 3% were duplications, and 7% contained both.

Functional annotation of these high-frequency SVRs identified 1,160 positional candidate genes overlapping QTLs involved in thermotolerance, immune regulation, metabolic flexibility, and tissue integrity, essential components of adaptation to tropical environments. The integration of such functionally relevant variants emphasizes the importance of structural genomic diversity in shaping adaptive phenotypes under environmental stressors such as heat, nutrient fluctuation, and pathogen pressure. These findings enhance our understanding of the genetic mechanisms behind Nellore cattle’s environmental adaptation due to changes in functionally significant genes. The inclusion of SV-based information in genomic evaluations may enhance the accuracy of selection for environmentally adaptive traits, complementing SNP-based analyses and offering novel opportunities for precision breeding in beef cattle.