Background & Summary

The Nguni sheep are indigenous to Southern Africa and comprise four ecotypes, namely Zulu, Pedi, Swazi, and Landim sheep1,2. Their origin can be traced to around 200 and 400 AD during the migration period of the Nguni people from central Africa to the Southern part of the continent1,3,4,5. The Nguni sheep are hardy, can walk long distances in search of forage and water and are naturally resistant to gastro-intestinal parasites and tick-borne diseases that are endemic in the region6,7. They therefore rarely fall sick or need vaccination46. They also have a strong foraging instinct making them well adapted to the harsh climate that is usually hot and humid7. Due to these traits, the Nguni sheep are therefore low maintenance, making them highly attractive to local smallholder and emerging farmers. Throughout Africa, sheep are important and valuable assets that provide meat, milk, wool, hide, and manure8,9,46. Sheep also play a critical cultural role in many communities by enhancing cultural practices and ceremonies due to the phenotypic appearance of multi-colored coat colors and patterns7,10,11,12,13. However, in recent times, the introduction of larger-bodied exotic sheep breeds has resulted in the Nguni sheep becoming a vulnerable species largely due to farmers’ preference for the larger-bodied exotic breeds14. The Nguni sheep have small bodies when compared to the exotic breeds, giving the impression that they are a low-performance breed15. This perception resulted in indiscriminate crossbreeding practices that aimed to improve their so-called inferior traits, proving to be a great threat to indigenous Nguni sheep. This has prompted the establishment of conservation programs to save these unique and irreplaceable genetic resources14,16.

The Nguni sheep, like many African indigenous genetic resources17, does not have a high-quality reference genome. Having a high-quality reference genome that is accessible to researchers is an essential genomic tool to establish efficient breeding programs and conservation strategies to guard against the extinction of indigenous genetic resources18,19. There are few genetic studies conducted on Nguni sheep. These studies assessed the genetic structure of Nguni sheep using microsatellite markers and Ovine 50 K chip and found that the Nguni sheep, despite the different ecotypes, the genetic admixtures were detected with Damara and Dorper sheep2,3. These findings are one of the pieces of evidence that we need to conserve the genetics of the Nguni sheep. Selepe et al.14 also emphasized that the Nguni sheep is currently at a high risk of extinction as they reported the admixture of Nguni sheep with exotic breeds. Therefore, this breed has been reported as one of the important breeds within the South African government conservation program to avoid extinction20,21,22.

Concurrently, in recent years, climate change has shown the need to have breeds that are well-adapted to changing climatic environments23. In this context, the development of effective breeding programs to improve production performance and efforts aimed at the conservation of indigenous breeds needs urgent attention. Such efforts require establishing high-quality reference genomes19. To understand the organism’s genetic architecture requires a comprehensive evaluation of the genome and its function. The generation of an error-free, near-gapless reference genome is a step closer to realise such understanding. The majority of the indigenous species constitute a valuable genetic makeup that is under-studied. This is true in the African indigenous sheep. Indigenous African sheep genetic resources play a crucial role in the provision of food security and socio-economic components. The available sheep reference genome24 may not fully explain the unique genome characteristics of the African indigenous breeds such as the Nguni sheep. This may be due to factors such as genetic diversity, evolutionary history, and the adaptation to environmental factors such as different climates in which the animals are exposed. Understanding the genetics of Nguni sheep is important to safeguard their genetic diversity, improve their production, and enhance their resilience to climate change and conservation. The African BioGenome Project (AfricaBP) is a coordinated Pan-African effort to sequence the genomes of 100,000 endemic and indigenous African species17. As part of AfricaBP’s initiatives, we have produced a near error-free chromosome-scale complete genome of the indigenous Southern African Nguni sheep (Ovis aries). Here, a high-quality reference genome of a female Nguni sheep was generated using Pacific Biosciences (PacBio) HiFi and Dovetail genomics Omni-C technologies.

Methods

Ethics statement

The sample collection procedure including processing and handling of the animals used in this study was approved by the University of South Africa ethics committee (reference number: AREC-100818-024), Limpopo department of Agriculture and the Department of Agriculture Land Reform and Rural Development (DALRRD) under section 20 of the Animal Diseases Act 1984 (Act 35 of 84) (ref no 12/11/1/1/23 (6508 AC). Furthermore, the ethics procedure was guided by the AfricaBP policy on ethics which emphasises ethical, legal, and social issues throughout the research activities. Finally, this work benefited from compliance consultations with the Department of Forestry, Fisheries and Environment (DFFE) as the Competent National Authority for biodiversity framework and digital sequence information in South Africa. Given that this work is solely for academic purposes with no commercialisation intentions either now or in the future, and the samples were sourced and the genome sequenced in South Africa, a bioprospecting permit is not required from DFFE.

Sample collection and sequencing

In this study, a pure breed of ewe Nguni sheep (Fig. 1A) was selected from Mara research station (Fig. 1B) herd for sampling in Limpopo province of South Africa (23° 05′S, 29° 25′E). Blood was collected from the jugular vein in EDTA tubes by the veterinarian and placed on dry ice, immediately transported to Inqaba Biotech laboratories in Pretoria, South Africa, and stored at −80 °C freezer until further processing. High molecular weight genomic DNA was extracted from 200 ul blood using Nanobind protocol for whole blood high molecular weight (HMW) DNA extraction25 to construct a sequencing library. The protocol was optimised for extraction from 200 ul of whole blood. Library sequencing on the PacBio Sequel IIe platform was done using SMRTbell® prep kit 3.026 following the manufacturer’s instructions. Dovetail Omni-C library prep was performed from the same sample used for HiFi sequencing following the manufacturer’s instructions. The resulting library was sequenced on NovaSeq 6000 instrument (Illumina). The total HiFi output data was 99 Gb at a coverage of 30 × while the output for Omni-C was 300 million read pairs.

Fig. 1
Fig. 1
Full size image

A Nguni sheep (Brown Head and White body) at Mara Research Station, Limpopo Province of South Africa (A) and a map of Mara Research station located in Limpopo Province of South Africa (B).

Genome assembly

The genome assembly was conducted through a series of steps (Fig. 2) using the Vertebrate Genome Project (VGP) pipeline for genome assembly27. Firstly, the sequenced PacBio HiFi reads were profiled to assess genome, repeat contents and heterozygosity rate. We employed Cutadapt28 on the HiFi reads to remove adaptors sequences then used meryl to generate k-mers (k = 21) count distribution and GenomeScope29 to estimate genome for profiling the genome. This process was carried out using PacBio HiFi reads. Subsequent to this, the genome assembly was conducted using Hifiasm30 in Hi-C mode, incorporating HiFi and Omni-C reads. This generates two haplotypes (hap 1 and hap 2) which are essentially equivalent. The assembly was further subjected to scaffolding with YaHS31 then ran the decontamination using dual decontamination preparation workflow which make use of Kraken232 as shown in Fig. 3 below. The dual decontamination workflow generates scaffolds versus Omni-C reads contact map with both haplotypes for curations. PretextMap was used to manually curate the scaffolds and by orienting and correcting mis-assemblies using the genomic proximity signal of the Omni-C reads alignment against scaffolds. The genome assembly was submitted to the National Center for Biotechnology Information (NCBI) for validation purposes, to confirm that it had been thoroughly cleaned of any extraneous taxa sequences. The submitted genome was subsequently employed for downstream analyses. The final genome size of Nguni sheep is 2.9 Gb, with a heterozygosity rate of 0.236% (Fig. 3A). The genome was sequenced at a coverage of 30X. The quality of the assembly was further assessed using Merqury completeness, which was 99.6% for both assemblies (Table 1). The genome completeness was further assessed using BUSCO with a dataset lineage of cetartiodactyla_odb10, which had 96.1% complete genes, 0.8% fragmented, and 3.1% missing genes respectively, totalling 13,335 (Fig. 3B).

Fig. 2
Fig. 2
Full size image

The assembly workflow (A), Technology used (B).

Fig. 3
Fig. 3
Full size image

(A) The linear plot generated by GenomeScope showing the observed k-mer profile. The coverage is shown on the x-axis and coverage frequency on the y-axis. (B). BUSCO plot of the Nguni Sheep. The BUSCO percentages of the complete (single – light blue), complete (duplicates - dark blue), fragmented (yellow) and Missing (red) are shown along the x-axis.

Table 1 Assembly quality results of merqury completeness.

Genome annotation

The preliminary phase in genome annotation entails the identification and masking of repeat elements in the Nguni sheep genome. This process utilizes RepeatMasker33, a software tool that facilitates the identification and masking of repetitive elements within genomic sequences. The characterization of Nguni_Sheep-specific repeats was facilitated through the implementation of repeat modelling with RepeatModeler v0.9, a software that incorporates tools such as RepeatScout v1.0.634, RECON v1.5.035, and TRF v4.0936. The identified repeat elements were then subjected to annotation using the “one_code_to_find_them_all” Perl script37.

Protein coding genes for repeat masked assemblies were predicted using TIBERius v1.1.438. The resulting predicted protein coding genes were subsequently used to build orthologous gene sets among the selected sheep breeds, Hu (GCA_040805955.1)39, Tibetan (GCA_017524585.1)40, and Rambouillet (GCF_016772045.2)24 using Orthofinder41, which clusters shared orthologous genes.

The assembly results are reported in Table 2, showed a scaffold N50 above 99.6 Mb and a contig N50 of 74 Mb. Our assembly showed a comparable contiguity with the latest sheep reference genomes, which include Ramboiuillet sheep genome from the International Sheep Genomics Consortium24, as well as the Hu39 and Tibetan40 sheep genomes (Table 2). The variation among these assemblies is attributable to the sequencing coverage and the sequencing technology employed which likely influence the completeness and accuracy of the genome42. Additionally, the results provide the first comprehensive genome information for Nguni sheep, highlighting the genomic differences that may underpin phenotypic variations among breeds. Subsequent to scaffolding, the Nguni sheep genome assembly underwent a decontamination process to eliminate contaminants and further refine the genome. This refinement ensures the reliability of the assembly and enhances its utility as a resource for genomic and phenotypic studies.

Table 2 Assembly stats for Nguni sheep as compared to the assembly stats for Hu, Tibetan, and Rambouillet Sheep.

Genome annotation of nguni sheep

The repeat annotation analysis of the Nguni genome revealed the presence of approximately 1435.61 Mb of repetitive sequences, constituting 47.39% of the total sequence length. The major categories of repetitive elements identified in the Nguni genome are summarized in Fig. 4, while a distribution of repetitive elements across the sheep genomes are presented in Table 3. The genome annotation parameters are reported in Table 4. Among the repetitive elements, 91.73% were annotated as known repeat motifs, while 8.27% were classified as uncharacterized elements unique to the Nguni Sheep. The most prevalent class of known repeats was the long interspersed nuclear elements (LINEs), which constituted 19.02% of the genome. While the number of repeat elements can influence genome size and certain levels of duplication within the genome may play a significant role in shaping the diverse gene repertoires in the organisms43, this study has yet to determine the extent to which these genomic factors impact gene composition among sheep breeds.

Fig. 4
Fig. 4
Full size image

Repetitive element distribution in percentage for the Nguni sheep genome. LINEs and SINEs.

Table 3 Summary of repeat elements identified in the genome of Nguni sheep.
Table 4 Genome annotation parameters of the Nguni sheep.

The genome was annotated using a deep learning-based ab initio gene structure prediction tool, which generated gene predictions based on inherent structural evidence. This resulted in 25,926 predicted protein-coding genes. The BUSCO analysis, based on the cetartiodactyla_odb10 lineage dataset showed that 89.7% of the BUSCOs were complete, with 88.1% being single-copy and 1.6% duplicated. The annotation process also identified 1.2% fragmented BUSCOs and 9.1% missing BUSCOs, resulting in a total of 13,335 cetartiodactyla BUSCO groups that were searched.

Further experimental evidence may be necessary to provide a more robust demonstration of the distinct characteristics of the Nguni sheep. The primary objective of our study was to generate a high-quality reference genome. This information serves as the essential tool for studies aimed at improving the Nguni sheep’s production and for conservation purposes. The generated genome reference will be made publicly available and accessible to the researchers for use in further research.

Data Records

The raw sequences were submitted to the NCBI and assigned accession number for SRA (SRR33210015) data of PRJNA117566744. The final genome assembly for the Nguni sheep was submitted to the NCBI under the BioProject: PRJNA1175667, BioSample: SAMN44368269 and assigned the accession number JBLGTL00000000045. The annotation data of the genome is shared through figshare46.

Technical Validation

The reference genome assembly of Nguni Sheep was supported by VGP workflows for genome assembly which were designed to reduce human error by employing workflows that produces near error-free high-quality reference genome. The assembly was done using Hifiasm on a Hi-C mode. The pipeline is comprised of different workflows that make use of methods such as BUSCO and mercury to assess the integrity of the genome.