Background & Summary

Yanbian cattle (Bos taurus), locally referred to as Yanbian yellow cattle, represent one of China’s five premier indigenous cattle breeds. Native to northeastern China, this breed exhibits exceptional adaptive traits including cold tolerance, disease resistance, and remarkable roughage utilization efficiency1,2. In recent years, Yanbian cattle have gained increasing agricultural and commercial importance due to their superior meat quality characteristics - particularly their highly marbled, tender beef with distinctive flavor profiles - meeting the growing demand for premium beef products in China’s evolving market3.

Current research has identified several candidate genes associated with the breed’s distinctive phenotypes. Studies have revealed key genetic factors (such as CORT, FGF5, and CD36) contributing to cold climate adaptation2, while polymorphisms in CAPN1 have been linked to meat quality traits4. However, existing investigations have primarily relied on the Hereford cattle reference genome5, leaving significant gaps in our understanding of Yanbian-specific genetic variations and their molecular mechanisms. The absence of a Yanbian-specific cattle reference genome substantially limits comprehensive genomic studies. Therefore, a complete, high-quality Yanbian cattle genome assembly would be essential, enabling precise identification of breed-specific genes, regulatory elements, and functional DNA regions, thereby facilitating deeper investigation into the genetic basis of its valuable traits.

In this study, we presented the first high-quality chromosome-level genome assembly of Yanbian cattle, generated through PacBio HiFi circular consensus sequencing (106.7 Gb) and high-throughput chromosome conformation capture (Hi-C) scaffolding (341.51 Gb raw data) combined with RNA-seq short reads6,7,8. The final assembly comprises 30 chromosomes (scaffold N50 ~111.08 Mb), represented by 59 contigs (contig N50 ~86.41 Mb), covering 97.45% of the original genome length. Notably, 17 chromosomes were completely assembled without gaps, demonstrating exceptional assembly continuity. The assembly also exhibits high completeness, with 93% of BUSCO genes identified and minimal missing (5.1%) or fragmented (2%) sequences. In summary, the first Yanbian cattle assembly (YB_JAAS) provides new insights into mining the specific genetic variation information of Yanbian cattle and in-depth understanding of its origin, domestication and development.

Methods

Blood samples and tissues collection

Blood samples were collected from a 10-year-old male Yanbian cattle from Yanbian Livestock Development Corporation, in Yanji city, Jilin Province for HiFi genome and HiC sequencing, supplemented by whole-genome resequencing of five 20-month-old Yanbian cattle from the same region. Tissue samples (heart, liver, spleen, lung, kidney, muscle, small intestine, and rumen) from one 4-year-old Yanbian cattle were snap-frozen in liquid nitrogen and processed by Wuhan Frasergen Bioinformatics Co., Ltd for RNA-seq.

DNA and RNA extraction

High-quality genomic DNA was extracted using a modified CTAB method9. Briefly, whole blood samples were lysed in 4 × CTAB buffer containing β-mercaptoethanol, incubated at 65 °C for 1.5 h, and then cooled to room temperature. DNA was extracted with chloroform-isoamyl alcohol (24:1) and precipitated with ethanol. Total RNA was extracted using Trizol reagent (Invitrogen, CA, USA). RNA purity and integrity were assessed using a NanoDrop spectrophotometer and an Agilent 2100 Bioanalyzer, while degradation was evaluated by 1.5% agarose gel electrophoresis.

Genome sequencing

The HiFi sequencing was conducted on the PacBio Revio platform, leveraging the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) for library construction. Original sequencing data were processed via SMRTlink and ccs tools to split subreads and generate HiFi reads, which referred to those sequences with full passes the same ZMW (Zero-Mode Waveguide) >3 and an accuracy consensus sequences >99%. The final output consisted of 6,503,032 high-quality reads, accumulating to 106.7 Gb of sequence data at an average length of 16.41 kb. (Table 1).

Table 1 The HiFi reads statistics.

To resolve scaffolding ambiguities and improve assembly quality, we employed a combined approach integrating Hi-C data with PacBio HiFi long-read sequencing. Hi-C libraries were prepared using conventional methods10. Chromatin was cross-linked with 1% formaldehyde (10 min, RT), quenched with 0.125 M glycine, and lysed. After SDS treatment (0.3%) to inactivate nucleases, chromatin was digested with MboI (100U), biotin-labeled, and ligated using T4 DNA ligase (50U). Cross-links were reversed and DNA purified (QIAamp DNA Mini Kit). The DNA was sheared (300–500 bp), end-repaired, A-tailed, and adapter-ligated. Biotinylated fragments were enriched via streptavidin pull-down, PCR-amplified, and sequenced (BGI PE150). Finally, 341.51 Gb Hi-C raw data were generated. After trimming adaptors and filtering low-quality reads using Trimmomatic11, we obtained 330.78 Gb clean reads (Table 2). Subsequent bioinformatic analyses were based on clean data.

Table 2 The HiC data statistics after filtering.

For genome annotation support, polyadenylated mRNA was enriched from 1 μg total RNA using oligo(dT) magnetic beads. RNA-seq libraries were prepared with the VAHTS Universal V6 kit and sequenced in paired-end mode on a DNBSEQ-T7 system. Finally, the RNA-seq experiment yielded 26.23 Gb of high-quality paired-end reads for further genome annotation.

Genome assembly

Filtered HiFi reads were assembled into contigs using hifiasm v0.19.8-r603, yielding a contig-level genome12. 2,215,797,704 clean Hi-C reads were aligned to the ARS-UCD2.0 reference genome using Juicer with default parameters to construct scaffold-level assembly (Table 2)13,14. Heterozygous and contaminated contigs were filtered out using Juicer13,15. We finally anchored the scaffolds on 30 chromosomes (29 autosomes plus X) and successfully gained a Yanbian cattle assembly with total length up to 2,849,678,991 bp (Fig. 1, Table 3). The contig N50 and scaffold N50 are 86.41 Mb and 111.08 Mb respectively. The Hi-C assistant assembly result was shown in Table 3. Notably, most chromosomes had only 1 contig, meaning that the assembled genome has good integrity with lots of 0 gaps (Table 3)16.

Fig. 1
figure 1

Genome-wide Hi-C interaction map with 500k resolution. The colors in the figure, from light to dark, indicate the increase in the intensity of the interaction. The darker the color, the stronger the interaction. The horizontal and vertical coordinates represent its N*bin position on the genome. The first 30 squares in the picture represent the 30 chromosomes of cattle.

Table 3 The assembly result assisted by HiC.

Assembly quality assessment

The assembled genome was assessed for completeness through BUSCO analysis with OrthoDB as the reference17. Overall, 93.0% complete BUSCOs were identified in YB assembly (Table 4). In addition, the genome sequences mounted by Hi-C were sorted according to the reference genome sequences using Mummer to perform a collinearity analysis. The result indicated good collinearity with the reference genome (Fig. 2). Moreover, we mapped the whole genome resequencing data of 5 Yanbian cattle to YB_JAAS genome. This analysis illustrated the high mapping ratio with an average mapping ratio of 99.56% and an average properly mapping ratio of 98.22% (Table 5). All the evidence confirmed the high quality of our assembled genome (YB_JAAS).

Table 4 The quality assessment of assembly YB_JAAS via BUSCO.
Fig. 2
figure 2

The mapping result of collinearity analysis using Mummer. The horizontal axis represents the reference genome, and the vertical axis represents the genome after Hi-C mounting. The red line indicates the forward matching of the sequence, and the blue line indicates the reverse complementary matching.

Table 5 The mapping results based on the whole genome re-sequencing data.

Annotation of repetitive sequences

To comprehensively characterize the repetitive elements in the YB_JAAS genome, we employed an integrated approach combining homology-based and de novo prediction strategies. For homology-based identification, known transposable elements (TEs) were annotated using RepeatMasker (version 4.1.2) with the Repbase TE database as a reference library18,19,20. Additionally, RepeatProteinMask (Revision 1.36) was implemented to detect repetitive elements using a curated TE protein database19. For de novo prediction, RepeatModeler was employed to build a de novo genome-specific (YB_JAAS) repeat library, which captured both known and novel repetitive elements including TEs, low-complexity regions, and unclassified repeats by integrating two complementary algorithms RECON21 and RepeatScout22. To further improve LTR retrotransposon detection, we performed a specialized search using LTR_FINDER (v1.0.7)22,23. Tandem repeats were identified using the Tandem Repeat Finder (TRF)24, while low-complexity repeats, satellites and simple repeats were annotated via RepeatMasker20. Finally, after integrating the libraries derived from both homology-based and de novo approaches and performing a comprehensive repeat annotation, 51.94% of the whole genome sequences were defined as repetitive sequences, achieving a complete characterization of repetitive elements in the YB_JAAS genome (Table 6). A total of 1.32 Gb sequences were identified as combined TEs, accounting for 45.22% of the whole genome length (Table 7). Of all the types in TEs, LINE sequences were annotated most, reaching 31.91% of the whole genome (Fig. 3).

Table 6 Statistical results of repetitive sequences using diverse methods.
Table 7 Statistics of classification results of Tes.
Fig. 3
figure 3

The classification statistics of TEs.

Gene annotation

To comprehensively identify protein-coding genes, we applied three prediction methods to the YB_JAAS genome: ab initio modeling, homology-based alignment and RNA-Seq-guided annotation. For homology-based prediction, we use tblastn to align the sequences of proteins encoded by five known related species including wild yak (Bos mutus), zebu (Bos indicus), water buffalo (Bubalus bubalis), American bison (Bison bison) and Banteng (Bos javanicus) to the YB_JAAS genome25, and utilize Exonerate to predict the gene structure26. Comparative analysis revealed conserved GC content distributions in both genes and coding sequences relative to these species, while demonstrating increased exon and introns number in the YB_JAAS genome (Fig. 4). Augustus (v3.3.1) and Genescan were adopted to perform ab initio gene prediction27,28. To accurately determine the splicing sites and exon regions, we first assembled clean RNA-Seq reads into transcripts using trinity29, and the gene structure were formed using PASA30. Lastly, MAKER (v3.00) was used to integrate the gene set predicted by various methods into a non-redundant and more complete gene set31. Then PASA combined with transcriptome data was used to update the gene structure30. The gene prediction results obtained by different kinds of software are elaborated in Table 8. Finally, the genome annotation identified 20,421 protein-coding genes, with a median gene length of 48.43 kb. Exon-intron structure analysis revealed an average of 9.85 exons per gene, with exon lengths averaging 322.13 bp and introns spanning 5,040.47 bp on average (Table 8). The average CDS length is 1,688.59 bp (Table 8).

Fig. 4
figure 4

The comparison results of gene characteristics between YB_JAAS (Bos_taurus) with wild relatives.

Table 8 The gene prediction results via different methods.

Functional annotation of protein-coding genes

Functional predictions were derived from sequence similarity analyses using BLASTP (v2.6.0 + )32,33 against curated protein databases (NCBI NR, TrEMBL34, Swiss-Prot34) and domain databases (InterPro35), supplemented with KEGG36 pathway mapping, applying a conservative e-value threshold of 1e-5 for all alignments. Gene Ontology (GO) IDs for each gene were obtained from Blast2GO37. In total, approximately 19,880 (about 97.35%) of the predicted protein-coding genes of YB_JAAS genome could be functionally annotated (Table 9) with InterPro, GO, KEGG, SwissProt, TrEMBL, and NR databases covering 85.08%, 84.25%, 96.42%, 93.99%, 95.54%, and 97.24%, respectively. Within the GO database, the cellular component category contained the highest number of annotations (91,115 terms) and genes (11,567 genes), followed by molecular function (82,501 terms, 11,147 genes) and biological process (56,524 terms, 9,543 genes) (Table S1). The KEGG database annotated 56.54% of genes into Brite Hierarchies, primarily associated with genetic information processing, metabolism, and signaling and cellular processes (Table S2). Among pathways, Human Diseases represented the largest proportion (16.21%), followed by Organismal Systems (15.16%). Other categories such as Environmental Information Processing, Cellular Processes, Metabolism and Genetic Information Processing comprised ~10% each (Table S2).

Table 9 Gene functional annotation based on various databases.

Annotation of non-coding RNA genes

According to the structural characteristics of tRNA, tRNAscan-SE was used to find the tRNA sequences in the genome38. RNAmmer (version 1.2) was carried out to predict rRNA in the genome39. In addition, the covariance model of Rfam (http://xfam.org/) family and the INFERNAL software provided by Rfam could be used to predict miRNA and snRNA sequence information on the genome40,41. Finally, we identified a total of 269,330 tRNAs, 2,946 rRNAs (including 14 18S, 14 28S and 2,9185S), 982 miRNAs and 1,795 snRNAs (including 199 CD-box, 364 HACA-box, 1,201 splicing and 31 scaRNA). The annotation results of various types of non-coding RNAs were summarized in Table 10. Among these types, tRNA accounted for the largest proportion, up to 67.49% of the whole genome (Table 10).

Table 10 Annotation results of non-coding RNA.

Ethics statement

The blood and tissues samples were obtained from Molecular Breeding Research Laboratory, Institute of Animal Husbandry and Veterinary Medicine, JAAS, China. This study has been approved by the Institutional Animal Ethics Committee of Jilin Academy of Agricultural Sciences (JAAS). The Number of Permit is JNK20221218-02.

Data Records

The whole genome sequence data in this paper have been deposited in the Genome Warehouse in National Genomics Data Center42,43, Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation, under accession number (SAMC4847615, PRJCA037127) that is publicly accessible at NGDC Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/92360/show (2025)44, also presented at DDBJ/ENA/GenBank under the accession JBPCCZ00000000045. The raw resequencing and RNA-seq data reported in this paper have been deposited in the Genome Sequence Archive46 in National Genomics Data Center43 China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA023772) that are publicly accessible at NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA023772 (2025)47 and SRA database in NCBI with accession SRP591175 (https://www.ncbi.nlm.nih.gov/sra/SRP591175)48. The genome annotation file has been shown in Figshare database (https://doi.org/10.6084/m9.figshare.29310821)49.

Technical Validation

Based on the single copy homologous gene set in OrthoDB, BUSCO was used to predict these genes and calculate their integrity, fragmentation, and possible loss rate. Thus, the integrity of gene regions in the overall assembly result was assessed. Our genome assembly (YB_JAAS) achieved a BUSCO completeness score of 93% (vertebrata_odb10), with 91.4% single-copy and 1.6% duplicated genes. The fragmented BUSCOs genes and missing BUSCOs genes were 66 and 170, respectively (Table 4). In addition, the mapped results using the whole genome sequences demonstrated extremely high mapping ratio, averaging 99.56% (Table 5). Functional annotations covered ~97.35% of genes by six databases (Table 9). These analyses all confirmed the high continuity and completeness of YB_JAAS.