Background & Summary

Alfalfa (Medicago sativa L.) is a perennial, high-quality legume forage with diverse uses. It offers many advantages, including rich nutrient content, high biomass yield, strong stress resistance, and wide adaptability, making valuable both for feeding and ecological purposes1,2. As one of the most important forages in the world, alfalfa is widely planted in more than 80 countries, covering over 33 million square hectares3. In China, the rapid expansion of animal husbandry has led to a growing demand for high-quality alfalfa. However, intensive cultivation and continuous cropping have contributed to increased prevalence of diseases, pests, and weeds. These challenges significantly reduce the yield and quality of alfalfa, causing substantial losses to China’s animal husbandry. To date, 31 fungal diseases, four bacterial diseases, one viral disease, and one phytoplasma disease have been identified in alfalfa4. Among them, alfalfa root rot is an important root disease, causing annual yield losses of up to 40% and field mortality rate exceeding 60%5.

Alfalfa root rot is caused by a variety of pathogens, including bacteria, fungi, viruses and nematodes. The primary fungal pathogens responsible for this disease are Fusarium spp., Phytophthora spp., Rhizoctonia spp., and Aphanomyces spp. These pathogens can cause alfalfa root rot either individually or in combination5,6,7. Studies have shown that Fusarium is the predominant cause of root rot, with reports of up to 20 species or varieties of Fusarium pathogens associated with alfalfa root rot in the world5,8. Furthermore, new pathogen species continue to emerge. Environmental factors such as temperature9 and soil moisture10 can influence the diversity of pathogens, leading to variations under different ecological conditions. In China, Fusarium species causing alfalfa root rot are mainly F. oxysporum, F. solani, and F. acuminutum5. In Egypt, the primary pathogens are F. oxysporum, F. semitectum, and F. catenatum11, while in Canada, F. oxysporum, F. semitectum, and F. solani are the most prevalent12.

Fusarium is a fungal genus within phylum Ascomycota13. As one of the most significant plant pathogens, Fusarium can infect a variety of crops including maize, wheat, or corn, leading to major diseases14,15,16. Certain species produce mycotoxins during infection, causing food poisoning in humans and animals17. Additionally, some Fusarium species can directly infect humans and animals, resulting in serious diseases18,19. Fusarium is known for its high heterogeneity, making classification challenging. However, accurate species identification is crucial for biological, epidemiological, and toxicological research. Current identification methods rely on morphological characteristics and the DNA-based molecular approaches20,21.

In this study, F. neocosmosporiellum strain CA18-1 was isolated from infected alfalfa roots in China. This species is shown to be pathogenic to plants including soybean22, peanut23 and mango24. We sequenced the whole genome of strain CA18-1 using a combination of long-read sequencing and short-read sequencing. We evaluated the genome assembly, including coding genes, non-coding RNAs, and repetitive sequences, and annotated genes using various databases. This study provides insight into the genome structure and function of strain CA18-1, contributing to our understanding of its biological traits and pathogenic mechanisms.

Methods

Sample collection and extraction of genomic DNA

In this study, the alfalfa root rot fungal CA18-1 was isolated from diseased alfalfa root collected from fields in Arhorchin Banner in Chifeng, China (Fig. 1A,B). And the morphological characteristics align with those of F. neocosmosporiellum24 (Fig. 1C–E). The strain CA18-1 was inoculated in potato dextrose broth (PDB) and cultured at 180 rpm/min 25 °C for 5 days. The culture medium was filtered through four layers of sterile gauze to collect the mycelia. Genomic DNA was extracted from strain CA18-1 using a plant genome DNA extraction kit (Tiangen Biothech, Beijing, China). The DNA quality was assessed using Qubit (Thermo Fisher Scientific, Waltham, MA) and a Nanodrop (Thermo-Fisher Scientific, Waltham, MA).

Fig. 1
figure 1

Alfalfa (Medicago sativa L.) root rot disease and morphology of the fungal strain CA18-1. (A) Plant wilting. (B) Browning and decaying tissue at the plant stem base. (C) Top view of the colony cultured on potato dextrose agar (PDA) plates after seven days of incubation. (D) Microconidia. (E) Ascospores with ornate walls. Scale bars are 10 μm (D) and 20 μm (E).

Genome sequencing, data quality control and de novo assembly

Whole-genome sequencing of F. neocosmosporiellum strain CA18-1 was performed by Genedenovo Biotechnology Co., Ltd (Guangzhou, China) using both Oxford Nanopore PromethION (long-read) and Illumina NovaSeq X Plus (short-read) platforms. For Nanopore sequencing, a sequencing library was constructed using ONT’s library construction kit SQK-LSK110 (Oxford Nanopore Technologies, Oxford, UK). ABI StepOnePlus Real-Time PCR System (Life Technologies, CA, USA) was used to detect the quality of the library, and Agilent 2100 (Agilent, Santa Clara, CA) was used to evaluate the size of the inserted fragment, followed by ONT kit EXP-NBD104/114. Sequencing was performed on the Oxford Nanopore PromethION (Oxford Nanopore Technologies, Oxford, UK) platform. The average depth of sequencing achieved with Nanopore reads was 234X. As a result, a total of 1,449,861 reads were obtained with a mean read length of 13,076.1 bp and the reads N50 length of 22,698 bp (Table 1).

Table 1 Summary of Fusarium neocosmosporiellum strain CA18-1 Nanopore sequencing data.

For Illumina sequencing, a sequencing library was constructed using the Illumina DNA Prep Kit (Illumina, CA, USA). A DNA library of 300-400 bp insert fragments was first prepared, and genome sequencing was subsequently performed using the Illumina NovaSeq X Plus platform (read length: 150 bp). The average depth for Illumina reads was 40X. The raw data were filtered using FASTP (version 0.20.0)25. After filtering, 18,235,894 reads were retained with 2,611,981,952 bp bases quality values reached Q20 and 2,548,654,550 bp bases quality values reached Q30 (Table 2).

Table 2 Summary of Fusarium neocosmosporiellum strain CA18-1 Illumina sequencing data.

The clean reads from Nanopore sequencing were corrected using FMLRC (version 1.0.0)26. The corrected Nanopore sequencing reads were then reassembled using the Canu (version 2.2)27 software. To verify the assembly quality and determine the final genomic sequence, the Nanopore sequencing reads were aligned to the assembled genomic sequence using Racon (version 1.4.10)28, and the corrected genomic sequence was output. Additionally, the Illumina sequencing reads were aligned to the corrected Nanopore sequencing genomic sequence using Pilon (version 1.23)29, and the genomic sequence was corrected using the default software parameters. The corrected genomic sequence and the information about the correction sites were then output. The final assembly of the F. neocosmosporiellum strain CA18-1 genome is summarized in Table 3. A de novo scaffold-level genome measuring 63,424,297 bp was generated from 17 contigs, with a Contig N50 of 6,480,858 bp and a Contig N90 of 3,230,245 bp. The genome had a GC content of 49.76%, with the longest contig measuring 7,768,507 bp and the shortest contig measuring 41,112 bp (Fig. 2; Table 4).

Table 3 Summary of Fusarium neocosmosporiellum strain CA18-1 genome assembly.
Fig. 2
figure 2

Circular map of Fusarium neocosmosporiellum strain CA18-1 genome assembly. (a) Physical locations of 17 contigs. Bar = 1 kb. (b,c) Protein-coding genes on the forward and reverse strands, annotated with KOG classes. (d) Gene density. (e) Transposable element (TE) density. (f) Tanden repeat density.

Table 4 Coding gene prediction of Fusarium neocosmosporiellum strain CA18-1.

Gene prediction and annotation

Coding genes represent the core functional regions of the genome of a species, encoding proteins necessary for the physiological and biochemical activities of the organism. Prior to gene prediction, the genome was subjected to repeat masking using BEDtools (version 2.28.0)30. The gene prediction analysis was then carried out on the repeat-masked genome utilizing Funannotate (version 1.8.9)31, resulting in a total of 28,006 predicted genes in the genome of F. neocosmosporiellum strain CA18-1 (Table 4). The total length of these genes was 51,444,891 bp, which accounted for 81.11% of the entire genome length. The distribution map of gene lengths showed that the largest number of genes fell within the >3000 bp range, totaling 2,159 genes (Fig. 3).

Fig. 3
figure 3

Length distribution of predicted protein-coding genes in Fusarium neocosmosporiellum strain CA18-1.

Non-coding RNAs, such as rRNAs, were predicted using RNAmmer (version 1.2)32, while tRNAs were identified using tRNAscan (version 1.3.1)33. Small RNAs (sRNA) and microRNAs (miRNA) were predicted through cmscan (version 1.1.2)34 by comparing with the Rfam database34. Analysis of the genomic data for F. neocosmosporiellum strain CA18-1 revealed the following: 345 tRNAs, 3 sRNAs, 0 miRNAs, 38 28s-rRNAs, 36 18s-rRNAs, and 65 5s-rRNAs (Table 5). Tandem repeats were predicted using Tandem Repeats Finder (TRF, version 4.09.1)35 software, while interspersed repeats were identified using EDTA (version 2.0.0)36. The results of the genome repeat sequence predictions for the sequenced strains are summarized in Table 6.

Table 5 Statistics of non-coding RNA in Fusarium neocosmosporiellum strain CA18-1.
Table 6 Statistics of repeat sequence in Fusarium neocosmosporiellum strain CA18-1.

Our genome annotation pipeline employed an integrated strategy to maximize functional insights. Initial gene predictions underwent rigorous multi-database validation: assembled sequences were first queried against NCBI’s non-redundant database37 using BLASTp (e-value cutoff 1 × 10^-5) for broad functional classification, followed performed protein alignment with Swiss-Prot database38. And KOG39, KEGG40,41, and GO42 databases were used for metabolic pathway reconstruction. Subsequent domain characterization utilized Pfam-Scan (version 1.6) with the Pfam library43 (version 32.0), identifying conserved protein motifs unique to Fusarium species. Through this multi-tiered annotation pipeline, we functionally characterized 15,389 high-confidence unigenes (54.9% of the total 28,006 predicted genes), achieving comprehensive functional coverage with 98.36% of unigenes receiving at least one functional assignment through orthogonal evidence from multiple databases (Table 7).

Table 7 Genome function annotations of Fusarium neocosmosporiellum strain CA18-1 in different databases.

Data Records

The Illumina and Nanopore sequencing data have been deposited in the NCBI Sequence Read Archive database under the SRA accession SRP56537044. The assembled genomes are available in GenBank under the accession number GCA_041296245.145. The genome annotation results have been deposited in the figshare46 database. The deposited CDS and protein fasta files contain only the 15,389 high-confidence unigenes with complete coding sequences. The gff file (CA18-1.gff) includes all 28,006 predicted gene models.

Technical Validation

The Illumina short reads served as input for Jellyfish (version 2.3.017)47, from which a k-mer (k = 17) frequency distribution was obtained, as depicted in Fig. 4. After discarding the k-mers with abnormal depth, the genome size was estimated by using the formula genome size = k-mers number/average depth of k-mers. Consequently, the estimated genome size of the F. neocosmosporiellum strain CA18-1was 59.01 Mb, with a heterozygosity rate of 0.0807%. The assembly quality was evaluated based on contig length and BUSCO analysis. The N50 lengths of the primary assemblies exceeded 6.5 Mb, with the longest contig reaching 7.8 Mb. BUSCO (version 5.2.2)48 was used to assess the completeness of the genome assembly using the ascomycota dataset. Results of the genome quality assessment indicated that the assembly of CA18-1 was both complete and accurate, as single-copy orthologous genes in F. neocosmosporiellum strain CA18-1 matched 98.68% of all 758 complete core genes in the ascomycota dataset (Table 8).

Fig. 4
figure 4

k-mer (k = 17) frequency distribution generated using Illumina sequencing data of Fusarium neocosmosporiellum strain CA18-1. The x-axis shows k-mer depth (coverage), while the y-axis represents frequency. The primary peak at ~40X indicates the dominant homozygous genome component, with the minor peak at ~20X representing heterozygous regions (heterozygosity rate: 0.0807%).

Table 8 Evaluation of protein-coding genes in Fusarium neocosmosporiellum strain CA18-1 using the BUSCO library.