Background & Summary

Flemingia macrophylla is a perennial shrub of the genus Flemingia in the family Fabaceae1,2. This evergreen species exhibits climbing or trailing growth habits2, trifoliate compound leaves bearing ovate to elliptical leaflets, and vibrant papilionaceous flowers with a tubular corolla base (Fig. 1a). It displays considerable ecological plasticity and is commonly found in open grasslands, shrublands, sunny forest margins, and along valley roadsides3,4. It is native to tropical and subtropical regions of Asia, including southern China (notably Guizhou, Yunnan, and Guangxi provinces), Southeast Asia, and India5, and has also spread to Africa and South America6.

Fig. 1
figure 1

Photos and genomic characteristics of F. macrophylla. (a) The leaves, roots, and flowers of F. macrophylla. (b) Genomic characteristics of F. macrophylla. The tracks from outer to inner circle represent the eleven chromosomes (Chr1-Chr11), gene density, GC content, LAI score distribution, LTR content and syntenic gene blocks within the genome indicated by connecting lines. (c) K-mer depth distribution for genome size estimation of F. macrophylla. (d) The Hi-C interaction heatmap for F. macrophylla.

Flemingia macrophylla has a long history of traditional use and a growing body of scientific evidence supporting its diverse pharmacological activities. In traditional Chinese medicine (TCM), it has been employed to dispel wind and eliminate dampness, promote blood circulation, and detoxify7. Its roots and stems are traditionally used to treat rheumatism and alleviate bone pain2,8. In Indian folk medicine, the leaves are commonly used in diabetes management7,9. Modern pharmacological studies further support its therapeutic potential by identifying bioactive compounds, such as flavonoids, that exhibit significant in vitro antioxidant10, anti-inflammatory, and antitumor activities2. In addition, the plant’s extracts are rich in legume-specific isoflavones11, which show neuroprotective potential against Alzheimer’s disease8,12 and therapeutic potential for osteoporosis13,14.

Although previous studies have assembled the chloroplast genome15 and nuclear genome16 of F. macrophylla, provided genetic insights for this medicinal plant, research at the nuclear genome level remains insufficient. In this study, we completed a chromosome-level genome assembly and annotation of F. macrophylla using high-fidelity (HiFi) long-read sequencing generated by Pacific Biosciences (PacBio), combined with chromosome conformation capture (Hi-C) data, providing a high-quality genomic resource that complements the previously published Nanopore-based assembly by Ding et al.16. In terms of genome contiguity, the genome assembled in this study has a total size of 1.13 Gb, with a contig N50 of 68.75 Mb and a scaffold N50 of 105.36 Mb, both higher than those in the previously published version (59.43 Mb and 100.63 Mb, respectively16) (Table 1). Compared to previous studies that relied on Nanopore sequencing and multiple rounds of error correction, our approach leveraged highly accurate PacBio HiFi reads and the hifiasm assembler optimized for diploid genomes, resulting in a more contiguous and accurate assembly with fewer redundant sequences and minimal polishing steps17. Finally, 1.06 Gb (93.29%) of the assembled sequences were successfully anchored and oriented onto 11 pseudochromosomes. (Fig. 1b), thereby reducing the assembly fragmentation. By integrating transcriptome-based, homology-based, and de novo prediction approaches, this study predicted 28,548 protein-coding genes, with a BUSCO completeness of 97.8% (Table 2), representing an improvement over the previously published 97.6%16. A total of 27,936 genes (97.86%) were annotated across multiple databases (Table 3), outperforming the previously reported annotation rate of 95.01%16. The successful construction of a high-quality reference genome for F. macrophylla enriches the genomic resources of the Fabaceae, providing a solid foundation for future genomic and evolutionary studies of the genus Flemingia. This achievement ultimately contributes to the sustainable development and utilization of medicinal plant resources.

Table 1 Comparison of the F. macrophylla genome assembly with the previously published Nanopore-based assembly.
Table 2 BUSCO assessment results of F. macrophylla.
Table 3 Statistics of F. macrophylla genome assembly and annotation.

Methods

Sample collection and sequencing

In November 2023, young healthy roots of F. macrophylla were collected from one individual at the Guangxi Botanical Garden of Medicinal Plants, Nanning, Guangxi, China (22°51′30″ N, 108°22′39″ E). Leaves were cleaned, flash-frozen in liquid nitrogen, preserved on dry ice, and subsequently used for genomic DNA. The cetyltrimethylammonium bromide (CTAB) method was used for genomic DNA extraction18. For PacBio HiFi sequencing, two 20-kb SMRTbell libraries were prepared and sequenced on the PacBio Sequel II platform in Circular Consensus Sequencing (CCS) mode using two SMRT cells, generating 45.88 Gb of high-quality filtered data (Table 4). Roots were used for Hi-C library preparation (chromatin cross-linking, MboI digestion, end repair, proximity ligation, purification) and sequenced in paired-end mode (2 × 150 bp) on the Illumina NovaSeq 6000 platform. RNA was extracted from the roots using TRIeasy™ Total RNA Extraction Reagent (Yeasen, China). RNA-seq libraries were constructed and then sequenced in paired-end mode (2 × 150 bp) on the Illumina NovaSeq 6000 platform, generating high-quality transcriptomic data for gene prediction and functional annotation.

Table 4 HiFi sequencing data statistics.

Genome survey

To estimate the genome size, heterozygosity and repeat content, a 21-mer frequency analysis was performed using Jellyfish v2.3.019 on high-quality filtered HiFi reads. The k-mer frequency distribution was then modeled with GenomeScope v.2.020 under a diploid assumption (-p 2). The analysis estimated a genome size of approximately 1.07 Gb, with a low heterozygosity rate of 0.001% and a repeat content of 59.7%. The unique sequence portion accounted for 41.5% of the genome, and the major k-mer peak occurred at a coverage depth of ~18.8 × . The estimated sequencing error rate was 0.156%, and the model exhibited a high goodness-of-fit (100%), indicating that the data were well suited for genome characterization (Fig. 1c).

De novo genome assembly

HiFi long reads generated by PacBio sequencing technology were de novo assembled using hifiasm v0.25.021 with default parameters optimized for diploid genomes. The ~45.88 Gb of filtered HiFi data correspond to an estimated ~43 × coverage of the ~1.07 Gb genome, providing a solid basis for the assembly. The primary assembly output was then processed with Purge Haplotigs v1.0.422 to remove residual redundancies, yielding a polished, non-redundant haploid assembly. The F. macrophylla genome assembly totaled 1.13 Gb, with a contig N50 of 68.75 Mb. (Table 3). To improve genome assembly contiguity23, draft contigs were scaffolded into a chromosome-scale assembly using the 3D-DNA pipeline24, guided by chromatin interaction data derived from uniquely mapped Hi-C reads25. The workflow was as follows:

Hi-C data preprocessing and integration: Hi-C sequencing data were processed using Juicer26 to generate a genome-wide contact frequency matrix. Leveraging the principle that physically proximal genomic regions exhibit higher interaction frequencies, contigs were preliminarily assigned to putative chromosome groups based on their interaction patterns. Chromosomal scaffolding: the 3D-DNA software was employed to construct chromosome-scale scaffolds by ordering, orienting, and estimating inter-contig gaps between contigs. Manual curation: using Juicebox27, Hi-C contact heatmaps were examined to manually adjust scaffold orientations, correct misassemblies, and validate the contig order, ensuring alignment with the physical interaction patterns captured by Hi-C.

Ultimately, a chromosome-level genome assembly was successfully constructed (Fig. 1d). Assembly statistics were computed using QUAST v5.3.028. A total of 1.06 Gb of sequences were anchored to eleven putative chromosomes (Table 5), with an anchoring rate of 93.29%. The scaffold N50 of the final chromosome-level genome reached 105.36 Mb, representing a 53% improvement over the contig N50 (68.75 Mb) from the preliminary assembly. This result clearly demonstrates the effectiveness of Hi-C technology in facilitating chromosome-scale genome assembly by capturing long-range genomic interactions.

Table 5 Summary of the eleven pseudochromosomes.

Repetitive sequence annotation

The presence of repetitive sequence regions in genomes can compromise the accuracy of gene prediction and increase computational burden. A combination of de novo and homology-based sequence prediction approaches was employed to identify and mask repetitive sequences in the F. macrophylla genome prior to structural annotation. De novo prediction was performed using RepeatModeler v2.0.529, which integrates RepeatScout v1.0.730 and RECON31 tools to identify, refine, and classify potential repetitive elements32, thereby constructing a custom repeat library. RepeatMasker v4.1.033 was subsequently applied to annotate repetitive sequences using a combined repeat library consisting of the custom library and the Dfam 3.1 database34. In F. macrophylla, repetitive sequences accounted for approximately 59.58% of the genome, with LTR retrotransposon representing the most abundant class at 39.25% (Table 6).

Table 6 Statistical results of repetitive sequences in F. macrophylla.

Gene structure prediction

Structural prediction of the F. macrophylla genome was performed using GETA v2.4.12 (https://github.com/chenlianfu/geta), which integrates three approaches: transcriptome-based, homology-based, and de novo predictions. For transcriptome-based prediction, raw reads were quality trimmed using Trimmomatic35, aligned to the genome using HISAT236, and coding sequences were predicted with TransDecoder v5.7.1 (https://github.com/TransDecoder/TransDecoder). Homology-based prediction was performed using GenWise v2.4.137, with protein sequences from five closely related species (Lupinus albus, Cicer arietinum, Glycine max, Phaseolus acutifolius, and Lotus japonicus) as queries. De novo gene prediction was carried out using AUGUSTUS v3.5.038. By integrating these three approaches, GETA produced accurate gene predictions (Table 3). BUSCO assessment showed 97.8% complete BUSCOs, further indicating a high-quality annotation (Table 2).

Gene functional annotation

Protein sequences of F. macrophylla were aligned against the National Center for Biotechnology Information (NCBI) non-redundant (NR) and Swiss-Prot protein databases using DIAMOND BLASTP v2.1.10.16439, with an E-value cutoff of 1e-5, to retrieve sequence similarity and functional annotation information. Functional annotations were further assigned using eggNOG-mapper v2.1.1240 based on the eggNOG database, which also provided Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway information. InterPro annotations were obtained using InterProScan v5.54-87.041. Gene Ontology (GO) terms were integrated from the annotation results of both eggNOG-mapper and InterProScan (Table 3).

Non-coding RNA annotation

The transfer RNA (tRNA) genes were predicted using tRNAscan-SE v2.0.1242 with default parameters. Ribosomal RNA (rRNA) and other non-coding RNAs (ncRNAs) were annotated using Infernal v1.1.543 in combination with the Rfam 15.044 database. In total, 1,116 rRNA genes, 2,265 small nuclear RNA (snRNA) genes, 124 microRNA (miRNA) genes, 583 tRNA genes, and 8 small RNA (sRNA) genes were identified in the F. macrophylla genome (Table 3).

Data Records

The sequencing reads generated in this study have been deposited in the NCBI Sequence Read Archive (SRA) under the BioProject accession number PRJNA1308524 (Hi-C reads: SRR3519686345, PacBio HiFi reads: SRR3519686446, and RNA-Seq reads: SRR3519685847, SRR3519685948, SRR3519686049, SRR3519686150, SRR3519686251). The chromosome-level genome assembly and associated annotation files have been deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.29986939.v4)52.

Technical Validation

QUAST v5.3.028 was employed to evaluate the genome assembly quality, focusing on assembly size and continuity. The assembled genome size reached 1.13 Gb, with a contig N50 of 68.75 Mb and a scaffold N50 of 105.36 Mb (Table 3). Genome assembly completeness was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.8.353 with the embryophyta_odb10 dataset54. A total of 93.4% of BUSCOs were identified as complete and single-copy, 3.5% as duplicated, 1.2% as fragmented, and 1.9% as missing (Table 2). The high overall completeness (96.9%) and low fragmentation rate indicate that the genome assembly of F. macrophylla is highly contiguous and reliable55. The LTR Assembly Index (LAI) was further used to evaluate the assembly quality of LTR retrotransposon regions, with higher scores reflecting greater structural integrity56. Using LTR_retriever v3.0.157, the assembled genome achieved an LAI score of 14.31, exceeding the threshold of 10 for a moderately high-quality LTR assembly and thus indicating high structural integrity in these regions56. Additionally, BUSCO assessment of the predicted gene set revealed 97.8% complete BUSCOs against the benchmark set of 2,326 conserved genes (Table 2).