Background & Summary

The vetch aphid, Megoura crassicauda, is a worldwide agricultural pest primarily infesting legumes and alfalfa plants including but not limited to Vicia faba, Pisum sativum, and Lathyrus quinquenervius1,2,3,4. Both nymphs and adults of M. crassicauda mostly preferentially feed on phloem sap from tender leaves and stems, inflicting greater damage on legumes and alfalfa than many co-occurring pests5 (Fig. 1). This feeding disrupts essential physiological processes in host plants, severely impairing growth, development, and reproductive stages—particularly flowering and fruiting3,6. Under heavy infestation, crops may wither and perish, resulting in substantial yield losses and quality degradation7. Additionally, M. crassicauda transmits multiple plant-pathogenic viruses among different crops6.

Fig. 1
figure 1

Genome assembly of Megoura crassicauda. (A) Morphology of M. crassicauda. (B) Genome scope profiles of 17-mer analysis. (C) Hi-C interactive heatmap of 6 linkage pseudo-chromosomes in M. crassicauda genome. Color indicates the intensity of the interaction signal. The darker the color, the higher the intensity. (D) Circle genome landscape of M. crassicauda. Circle a represents chromosomes, while circles b-e indicate gene density, DNA transposon density, long terminal repeat retrotransposon density and GC content of each respective chromosome, respectively.

As one significant legumes pest, M. crassicauda exhibits a broad global distribution with notable prevalence in China, Russia, the Korean Peninsula, and Japan2,3. Traditional strategies to control this pest mainly rely on insecticides, however, prolonged and excessive application accelerates resistance evolution of M. crassicauda to diverse pesticides while triggering adverse ecological consequences—including pesticide residues, environmental contamination, and suppression of natural enemies3,8,9,10. Elucidating the molecular mechanisms underpinning pesticide resistance is imperative for developing innovative management approaches for controlling this destructive pest. Nevertheless, the genetics of M. crassicauda remain poorly characterized. Existing research on M. crassicauda is largely confined to superficial biological investigations, lacking deeper exploration of gene function. Consequently, generating a high-quality reference genome is essential to advance genetic understanding and enable effective control of this pest.

Aphids (Hemiptera: Aphididae) represent a diverse insect group comprising over 4,700 described species globally11,12. Despite this diversity, whole-genome assemblies are available for only 86 species, with merely 22 sequenced to chromosomal resolution, and most of the remaining genomes remain at the scaffold, even contig levels13. Notably, genomes of considerable aphids feeding on main crops such as wheat, corn, cotton, sorghum have been fully obtained, including Diuraphis noxia14, Sitobion avenae15, Schizaphis graminum16, Rhopalosiphum padi17, Sitobion miscanthi18, Rhopalosiphum maidis19, Aphis gossypii20, Melanaphis sacchari21, Myzus persicae22, Eriosoma lanigerum23, with some species possessing multiple chromosomal assemblies. In contrast, genomic resources for legume-specialized aphids remain scarce: only Acyrthosiphon pisum13 and Aphis glycines24 have been sequenced. This critical genomic gap impedes fundamental research on legume aphids and constrains broader aphid studies. Generating a high-quality M. crassicauda genome is therefore imperative to advance research on its genetics, biology, and ecology, ultimately providing theoretical foundations for optimized management strategies against this significant pest.

In this study, we integrated short-reading sequencing, PacBio high-fidelity (HiFi) sequencing, and high-resolution chromosome conformation capture (Hi-C) techniques to generate a high-quality chromosome-level genome assembly for M. crassicauda (Table 1). The assembly spans 440.80 Mb, comprising 179 contigs with a N50 length of 41.57 Mb (Table 2). The GC content was 29.77%, and 94.82% of the assembly was anchored to 6 chromosomes (Table 3). The predicted transposable elements and tandem repeats constituted 36.24% and 4.15% of the genome, respectively (Table 4). Furthermore, we predicted 19,687 protein-coding genes (Table 5), and annotated them against six databases: NR, InterPro, GO, KEGG, TrEMBL, and Swiss-Prot (Table 6). Above database annotation yielded the following gene counts: NR (19,352), TrEMBL (19,114), KEGG (19,154), InterPro (13,914), Swiss-Prot (12,100), and GO (8,233). (Table 6), respectively. Collectively, 19,424 genes received functional annotations, accounting for 98.66% of all protein-coding genes.

Table 1 Statistics of sequencing data of M. crassicauda genome.
Table 2 Statistics of genome assembly of M. crassicauda at the chromosomal level.
Table 3 Statistics of Hi-C assembly results.
Table 4 Classification of repeat elements in M. crassicauda genome.
Table 5 Gene annotation statistics of M. crassicauda genome.
Table 6 Functional annotation statistics of M. crassicauda genome.

Methods

Sample material

Megoura crassicauda colony, originally collected in Anyang city (114°20′ E, 36°6′ N) of China, was reared on Vicia faba in laboratory under the control conditions of 25 ± 0.5 °C, relative humidity of 60 ± 5%, and a photoperiod of 16 h light: 8 h dark25. Before sampling, the adults were soaked in 1% sodium hypochlorite solution for 5 min, followed by rinsing in sterile water and immersion in 70% ethanol twice, and then washed in sterile water. Before extracting genomic DNA and RNA, the samples were quickly transferred to collection tubes, and were stored at − 80 °C after rapid freezing in liquid nitrogen.

Genomic size and heterozygosity were assessed using mixed DNA from 300 adults, and whole-genome sequencing was performed on PacBio sequel II platform. In addition, 500 adults were used for Hi-C library construction and sequencing, and more than 800 individuals including apterous and alate nymphs and adults at different development stages were used for transcriptome library construction and sequencing.

Genome sequencing

High-quality genomic DNA was extracted by CTAB method using QIAGEN® kit (QIAGEN, Germany) for DNA purification. Electrophoresis on 1% agarose gel (140 V, 15 min) was carried out with the extracted DNA to check for its purity and possible damage. Next, the samples’ purity was checked in a NanoDrop One (Thermo Fisher Scientific, USA), and those where the spectral ratio of OD260/280 was 1.8–2.0 and OD260/230 2.0–2.2 were considered acceptable. Finally, the DNA concentration was measured using Invitrogen Qubit 4.0 fluorometer (Thermo Fisher Scientific, USA).

For genome survey, fragments of DNA between 300 bp and 500 bp in length were formed using the Covaris ultrasonic disruptor and then the whole library was built through terminal repair, addition of the A-tail, sequencing adapter, purification, PCR amplification and similar steps. The constructed library was sequenced by using ‘PE150’ strategy on BGI MGISEQ platform (Beijing Genomics Institute, China). For long-reading sequencing, the genomic DNA was broken up by g-TUBEs (Covaris, USA), and only DNA fragments smaller than 15 kb were used in constructing the library. Then remove the single-stranded dangling sequence, repair the damaged DNA fragment, and add the A tail and the PacBio adaptor. The library was processed with the SMRTbell Enzyme Cleanup Kit (Pacific Biosciences, USA) and then purified by AMPure PB Beads. The target fragment was screened by blue pippin (Sage Science, USA). After that, purified library fragments were determined with Agilent 2100 (Agilent technologies, USA). All the sequencing was done using the PacBio Sequel II platform and Sequencing Primer V2 and Sequel II Binding Kit at Wuhan Gene Read Biotechnologies Co. Ltd.

The 1 g tissue was cross-linked with 1% final concentration of fresh formaldehyde for 10 min, afterward, the tissue was quenched with 5 min of 0.2 M glycine. The cross-linked cells were then lysed in the lysate. After that, 663 μL of DNase/RNase-free water, 120 μL of 10 × blunt-end ligation buffer, 100 μL of 10% Triton X-100, and 20 U of T4 DNA ligase were placed in the solution. Ligation took place at 16°C for a duration of 4 h. After ligation, the cross-linking was reversed overnight at 65 °C with 200 μg/mL protease K (Thermo Fisher Scientific, USA). The DNA was purified using QIAamp DNA Mini Kit (QIAGEN, Germany) according to manufacturer’s instructions. The Hi-C library for Illumina sequencing was prepped by NEBNext® Ultra™ II DNA library Prep Kit for Illumina (New England Biolabs, England) according to manufacturer’s instructions26. The final library was sequenced on the Illumina HiSeq X Ten platform (San Diego, USA) with 150 bp paired-end reads.

Transcriptomic sequencing and analysis

Total RNA was extracted from the collected samples using TRIzol reagent (Termo Fisher Scientific, USA). RNA integrity, DNA contamination and purity were analyzed by agarose gel electrophoresis, Agilent 2100 bioanalyzer and NanoPhotometer (Implen, Germany), respectively. Oligo (dT) magnetic beads were used to enrich specific mRNAs. The synthesis of the first cDNA involved a fragmented mRNA as its template, and the second cDNA was synthesized from dNTPs. The purified double-stranded cDNA was subjected to end-repair, A-tailing, and sequencing adapter ligation. The cDNA of approximately 200 bp was screened with AMPure XP beads, and the PCR product was amplified and then purified with AMPure XP beads as well. The library was constructed by using Illumina’s NEBNext® UltraTM RNA Library Prep Kit (NEW ENGLAND BIOLABS, ENGLAND). The constructed library was preliminarily quantified using the Qubit 2.0 Fluorometer (Thermo Fisher Scientific, USA) and detection was done using the Agilent 2100 bioanalyzer (Agilent technologies, USA). RT-qPCR was used to check the concentration of the library so that its quality could be verified.

After the examination and approval of the library, different libraries must meet the standards of effective concentration and data volume setting. DNA libraries were prepared used by the MiSeq and paired end reads of 150 base pairs were collected. The dataset was filtered with Fastp (v0.23.2)27 using the parameter ‘-cut _ front-cut _ tail-cut _ window _ size 4-a auto -cut _ mean _ quality 20 -length _ required 36’. An index was made for the reference genome, and the paired end clean reads were mapped against it with HISAT2 (v2.2.1)28 using these parameters: ‘-phred33-no-mixed-no-discordant’. Differential expression analysis between the two comparison combinations was performed using DESeq2 software (v1.16.1)29 with a filtering threshold of padj < 0.05 and |log2FoldChange| > 1. Benjamini and Hochberg’s method helped adjust the P value to make sure the false discovery rate stayed low.

Genome size estimation and assembly

We used SOAPnuke software (v2.1.0) to filter Raw reads30. Firstly, the reads containing the adaptor sequence were filtered out. Then the duplicated reads caused by PCR amplification were removed. When the content of N in one end of the sequencing read exceeds 10% of the read length ratio or the number of low-quality (<  = 5) bases exceeds 50% of the read length ratio, this paired reads need to be removed, and finally clean reads were obtained. The main parameters were: -lowQual = 20, -nRate = 0.005, -qualRate = 0.5, and other parameters were default. Then, we used FastQC software (v0.12)31 to evaluate the quality of sequencing from the following four aspects: 1) quality inspection of sequencing data; 2) sequencing error rate distribution check; 3) Base frequency distribution check; 4) GC content distribution inspection. The K-mer count and statistics of the sequence files were performed by GCE software (v1.0.2)32 to estimate the genome size, heterozygosity, and repetition. Genome assembly was performed using hifiasm (v0.14.2)33. The final genome draft was obtained by dehybridization of the corrected genome. In order to evaluate the integrity and consistency of the assembly, the software BWA (v0.7.17-r1188)34 was used to align the second-generation sequencing data back to the assembled genome, and the comparison rate of reads, the degree of coverage of the genome, and the distribution of depth were counted (98.42% mapping rate; 28.89 × ; 99.38% coverage). Finally, based on the single-copy homologous gene set in OrthoDB, BUSCO (v5.2.2)35 was used to predict these genes and calculate their integrity, fragmentation and possible loss rate. In this way, the integrity of the gene region in the entire chromosome-level genome was assessed.

The Hi-C analysis method was used to achieve genome assembly to the chromosome level26. Quality filtering of the genome using fastp (v0.23.2)27, the data removed includes filters mapped to the end of the genome and other unavailable data. The reads were mapped to polished genomes using BWA (v0.7.17-r1188)34 with default parameters. The paired reads are mapped to a different contig (or scaffold) to be used as Hi-C related scaffolds, and then the agglomerative hierarchical clustering method is further applied to the order and localization of the aggregated sequences in Lachesis. Then we used JuiceBox (v1.8.8)36 to correct the assembly errors in the visualization.

In this study, a 300–500 bp library of M. crassicauda sample was constructed on the Illumina platform. A total of 66.98 Gb clean reads were successfully sequenced. The sequencing quality was normal and the sequencing error rate was normal. NT library comparison results showed that there was no obvious exogenous pollution. K-mer analysis estimated the genome size to be 405.43 Mb. It was speculated that the heterozygosity of the genome was 1.72%, the proportion of repetitive sequences was 36.24%, and the GC content of the genome was about 31.85%. For long read sequencing, after filtering out low-quality data, we obtained 63.58 Gb of HiFi reads. The average lengths of HiFi reads and N50 were 15, 470 bp and 15, 142 bp, respectively. The final genome size obtained by genome assembly was 440.80 Mb, including 179 contigs with the 41.57 Mb of contig N50. The GC content was 29.77% (Tables 1, 2).

A total of 72.68 Gb clean reads were generated by Hi-C sequencing for genome assembly at the chromosome level. These reads were mapped to the genome, retaining 411, 704, 643 paired - end (PE) reads, of which 350, 328, 429 were valid interactive PE reads. Among the 179 contigs, 94.82% of the sequence length was anchored on 6 linkage groups, with lengths ranging from 22.38 Mb to 128.47 Mb and the average sequencing depth was 28.89 (Table 2).

Gene prediction and annotation

Genome prediction and annotation mainly includes three research directions: repeat sequence identification, non-coding RNA prediction, gene structure prediction and functional annotation. We will analyze from these three aspects.

Repetitive sequence annotation

The identification of repetitive sequences combines the homology prediction method based on the RepBase library (http://www.girinst.org/repbase) (software: RepeatMasker (vopen-4.0.9)37 and RepeatProteinMask (vopen-4.0.9)38) and the de novo prediction method based on self-sequence alignment (RepeatModeler (vopen-1.0.11)39) and repetitive sequence features (LTR-FINDER40). In addition, the de novo prediction method also uses Trf software (v4.09)41 to find Tandem Repeats in the genome.

Gene structure prediction

The gene structure prediction in this study used a combination of homology prediction, ab initio prediction, and RNA-seq-guided prediction. First, homology prediction involved the analysis of protein sequence files of closely related species on NCBI. Subsequently, the Exonerate (v2.4.0)42 was used to predict transcripts and coding regions based on alignment results. In this evaluation, hundreds of genomes were sampled using software BUSCO (v5.2.2)35, and genes with single-copy orthologs > 90% were selected to construct gene sets of 6 major phylogenetic branches as indirect homology evidence input. ab initio prediction was performed using software Augustus (v3.3)43, Genscan (v1.0)44, and GlimmerHMM (v3.0.4)45. For transcriptome prediction, RNA-seq data were compared by gmap (v2020-10-24)46, Stringtie (v2.1.1)47 and cd-hit (v4.8.1) to reconstruct the transcripts, and then use software TransDecoder (v5.7.0)48, PASA (v2.5.3)49 and other software to predict the coding frame. Finally, the gene evidence sets predicted by various methods are integrated and filtered to form a non-redundant and more complete gene set using MAKER (v3.00)50 software.

Gene function prediction

The proteins in the predicted gene set were functionally annotated using BLASTP and InterProScan based on the similarity of the foreign protein database (Swiss - Prot51, TrEMBL, KEGG52, InterPro53, GO54 and NR). In addition, we also used GlimmerHMM (v3.0.4)45 for domain prediction to obtain information about conserved sequences, motifs, and protein domains.

In the annotation process of non-coding RNA, according to the structural characteristics of tRNA, tRNAscan-SE (v1.3.1)55 software was used to find the tRNA sequence in the genome. Because rRNA is highly conserved, rRNA sequences of closely related species can be selected as reference sequences to search for rRNA in the genome by BLASTN alignment. In addition, using the covariance model of the Rfam family, the miRNA and snRNA sequence information on the genome can be predicted using INFERNAL software (v1.1.2) from the Rfam database56.

Repetitive sequences are an important part of the genome, which mainly include two categories: tandem repeats and interspersed repeats. The tandem repeat sequences include microsatellite sequences, small satellite sequences and so on; interspersed repeats sequences are also called transposable elements. In this study, a total of 156.93 Mb repetitive sequences were identified, in which a large proportion (33.93%) was composed of transposon, followed by Tandem repeat (4.15%) (Table 4). The number of repetitive sequences affects the size of the genome, and the proportion of unknown categories was relatively high, which may be due to the lack of research on hemipteran insects.

In this study, gene structure prediction uses a combination of three methods: homologous prediction, de novo prediction, and transcript prediction. A total of 19687 genes were annotated in this experiment. The average lengths of the gene, CDS, exons and introns were 7, 821.84 bp, 1, 382.38 bp, 285.02 bp and 1, 468.9 bp, respectively, and the average number of exons per gene was 5.3. In the 19687 genes that have been sequenced, we annotated a total of 19424 genes through multiple gene function databases, it accounted for 98.66% of the total number of genes annotated. (Table 5). There are 19352 genes in NR database and 19114 genes in TrEMBL database. In addition, 19154, 13914, 8233, and 12100 genes were annotated in KEGG, InterPro, GO and Swiss - prot databases (Table 6). We also annotated a total of 1092 non-coding RNAs, of which rRNA, tRNA, snRNA, and miRNA were 294, 657, 94, and 47, respectively (Table 7). Finally, we identified 97.1% of the genes in the BUSCO Insecta database (hemiptera _ odb10) (single - copy BUSCOs: 95.7%, duplicated BUSCOs: 1.40%), indicating that the genome has good assembly integrity (Table 8). In addition to assembly integrity, we also used BUSCO to evaluate the gene annotation results of the genome. The total number of genes that can be completely aligned with BUSCO was 2407, accounting for 95.9% of the total gene set. There were 2372 single-copy genes and 35 duplicated genes, accounting for 94.5% and 1.4% of the gene set, respectively. The number of genes that could not be compared with BUSCO profile was 99, accounting for 3.9% of the gene set.

Table 7 Statistics of noncoding RNA in M. crassicauda genome.
Table 8 Evaluation index of M. crassicauda genome.

Data Records

Genomic Illumina sequencing data, PacBio sequencing data and Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI under accession number PRJNA122693357. The final assembled Megoura crassicauda genome were deposited in the Sequence Read Archive at NCBI under accession number ASM5016992v158. The annotation files of the Megoura crassicauda genome have been deposited at figshare59.

Technical Validation

DNA integrity. The concentration of extracted DNA was done with the help of Nanodrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) and QubitTM 3 Flurometer (Thermo Fisher Scientific, USA). Obtained DNA absorbance was approximated at 260/ 280 nm and 260/ 230 nm measuring about 1.8. Agarose gel electrophoresis was used in the identification of the quality of genomic DNA. The key band size of the DNA fragments 23 K, the degradation band was over 5 K. The sample holes did not have contamination, which was a vivid indication of good integrity of the DNA molecules in the present study.