Background & Summary

Brevicoryne brassicae, commonly known as the cabbage aphid, is a notorious pest that specializes in plants of the Brassicaceae family, including crops like rapeseeds, cabbage, and broccoli. The B. brassicae causes damage to the plants directly through sap sucking from phloem tissues as well as indirectly by transmitting several plant viruses, which collectively result in significant yield losses to many Brassicaceae crops worldwide. B. brassicae is a nonhost-alternating species, meaning its entire life cycle is completed on the herbaceous plants that typically serve as secondary hosts for host-alternating aphids1. The life cycle includes a sexual generation and several asexual generations. During the winter, B. brassicae produces sexual forms and overwinters in the egg stage. In warm seasons and regions, the life cycle simplifies to parthenogenetic reproduction. The winged females (alates) (Fig. 1a) emerge when the population density increases and the host quality declines. Alates migrate to distant crops and produce offspring via parthenogenesis, leading to population expansion exponentially and the escalation of aphid damage in the fields2.

Fig. 1
figure 1

Morphological and genomic characteristics of B. brassicae. (a) Pictures of parthenogenetic B. brassicae: nymph, apterous adult and alate adult. Bar = 1 mm. (b) Karyotype analysis of B. brassicae. Chromosomes (purple) were stained with Gurr’s Giemsa R66 (Giemsa), the diploidic B. brassicae has 16 chromosomes (2n (♀) = 16). Bar = 10 μm. (c) Genome-wide contact matrix of B. brassicae generated using Hi-C data. The colour bar indicates the intensity of Hi-C interaction. Yellow indicates low, and red is high. (d) Circos plot overview of features of B. brassicae genome. Rings from inside to outside (A-E) represents gene density (A), GC ratios (B), GC skew (C), transposable element (D), and the number of SNPs (E).

Since the first aphid genome, the genome of Acyrthosiphon pisum, was published in 20103, now dozens of aphid genomes have become available, including important agricultural pests such as Myzus persicae4,5, Aphis gossypii6, Diuraphis noxia7, and valuable recourse insects like Schlechtendalia chinensis8. These genomes have greatly facilitated research on these aphids, leading to a deeper understanding of molecular mechanisms of aphid biology. In contrast, studies on B. brassicae have been limited, largely due to the lack of genomic resources. Although a genome of B. brassicae is available9, the quality and annotation need to be further improved. Therefore, we constructed a high-quality B. brassicae genome at the chromosomal level using PacBio HiFi long reads and high-throughput chromosomal conformation capture (Hi-C) data. We annotated the genome for both protein-coding genes and long non-coding RNA (lncRNA) transcripts, and performed phylogenic and evolutionary analysis with different aphid genomes. Our efforts will offer substantial support for a deeper understanding of B. brassicae and future studies into aphids.

After quality control and filtering, we obtained a total of 29.00 Gb (~67.44 × depth) of PacBio long reads and 42.78 Gb (~99.49 × depth) of Illumina short reads. These reads were assembled into 131 contigs with an N50 length of 16.79 Mb (Table 1). Chromosome scaffolding based on Hi-C data resulted in eight chromosomes that contained 96.19% of scaffold sequences. The chromosome number was confirmed by karyotype analysis (Fig. 1b). The final assembled genome is eight chromosomes with a total size of 429.99 Mb (Fig. 1c, Table 1). Chromosome lengths ranged from 16.74 Mb to 125.10 Mb (Fig. 1d). The genome assembly is highly accurate in terms of gene content, with 99.24% of Hemiptera BUSCO (Benchmarking Universal Single-Copy Ortholog) genes and 95.16% of CEGMA (Core Eukaryotic Genes Mapping Approach) genes being present (Table 1), indicating a comprehensive representation of the gene set expected for this taxonomic group. Altogether, the assembly of B. brassicae genome is contiguous, accurate, and complete.

Table 1 Statistics for the genome assembly and annotation of B. brassicae. N.A.: not available.

We used a phylogenomic approach to assess the phylogenetic relationships among B. brassicae and other 14 hemipteran insects. Phylogenetic analysis revealed that B. brassicae diverged from D. noxia approximately 49.9 million years ago (MYA) and from other Macrosiphini species about 53.9 MYA (Fig. 2a). We also identified chromosome 1, the largest one (125.10 Mb) in the B. brassicae genome, as the X chromosome since it showed massive synteny to the X chromosomes of M. persicae and A. pisum10 (Fig. 2b).

Fig. 2
figure 2

Comparative analysis of B. brassicae genome. (a) Phylogenetic tree constructed based on the 484 single-copy genes of 15 aphids (Myzus cerasi, Myzus persicae, Sitobin avenae, Metopolophium dirhodum, Acyrthosiphon pisum, B. brassicae, Diuraphis noxia, Rhopalosiphum padi, Rhopalosiphum maidis, Aphis fabae, Aphis gossypii, Aphis glycines, Sipha flava, Cinara cedri, and Daktulosphaira vitifoliae). (b) Synteny of the genomes of A. pisum, B. brassicae and M. persicae. (c) The number of genes related to host adaptation in B. brassicae. Chemosensory-related gene families, including gustatory receptors (GR), olfactory receptors (OR), ionotropic receptors (IR), odorant-binding proteins (OBP), and chemosensory proteins (CSP); detoxification gene families, including cytochrome P450 (P450), carboxylesterases (CCE), glutathione S-transferases (GST), UDP-glucuronosyltransferases (UGT), ATP-binding cassette transporters (ABC), and myrosinase (MYR) genes, as well as lncRNA Ya gene family.

Using chromosome-level genome assemblies, we annotated protein-coding genes and lncRNA transcripts using evidence from 90.31 Gb of RNA sequencing (RNA-seq) data (63.60 Gb un-stranded and 26.71 Gb stranded). In total, 22,671 protein-coding genes and 3,594 lncRNA genes were annotated (Table 1). Among them, 22 lncRNA genes in B. brassicae genome were identified as homologous of Ya1 in M. persicae, previously known as a virulence factor11. Conservation of protein-coding sequences is high among different aphid species, and lncRNA sequences tend to be more divergent (Fig. 2a). We also identified 141.19 Mb of repeating sequences accounting for 32.84% of the genome assembly (Table 2). Additionally, we annotated 154 loci of miRNAs, 332 loci of tRNA, 837 loci of rRNAs, and 124 loci of snRNAs (Table 3).

Table 2 Statistics of the transposable elements in the genome of B. brassicae.
Table 3 Genomic annotation of miRNA, tRNA, rRNA, and snRNA loci in the genome of B. brassicae.

We annotated chemosensory-related genes for gustatory receptors (GRs), odorant receptors (ORs), ionotropic receptors (IRs), and odorant-binding proteins (OBPs) as well as chemosensory proteins (CSPs). The B. brassicae genome encodes 33 GRs, 25 ORs, 17 IRs, 10 OBPs, and 10 CSPs. In addition, we annotated detoxification-related genes and identified 48 genes for cytochrome P450 (P450), 28 for carboxyl/choline esterase (CCE), 21 for glutathione-S-transferase (GST), 49 for UDP-Glycosyltransferase (UGT), 77 for ATP-binding cassette transporters (ABC), and 7 for myrosinases (MYR) (Fig. 2c).

This study presents the high-quality chromosome-level genome assembly of B. brassicae and comprehensive annotations, which provides an invaluable genomic resource for understanding the genetic, evolutionary, and ecological aspects of the cabbage aphid and further offers the possibility to implement integrated pest management of this pest.

Methods

Sample collection and genome sequencing

The B. brassicae nymphs were collected from rapeseed fields in Lanzhou (35°56′39.062″ N, 104°8′49.009″ E), Gansu province, China. Subsequently, these aphids were reared on the Brassica napus variety Zhongshuang11 in a growth chamber set to 22°C and a 16/8 h light/dark cycle in our laboratory. Genomic DNA was extracted from 50 adult insects using the CTAB (cetyltrimethylammonium bromide) method12 for Illumina, PacBio, and Hi-C sequencing. 1.5% agarose gel electrophoresis and NanoDrop 2000C spectrophotometer were used to validate the quality of the genomic DNA. Briefly, the fragmented genomic DNA sample with a size of 350 bp was end-polished, A-tailed, and then ligated with the full-length adapter following the manuscript of SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA). Libraries with 350 bp inserts were constructed and sequenced on the Illumina NovaSeq 6000 platform. Raw reads were subjected to quality check by removing adapter sequences and low-quality reads, ultimately 42.78 Gb of clean data were obtained for subsequent analysis.

For PacBio HiFi sequencing, the genomic DNA was sheared into ~15 kb fragments using g-Tubes (Covaris, Woburn, MA, USA) and then were purified by 0.45 × AMPure PB beads (Beckman Coulter, Brea, CA, USA). The cleaned DNA fragments were constructed to SMRT bell libraries as described above. Fragments with sizes of 15–18 kb were selected using BluePippin (Sage Science, Beverly, MA, USA). After annealing the primers and binding Sequel DNA polymerase to SMRT bell templates, the libraries were sequenced using one SMRT cell 1 M on the Sequel System (Biomarker Technologies). In total, 29.00 Gb of subreads with an average read length of 10.45 kb were obtained, producing an overall 67.44 × coverage of the B. brassicae genome. The Hi-C technique was used to achieve chromosome-level assembly, by identifying the contacts between different regions of chromatin filaments. The Hi-C library was constructed following the standard library preparation protocol and sequenced on the Illumina NovaSeq 6000 platform. In total, 60.78 Gb of paired-end clean reads with 150-bp were obtained.

RNA extraction and transcriptome sequencing

Parthenogenetic aphids were collected for RNA-seq. 30 first-instar nymphs, 15 apterous adults, and 15 alate adults collected from rapeseed plants in the lab were used separately for RNA extraction. Total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA). RNA quality was evaluated using 1.5% agarose gel electrophoresis and the concentration was measured by a NanoDrop 2000C spectrophotometer (Thermo Fisher Scientific, Pittsburgh, PA, USA). RNA integrity was quantified by an Agilent 5400 Fragment Analyzer (Agilent, Santa Clara, CA, USA) following the manufacturer’s instructions. RNA-seq libraries were constructed using the NEBNext® Ultra™ RNA Library Prep Kit (NEB, Ipswich, MA, USA) following the manufacturer’s instructions. Libraries were then sequenced on the Illumina NovaSeq 6000 platform, and 63.60 Gb un-stranded and 26.71 Gb stranded 150-bp paired-end reads were obtained and used for gene prediction.

Genome estimation and assembly

PacBio HiFi long-read data were used to generate a contig-level assembly of the B. brassicae genome. WTDBG2 v2.513 was used to generate a preliminary assembly and Pilon v1.2314 was used for short-read correcting. Then, the B. brassicae genome assembly was generated, consisting of 131 contigs with a total length of 429.99 Mb and a contig N50 of 16.79 Mb. 29.00 Gb of Hi-C clean data with low-quality reads and adaptor sequences removed were mapped to the draft B. brassicae genome by BWA v0.7.1015 with the default parameters. Invalid read pairs, including dangling ends, re-ligation, self-cycle, and dumped pairs were further assessed and eliminated from uniquely aligned read pairs by HiC-Pro v2.10.016. Valid interaction pairs for scaffold correction were processed by LACHESIS v2e27abb17 with the default parameters to cluster, order, and orient the contigs onto chromosomes. The draft assembly was examined for contamination by manually inspecting taxon-annotated GC content coverage plots, generated using BlobTools v1.0.118,19. Ultimately, eight chromosomes with a scaffold N50 of 93.31 Mb were constructed, covering a span of 429.99 Mb and representing 96.19% of the draft genome assembly.

Genomic repeat annotation

Repeat sequences mainly include tandem and interspersed repeats, the latter being primarily transposable elements (TEs). The TE sequences were annotated by a combination of homology-based and de novo approaches. Initially, a de novo repeat library based on the assembly sequences was generated by using RepeatModeler v2.0.2a20, LTR_FINDER21, and RepeatScout22 with default parameters. Subsequently, the predicted repeats were further classified by the PASTE Classifier v1.023 and were combined with the database of Dfam v3.224 to construct a species-specific TE library without redundancy. The library was used as the database for the identification of the TE sequences on the assembly genome by homology searching by RepeatMasker v4.1025 and Repeatproteinmask25. Ultimately, 141.19 Mb of TE sequences were identified, accounting for 32.84% of the genome assembly. Long terminal repeats (LTR) were the largest category of transposable elements, representing 17.47% of the genome, followed by DNA transposons, representing 8.81% of the genome, unknown repeated sequences and long interspersed nuclear elements (LINE), accounting for 6.62% and 1.51% of the whole genome (Table 2).

Protein-coding gene annotation

An integrated approach based on B. brassicae transcriptome and protein homologs from other aphids was used for predicting protein-coding genes on the reference genome being masked with repeats. The RNA-seq data from pooled wingless/winged asexual females and nymphs was used. Reads were mapped to the reference genome using HISAT2 v2.2.126 with default parameters and processed by SAMtools v0.1.1827. The alignment results were provided for Braker128, which generated a transcriptome-based gene set. Furthermore, protein-coding genes of related species, including M. persicae, A. pisum, and A. gossypii, and the model insect Drosophila melanogaster, were filtered with isoforms and provided for Braker229, which generated another homolog-based gene set. The two independent gene sets were compared at both exon and transcript levels to generate a consensus gene set. To do this, unique models non-overlapped to each other were selected first, while the models with the disparity between the two approaches were further checked based on evidence of homolog alignment and transcriptome to reserve the best one.

Phylogenetic tree construction and genome synteny analyses

We identified 484 single-copy genes using OrthoFinder v2.5.430 based on protein sequences from 15 aphid genomes, including 7 Macrosiphini species (M. cerasi9, M. perisicae5, Sitobion avenae31, Metopolophium dirhodum32, A. pisum10, B. brassicae, and D. noxia7), 5 Aphidini species (Rhopalosiphum maidis33, Rhopalosiphum padi9, Aphis fabae9, A. gossypii6, and Aphis glycines9), one species from Chaitophorinae (Sipha flava34), Lachnini (Cinara cedri35), and Phylloxeridae (Daktulosphaira vitifoliae36). The protein sequences of the single-copy genes were concatenated and aligned automatically by OrthoFinder and generated a multiple sequence alignment file, which was used for phylogenetic analysis. For the phylogenetic tree reconstruction, ProTest v3.237 was used first and found “JTT + I + G4” to be the best model, which was later used in the maximum likelihood phylogenetic tree reconstruction using RAxML v8.2.1238. To estimate divergence dates, we utilized the topology derived from the maximum likelihood (ML) analysis of first and second position nucleotides as the input tree. We incorporated a calibration point of 23.9 million years ago (MYA) between Metopolophium dirhodum and Acyrthosiphon pisum39. This calibration was employed as the minimum age in soft-bound uniform priors, which were then applied in a Bayesian MCMCTree molecular dating analysis by using PAML (Phylogenetic Analysis by Maximum Likelihood), with the requirement that the sites be present in at least 95% of the taxa40. We used iTOL v641 for tree visualization.

For the genome synteny analysis, the 1:1:1 orthologs among B. brassicae, M. perisicae, and A. pisum genomes were extracted from OrthoFinder’s result and fed to MCScanX_h42, which was used with “-b 2” option to get the inter-species collinearity among B. brassicae, M. perisicae, and A. pisum. SynVisio43 was used to visualize the genome synteny.

Annotation of long non-coding RNAs

The process of identifying lncRNA genes in the B. brassicae genome was divided into three main steps. Firstly, the reads of stranded RNA-seq were mapped to the B. brassicae genome. The raw reads were subjected to quality control using Fastp v0.23.444 with default parameters to ensure data integrity for downstream analyses. The processed reads were aligned to B. brassicae genome using HISAT2 v2.2.126. The aligned reads were assembled into transcripts by StringTie v2.2.145. The Gffread v0.12.746 with the parameters “-V -H -U -N -P -J -M -K -Q -Y -Z -F–keep-exon-attrs” was used for extracting the assembled transcript sequences. Secondly, LGC (Long Genomic Region Classifier) v1.047 was used for identifying transcripts with non-coding features based on the relationship between ORF (open reading frame) Length and GC content. Meanwhile, assembled transcripts were subjected to the CPC2 (Coding Potential Calculator 2)48 to calculate the coding potential. The intersection of the results from LGC and CPC2 was the putative lncRNAs. Thirdly, the putative lncRNA transcripts were screened by rFAM v14.349 to eliminate housekeeping RNAs, such as rRNA, tRNA, and snoRNA.

To annotate the Ya gene family in the B. brassicae genome, the sequence of M. persicae Ya1 (MpYa1) was used as a query. BLASTn50 was utilized to perform sequence alignment of the MpYa1 sequence against the annotated lncRNA transcripts with an E-value cutoff of less than 10^-5 and an 80% similarity threshold. The final alignment resulted in the identification of 22 Ya genes in the B. brassicae.

Gene family identification

To annotate the detoxification- and chemosensory-related genes of B. brassicae, amino acid sequences of those genes reported in other aphid species were used as the query in the Diamond blast v0.8.2951 to identify putative homologies with E-value less than 10^-5. The identified sequences were further validated by annotation of domains using PfamScan52 and annotation by the Protein BLAST tool in the National Center for Biotechnology Information (NCBI)53. The sequences of the detoxification-related genes, including cytochrome P450 (P450), carboxylesterases (CCE), glutathione S-transferases (GST), UDP-glucuronosyltransferases (UGT), ATP-binding cassette transporters (ABC), and myrosinase (MYR) genes, were downloaded from the InsectBase v2.054. The sequences of the chemosensory-related genes, including gustatory receptors (GRs), olfactory receptors (ORs), ionotropic receptors (IRs), odorant-binding proteins (OBPs), and chemosensory proteins (CSPs) were obtained from published papers55,56.

Karyotype analysis

The number of chromosomes was confirmed by karyotype analysis, indicating that diploidic B. brassicae has 16 chromosomes (2n = 16). The Gurr’s Giemsa R66 chromosome staining method57 was used. Briefly, chromosome squash preparations are made from young embryos dissected from parthenogenetic adult aphids. The embryos were treated in 0.75% of potassium chloride, and then fixed in a freshly prepared mixture of absolute methanol and glacial acetic acid (3:1 in volume) for 10 minutes. Next, the embryos were carefully transferred onto a pin’s tip, subsequently were moved to a clean microscope slide with a small drop of 45% propionic acid (5 minutes), squashed with a coverslip then dried for 24 hours at room temperature. HCl solution (0.2 M) is applied dropwise for 30 minutes at room temperature, followed by rinsing with distilled water and immersion in a 5% saturated Ba(OH)2 solution. The sample is then treated in a 60 °C constant temperature water bath for 3 minutes. After the treatment, it is briefly processed in HCl solution (0.2 M) to interrupt the reaction with the strong base, rinsed with distilled water, and air-dried at room temperature. Subsequently, the sample is stained with 5% Giemsa stain (pH 7.0) for 30 minutes, air-dried at room temperature, and examined and photographed under an optical microscope.

Data Records

The genome sequencing data (PacBio, Illumina and HiC) of B. brassicae have been submitted to the Sequence Read Archive (SRA) at the NCBI with accession numbers SRR2889272758, SRR2889272659, SRR2889272560 under the BioProject of PRJNA1099426. The assembled genome is deposited under the same BioProject at NCBI (JBHUPR000000000.1)61. The RNA-seq data generated in this study have been deposited in the SRA at the NCBI under the BioProject accession number PRJNA1104693 and this submission includes a total of 9 un-stranded RNA-seq data with accession numbers SRR2882902362, SRR2882902263, SRR2882902164, SRR2882902065, SRR2882901966, SRR2882901867, SRR2882901768, SRR2882901669 and SRR2882901570 and 3 stranded RNA-seq data with accession numbers SRR2889265571, SRR2889265472 and SRR2889265373. The B. brassicae genome assembly FASTA and GFF files, the annotation GTF files of protein-coding genes, the annotation files including PFAM, KEGG and GO, the annotation files of several regulatory elements including transposable element, lncRNA and miRNA, the annotation files of tRNA, rRNA, and snRNA loci, and the protein sequences of detoxification- and chemosensory-related genes have been deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.25583814.v3)74.

Technical Validation

Assessing the validity of gene prediction and annotation

The number of chromosomes was confirmed by karyotype analysis, indicating that diploidic B. brassicae has 16 chromosomes (2n = 16). The BUSCO v5.7.075 was used for completeness analysis. The complete BUSCOs under genome mode were used to assess the genome completeness against the database Hemiptera. A total of 99.24% complete BUSCOs were identified, including 97.61% single-copy BUSCOs, 1.63% duplicated BUSCOs, 0.36% fragmented BUSCOs, and 0.40% missing BUSCOs. Similar results were achieved when protein mode was used. A total of 98.65% complete BUSCO were identified, including 96.73% single-copy BUSCO, 1.91% duplicated BUSCO, 0.72% fragmented BUSCO, and 0.64% missing BUSCO. 95.16% of completeness of CEGs was identified based on 248 ultra-conserved CEGs.