Abstract
The cabbage aphid, Brevicoryne brassicae, is a major pest on Brassicaceae plants, causing significant yield losses annually. However, the lack of genomic resources has hindered progress in understanding this pest at the molecular level. Here, we present a high-quality, chromosomal-level genome assembly for B. brassicae, based on PacBio HiFi long-read sequencing and Hi-C data. The final assembled genome size was 429.99 Mb, with a scaffold N50 of 93.31 Mb. Notably, 96.19% of the assembled sequences were anchored to eight chromosomes. The genome covered 99.24% of BUSCO genes and 95.16% of CEGMA genes, indicating a high level of completeness. By integrating high-coverage transcriptome data, we annotated 22,671 protein-coding genes and 3,594 lncRNA genes. Preliminary comparative genomic analyses focused on genes related to host colonization, such as chemosensory- and detoxification-related genes, as well as cross-kingdom lncRNA Ya. In summary, this study presents a contiguous and complete genome for B. brassicae, which will advance our understanding of the molecular mechanisms underlying its host adaptation, pest behavior, and interaction with Brassicaceae plants.
Similar content being viewed by others
Background & Summary
Brevicoryne brassicae, commonly known as the cabbage aphid, is a notorious pest that specializes in plants of the Brassicaceae family, including crops like rapeseeds, cabbage, and broccoli. The B. brassicae causes damage to the plants directly through sap sucking from phloem tissues as well as indirectly by transmitting several plant viruses, which collectively result in significant yield losses to many Brassicaceae crops worldwide. B. brassicae is a nonhost-alternating species, meaning its entire life cycle is completed on the herbaceous plants that typically serve as secondary hosts for host-alternating aphids1. The life cycle includes a sexual generation and several asexual generations. During the winter, B. brassicae produces sexual forms and overwinters in the egg stage. In warm seasons and regions, the life cycle simplifies to parthenogenetic reproduction. The winged females (alates) (Fig. 1a) emerge when the population density increases and the host quality declines. Alates migrate to distant crops and produce offspring via parthenogenesis, leading to population expansion exponentially and the escalation of aphid damage in the fields2.
Morphological and genomic characteristics of B. brassicae. (a) Pictures of parthenogenetic B. brassicae: nymph, apterous adult and alate adult. Bar = 1 mm. (b) Karyotype analysis of B. brassicae. Chromosomes (purple) were stained with Gurr’s Giemsa R66 (Giemsa), the diploidic B. brassicae has 16 chromosomes (2n (♀) = 16). Bar = 10 μm. (c) Genome-wide contact matrix of B. brassicae generated using Hi-C data. The colour bar indicates the intensity of Hi-C interaction. Yellow indicates low, and red is high. (d) Circos plot overview of features of B. brassicae genome. Rings from inside to outside (A-E) represents gene density (A), GC ratios (B), GC skew (C), transposable element (D), and the number of SNPs (E).
Since the first aphid genome, the genome of Acyrthosiphon pisum, was published in 20103, now dozens of aphid genomes have become available, including important agricultural pests such as Myzus persicae4,5, Aphis gossypii6, Diuraphis noxia7, and valuable recourse insects like Schlechtendalia chinensis8. These genomes have greatly facilitated research on these aphids, leading to a deeper understanding of molecular mechanisms of aphid biology. In contrast, studies on B. brassicae have been limited, largely due to the lack of genomic resources. Although a genome of B. brassicae is available9, the quality and annotation need to be further improved. Therefore, we constructed a high-quality B. brassicae genome at the chromosomal level using PacBio HiFi long reads and high-throughput chromosomal conformation capture (Hi-C) data. We annotated the genome for both protein-coding genes and long non-coding RNA (lncRNA) transcripts, and performed phylogenic and evolutionary analysis with different aphid genomes. Our efforts will offer substantial support for a deeper understanding of B. brassicae and future studies into aphids.
After quality control and filtering, we obtained a total of 29.00 Gb (~67.44 × depth) of PacBio long reads and 42.78 Gb (~99.49 × depth) of Illumina short reads. These reads were assembled into 131 contigs with an N50 length of 16.79 Mb (Table 1). Chromosome scaffolding based on Hi-C data resulted in eight chromosomes that contained 96.19% of scaffold sequences. The chromosome number was confirmed by karyotype analysis (Fig. 1b). The final assembled genome is eight chromosomes with a total size of 429.99 Mb (Fig. 1c, Table 1). Chromosome lengths ranged from 16.74 Mb to 125.10 Mb (Fig. 1d). The genome assembly is highly accurate in terms of gene content, with 99.24% of Hemiptera BUSCO (Benchmarking Universal Single-Copy Ortholog) genes and 95.16% of CEGMA (Core Eukaryotic Genes Mapping Approach) genes being present (Table 1), indicating a comprehensive representation of the gene set expected for this taxonomic group. Altogether, the assembly of B. brassicae genome is contiguous, accurate, and complete.
We used a phylogenomic approach to assess the phylogenetic relationships among B. brassicae and other 14 hemipteran insects. Phylogenetic analysis revealed that B. brassicae diverged from D. noxia approximately 49.9 million years ago (MYA) and from other Macrosiphini species about 53.9 MYA (Fig. 2a). We also identified chromosome 1, the largest one (125.10 Mb) in the B. brassicae genome, as the X chromosome since it showed massive synteny to the X chromosomes of M. persicae and A. pisum10 (Fig. 2b).
Comparative analysis of B. brassicae genome. (a) Phylogenetic tree constructed based on the 484 single-copy genes of 15 aphids (Myzus cerasi, Myzus persicae, Sitobin avenae, Metopolophium dirhodum, Acyrthosiphon pisum, B. brassicae, Diuraphis noxia, Rhopalosiphum padi, Rhopalosiphum maidis, Aphis fabae, Aphis gossypii, Aphis glycines, Sipha flava, Cinara cedri, and Daktulosphaira vitifoliae). (b) Synteny of the genomes of A. pisum, B. brassicae and M. persicae. (c) The number of genes related to host adaptation in B. brassicae. Chemosensory-related gene families, including gustatory receptors (GR), olfactory receptors (OR), ionotropic receptors (IR), odorant-binding proteins (OBP), and chemosensory proteins (CSP); detoxification gene families, including cytochrome P450 (P450), carboxylesterases (CCE), glutathione S-transferases (GST), UDP-glucuronosyltransferases (UGT), ATP-binding cassette transporters (ABC), and myrosinase (MYR) genes, as well as lncRNA Ya gene family.
Using chromosome-level genome assemblies, we annotated protein-coding genes and lncRNA transcripts using evidence from 90.31 Gb of RNA sequencing (RNA-seq) data (63.60 Gb un-stranded and 26.71 Gb stranded). In total, 22,671 protein-coding genes and 3,594 lncRNA genes were annotated (Table 1). Among them, 22 lncRNA genes in B. brassicae genome were identified as homologous of Ya1 in M. persicae, previously known as a virulence factor11. Conservation of protein-coding sequences is high among different aphid species, and lncRNA sequences tend to be more divergent (Fig. 2a). We also identified 141.19 Mb of repeating sequences accounting for 32.84% of the genome assembly (Table 2). Additionally, we annotated 154 loci of miRNAs, 332 loci of tRNA, 837 loci of rRNAs, and 124 loci of snRNAs (Table 3).
We annotated chemosensory-related genes for gustatory receptors (GRs), odorant receptors (ORs), ionotropic receptors (IRs), and odorant-binding proteins (OBPs) as well as chemosensory proteins (CSPs). The B. brassicae genome encodes 33 GRs, 25 ORs, 17 IRs, 10 OBPs, and 10 CSPs. In addition, we annotated detoxification-related genes and identified 48 genes for cytochrome P450 (P450), 28 for carboxyl/choline esterase (CCE), 21 for glutathione-S-transferase (GST), 49 for UDP-Glycosyltransferase (UGT), 77 for ATP-binding cassette transporters (ABC), and 7 for myrosinases (MYR) (Fig. 2c).
This study presents the high-quality chromosome-level genome assembly of B. brassicae and comprehensive annotations, which provides an invaluable genomic resource for understanding the genetic, evolutionary, and ecological aspects of the cabbage aphid and further offers the possibility to implement integrated pest management of this pest.
Methods
Sample collection and genome sequencing
The B. brassicae nymphs were collected from rapeseed fields in Lanzhou (35°56′39.062″ N, 104°8′49.009″ E), Gansu province, China. Subsequently, these aphids were reared on the Brassica napus variety Zhongshuang11 in a growth chamber set to 22°C and a 16/8 h light/dark cycle in our laboratory. Genomic DNA was extracted from 50 adult insects using the CTAB (cetyltrimethylammonium bromide) method12 for Illumina, PacBio, and Hi-C sequencing. 1.5% agarose gel electrophoresis and NanoDrop 2000C spectrophotometer were used to validate the quality of the genomic DNA. Briefly, the fragmented genomic DNA sample with a size of 350 bp was end-polished, A-tailed, and then ligated with the full-length adapter following the manuscript of SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA). Libraries with 350 bp inserts were constructed and sequenced on the Illumina NovaSeq 6000 platform. Raw reads were subjected to quality check by removing adapter sequences and low-quality reads, ultimately 42.78 Gb of clean data were obtained for subsequent analysis.
For PacBio HiFi sequencing, the genomic DNA was sheared into ~15 kb fragments using g-Tubes (Covaris, Woburn, MA, USA) and then were purified by 0.45 × AMPure PB beads (Beckman Coulter, Brea, CA, USA). The cleaned DNA fragments were constructed to SMRT bell libraries as described above. Fragments with sizes of 15–18 kb were selected using BluePippin (Sage Science, Beverly, MA, USA). After annealing the primers and binding Sequel DNA polymerase to SMRT bell templates, the libraries were sequenced using one SMRT cell 1 M on the Sequel System (Biomarker Technologies). In total, 29.00 Gb of subreads with an average read length of 10.45 kb were obtained, producing an overall 67.44 × coverage of the B. brassicae genome. The Hi-C technique was used to achieve chromosome-level assembly, by identifying the contacts between different regions of chromatin filaments. The Hi-C library was constructed following the standard library preparation protocol and sequenced on the Illumina NovaSeq 6000 platform. In total, 60.78 Gb of paired-end clean reads with 150-bp were obtained.
RNA extraction and transcriptome sequencing
Parthenogenetic aphids were collected for RNA-seq. 30 first-instar nymphs, 15 apterous adults, and 15 alate adults collected from rapeseed plants in the lab were used separately for RNA extraction. Total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA). RNA quality was evaluated using 1.5% agarose gel electrophoresis and the concentration was measured by a NanoDrop 2000C spectrophotometer (Thermo Fisher Scientific, Pittsburgh, PA, USA). RNA integrity was quantified by an Agilent 5400 Fragment Analyzer (Agilent, Santa Clara, CA, USA) following the manufacturer’s instructions. RNA-seq libraries were constructed using the NEBNext® Ultra™ RNA Library Prep Kit (NEB, Ipswich, MA, USA) following the manufacturer’s instructions. Libraries were then sequenced on the Illumina NovaSeq 6000 platform, and 63.60 Gb un-stranded and 26.71 Gb stranded 150-bp paired-end reads were obtained and used for gene prediction.
Genome estimation and assembly
PacBio HiFi long-read data were used to generate a contig-level assembly of the B. brassicae genome. WTDBG2 v2.513 was used to generate a preliminary assembly and Pilon v1.2314 was used for short-read correcting. Then, the B. brassicae genome assembly was generated, consisting of 131 contigs with a total length of 429.99 Mb and a contig N50 of 16.79 Mb. 29.00 Gb of Hi-C clean data with low-quality reads and adaptor sequences removed were mapped to the draft B. brassicae genome by BWA v0.7.1015 with the default parameters. Invalid read pairs, including dangling ends, re-ligation, self-cycle, and dumped pairs were further assessed and eliminated from uniquely aligned read pairs by HiC-Pro v2.10.016. Valid interaction pairs for scaffold correction were processed by LACHESIS v2e27abb17 with the default parameters to cluster, order, and orient the contigs onto chromosomes. The draft assembly was examined for contamination by manually inspecting taxon-annotated GC content coverage plots, generated using BlobTools v1.0.118,19. Ultimately, eight chromosomes with a scaffold N50 of 93.31 Mb were constructed, covering a span of 429.99 Mb and representing 96.19% of the draft genome assembly.
Genomic repeat annotation
Repeat sequences mainly include tandem and interspersed repeats, the latter being primarily transposable elements (TEs). The TE sequences were annotated by a combination of homology-based and de novo approaches. Initially, a de novo repeat library based on the assembly sequences was generated by using RepeatModeler v2.0.2a20, LTR_FINDER21, and RepeatScout22 with default parameters. Subsequently, the predicted repeats were further classified by the PASTE Classifier v1.023 and were combined with the database of Dfam v3.224 to construct a species-specific TE library without redundancy. The library was used as the database for the identification of the TE sequences on the assembly genome by homology searching by RepeatMasker v4.1025 and Repeatproteinmask25. Ultimately, 141.19 Mb of TE sequences were identified, accounting for 32.84% of the genome assembly. Long terminal repeats (LTR) were the largest category of transposable elements, representing 17.47% of the genome, followed by DNA transposons, representing 8.81% of the genome, unknown repeated sequences and long interspersed nuclear elements (LINE), accounting for 6.62% and 1.51% of the whole genome (Table 2).
Protein-coding gene annotation
An integrated approach based on B. brassicae transcriptome and protein homologs from other aphids was used for predicting protein-coding genes on the reference genome being masked with repeats. The RNA-seq data from pooled wingless/winged asexual females and nymphs was used. Reads were mapped to the reference genome using HISAT2 v2.2.126 with default parameters and processed by SAMtools v0.1.1827. The alignment results were provided for Braker128, which generated a transcriptome-based gene set. Furthermore, protein-coding genes of related species, including M. persicae, A. pisum, and A. gossypii, and the model insect Drosophila melanogaster, were filtered with isoforms and provided for Braker229, which generated another homolog-based gene set. The two independent gene sets were compared at both exon and transcript levels to generate a consensus gene set. To do this, unique models non-overlapped to each other were selected first, while the models with the disparity between the two approaches were further checked based on evidence of homolog alignment and transcriptome to reserve the best one.
Phylogenetic tree construction and genome synteny analyses
We identified 484 single-copy genes using OrthoFinder v2.5.430 based on protein sequences from 15 aphid genomes, including 7 Macrosiphini species (M. cerasi9, M. perisicae5, Sitobion avenae31, Metopolophium dirhodum32, A. pisum10, B. brassicae, and D. noxia7), 5 Aphidini species (Rhopalosiphum maidis33, Rhopalosiphum padi9, Aphis fabae9, A. gossypii6, and Aphis glycines9), one species from Chaitophorinae (Sipha flava34), Lachnini (Cinara cedri35), and Phylloxeridae (Daktulosphaira vitifoliae36). The protein sequences of the single-copy genes were concatenated and aligned automatically by OrthoFinder and generated a multiple sequence alignment file, which was used for phylogenetic analysis. For the phylogenetic tree reconstruction, ProTest v3.237 was used first and found “JTT + I + G4” to be the best model, which was later used in the maximum likelihood phylogenetic tree reconstruction using RAxML v8.2.1238. To estimate divergence dates, we utilized the topology derived from the maximum likelihood (ML) analysis of first and second position nucleotides as the input tree. We incorporated a calibration point of 23.9 million years ago (MYA) between Metopolophium dirhodum and Acyrthosiphon pisum39. This calibration was employed as the minimum age in soft-bound uniform priors, which were then applied in a Bayesian MCMCTree molecular dating analysis by using PAML (Phylogenetic Analysis by Maximum Likelihood), with the requirement that the sites be present in at least 95% of the taxa40. We used iTOL v641 for tree visualization.
For the genome synteny analysis, the 1:1:1 orthologs among B. brassicae, M. perisicae, and A. pisum genomes were extracted from OrthoFinder’s result and fed to MCScanX_h42, which was used with “-b 2” option to get the inter-species collinearity among B. brassicae, M. perisicae, and A. pisum. SynVisio43 was used to visualize the genome synteny.
Annotation of long non-coding RNAs
The process of identifying lncRNA genes in the B. brassicae genome was divided into three main steps. Firstly, the reads of stranded RNA-seq were mapped to the B. brassicae genome. The raw reads were subjected to quality control using Fastp v0.23.444 with default parameters to ensure data integrity for downstream analyses. The processed reads were aligned to B. brassicae genome using HISAT2 v2.2.126. The aligned reads were assembled into transcripts by StringTie v2.2.145. The Gffread v0.12.746 with the parameters “-V -H -U -N -P -J -M -K -Q -Y -Z -F–keep-exon-attrs” was used for extracting the assembled transcript sequences. Secondly, LGC (Long Genomic Region Classifier) v1.047 was used for identifying transcripts with non-coding features based on the relationship between ORF (open reading frame) Length and GC content. Meanwhile, assembled transcripts were subjected to the CPC2 (Coding Potential Calculator 2)48 to calculate the coding potential. The intersection of the results from LGC and CPC2 was the putative lncRNAs. Thirdly, the putative lncRNA transcripts were screened by rFAM v14.349 to eliminate housekeeping RNAs, such as rRNA, tRNA, and snoRNA.
To annotate the Ya gene family in the B. brassicae genome, the sequence of M. persicae Ya1 (MpYa1) was used as a query. BLASTn50 was utilized to perform sequence alignment of the MpYa1 sequence against the annotated lncRNA transcripts with an E-value cutoff of less than 10^-5 and an 80% similarity threshold. The final alignment resulted in the identification of 22 Ya genes in the B. brassicae.
Gene family identification
To annotate the detoxification- and chemosensory-related genes of B. brassicae, amino acid sequences of those genes reported in other aphid species were used as the query in the Diamond blast v0.8.2951 to identify putative homologies with E-value less than 10^-5. The identified sequences were further validated by annotation of domains using PfamScan52 and annotation by the Protein BLAST tool in the National Center for Biotechnology Information (NCBI)53. The sequences of the detoxification-related genes, including cytochrome P450 (P450), carboxylesterases (CCE), glutathione S-transferases (GST), UDP-glucuronosyltransferases (UGT), ATP-binding cassette transporters (ABC), and myrosinase (MYR) genes, were downloaded from the InsectBase v2.054. The sequences of the chemosensory-related genes, including gustatory receptors (GRs), olfactory receptors (ORs), ionotropic receptors (IRs), odorant-binding proteins (OBPs), and chemosensory proteins (CSPs) were obtained from published papers55,56.
Karyotype analysis
The number of chromosomes was confirmed by karyotype analysis, indicating that diploidic B. brassicae has 16 chromosomes (2n = 16). The Gurr’s Giemsa R66 chromosome staining method57 was used. Briefly, chromosome squash preparations are made from young embryos dissected from parthenogenetic adult aphids. The embryos were treated in 0.75% of potassium chloride, and then fixed in a freshly prepared mixture of absolute methanol and glacial acetic acid (3:1 in volume) for 10 minutes. Next, the embryos were carefully transferred onto a pin’s tip, subsequently were moved to a clean microscope slide with a small drop of 45% propionic acid (5 minutes), squashed with a coverslip then dried for 24 hours at room temperature. HCl solution (0.2 M) is applied dropwise for 30 minutes at room temperature, followed by rinsing with distilled water and immersion in a 5% saturated Ba(OH)2 solution. The sample is then treated in a 60 °C constant temperature water bath for 3 minutes. After the treatment, it is briefly processed in HCl solution (0.2 M) to interrupt the reaction with the strong base, rinsed with distilled water, and air-dried at room temperature. Subsequently, the sample is stained with 5% Giemsa stain (pH 7.0) for 30 minutes, air-dried at room temperature, and examined and photographed under an optical microscope.
Data Records
The genome sequencing data (PacBio, Illumina and HiC) of B. brassicae have been submitted to the Sequence Read Archive (SRA) at the NCBI with accession numbers SRR2889272758, SRR2889272659, SRR2889272560 under the BioProject of PRJNA1099426. The assembled genome is deposited under the same BioProject at NCBI (JBHUPR000000000.1)61. The RNA-seq data generated in this study have been deposited in the SRA at the NCBI under the BioProject accession number PRJNA1104693 and this submission includes a total of 9 un-stranded RNA-seq data with accession numbers SRR2882902362, SRR2882902263, SRR2882902164, SRR2882902065, SRR2882901966, SRR2882901867, SRR2882901768, SRR2882901669 and SRR2882901570 and 3 stranded RNA-seq data with accession numbers SRR2889265571, SRR2889265472 and SRR2889265373. The B. brassicae genome assembly FASTA and GFF files, the annotation GTF files of protein-coding genes, the annotation files including PFAM, KEGG and GO, the annotation files of several regulatory elements including transposable element, lncRNA and miRNA, the annotation files of tRNA, rRNA, and snRNA loci, and the protein sequences of detoxification- and chemosensory-related genes have been deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.25583814.v3)74.
Technical Validation
Assessing the validity of gene prediction and annotation
The number of chromosomes was confirmed by karyotype analysis, indicating that diploidic B. brassicae has 16 chromosomes (2n = 16). The BUSCO v5.7.075 was used for completeness analysis. The complete BUSCOs under genome mode were used to assess the genome completeness against the database Hemiptera. A total of 99.24% complete BUSCOs were identified, including 97.61% single-copy BUSCOs, 1.63% duplicated BUSCOs, 0.36% fragmented BUSCOs, and 0.40% missing BUSCOs. Similar results were achieved when protein mode was used. A total of 98.65% complete BUSCO were identified, including 96.73% single-copy BUSCO, 1.91% duplicated BUSCO, 0.72% fragmented BUSCO, and 0.64% missing BUSCO. 95.16% of completeness of CEGs was identified based on 248 ultra-conserved CEGs.
Code availability
No specific script was used in this work. All commands and pipelines used in data processing were executed according to the manual and protocols of the corresponding bioinformatic software.
References
Pal, M. & Singh, R. Biology and ecology of the cabbage aphid, Brevicoryne brassicae (Linn.) (Homoptera: Aphididae): a review. J Aphidol. 27, 59–78 (2013).
Hughes, R. H. Population dynamics of the cabbage aphid, Brevicoryne brassicae (L.). J. Anim. Ecol. 32, 393–424 (1963).
The International Aphid Genomics Consortium. Genome sequence of the pea aphid Acyrthosiphon pisum. PLoS Biol. 8, e1000313 (2010).
Mathers, T. C. et al. Rapid transcriptional plasticity of duplicated gene clusters enables a clonally reproducing aphid to colonise diverse plant species. Genome Biol. 18, 27 (2017).
Mathers, T. C. et al. Chromosome-scale genome assemblies of aphids reveal extensively rearranged autosomes and long-term conservation of the X chromosome. Mol. Biol. Evol. 38, 856–875 (2020).
Zhang, S. et al. Chromosome-level genome assemblies of two cotton-melon aphid Aphis gossypii biotypes unveil mechanisms of host adaption. Mol Ecol Resour. 22, 1120–1134 (2022).
Nicholson, S. J. et al. The genome of Diuraphis noxia, a global aphid pest of small grains. BMC Genom. 16, 429 (2015).
Wei, H. Y. et al. Chromosome-level genome assembly for the horned-gall aphid provides insights into interactions between gall-making insect and its host plant. Ecol. Evol. 12, e8815 (2022).
Mathers, T. C. et al. Aphidinae comparative genomics resource (Version v2) [Data set]. Zenodo. (2022).
Li, Y., Park, H., Smith, T. E. & Moran, N. A. Gene family evolution in the pea aphid based on chromosome-level genome assembly. Mol. Biol. Evol. 36, 2143–2156 (2019).
Chen, Y. et al. An aphid RNA transcript migrates systemically within plants and is a virulence factor. Proc. Natl. Acad. Sci. USA. 117, 12763–12771 (2020).
Chen, H., Rangasamy, M., Tan, S. Y., Wang, H. & Siegfried, B. D. Evaluation of five methods for total DNA extraction from western corn rootworm beetles. PLoS One. 5, e11963 (2010).
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat Methods. 17, 155–158 (2020).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 9, e112963 (2014).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 25, 1754–1760 (2009).
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259 (2015).
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Kumar, S. et al. Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots. Front. Genet. 4, 237 (2013).
Laetsch, D. R. & Blaxter, M. L. BlobTools: interrogation of genome assemblies. F1000Research. 6, 1287 (2017).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA. 117, 9451–9457 (2020).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics. 21, i351–i358 (2005).
Hoede, C. et al. PASTEC: an automatic transposable element classification tool. PloS One. 9, e91929 (2014).
Wheeler, T. J. et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 41, D70–D82 (2012).
Tempel, S. Using and Understanding RepeatMasker. Methods Mol. Biol. 859, 29–51 (2012).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25, 2078–2079 (2009).
Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 32, 767–769 (2016).
Brůna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: Automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 3, lqaa108 (2021).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 1–14 (2019).
Byrne, S. et al. Genome sequence of the English grain aphid, Sitobion avenae and its endosymbiont Buchnera aphidicola. G3-Genes Genom Genet. 12, jkab418 (2021).
Zhu, B. et al. A high-quality chromosome-level assembly genome provides insights into wing dimorphism and xenobiotic detoxification in Metopolophium dirhodum (Walker). Res Sq. 1–24 (2022).
Chen, W. B. et al. Genome sequence of the corn leaf aphid (Rhopalosiphum maidis Fitch). Gigascience. 8, giz033 (2019).
Smith, T. E., Li, Y., Perreau, J. & Moran, N. A. Elucidation of host and symbiont contributions to peptidoglycan metabolism based on comparative genomics of eight aphid subfamilies and their Buchnera. PLoS Genet. 18, e1010195 (2022).
Julca, I. et al. Phylogenomics identifies an ancestral burst of gene duplications predating the diversification of aphidomorpha. Mol. Biol. Evol. 37, 730–756 (2019).
Li, Z. et al. Phylloxera and aphids show distinct features of genome evolution despite similar reproductive modes. Mol. Biol. Evol. 40, msad271 (2023).
Darriba, D., Taboada, G. L., Doallo, R. & Posada, D. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics. 27, 1164–1165 (2011).
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 30, 1312–1313 (2014).
Hardy, N. B., Peterson, D. A. & Von Dohlen, C. D. The evolution of life cycle complexity in aphids: Ecological optimization or historical constraint? Evolution. 69, 1423–1432 (2015).
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Letunic, I. & Bork, P. Interactive Tree of Life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res. gkae268 (2024).
Wang, Y. P. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40 (2012).
Bandi, V. K. 2020. SynVisio: a multiscale tool to explore genomic conservation. Thesis. Saskatoon, Saskatchewan, Canada: University of Saskatchewan. (2020).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 34, i884–i890 (2018).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).
Wang, G. et al. Characterization and identification of long non-coding RNAs based on feature relationship. Bioinformatics. 35, 2949–2956 (2019).
Kang, Y. J. et al. CPC2: A fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 45, W12–W16 (2017).
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 49, D192–D200 (2021).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12, 59–60 (2015).
Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2023).
Mcginnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, 20–25 (2004).
Yin, C. et al. InsectBase: a resource for insect genomes and transcriptomes. Nucl Acids Res. 44, D801–D807 (2016).
Robertson, H. M., Robertson, E. C. N., Walden, K. K. O., Enders, L. S. & Miller, N. J. The chemoreceptors and odorant binding proteins of the soybean and pea aphids. Insect Biochem. Mol. Biol. 105, 69–78 (2019).
Kuang, Y. et al. Candidate odorant-binding protein and chemosensory protein genes in the turnip aphid Lipaphis erysimi. Arch. Insect Biochem. 113, e22022 (2023).
Blackman, R. L. Chromosome numbers in the Aphididae and their taxonomic significance. Syst. Entomol. 5, 7–25 (1980).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28892727 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28892726 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28892725 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc:JBHUPR000000000.1 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829023 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829022 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829021 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829020 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829019 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829018 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829017 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829016 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28829015 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28892655 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28892654 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28892653 (2024).
Chen, Y. Annotated reference genome of Brevicoryne brassicae. figshare https://doi.org/10.6084/m9.figshare.25583814.v3 (2024).
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 31, 3210–3212 (2015).
Acknowledgements
This project is funded by the National Key Research and Development Program of China (project No. 2023YFF1000703), and supported by Hubei Hongshan Laboratory (project No. 2022hszd026 to YC), the Startup Foundation for Advanced Talents at HZAU to YC, and the First Class Discipline Construction Funds of College of Plant Science and Technology, Huazhong Agricultural University (project No. 2022ZKPY003 to YC), and the Wuhan Yingcai Talent Program to YC.
Author information
Authors and Affiliations
Contributions
Y.C. conceived and led the research, Y.C., G.L., J.W., Z.L. and Y.Z. were involved in sample collection, preparation and genome assembly. S.Z., Y.C., J.W. and G.L. contributed to gene prediction and annotation, data analysis. W.Y. conducted the karyotype analysis. Y.C. and R.H. contributed to data management. Y.C. and J.W. wrote the manuscript and all authors read, revised and approved the final version of the manuscript. Y.C. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wu, J., Li, G., Lin, Z. et al. A chromosome-level genome assembly of the cabbage aphid Brevicoryne brassicae. Sci Data 12, 167 (2025). https://doi.org/10.1038/s41597-025-04501-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-04501-2
This article is cited by
-
Chromosome-level genome assembly of soybean aphid
Scientific Data (2025)