Background & Summary

Schizaphis graminum (Rondani), the greenbug, is a major pest of cereal crops such as wheat, barley, sorghum, and several wild grasses, causing severe direct damage by phloem feeding that leads to leaf chlorosis and plant stunting1. It exhibits a complex life history with cyclical parthenogenesis, typically reproducing viviparously for multiple generations during the crop-growing season and switching to sexual reproduction to produce overwintering eggs in winter. The insect pest causes substantial yield losses through direct feeding damage and, more notably, by serving as an efficient vector of barley yellow dwarf viruses (BYDV)2,3,4. The greenbug feeds on the phloem sap from hosts such as wheat and oats (Fig. 1). Infestations typically originate on the lower leaves and progressively advance upward to the younger foliage. Dense colonies on the abaxial leaf surface, accompanied by extensive honeydew secretion, inhibit photosynthesis and disrupt normal plant growth. Symptoms include leaf reddening, reduced spike size, impaired grain filling, and in severe cases, complete sterility, resulting in shriveled kernels. Such damage substantially compromises both the yield and quality of affected crops5,6.

Fig. 1
figure 1

Adults of Schizaphis graminum and associated infestation symptoms. (A) Adult of S. graminum. (B) Infestation symptoms on wheat after seven days of feeding. (C) Infestation symptoms on wheat after fourteen days of feeding.

Recent advances in high-quality genome assemblies of aphids, such as Aphis glycines7, Brevicoryne brassicae8 and Therioaphis trifolii9, have greatly enhanced our understanding of aphid biology, particularly in areas such as polyphenism, host adaptation, and insecticide resistance. However, genomic resources for S. graminum have remained limited, hindering the elucidation of the genetic mechanisms governing its rapid adaptation to agricultural environments, the development of insecticide resistance, and its highly efficient vectoring capacity10. The population dynamics and pest severity of S. graminum are strongly shaped by abiotic stresses, such as drought, temperature extremes, and chemical control practices11. Transcriptomic studies have implicated key gene families involved in stress responses, such as cytochrome P450s and cuticular proteins5,12. However, the lack of a chromosome-level reference genome has impeded comprehensive genome-wide analyses of these gene families, their evolutionary trajectories, and their regulatory mechanisms controlling their expressions.

Here, we present the first high-quality, chromosome-level genome assembly of S. graminum, which offers a robust platform for comparative genomic and functional analyses. Based on this assembly, we systematically identified and characterized expanded gene families implicated in insecticide metabolism and environmental adaptation. The final assembled genome spans 380.30 Mb, with 379.43 Mb (99.77%) anchored to four pseudochromosomes. The assembly exhibits high continuity, as reflected in scaffold and contig N50 values of 105.17 Mb and 48.79 Mb, respectively, and a BUSCO completeness score of 97.10%.

This chromosome-scale genome assembly provides critical insights into the genomic architecture underlying the pest status of S. graminum in cereal crops. It establishes a valuable resource for elucidating the molecular mechanisms of host specialization, particularly on wheat, oat, and barley, and facilitates the development of RNA interference-based management strategies. Collectively, this genomic dataset offers a foundational platform for investigating the molecular basis of aphid virulence, insecticide resistance, and virus transmission, thereby supporting the design of sustainable pest management strategies.

Methods

Sample collection and rearing

The lab-reared colonies of S. graminum used in this study were originally collected from Yangling City, Shanxi Province, China (34°27′27″N, 108°08′11″E) in 2016. Subsequently, these aphids were maintained on common oats (Avena sativa) under controlled conditions of 26 ± 1 °C, a photoperiod of 16 h light and 8 h darkness, and a relative humidity of 65 ± 5%.

Adult S. graminum were first transferred to culture dishes and starved for 24 h under controlled laboratory conditions. This treatment was applied to eliminate plant-derived materials and reduce transient microorganisms from gut contents, ensuring that the sequencing data accurately represent the endogenous gene expression of the aphids rather than transcripts originating from the host plant. Approximately 1000 live individuals were then collected for Illumina, PacBio HiFi, and Hi-C sequencing, respectively. In addition, 500 winged and wingless adults were selected, rapidly flash-frozen in liquid nitrogen, and stored at −80 °C for ONT sequencing.

Genome sequencing

Genomic DNA was extracted using the CTAB (cetyltrimethylammonium bromide) method. For PacBio HiFi long-read sequencing, libraries were prepared with the SMRTbell® Express Template Prep Kit 3.0 (Pacific Biosciences, CA, USA) targeting an insert size of ~20 kb. High-quality DNA samples were sheared into ~10 kb fragments using a Megaruptor B06010001 (Diagenode, Liège, Belgium), concentrated with AMPure® PB Beads (Pacific Biosciences, CA, USA), and processed with the SMRTbell 3.0 kit (Pacific Biosciences, CA, USA) for library construction. This workflow included removal of single-stranded overhangs, DNA damage repair, end repair, A-tailing, adapter ligation, and exonuclease digestion. Size selection of SMRTbell libraries was performed using the SageELF ELF000 system (Sage Science, MA, USA). PacBio HiFi 10 kb library preparation was carried out by Berry Genomics Corporation (Beijing, China).

For short-read sequencing, whole-genome libraries were prepared using the Agencourt AMPure XP-Medium Kit (Beckman Coulter, CA, USA) with an insert size of 200–400 bp. Libraries were subsequently sequenced on the DNBSEQ-T7 platform. Short-read library construction was also performed by Berry Genomics Corporation (Beijing, China).

Hi-C library construction was performed by Berry Genomics Corporation. The procedure involved formaldehyde cross-linking of chromatin, restriction enzyme digestion, end repair, DNA cyclization, and DNA purification. MboI was used as the restriction enzyme during the digestion step. The integrity and quality of the extracted genomic DNA were assessed using 1.0% agarose gel electrophoresis, a NanoDrop One Spectrophotometer (NanoDrop Technologies, Wilmington, DE) and a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The extracted DNA exhibited optimal quality parameters: concentration = 88.2 ng/μL, OD260/280 value = 1.82, and OD260/230 = 2.26.

The sequencing output and depth of coverage are summarized in Table 1.

Table 1 Library sequencing data for genome assembly of S. graminu.

Genome size estimation

The primary objective of the genome survey analysis was to estimate genome size, heterozygosity, and repeat content, thereby guiding the selection of appropriate assembly strategies and parameter optimization. Raw short-read data generated from the BGI platform were quality-controlled and trimmed using fastp v0.23.413 with the following parameters: -q 20 -D -g -x -u 10 -5 -r -c. Specifically, bases with quality scores below Q20 were discarded, duplicate reads were removed (-D), poly-G/X tails were trimmed (-g -x), reads with more than 10% low-quality bases were filtered out (-u 10), and base correction was performed using overlapping read information (-c).

Genome survey analyses were conducted based on the k-mer frequency distribution. K-mer counts were generated using khist.sh (part of the BBTools v38.90 package14, with k-mer size set to 21. Genome features were further analyzed using GenomeScope v2.015 with parameters -k 21 -p 2 -m 10000, where the maximum k-mer coverage cutoff was set to 10,000 (Fig. 2).

Fig. 2
figure 2

Genome size estimation of S. graminum using Illumina reads.

Using BBTools, the estimated genome size of S. graminum was 390,080,888 bp with a heterozygosity of 1.79%. The k-mer frequency distribution revealed a heterozygous peak at 44× that was higher than the main peak at 88× , indicating a highly heterozygous genome. These genomic features provide a critical foundation for future in-depth studies on S. graminum, including functional annotation and comparative genomic analyses.

Genome assembly

High-quality HiFi reads were generated using pbccs v6.4.0 (https://github.com/PacificBiosciences/ccs). The initial assemblies were produced with Hifiasm v0.24.016 using the parameter ‘-I 3’. Low-depth contigs are likely to represent contaminants or assembly errors; therefore, only contigs with sequencing depth >12× for S. graminum were retained, while contigs below one-tenth of the average sequencing depth were discarded. Given the extremely high quality of the HiFi reads, long contigs exhibited QV scores of approximately 60, and thus additional polishing was not required. To address potential haplotype redundancy in the assemblies arising from heterozygosity in diploid or wild populations, redundant sequences were removed using Purge_dups v1.2.517, which leverages both contig similarity and sequencing depth. Sequence alignments were performed with Minimap2 v2.2918,19 (-x map-hifi for mapping HiFi reads to the assembly; -x asm5 -DP for self-to-self assembly alignment). Purge_dups was run with default parameters (-2 -a 60).

Chromosome-scale scaffolding was achieved using Hi-C data and the YAHS v1.2 pipeline20. Hi-C reads were processed with chromap v0.2.621, including read alignment, duplicate removal, and Hi-C contact extraction. Two rounds of scaffolding were conducted with YAHS using default parameters. The preliminary scaffolds were manually curated with Juicebox v1.11.0822 to correct assembly errors such as misjoins, translocations, and inversions, followed by a second round of YAHS scaffolding to generate the final assembly. Sequencing depth for each pseudochromosome was assessed using SAMtools v1.1023, with alignment files generated by Minimap2 (-ax map-hifi for HiFi reads; -ax sr for short-read WGS data). Hi-C contact maps (Fig. 3) confirmed the high quality of the scaffolding, yielding chromosome-level assemblies of four pseudochromosomes for S. graminum.

Fig. 3
figure 3

Hi-C heatmap of S. graminum genome assembly. The boundary indicates that the genome contains 4 chromosomes.

Genome completeness was evaluated using BUSCO v5.7.124 with the insecta_odb10 reference dataset (n = 1,367 single-copy orthologs). Additionally, raw short-read genome data and transcriptome reads (both second- and third-generation) were mapped back to the assembled genome to assess data utilization and assembly completeness. Read mapping was performed using Minimap2, and mapping rates were calculated with SAMtools. Potential contaminants in the assembly were identified using MMseqs. 2 v1325 through a BLASTN-like search against the NCBI nt and UniVec databases. Base-level quality scores (QV) and k-mer spectra of the genome were assessed using Merqury v1.326.

The final S. graminum genome assembly metrics are summarized in Table 2. The assembled genome size was 380.30 Mb, which is smaller than those of B. brassicae (429.99 Mb) and T. trifolii (541.25 Mb), but larger than that of A. glycines (324 Mb)7,8,9. The final assembly comprised 13 scaffolds and 37 contigs. The maximum scaffold and contig lengths reached 108.054 Mb and 102.518 Mb, respectively, with scaffold and contig N50 values of 105.17 Mb and 48.793 Mb. The overall GC content was 27.65%, reflecting the high continuity of the assembly. The genome assembly was evaluated using BUSCO, which showed a high completeness of 97.1%. This level of completeness is comparable to or slightly higher than those reported for other aphid species, including A. glycines (97.2%), B. brassicae (96.19%), and T. trifolii (96.6%)7,8,9. The proportion of duplicated BUSCOs was only 3.1%, suggesting minimal redundancy in the assembly.

Table 2 Summary statistics of the genome assembly of S. graminum.

Four pseudochromosomes accounted for 379.43 Mb of sequence, representing 99.77% of the total assembly. Detailed statistics on chromosome length, sequencing depth, and QV scores are provided in Table 3.

Table 3 Genome assembly summary of length, sequencing coverage and QV value for each chromosome.

Genome annotation

A species-specific de novo repeat library was constructed using RepeatModeler v2.0.527 with the additional LTR structural search enabled (-LTRStruct), based on both structural features of repetitive elements and de novo prediction. This library was then combined with the Dfam 3.828 and RepBase-2018102629 databases to generate the final reference repeat database. Repeat annotation was performed using RepeatMasker v4.1.530, aligning the assembled genome against the final repeat database to identify and classify repetitive elements.

A total of 746,658 repetitive elements (125,823,370 bp) were identified in the S. graminum genome, corresponding to 33.09% of the assembly, indicative of a genome with a moderate repeat content. This proportion is comparable to those observed in other three aphid species, including A. glycines (32.06%), B. brassicae (32.84%), and T. trifolii (36.86%)7,8,9. The six most abundant repeat classes were classified as follows: Unknown (12.70%), DNA transposons (10.73%), simple repeats (3.52%), LTR retrotransposons (3.16%), LINEs (1.81%), and rolling-circle elements (0.40%). Detailed statistics are provided in Table 4.

Table 4 Summary of repetitive elements.

Non-coding RNAs (ncRNAs), which play critical roles in a wide range of biological processes—including transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) that are indispensable for protein biosynthesis—were comprehensively annotated in the genome. The ncRNA annotation was performed using two complementary approaches. First, ribosomal RNAs (rRNAs), small nuclear RNAs (snRNAs), and microRNAs (miRNAs) were identified through sequence homology searches against the Rfam database using Infernal v1.1.531, which employs covariance models to detect conserved RNA secondary structures. Second, transfer RNA (tRNA) genes were predicted using tRNAscan-SE v2.0.1232 with default parameters. To improve annotation accuracy, low-confidence tRNA predictions were excluded using the built-in EukHighConfidenceFilter script (Table 5).

Table 5 Summary of non-coding RNAs.

The MAKER v3.01.0433 annotation pipeline predicted 13,659 protein-coding genes in S. graminum, with an average gene length of 11,989.0 bp. Genes comprised an average of 7.6 exons (mean length 310.3 bp) and 6.6 introns (mean length 1,582.8 bp). On average, 7.2 coding sequences (CDSs) were identified per gene, with a mean CDS length of 217.6 bp (Table 6).

Table 6 Statistics of coding gene structure annotation of the S. graminum genome.

Functional annotations were systematically assigned to all predicted genes by integrating evidence from seven major biological databases: Kyoto Encyclopedia of Genes and Genomes (KEGG)34, UniProt35, Gene Ontology (GO)36 and InterPro37. For the S. graminum genome, 13,569 genes (99.34%) matched entries in the UniProtKB database. InterPro analysis identified protein domains in 10,573 protein-coding genes, which is fewer than those reported in three other aphid species, A. glycines (20,781 genes), B. brassicae (22,671 genes), and T. trifolii (13,684 genes)7,8,9. In addition, integrated InterPro and eggNOG-mapper annotations assigned GO terms to 9,390 genes and KEGG pathway entries to 4,668 genes (Table 7). A circos plot integrating genome structure and annotation was generated using TBTools-II v2.0963338. The results revealed that the longest chromosome measured 108.05 Mb, whereas the shortest was 62.85 Mb (Fig. 4).

Table 7 Statistics of coding gene functional annotation of the S. graminum genome.
Fig. 4
figure 4

Circos plot of genomic features in S. graminum genome from the outermost to the innermost represent chromosome length (a), GC content (b), gene density (c), DNA transposon density (d), SINE density (e), LINE density (f), LTR density (g), and simple repeat density (h).

Although this study provides a high-quality genomic resource for S. graminum, it is based on a single geographical population collected from Yangling, Shaanxi Province. Therefore, potential genetic differentiation among populations from different regions was not considered. Future studies incorporating genomic data from multiple wheat-producing areas, such as North China and Northwest China, will be essential to elucidate the population genetic structure and adaptive divergence of S. graminum, thereby improving the representativeness and applicability of the genomic resource.

Data Records

The genome sequencing data generated in this study have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession number PRJNA1282330. The PacBio, illumina DNA short reads, Hi-C sequencing and transcriptome data used for the genome assembly have been deposited with the accession numbers SRR3555301739, SRR3555301840, SRR3555301641 and SRR3555301542. Genome assembly has been deposited at the NCBI under the accession number of JBPJBB00000000043. The genome annotation files are available in the Figshare database44.

Technical Validation

Two approaches were employed to evaluate the quality of the genome assembly. First, genome completeness was assessed using BUSCO v5.7.1 with the insect_odb10 database (n = 1,367). The S. graminum assembly achieved a BUSCO completeness of 97.0%, consisting of 76.7% single-copy, 20.3% duplicated, 0.3% fragmented, and 2.7% missing BUSCOs. Second, assembly accuracy was evaluated by mapping PacBio, Illumina, and RNA-seq reads to the genome, yielding mapping rates of 99.51%, 95.69%, and 64.39%, respectively. Together, these results demonstrate the high quality and reliability of the assembled genome.