Background & Summary

The family Schizodactylidae Brullé, 1835 represents one of the most morphologically distinctive lineages within the suborder Ensifera (e.g. crickets, katydids, and their allies). Members of this family are easily recognized by their conspicuously expanded, lobiform tarsi and are uniquely adapted to psammophilous (sand-dwelling) environments1. They are nocturnal burrowers, remaining concealed in self-excavated tunnels during the day and emerging at night for feeding and activity2,3. Extant diversity within Schizodactylidae is extremely limited, comprising only 20 described species4 in two genera: Comicus (11 species) and Schizodactylus (9 species). Both genera are restricted to dynamic dune environments characterized by poor vegetation cover and high wind activity, which continuously reshapes their habitats5,6. Even so, the two genera exhibit a disjunct biogeographic pattern. Comicus is confined to the arid sand-dune systems of southern Africa, including the Namib and Kalahari deserts, and all species are wingless throughout all developmental stages7. Schizodactylus, on the other hand, is mainly distributed across semi-stable or shifting dunes in South and Southeast Asia countries such as China, India, Pakistan, Sri Lanka, and Myanmar, with only one brachypterous species, Schizodactylus inexpectatus, endemic to Turkey8,9. Very interestingly, all South and Southeast Asian species are characterized by fully developed or reduced wings. The family has a deep evolutionary history, with fossil representatives dating back to the Lower Cretaceous, although phylogenetic evidence suggests an even more ancient origin10.

Among the two extant genera, detailed morphological investigations of the wing-bearing thoracic structures have only been conducted for the species of Comicus11, but no related published data are currently available for Schizodactylus. Most species within Schizodactylus exhibit a particularly distinctive wing-folding morphology that sets them apart from other orthopterans (Fig. 1). Although only limited molecular data, such as partial mitochondrial genomes or short gene fragments, have been reported for certain species of Schizodactylidae and related orthopterans12, these resources remain too sparse and fragmented to elucidate the genetic mechanisms underlying their unique morphological traits or to establish a robust phylogenomic framework for the group. In China, the family Schizodactylidae was undocumented until the recent discovery of a new species, Schizodactylus jimo, in Yunnan Province in 20219. To investigate its evolutionary significance and ecological adaptations, we assembled a high-quality chromosome-level genome for S. jimo. Comprehensive annotation was performed, including the identification of repetitive elements, non-coding RNA genes, and protein-coding genes. This genomic assembly provides resources for investigating the evolutionary trajectory, habitat specialization, and functional genomics of Schizodactylidae, and serves as a foundation for future comparative and ecological studies of dune-adapted orthopterans.

Fig. 1
Fig. 1
Full size image

Schizodactylus jimo He & Liu, 2021. (a) Dorsal view of specimen showing the right wing fully extended and the left wing in a folded position. The scale bar is 2 cm. (b) Field photograph of a living individual in its natural dune habitat, with both wings fully folded over the abdomen.

Methods

Sample information

Schizodactylus jimo adult individuals were collected from sandy dune habitats at the riverside of the Nujiang River, Mangkuan Township, Baoshan City, Yunnan Province, China, during 2022–2024 (98.8891°E, 25.4412°N, alt. 735 m) by Zhiwei Dong and Zhengzhong Huang. Two male adults were dissected to remove guts and mouthparts to reduce microbial contamination and then their tissue samples were flash-frozen in liquid nitrogen and stored at −80 °C until use. The tissue of one male adult was used for Illumina, PacBio HiFi, and Hi-C sequencing. The tissue of another individual was used for RNA sequencing. Voucher specimens were deposited in both Insect Collection of Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China (KIZ 0139654) and Institute of Zoology, Chinese Academy of Sciences, China (IOZ 20250710).

Illumina, PacBio HiFi, Hi-C, and RNA sequencing

Genomic DNA was extracted from one adult male individual using a CTAB-based protocol optimized for insect tissues to obtain high-molecular-weight DNA. DNA degradation and contamination were monitored on 1% agarose gels. DNA purity was checked using the NanoPhotometer®spectrophotometer (IMPLEN, CA, USA). DNA concentration was measured using Qubit®DNA Assay Kit in Qubit®2.0 Fluorometer (Life Technologies, CA, USA).

Short-read libraries were generated using the Truseq Nano DNA HT Sample Preparation Kit (Illumina, USA) following the manufacturer’s protocol. The libraries were then sequenced on an Illumina NovaSeq 6000 platform, generating 150 bp paired-end reads, with post-sequencing validation confirming an average insert size of ~350 bp.

PacBio HiFi read libraries were prepared using the SMRTbell® prep kit 3.0 according to the manufacturer’s instructions. After DNA quality control, DNA shearing and cleanup, DNA repair and A-tailing reactions, adapter ligation and cleanup, nuclease treatment, size selection, and sequencing was performed using the PacBio Revio system according to the manufacturer’s operating manual.

Hi-C libraries were also generated using the TruSeq Nano DNA HT Sample Preparation Kit (Illumina, USA) following the manufacturer’s protocol. After cross-linking, cells were lysed and digested with the restriction enzyme DpnII, followed by end repair, DNA cyclization and DNA purification. Finally, the Hi-C libraries were sequenced on the Illumina NovaSeq 6000 platform.

RNA sequencing libraries were prepared using the TruSeq RNA Library Preparation Kit (Illumina, USA), and then sequenced on an Illumina NovaSeq 6000 with 150 bp paired-end reads.

All sequencing was performed at NextOmics (Wuhan, China). In total, the sequencing effort produced 83.51 Gb of PacBio HiFi reads, 110.19 Gb of Hi-C reads, 327.65 Gb of Illumina reads and 11.71 Gb of RNA-seq reads (Table 1).

Table 1 Summary of sequence reads.

Genome survey and assembly

Before assembling the S. jimo genome, raw sequencing reads were processed with fastp v0.23.4 under stringent parameters (-q 20 -D -g -x -u 10 -5 -r -c) to remove low-quality bases, duplicates, and artifacts, thereby reducing sequencing-level contamination, genome survey analysis was performed using a 21-mer frequency distribution of short-reads generated with the script “kit.sh” in BBTools (v38.90)13. The genome size was estimated to be 1,288,053,473 bp (i.e., 1.29 Gb) with 1.04% heterozygosity and 63.41% repetitive sequences.

High-quality HiFi reads (Q20 or higher) were assembled for the draft genome by Hifiasm (v0.24.0)14. In the depth-based filtration during assembly, with the ‘-l 3’ parameter to explicitly retain contigs having a sequencing depth above 3× and discard very low-depth contigs (<3×), which are highly likely to represent foreign contamination or assembly errors. Haplotypic duplication sequences were removed using Purge_dups (v1.2.5)15 pipeline with the default parameters (‘-2 -a 70’). These two steps were applied sequentially to address distinct problems. The first step performs depth-based elimination of contamination, and the second step removes haplotigs in order to obtain a non-redundant primary haplotype assembly. To further construct the chromosome-level genome, Hi-C data were subsequently mapped to these contigs using Chromap (v0.2.6)16. The uniquely mapped reads were then used to anchor the contigs onto the pseudochromosomes using YAHS (v1.2) software17. Manual correction of misjoins was conducted in Juicebox (v1.11.08)18. Explicit contamination screening of the final assembly was systematically performed using MMseqs2 v13 with a blastn-like search strategy against two comprehensive databases—the NCBI nt database and the UniVec database for common vectors and contaminants. This process includes an integrated step wherein NCBI’s Foreign Contamination Screen (FCS) tool is employed to identify and filter out contaminant sequences.

The final chromosome-level genome is 1.33 Gb, with scaffold and contig numbers of 111 and 164, closely matches our 1.29 Gb preliminary estimate, supporting the robustness of both estimation and assembly methods. The longest scaffold/contig length is 275.07/139.07 Mb, and the scaffold/contig N50 length is 216.22/45.88 Mb. The assembly exhibits high continuity, as supported by the metrics in Table 2. The GC content was 39.82%. A total of 9 pseudochromosomes (1.23 Gb) were assembled, with an anchoring rate of 92.12% (Fig. 2).

Table 2 Genome assembly summary.
Fig. 2
Fig. 2
Full size image

Assembly of chromosome-level genome of Schizodactylus jimo. (a) Circos plot with a window size of 100k bp. (b) The heatmap displays all interactions among 9 chromosomes. The outer to inner circle of (a) were chromosome length, GC content, gene density, DNA transposon density, SINE transposon density, LINE transposon density, LTR transposon density, and Simple repeats density, respectively.

The sequencing depth of each chromosome was assessed using the Minimap2 (v2.29)19 and SAMtools (v1.10)20. With referring to the sex determination mechanism of male in the suborder Ensifera being XO21,22, in which the sequencing depth of male X chromosome is theoretically only half of that of autosomes, chromosome 2 is identified as X chromosome (Table 3). The single-base QV was assessed using Merqury v1.314, with values around 60 for each chromosome, corresponding to an estimated error rate of approximately 1 × 10−7. These results indicate that the assembly achieves near single-base accuracy while retaining the structural completeness provided by long-read sequencing.

Table 3 Genome assembly summary of length, sequencing coverage and QV value for each chromosome.

Genome completeness was assessed using BUSCO v5.7.123 against the Insecta_odb10 database (1,367 genes), revealing 98.1% completeness with only 1.7% multi-copy BUSCOs, indicating minimal assembly redundancy. The mapping ratio of short-reads, HiFi-reads, and RNA-seq reads was 99.01%, 98.99% and 97.04%, respectively, indicating extremely high sequencing quality. All the above data indicate that the assembled genome has reached an extremely high level of continuity and integrity.

Repeat annotation

The repeat sequence library was constructed based on the principle of repeat sequence-specific structure and de novo prediction by using RepeatModeler v2.0.524 with LTR search process (‘-LTRStruct’). This library was then merged with the Dfam 3.81625 and RepBase-201810261726 databases to form the final repeat sequence reference database.

The repeat sequence content of S. jimo was annotated using RepeatMasker v4.1.527 and the custom repeat library described above. This analysis identified 609.87 Mb of repetitive sequences, accounting for 45.74% of the total genome, which is a typical medium to high repetitive genome (Table S1).

The top six types of repetitive sequences in terms of proportion are: LINE (23.30%), Unknown (11.85%), SINEs (3.10%), Simple repeats (2.66%), DNA (2.97%), and LTRs (0.59%). As can be seen from Fig. 3, the transposons in the genome are mainly caused by two expansions. The distant expansion (the peak on the right) is mainly contributed by Unknown repeats, while the recent expansion is mainly formed by LINE repeats (the peak on the left) (Fig. 3).

Fig. 3
Fig. 3
Full size image

Interspersed repeat landscape of the S. jimo genome based on Kimura 2-parameter (K2P) divergence (CpG-adjusted). (a) Repeat landscape of all annotated repeat classes, including DNA transposons, LINEs, LTRs, and unclassified/unknown repeat elements. The x-axis represents K2P divergence, while the y-axis shows the proportion of the genome contributed by each repeat class. Peaks indicate periods of transposable element expansion. (b) Same K2P landscape with unclassified/unknown repeat elements removed to highlight the dynamics of well-annotated TE families.

Non-coding RNA genes

Non-coding RNA annotation was performed by comparing sequences with the Rfam database using Infernal v1.1.528 to identify various ncRNAs including rRNA, snRNA and miRNA, while tRNA prediction was conducted using tRNAscan-SE v2.0.1229 with low-confidence tRNAs filtered by the built-in EukHighConfidenceFilter script.

The genomic non-coding RNA annotation identified a total of 5,395 ncRNAs, consisting of 1,699 rRNAs, 74 miRNAs, 1,306 snRNAs (including 1,289 spliceosomal RNAs [U1, U2, U4, U5, U6, U11], 2 minor spliceosomal RNAs [U4atac, U6atac], 12 C/D box snoRNAs, and 1 HACA-box snoRNA), 1,475 tRNAs, 4 ribozymes, and 2 lncRNAs (Table S2).

Gene structure prediction

The protein-coding genes (PCGs) of the repeat-masked genome were predicted based on de novo prediction, homology prediction, and RNA-seq prediction. For de novo prediction, BRAKER30 automatically trained two ab initio prediction tools, Augustus3.4.031 and GeneMark-ETP32, and integrated arthropod protein sequences (extracted from the OrthoDB12v1 database33) along with transcriptome data to enhance prediction accuracy. (The transcriptome alignment was performed by mapping the RNA-seq short-read transcriptome data to the genome using minimap2 to generate BAM alignment files with the parameter ‘-x splice:sr’). For homology-based prediction, protein datasets from the holometabolous insect Drosophila melanogaster (Diptera, GCA_000001215.4) and four neopteran insects: Anabrus simplex (Orthoptera, GCA_040414725.1), Schistocerca nitens (Orthoptera, GCA_023898315.2), Bacillus rossius redtenbacheri (Phasmatodea, GCA_032445375.1), and Periplaneta americana (Blattodea, GCA_040183065.1) were used in GeMoMa (v1.9; parameters: GeMoMa.c = 0.4, GeMoMa.m = 130000)34 to perform gene prediction based on sequence homology. For RNA-seq prediction, we aligned RNA-seq using minimap219,35 (‘-x splice:sr’) and assembled unigenes using StringTie v2.2.136, and the coding sequences were identified using GeneMark-ETP.

We predicted a total of 12,612 PCGs (Table 4 and Table S3) with an average gene length of 27,906.6 bp in the genome by MAKER (v3.01.04)37. On average, each gene contained 8.7 exons (355.6 bp each), 7.7 introns (3,463.4 bp each), and 8.3 coding sequences (209.8 bp each). The integrity of the gPCGs evaluated by BUSCO software (v5.7.1)23 was 97.9%, which is close to 98.1% of the genomic evaluation. It can be seen that the annotation quality of protein coding genes is excellent.

Table 4 Summary of gene structure prediction.

Gene functional prediction

Functional annotation of the predicted proteins was performed by aligning against UniProtKB v202503 (SwissProt + TrEMBL) using DIAMOND v2.1.7.16138 (--very-sensitive, --evalue 1e-5), searching the Pfam39 database via InterProScan 5.70-102.040, and annotating with eggNOG-mapper v2.1.1241 against eggNOG v5.0.242 to predict conserved domains, Gene Ontology (GO) terms, and pathways (KEGG, Reactome).

The result showed that 11761 (93.25%) genes in the genome matched records from the UniProtKB database. InterPro identified the protein domains of 10438 protein coding genes. InterPro and eggNOG mapper identified 9921 GO genes and 4738 KEGG pathway entries (Table 5).

Table 5 Summary of gene functional prediction.

Data Records

The dataset is available at the NCBI under the BioProject PRJNA127977643. RNA-seq data were deposited in the Sequence Read Archive (SRA) at NCBI under accession number SRR3459078044. PacBio HiFi long-read sequencing data were deposited in the SRA under accession numbers SRR3459078145. Hi-C sequencing data were deposited in the SRA under accession numbers SRR3459078246. Illumina sequencing data were deposited in the SRA under accession numbers SRR3459078347. The assembled genome is available at Genbank48. The genome assembly and genome annotation results have additionally been deposited in the Figshare repository under the https://doi.org/10.6084/m9.figshare.2988023949.

Technical Validation

The completeness and integrity of the chromosome-level genome assembly for S. jimo were assessed using two complementary approaches. First, all original short-read and long-read sequencing data were aligned back to the assembly using Minimap2 (v2.29)19. The resulting alignment files were processed with SAMtools (v1.10)20 to calculate the coverage depth across the assembled chromosomes, verifying the effective incorporation of raw data. Second, the completeness of the gene space was benchmarked with BUSCO (v5.7.1)23 against the Insecta_odb10 dataset, which contains 1,367 conserved single-copy orthologs. This analysis identified 98.1% of the benchmark genes as complete and 0.4% as fragmented, collectively confirming a high level of assembly completeness.