Background & Summary

Trilocha varians (Lepidoptera: Bombycidae; Fig. 1a) is a member of bombycid moths. While in Japan this species was identified for the first time in Okinawa in 20011, T. varians is widely distributed in South and Southeast Asia2. Since T. varians lives in low latitude regions, it is a completely non-dormant insect that does not go dormant under any rearing conditions. T. varians mainly feed on banyan leaves, Ficus microcarpa while the domesticated silkworm, Bombyx mori, mainly feed on mulberry leaves. A notable characteristic of this insect is its short generation time. T. varians takes about 30 days at 25 °C and 22 days at 30 °C from hatching to eclosion3. In addition, under rearing at 25 °C, eggs hatch in 5 days. Compared to other lepidopteran model species such as Samia ricini4, approximately 30 days of generation time is remarkably short, which is a great advantage as a model species.

Fig. 1
figure 1

General genome annotation information. (a) Dorsal view of final instar larva of T. varians. (b) The 21-mer distribution for estimation of genome heterozygosity of T. varians. (c) Summary of T. varians genome characteristics. The outermost to the innermost circle show the following: I. chromosome ideograms; II. GC content; III. GC skew; IV. LTR element density; V. non-LTR retrotransposon density; VI. DNA transposon density; VII. rolling circle density; and VIII. gene model density.

We have recently published a chromosome-scale female genome assembly of T. varians (NCBI acc: GCA_030269945.2)5,6. Although T. varians genome retains micro and macro synteny to B. mori genome despite several chromosome fusion and fission events, the W chromosome of T. varians does not show any homology to the W chromosome of B. mori. The W chromosome of both species is derived from the Z chromosome5, but it is still uncertain whether the W chromosomes of both species are “orthologous” or not. As we have discussed, B. mori and T. varians have different physiological and genetic characteristics, even though they are members of the same family Bombycidae. Providing genome annotation information on T. varians will be useful in researching the evolution of the family Bombycidae.

T. varians is 2n = 52 species3, and females have 25 pairs of autosomes, Z chromosome, and W chromosome. T. varians used in this study is an inbred strain derived from descendants of females captured at Ishigaki island, Japan, in 2010. Therefore, the heterozygosity in the genome was 0.12% (Fig. 1b). In preparing the annotation information, we first attempted to locate the nucleolar organizer region (NOR) because NOR is a region of long repetitive sequences6, which often prevent chromosome-scale genome assembly. Transcriptome-based transcriptome-based gene prediction identified 16,226 protein-coding genes in T. varians genome. The following functional annotation was also performed using EnTAP7. Although application examples of CRISPR/Cas9-mediated genome editing in T. varians have not been reported, applying genome editing techniques should be a prerequisite for promoting further use of T. varians as a model species. Since Cas9 is known to be less efficient in heterochromatin regions8, we performed embryonic ATAC-seq to identify open chromatin regions.

It is known that piRNA is involved in the early development of lepidopteran insects. Although piRNA was originally discovered specifically in germline cells9,10,11,12, lepidopteran piRNAs are also present in the early embryos. A prominent example of the involvement of piRNAs in early development is Fem piRNA of B. mori13. Fem piRNA functions as master determinant of female. Although T. varians does not have Fem, it is known that in diamondback moths, Plutella xylostella, W-derived piRNAs are still responsible for female determination14. So far, there is no report that embryonic piRNAs are involved in developmental processes other than sex determination. However, the abundance of embryonic piRNAs does not rule out such possibility. To contribute to future piRNA research in T. varians, we performed small RNA-seq in early embryos, pupal testes, and pupal ovaries to identify piRNA clusters.

Methods

Insects

T. varians (NBRP strain, derived from individuals caught in Ishigaki Island, Japan)3 was provided from National BioResource Project-Wild moths (NBRP-Wild moths; http://shigen.nig.ac.jp/wildmoth/). T. varians larvae were fed on fresh leaves of F. microcarpa. T. varians was reared under a long-day condition (16 h light/8 h dark) at 25 °C.

Estimation of genome heterozygosity

Heterozygosity of female T. varians genome was estimated using a k-mer (k = 21) analysis. Down sampled (to one-tenth) genomic PE short read data derived from female T. varians (accession number: DRR452104)15 was subjected to Jellyfish (v2.3.0)16 to count k-mer. k-mer count was plotted by GenomeScope17 software. The k-mer distribution displays a single peak and the estimated heterozygosity in the genome was 0.119% (Fig. 1a).

Repetitive elements annotation in the genomes

Repetitive annotation of T. varians genome was previously defined by our group5. To improve readability, the process of repetitive annotation is briefly summarized here: repetitive elements in the genome assembly were identified using RepeatModeler (v 2.0.4)18 with “-LTRstruct” option for performing an LTR structural search. The annotated elements were masked using RepeatMasker v 4.1.2. (http://www.repeatmasker.org) with default settings. Among the repetitive elements, LTR, non-LTR (LINE or SINE), DNA transposons, and rolling circles were extracted and the density information of those repetitive groups were visualized by circlize (v 0.4.16)19 (Fig. 1c). GC content and GC skew did not differ significantly among chromosomes, with GC content averaging about 35.6% (Fig. 1c). However, the GC content was higher in the W chromosome, at about 39.0%. This may reflect the characteristics of W chromosomes to accumulate transposons5 (Fig. 1c).

Construction of a T. varians BAC library

Bacterial artificial chromosome (BAC) construction was carried out as previously described5. Basically, the procedures were followed according to a method described in Okumura et al.20 with slight modifications20. We used male genomic DNAs extracted from T. varians pupae (600 mg), and the genomic DNAs were digested with HindIII (8–12 U/ml) at 37 °C for 25 min. The digested fragments were fractionated and collected using CHEF Mapper XA pulsed-field gel electrophoresis system (Bio-Rad). The extracted DNA fragments were ligated to the pBeloBAC11 vector, and the ligates were transformed by electroporation (GenePulser II, Bio-Rad) into DH5α Electro-Cells (TaKaRa). The electroporated cells were spread on L.B. plates containing 12.5 mg/l chloramphenicol (Cm), X-gal, and isopropyl β-D-thiogalactopyranoside. Grown white colonies were stocked in 384-well plates. Stocked plates were stored at −80 °C until further use.

Chromosome preparations

Chromosome specimens were prepared using a method described in Yoshido et al.21. Briefly, ovaries and testes of the last instar larvae of T. varians were dissected in a physiological solution, and testes and ovaries were treated with 75 mM and 100 mM KCl solution for 15 min, respectively. After the hypotonic treatment, the testes and ovaries were fixed in Carnoy’s fixative (ethanol: chloroform: acetic acid, 6:3:1) for 10 min. Spermatocytes and oocytes were transferred into a 60% acetic acid drop on a glass slide and spread at 55 °C using a heating plate. The preparations were passed through 70%, 80%, and 99% ethanol series, air dried, and stored at −20 °C until time to use.

BAC-FISH mapping

Using the STS primer pairs, we selected the BACs according to the methods written in Yoshido et al.21. The PCR-selected BACs were cultured in 1.5 ml of LB medium containing 20 mg/l chloramphenicol (Cm) for 16 h at 37 °C with a shaking incubator (Bio Shaker BR-23FH, Taitec). Then, plasmid DNA was extracted using a standard alkaline SDS method. BAC-DNA probe (18N21 on Chr11 and 15F13, 17O20 on Chr20, and Pieris brassicae 01A06 for NOR detection19) labeling and BAC-FISH were performed according to a method described in Yoshido et al.21. The FISH preparations were counterstained and mounted with Vectashield Antifade Mounting Medium with DAPI (Vector Laboratories). A Leica DM6000B fluorescence microscope (Leica Microsystems) and a DFC350FX black and white charge-coupled device camera (Leica Microsystems) were used for observation and image capturing. The images were processed with Adobe Photoshop 2022. As a result, we successfully located NOR of T. varians on chromosome 20 (Fig. 2), while in B. mori, the NOR is located on chromosome 11.

Fig. 2
figure 2

BAC-FISH mapping of NOR. BAC-FISH mapping was performed using BAC probes 17O20 (yellow), 01A06 (green), 18N21 (red), and 15F13 (magenta) on T. varians chromosomes to identify the chromosome location of the NOR. The images of chromosomes 11 and 20 on the right are the results of a previous BAC-FISH analysis using the same BAC (17O20, 18N21, 15F13) (see Lee et al.5, Fig. 3). The T. varians NOR is located in the middle of chromosome 20 while NOR of Bombyx mori is located on chromosome 11. Significant chromosome elongation can be observed in the region near the NOR. The picture of chromosome 20 in the top right is from the different sample. This picture shows that the two probes, 17O20 and 15F13, paint the same chromosome. For comparison, a picture of chromosome 11 from the same sample is also shown in the bottom right.

RNA sample preparation for sRNA-seq, RNA-seq, and Iso-seq

All RNA samples were prepared exactly as previously described5. Total RNA was extracted from multiple embryos, larval, pupal, and adult tissue using TRIzol reagent (Invitrogen) according to the manufacturer’s protocol. Embryos were sampled 72 hours after oviposition. Testis and ovary-derived RNA samples were subjected to RNA-seq and sRNA-seq, respectively. Embryonic RNA samples were subjected to sRNA-seq, RNA-seq, and Iso-seq, respectively.

Library preparation for sRNA-seq, RNA-seq, and Iso-seq

The sRNA-seq library was prepared using TruSeq small RNA kit (Illumina) according to the manufacturer’s protocol with a slight modification. To target piRNA, a region of 147–158 nucleotides was extracted in the purification step of the cDNA construct using BluePippin (Sage Science, USA). The constructed library was sequenced on the Illumina HiSeq. 2500 platform (Illumina, USA). RNA-seq library was prepared using NEBNext Poly(A) mRNA Magnetic Isolation Module (New England BioLabs) and NEBNext® Ultra™ ll Directional RNA Library Prep Kit (New England BioLabs) according to the manufacturer’s protocol. The constructed library was sequenced on the Illumina Novaseq. 6000 platform (Illumina, USA). For Iso-seq, the library was constructed using Sequel Iso-seq Express Template Prep (Pacific Bioscience, USA) according to the manufacturer’s protocol. The constructed library was sequenced on the PacBio Sequel platform (PacBio, USA).

Transcriptome-based gene prediction

BRAKER3 (v 3.0.8) was used for gene prediction22,23. The RNA-seq and Iso-seq data were submitted to BRAKER3 separately24, and their respective prediction was finally merged by Tsebra25. The detailed information of transcriptome data was summarised in Table 1. Quality trimming for short read data was conducted using fastp (v 0.20.1)26 with following options: ‘-q 28 -l 80.’ Trimmed short read data were submitted to BRAKER3 using the ‘–rnaseq_sets_ids’ option. Then short reads were aligned to the genome assembly by hisat2 (v 2.2.1)27. The alignment rates of short read data to the genome assembly were summarised in Table 2. Iso-seq data were generated consensus for each read cluster according to the following procedure28: Iso-seq subreads were converted to circular consensus sequences (ccs) using ccs v 6.4.0 with options ‘–minLength 10–maxLength 100000–minPasses 0–minSnr 2.5–minPredictedAccuracy 0.0.’ lima (v 2.7.1) was used to remove primer sequences from the CCSs with options ‘–isoseq–peek-guess–ignore-biosamples.’ After the trimming of adaptors, PolyA tail trimming and concatemer removal were performed by isoseq. 3 v 3.8.2 in ‘refine’ mode with option ‘–require-polya.’ Finally, isoform-level clustering was conducted by isoseq. 3 in ‘cluster’ mode with option ‘–use-qvs.’ The resulting clustered.bam file was submitted to BRAKER3. Prior to gene prediction with Iso-seq data, BUSCO analysis on the genome assembly was conducted to obtain complete and single-copy BUSCO sequences5,29. Complete and single-copy BUSCO sequences were submitted to BRAKER3 together with Iso-seq derived bam file. Since we had two Iso-seq datasets (Table 1), we ran BRAKER3 for them separately. BUSCO analysis29 on the constructed gene models scored 98.6% of completeness (Fig. 2a). Basic statistics of the predicted gene models were summarised in Table 3.

Table 1 Transcriptome data used in this study.
Table 2 Mapping rates of RNA-seq data.
Table 3 Statistical summary of the constructed gene models.

Functional annotation of gene models

The deduced amino acid sequences of gene models were submitted to EnTAP6 for functional annotation. Protein similarity search was conducted against the latest complete UniProtKB/TrEMBL protein data set and complete UniProtKB/Swiss-Prot data set using diamond (v 0.9.14)30. Protein orthology search was also conducted against the EggNOG databases31 to assign Gene Ontology (GO), KEGG terms and protein domains from pfam32 and smart33. Additional family and domain search was performed against tigrfam34, sfld35, hamap36, cdd37, superfamily38, prints39, panther40, and gene3d41 using InterProScan (v 5.68–100)42. The results of functional annotation were summarised in Table 4. The top 10 GOs assigned to the gene models are shown in Fig. 3b without distinguishing between molecular function, biological process, and cellular component. The top 10 GOs for each category were shown in Fig. 3c–e.

Table 4 Brief summary of functional annotation.
Fig. 3
figure 3

BUSCO assessment and the top 10 GO assignments of transcriptome-based predicted gene models. (a) BUSCO scores of the gene models (top) and the genome assembly (bottom). (b) Overall top 10 GO assignments to gene models. (c) top 10 GO assignments of “biological process” in gene models. (d) top 10 GO assignments of “cellular component” in gene models. (e) top 10 GO assignments of “molecular function” in gene models.

ATAC library preparation and data processing

Another batch of early embryo samples subjected to RNA-seq, Iso-seq, and sRNA-seq were subjected to ATAC-seq. Fragmentation and amplification of the ATAC-seq libraries were conducted according to Buenrostro et al.43. The constructed libraries were sequenced on the Illumina HiSeq. ATAC-seq reads were pretreated with fastp and mapped to the genome with bwa-mem2 (v 2.2.1)44. Alignments containing mismatches were then removed using bamutils (v 0.5.9)45. Next, we removed duplicated reads using GATK MarkDuplicates (v 4.1.7)46. The resulting bam files were converted to bigwig files using deepTools bamCoverage with 10-bp width bin (v 3.5.1)47. The number of reads per bin was normalised by “reads per genomic content” (RPGC) method. Heatmap was created using deepTools computeMatrix with the starting point of the gene model being set to the reference point (Fig. 4).

Fig. 4
figure 4

Heatmap around gene bodies of ATAC-seq on early embryos. The Y axis of the profile plot on the top indicates the normalised read counts per bin (10-bp).

Small RNA mapping

The small RNA reads were trimmed using Trim Galore v 0.6.6 (https://github.com/FelixKrueger/TrimGalore) in small RNA mode. The trimmed small RNA reads were mapped to the assembled transcriptome, allowing up to 3 nucleotide mismatches using Hisat2 (v 2.1.0)27 and ngsutils (v 0.5.9)45. The information for each library was summarized in Table 1.

piRNA cluster detection

The piC detection was performed as previously described5. proTRAC (v 2.4.4)48 was used with options ‘-clsize 5000 -pimin 23 -pimax 29 -1Tor10A 0.3 -1Tand10A 0.3 -clstrand 0.0 -clsplit 1.0 -distr 1.0–99.0 -spike 90–1000 -nomotif -pdens 0.05.’ As a result, we successfully identified a total of 517 piRNA clusters in the three tissues (Fig. 5). The identity of a piC is defined by the two nearest (upstream and downstream) gene models. If multiple piCs were predicted between the two gene pair, such piCs were treated as a single piC. The genomic positions of piCs identified in testes, ovaries, and early embryos were visualized by RIdeogram (v 0.2.2)49 (Fig. 5a). The aggregation relationship of those piCs was visualized by ComplexUpset (v 1.3.3)50 (Fig. 5b).

Fig. 5
figure 5

piRNA clusters on T. varians genome. (a) piCs distribution of detected in early embryos (box), pupal ovary (circle), and pupal testis (triangle). (b) UpSet plot visualising piCs that are each assigned to each tissue. The vertical bars correspond to the intersections. When the circles corresponding to tissues are connected by a line, the bar above circles represents the number of piCs commonly identified in concerning tissues. The identity of piCs was defined by the nearest two gene models: When comparing piCs identified in different tissues, if the nearest upstream and downstream gene models are the same, those piCs were treated as the same piC.

Data Records

The raw sequence data reported in this paper have been deposited in DDBJ. RNA-seq data and Iso-seq data derived from tissues other than early embryos were registered across the accession code PRJDB941951. Embryonic Iso-seq and Embryonic RNA-seq data, small RNA-seq data, and ATAC-seq data are available under the accession code PRJDB1395552 [DRR396187, DRR396188, DRR396189, DRR396190, DRR396191, DRR515037]. Annotated gene models have been deposited in the figshare repository53.

Technical Validation

To assess the quality of gene models, BUSCO v. 5.4.66 with lepidoptera_odb10 lineage dataset was used. For comparison, the results are summarized in Fig. 2, together with the results of BUSCO analysis for the genome assembly. 98.58% of the complete and single-copy BUSCO sequences were present in the gene models, while 98.66% of the complete and single-copy BUSCO sequences were present in the genome assembly. BUSCO completeness scores were almost the same between the genome assembly and the gene model, suggesting that the gene prediction process is highly accurate across all genome regions. The mapping rates of RNA-seq data to genome assembly were summarized in Table 1. The mapping rates ranged between 87.5–93.7% for all samples. The mapping rates and the above-mentioned BUSCO completeness scores demonstrate the RNA-seq data quality and the genome assembly quality.