Background & Summary

Castanea is a genus of plants in the family Fagaceae that has significant ecological and economic value1, including excellent nutritional quality2. The genus comprises seven species, including the Chinese chestnut (Castanea mollissima), Seguin chestnut (Castanea seguinii), Henry chestnut (Castanea henryi), and Japanese chestnut (Castanea crenata) in East Asia, the American chestnut (Castanea dentata) and Chinkapin (Castanea pumila) are present in North America, while European chestnut (Castanea sativa) is grown in Europe. Chestnuts play an important role in nut production and forest ecosystem services3. Chinese chestnuts are widely cultivated in 26 provinces of China4, with a nut yield of 1,562,685 tons, accounting for 74% of the global chestnut production, ranking first in the world in 20225.

Recent advancements in high-throughput sequencing technologies have enabled the continuous accumulation of genomic data for Castanea species. However, existing resources are relatively scattered and lack integration, limiting researchers’ ability to conduct in-depth studies on Castanea genetics, functional genomics, and evolutionary biology.

To address this gap, we have gathered and analyzed a comprehensive collection of genomic datasets for Castanea. The datasets include genomic information from eight Castanea species, 213 RNA-Seq samples, and 330 resequencing samples and are publicly available on figshare. This curation and validation process ensures the reliability and utility of the data for researchers.

Additionally, the CGD serves as a complementary platform to enhance the usability of these datasets. The CGD provides a user-friendly interface and a suite of advanced data mining and analysis tools, including BLAST, Batch Query, GO/KEGG Enrichment Analysis, and Synteny Viewer. These tools facilitate the exploration and analysis of the curated datasets, enabling researchers to investigate deeper into Castanea genetics and functional genomics.

We envision the CGD as an essential resource for advancing functional genomic research and understanding the evolutionary relationships within the Castanea genus. By making these datasets and tools readily accessible, we aim to facilitate collaborative research and drive innovation in the field.

Methods

Data collection

The CGD contains eight chestnut genomes, including seven from C. mollissima (‘HBY-2’6,7, ‘N11-1’8, ‘Sun’9, ‘Vanuxem’10, ‘drought-resistant’(H7)11, ‘early-maturing’(ZS)11 and ‘easy-pruning’(YH)11) and one from C. crenata12 (Table 1). Genomic information for chestnut, including the genome sequences, mRNA, gene structure annotations in general feature format (GFF), coding (CDS) and protein sequences of protein-coding genes, were obtained based on previously published articles. We collected the genome of ‘Vanuxem’ from the Hardwood Genomics website (now relocated to the TreeGenes Database, https://treegenesdb.org/), which includes several transcriptome libraries from trees affected by chestnut blight disease10,13. The TreeGenes Database also contains data for eight chestnut species. However, it is not specific to chestnut, it includes data for multiple tree species. Noted that the C. mollissima cv. HBY-2 we previously generated using Pacific Biosciences single-molecule sequencing technology6. We performed Hi-C analysis on the Chinese chestnut genome HBY-2 to enhance the genome sequence assembly and contiguity7. Therefore, this version exhibits superior genome contiguity and sequence quality compared to the previous version. Besides, RNA-Seq data from various samples, which are at different tissues, developmental stages, as well as cultivars (Supplementary Table 1) were downloaded from the NCBI SRA databese. And resequencing datasets from 330 accessions were downloaded from the NCBI database7,9.

Table 1 List of collected Castanea genomes.

Data processing

Gene functional annotation

In our previous study, we developed a standard pipeline for comprehensively annotating predicted protein-coding genes14. In brief, the protein sequences of the predicted genes were analyzed against the UniPort (Swiss-Port and TrEMBL), and NCBI nonredundant (nr), and Arabidopsis protein (TAIR) databases using DIAMOND15 with an E-value cutoff of 1e-4. Furthermore, all the protein sequences were compared against the InterPro database to identify functional domains by InterProScan16. For the purpose of conducting functional enrichment analyses and generating GO and KEGG pathway annotations, protein sequences were aligned with the EggNOG database using eggnog-mapper17. The GO terms assigned to genes/transcripts based on the eggnog-mapper results were transformed into the GO Annotation File (GAF) format. Within the eggnog-mapper outcomes, certain KEGG pathways unrelated to plants were discarded. The iTAK program was employed to identify transcription factors (TFs), transcriptional regulators (TRs), and protein kinases(PKs) from the predicted protein-coding genes and to classify them into different families18.

After the above analysis, we obtained files containing homologous genes identified by BLAST, protein functional domains, AHRD-based functional descriptions, and GO/KEGG annotations, which have been shared in our database and on Figshare.

RNA-Seq analysis

A specific pipeline was utilized to procedure and analyze hundreds of RNA-Seq datasets, which were downloaded from the NCBI SRA Database. Firstly, raw RNA-Seq reads were processed using the FastQC software (v0.11.9)19 to evaluate the quality of reads, then treated to remove adaptor and low-quality sequences using Trimmomatic20, Trimmed reads shorter than 80% of their initial length were removed, and then the remaining cleaned reads were aligned to the reference genome (HBY-2) using the STAR (version 2.7.10b)21. Finally, read counts for each gene were calculated based on the alignments and normalized to fragments per kilobase of transcript per million mapped fragments (FPKM) values22.

Following the analysis, we generated the raw and normalized expression matrices and made them available both in our database and on Figshare.

Variant identification

To identify variants, a pipeline developed by Sentieon Inc.23 was used with default parameters. First, quality evaluation utilizing the FastQC software (v0.11.9)19, and resequencing reads were processed to remove adapter and low-quality sequences using Trimmomatic20. Then, the cleaned reads were aligned to the reference genomes (HBY-2) utilizing the BWA-MEM algorithm with default parameters24. Following alignment, duplicated reads were removed using the ‘LocusCollector’ and ‘Dedup’ algorithms of Sentieon. Finally, variants were called using the ‘Haplotyper’ algorithms of Sentieon software (https://www.sentieon.com/).

As a result of the analysis, we produced the vcf files and have shared them through our database and Figshare.

Data Records

All functional annotations, expression profiles, and called variant data have been uploaded to Figshare (https://figshare.com/s/d8505ab07724a111b1f3)25, where researchers can access the required files. Additionally, the expression profiles are available on the Gene Expression Omnibus (GEO) under accession numbers GSE28451026, GSE28451627, GSE28451728, and GSE28451829. The variant data are accessible through the European Variation Archive (EVA)30. On Figshare, we have organized the data into three main categories: genomic data, gene expression data, and variant data.

The genomic data category includes comprehensive functional annotations for all collected genes. These annotations cover homologous genes identified by BLAST, protein functional domains, AHRD-based functional descriptions, and GO/KEGG annotations. The files are stored in a compressed archive named “genome_anno.tgz,” with subfolders named according to the respective genomes. Additionally, the genome files for HBY-2 have also been uploaded in this section. The gene expression data category consists of both the raw and normalized expression matrices. These files are stored in a compressed archive named “gene_expression.zip.” The variant data category includes VCF files for variant data, all starting with “variants_data.” It encompasses 330 samples from the GBS (Genotyping-by-Sequencing) and resequencing projects.

Additionally, all genomic sequences, functional annotations, expression profiles, and called variant data can be downloaded from the CGD under the “Tools - > Download” module. The sample information for the hierarchical clustering heatmap can be found in Supplementary Table 1. Notably, the updated data for HBY-2 has also been shared on Figshare (https://figshare.com/articles/dataset/Wild_Chinese_chestnut_genome_V2_/28098758).

By organizing the data in this structured manner, we aim to facilitate easy access and usability for researchers interested in chestnut genomics and related studies.

Technical Validation

The integrity of genomes

We evaluated genome integrity using BUSCO (version 3)31, the BUSCO analysis is based on conserved orthologous genes among species. In this study, simultaneous analysis of the genomes of eight varieties was conducted, and the proportions of identified complete genes in the BUSCO database (with a total of 1,614 genes) were 98.0%, 97.6%, 94.0%, 98.5%, 95.6%, 94.3%, 98.5%, and 92.6%, respectively (Table 2), the results indicate that the genomes integrity are satisfactory.

Table 2 BUSCO assessment of the completeness of genomes.

RNA-Seq data

Among the 213 samples, 91.81% of the clean reads were mapped to the reference genome (HBY-2). To ensure the accurate reflection of expression across diverse tissues and developmental stages, hierarchical clustering analysis (Fig. 1) based on FPKMs was conducted using R, derived from various RNA-seq items of chestnut (Supplementary Table 1). The heatmap is divided into four color blocks from top to bottom: the first block represents seed kernel and embryo at relatively early stages, the second block represents leaves, buds, galled leaves, etc., at similar stages, the third block indicates seed kernel and embryo at relatively later stages, and the fourth block includes somatic embryo, embryo, root, and callus at similar stages. Among the 213 samples shown in the heatmap, nine are not clustered within the expected blocks (marked with red horizontal lines in the figure). Overall, similarly developmental stages originating from the same tissue exhibited cohesive clustering patterns. For example, the seed kernel is similar to the tissue of the embryo, and at similar developmental stages, they exhibit cohesive clustering patterns. The same reasoning applies to galled leaves, leaf, insect galls, and some of the tissue of the bud, which indicates the reliability of the data we collected.

Fig. 1
Fig. 1
Full size image

The heatmap of hierarchical clustering in diferent RNA-seq samples based on FPKM. The four color blocks from top to bottom respectively represent: (1) seed kernel and embryo at relatively early stages, (2) leaves at similar stages, buds, galled leaves, etc., (3) seed kernel and embryo at relatively later stages, (4) somatic embryo at similar stages, embryo, root and callus, etc. Note: There is one cluster of leaf tissue in the third color block, and four clusters of leaf tissue and four clusters of bud tissue in the fourth color block, all marked with red lines on the blocks.

Resequencing data

To verify the accuracy of Sentieon in detecting varinsts, several samples were randomly selected from the resequencing data and GATK pipeline was used to identify SNPs on them. Briefly, the cleaned reads were aligned to the reference genomes (HBY-2) using the BWA-MEM algorithm with default parameters24. Next, the aligned reads were processed to remove duplicated reads using the MarkDuplicates algorithm from the Picard. After mapping, 93.81% of clean reads were mapped to the reference genome (HBY-2). The variant were called using the ‘HaplotypeCaller’ algorithms of GATK3.8, cross-validated with Sentieon, yielding results of 12,631,432 and 12,314,460 respectively (Table 3). The concordance rate of common SNPs identified by Sentieon and GATK3.8 software packages stands at 99.943% and 97.435%, respectively, indicate a high level of consistency in the results.

Table 3 Statistics of SNPs identified by Sentieon and GATK.

Usage Notes

The CGD serves as a complementary platform to primary datasets hosted on Figshare and other repositories, enhancing data accessibility and utility. The following sections provide a concise overview of how the CGD enables efficient analysis and exploration of the aforementioned datasets.

Each gene feature page in the CGD offers basic gene details, an interactive genome browser, and access to mRNA, protein, and genomic sequences, along with homologous genes, functional annotations, protein domains, and syntenic blocks (Fig. 2). To facilitate dataset storage, browsing, and querying, we have implemented a versatile gene search tool that supports flexible queries based on selected genomes from a drop-down menu. Users can quickly locate specific genes of interest by entering keywords such as gene names, functional descriptions, transcription factor or protein kinase family names, or GO/Pfam terms. The search interface is user-friendly, featuring auto-suggestions to streamline the selection process (Fig. 3A).

Fig. 2
Fig. 2
Full size image

Gene feature page in Castanea Genome Database. (A) Screenshot of the gene page including basic information and gene structure. (B) Screenshot of the gene page containing gene, mRNA, CDS, and protein sequences. (C) Screenshot of the homolog genes and sequence alignments from the BLAST results. (D) Screenshot of the GO terms assigned to the gene. (E) Screenshot of the functional domains predicted from the protein sequence of the gene.

Fig. 3
Fig. 3
Full size image

Query interfaces of data mining tools in Castanea Genome Database. (A) Screenshot of the ‘Search’ interface and result page. (B) Screenshot of the ‘BLAST’ result page. (C) Screenshot of ‘Synteny Viewer’ result page.

The CGD also offers a suite of advanced data mining and analysis tools to maximize the utility of our datasets for comparative genomics, gene function discovery, and molecular breeding. For enrichment analysis, BLAST-indexed databases are organized into nucleotide and protein categories. Nucleotide databases include indexes for genomic, mRNA, and CDS sequences, while protein databases cover all available protein sequence indexes (Fig. 3B).

Furthermore, we have introduced the Synteny Viewer, a tool designed to simplify the identification and visualization of homologs within specific regions of the Castanea genome (Fig. 3C).

For more information, please visit the CGD at http://castaneadb.net. Detailed descriptions of the site’s features and functionalities are available directly on the platform.